Scala Code to Read Json File in Hdfs

EXPEDIA Grouping TECHNOLOGY — Information

Working with JSON in Apache Spark

Denormalising human being-readable JSON for sweet data processing

JSON is omnipresent. However, information technology isn't e'er easy to process JSON datasets because of their nested construction. Hither in this tutorial, I discuss working with JSON datasets using Apache Spark™️. Previously, I've published several blogs about Apache Spark every bit mentioned below :

  • Starting time Your Journey with Apache Spark — Function 1
  • Start Your Journey with Apache Spark — Part two
  • Outset Your Journey with Apache Spark — Part three
  • Deep Swoop into Apache Spark DateTime Functions
  • Deep Dive into Apache Spark Window Functions
  • Deep Dive into Apache Spark Array Functions
  • Apache Spark Structured Streaming

Delight have a look if you oasis't already.

Let'south play with sample JSON. We will employ the sample data below in this weblog. You tin find the data file here at GitHub® also.

Import Required Libraries

Before we brainstorm to read the JSON file, let's import useful libraries.

Read Sample JSON File

Now allow'southward read the JSON file. You lot can salve the above data as a JSON file or you can become the file from here. We will apply the json office nether the DataFrameReader class. It returns a nested DataFrame.

You must provide the location of the file to be read. As well, we used multiLine = true considering our JSON tape spans multiple lines. You can find a detailed list of options here which can exist used in the above json function.

Explore DataFrame Schema

We use printSchema() to display the schema of the DataFrame.

Output

Looking at the above output, yous tin see that this is a nested DataFrame containing a struct, array, strings, etc. Experience complimentary to compare the in a higher place schema with the JSON data to amend sympathise the data before proceeding.

For example, cavalcade batters is a struct of an assortment of a struct. Column topping is an assortment of a struct. Column id, name, ppu, and blazon are simple string, string, double, and string columns respectively.

Convert Nested "batters" to Structured DataFrame

Now let's work with batters columns which are a nested column. First of all, let's rename the meridian-level "id" column because we have another "id" every bit a central of element struct under the batters.

Let'due south try to explore the "batters" columns now. Extract batter element from the batters which is Struct of an Array and check the schema.

Output

Let's check the content of the DataFrame "batDF". Yous tin find more than details about evidence function here.

Output

We accept got all the batter details in a single row considering the batter is an Array of Struct. Let'due south try to create a separate row for each batter.

Allow'southward create a separate row for each chemical element of "batter" array by exploding "batter" column.

Output

Let'south check the schema of the bat2DF.

Output

Now we can excerpt the individual elements from the "new_batter" struct. Nosotros can utilize a dot (".") operator to excerpt the individual element or we can employ "*" with dot (".") operator to select all the elements.

Output

At present we take converted the JSON to structured DataFrame.

Let's put together everything we discussed so far.

Output

Convert Nested "toppings" to Structured DataFrame

Permit'south convert the "toppings" nested construction to a unproblematic DataFrame. Here we use the techniques that we learned so far to extract elements from a Struct and an Array.

Output

I promise you have enjoyed learning almost working with JSON data in Apache Spark. For piece of cake reference, a notebook containing the examples above is available on GitHub.

Apache Spark, Spark, Apache, the Apache plume logo, and the Apache Spark project logo are either registered trademarks or trademarks of The Apache Software Foundation in the U.s. and other countries.

Photo by Anna Sullivan on Unsplash

http://lifeatexpediagroup.com

Scala Code to Read Json File in Hdfs

Source: https://medium.com/expedia-group-tech/working-with-json-in-apache-spark-1ecf553c2a8c

0 Response to "Scala Code to Read Json File in Hdfs"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel