Scala Code to Read Json File in Hdfs

EXPEDIA Grouping TECHNOLOGY — Information

Working with JSON in Apache Spark

Denormalising human being-readable JSON for sweet data processing

Donuts of different types: glazed chocolate cake, glazed, chocolate iced glaze with sprinkles, etc

JSON is omnipresent. However, information technology isn't e'er easy to process JSON datasets because of their nested construction. Hither in this tutorial, I discuss working with JSON datasets using Apache Spark™️. Previously, I've published several blogs about Apache Spark every bit mentioned below :

Starting time Your Journey with Apache Spark — Function 1
Start Your Journey with Apache Spark — Part two
Outset Your Journey with Apache Spark — Part three
Deep Swoop into Apache Spark DateTime Functions
Deep Dive into Apache Spark Window Functions
Deep Dive into Apache Spark Array Functions
Apache Spark Structured Streaming

Delight have a look if you oasis't already.

Let'south play with sample JSON. We will employ the sample data below in this weblog. You tin find the data file here at GitHub® also.

          {
            "id": "0001",
            "type": "donut",
            "name": "Cake",
            "ppu": 0.55,
            "batters":
            {
            "batter":
            [
            { "id": "1001", "type": "Regular" },
            { "id": "1002", "type": "Chocolate" },
            { "id": "1003", "type": "Blueberry" }
            ]
            },
            "topping":
            [
            { "id": "5001", "type": "None" },
            { "id": "5002", "type": "Glazed" },
            { "id": "5005", "blazon": "Sugar" },
            { "id": "5007", "type": "Powdered Saccharide" },
            { "id": "5006", "type": "Chocolate with Sprinkles" },
            { "id": "5003", "type": "Chocolate" },
            { "id": "5004", "blazon": "Maple" }
            ]
}

Import Required Libraries

Before we brainstorm to read the JSON file, let's import useful libraries.

          from pyspark.sql.functions import *

Read Sample JSON File

Now allow'southward read the JSON file. You lot can salve the above data as a JSON file or you can become the file from here. We will apply the json office nether the DataFrameReader class. It returns a nested DataFrame.

          rawDF = spark.read.json("<PATH_to_JSON_File>", multiLine = "truthful")

You must provide the location of the file to be read. As well, we used multiLine = true considering our JSON tape spans multiple lines. You can find a detailed list of options here which can exist used in the above json function.

Explore DataFrame Schema

We use printSchema() to display the schema of the DataFrame.

          rawDF.printSchema()

Output

          root
|-- batters: struct (nullable = truthful)
|    |-- batter: array (nullable = truthful)
|    |    |-- chemical element: struct (containsNull = true)
|    |    |    |-- id: string (nullable = true)
|    |    |    |-- type: string (nullable = true)
|-- id: string (nullable = true)
|-- proper name: string (nullable = true)
|-- ppu: double (nullable = truthful)
|-- topping: array (nullable = truthful)
|    |-- element: struct (containsNull = truthful)
|    |    |-- id: string (nullable = true)
|    |    |-- type: string (nullable = truthful)
|-- type: string (nullable = true)

Looking at the above output, yous tin see that this is a nested DataFrame containing a struct, array, strings, etc. Experience complimentary to compare the in a higher place schema with the JSON data to amend sympathise the data before proceeding.

For example, cavalcade batters is a struct of an assortment of a struct. Column topping is an assortment of a struct. Column id, name, ppu, and blazon are simple string, string, double, and string columns respectively.

Convert Nested "batters" to Structured DataFrame

Now let's work with batters columns which are a nested column. First of all, let's rename the meridian-level "id" column because we have another "id" every bit a central of element struct under the batters.

          sampleDF = rawDF.withColumnRenamed("id", "key")

Let'due south try to explore the "batters" columns now. Extract batter element from the batters which is Struct of an Array and check the schema.

          batDF = sampleDF.select("key", "batters.concoction")
batDF.printSchema()

Output

          root
|-- key: cord (nullable = true)
|-- batter: assortment (nullable = true)
|    |-- chemical element: struct (containsNull = true)
|    |    |-- id: string (nullable = true)
|    |    |-- type: cord (nullable = truthful)

Let's check the content of the DataFrame "batDF". Yous tin find more than details about evidence function here.

          batDF.show(one, False)

Output

          +----+------------------------------------------------------------+
|key |concoction                                                      |            
+----+------------------------------------------------------------+
|0001|[[[1001, Regular], [1002, Chocolate], [1003, Blueberry]]]   |
+----+------------------------------------------------------------+

We accept got all the batter details in a single row considering the batter is an Array of Struct. Let'due south try to create a separate row for each batter.

Allow'southward create a separate row for each chemical element of "batter" array by exploding "batter" column.

          bat2DF = batDF.select("key", explode("batter").alias("new_batter"))
bat2DF.testify()

Output

          +----+--------------------+
| key|          new_batter|
+----+--------------------+
|0001|     [1001, Regular]|
|0001|   [1002, Chocolate]|
|0001|   [1003, Blueberry]|
+----+--------------------+

Let'south check the schema of the bat2DF.

          bat2DF.printSchema()

Output

          root
|-- key: cord (nullable = true)
|-- new_batter: struct (nullable = true)
|    |-- id: string (nullable = true)
|    |-- type: string (nullable = true)

Now we can excerpt the individual elements from the "new_batter" struct. Nosotros can utilize a dot (".") operator to excerpt the individual element or we can employ "*" with dot (".") operator to select all the elements.

          bat2DF.select("key", "new_batter.*").prove()

Output

          +----+----+------------+
| cardinal|  id|        type|
+----+----+------------+
|0001|1001|     Regular|
|0001|1002|   Chocolate|
|0001|1003|   Blueberry|
+----+----+------------+

At present we take converted the JSON to structured DataFrame.

Let's put together everything we discussed so far.

          finalBatDF = (sampleDF
            .select("key",            
explode("batters.batter").allonym("new_batter"))
            .select("key", "new_batter.*")
            .withColumnRenamed("id", "bat_id")
            .withColumnRenamed("type", "bat_type"))
finalBatDF.prove()

Output

          +----+------+------------+
| primal|bat_id|    bat_type|
+----+------+------------+
|0001|  1001|     Regular|
|0001|  1002|   Chocolate|
|0001|  1003|   Huckleberry|
+----+------+------------+

Convert Nested "toppings" to Structured DataFrame

Permit'south convert the "toppings" nested construction to a unproblematic DataFrame. Here we use the techniques that we learned so far to extract elements from a Struct and an Array.

          topDF = (sampleDF
            .select("key", explode("topping").alias("new_topping"))
            .select("key", "new_topping.*")
            .withColumnRenamed("id", "top_id")
            .withColumnRenamed("type", "top_type")
            )
topDF.show(10, False)

Output

          +----+------+------------------------+
|fundamental |top_id|top_type                |
+----+------+------------------------+
|0001|5001  |None                    |
|0001|5002  |Glazed                  |
|0001|5005  |Sugar                   |
|0001|5007  |Powdered Sugar          |
|0001|5006  |Chocolate with Sprinkles|
|0001|5003  |Chocolate               |
|0001|5004  |Maple                   |
+----+------+------------------------+

I promise you have enjoyed learning almost working with JSON data in Apache Spark. For piece of cake reference, a notebook containing the examples above is available on GitHub.

Apache Spark, Spark, Apache, the Apache plume logo, and the Apache Spark project logo are either registered trademarks or trademarks of The Apache Software Foundation in the U.s. and other countries.

Photo by Anna Sullivan on Unsplash

http://lifeatexpediagroup.com

Scala Code to Read Json File in Hdfs

Source: https://medium.com/expedia-group-tech/working-with-json-in-apache-spark-1ecf553c2a8c

Scala Code to Read Json File in Hdfs

EXPEDIA Grouping TECHNOLOGY — Information

Working with JSON in Apache Spark

Denormalising human being-readable JSON for sweet data processing

Import Required Libraries

Read Sample JSON File

Explore DataFrame Schema

Convert Nested "batters" to Structured DataFrame

Convert Nested "toppings" to Structured DataFrame

0 Response to "Scala Code to Read Json File in Hdfs"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel