Scala Code to Read Json File in Hdfs
EXPEDIA Grouping TECHNOLOGY — Information
Working with JSON in Apache Spark
Denormalising human being-readable JSON for sweet data processing
JSON is omnipresent. However, information technology isn't e'er easy to process JSON datasets because of their nested construction. Hither in this tutorial, I discuss working with JSON datasets using Apache Spark™️. Previously, I've published several blogs about Apache Spark every bit mentioned below :
- Starting time Your Journey with Apache Spark — Function 1
- Start Your Journey with Apache Spark — Part two
- Outset Your Journey with Apache Spark — Part three
- Deep Swoop into Apache Spark DateTime Functions
- Deep Dive into Apache Spark Window Functions
- Deep Dive into Apache Spark Array Functions
- Apache Spark Structured Streaming
Delight have a look if you oasis't already.
Let'south play with sample JSON. We will employ the sample data below in this weblog. You tin find the data file here at GitHub® also.
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "blazon": "Sugar" },
{ "id": "5007", "type": "Powdered Saccharide" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "blazon": "Maple" }
]
}
Import Required Libraries
Before we brainstorm to read the JSON file, let's import useful libraries.
from pyspark.sql.functions import *
Read Sample JSON File
Now allow'southward read the JSON file. You lot can salve the above data as a JSON file or you can become the file from here. We will apply the json office nether the DataFrameReader class. It returns a nested DataFrame.
rawDF = spark.read.json("<PATH_to_JSON_File>", multiLine = "truthful")
You must provide the location of the file to be read. As well, we used multiLine = true considering our JSON tape spans multiple lines. You can find a detailed list of options here which can exist used in the above json function.
Explore DataFrame Schema
We use printSchema() to display the schema of the DataFrame.
rawDF.printSchema()
Output
root
|-- batters: struct (nullable = truthful)
| |-- batter: array (nullable = truthful)
| | |-- chemical element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- type: string (nullable = true)
|-- id: string (nullable = true)
|-- proper name: string (nullable = true)
|-- ppu: double (nullable = truthful)
|-- topping: array (nullable = truthful)
| |-- element: struct (containsNull = truthful)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = truthful)
|-- type: string (nullable = true)
Looking at the above output, yous tin see that this is a nested DataFrame containing a struct, array, strings, etc. Experience complimentary to compare the in a higher place schema with the JSON data to amend sympathise the data before proceeding.
For example, cavalcade batters is a struct of an assortment of a struct. Column topping is an assortment of a struct. Column id, name, ppu, and blazon are simple string, string, double, and string columns respectively.
Convert Nested "batters" to Structured DataFrame
Now let's work with batters columns which are a nested column. First of all, let's rename the meridian-level "id" column because we have another "id" every bit a central of element struct under the batters.
sampleDF = rawDF.withColumnRenamed("id", "key")
Let'due south try to explore the "batters" columns now. Extract batter element from the batters which is Struct of an Array and check the schema.
batDF = sampleDF.select("key", "batters.concoction")
batDF.printSchema()
Output
root
|-- key: cord (nullable = true)
|-- batter: assortment (nullable = true)
| |-- chemical element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- type: cord (nullable = truthful)
Let's check the content of the DataFrame "batDF". Yous tin find more than details about evidence function here.
batDF.show(one, False)
Output
+----+------------------------------------------------------------+
|key |concoction |
+----+------------------------------------------------------------+
|0001|[[[1001, Regular], [1002, Chocolate], [1003, Blueberry]]] |
+----+------------------------------------------------------------+
We accept got all the batter details in a single row considering the batter is an Array of Struct. Let'due south try to create a separate row for each batter.
Allow'southward create a separate row for each chemical element of "batter" array by exploding "batter" column.
bat2DF = batDF.select("key", explode("batter").alias("new_batter"))
bat2DF.testify()
Output
+----+--------------------+
| key| new_batter|
+----+--------------------+
|0001| [1001, Regular]|
|0001| [1002, Chocolate]|
|0001| [1003, Blueberry]|
+----+--------------------+
Let'south check the schema of the bat2DF.
bat2DF.printSchema()
Output
root
|-- key: cord (nullable = true)
|-- new_batter: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- type: string (nullable = true)
Now we can excerpt the individual elements from the "new_batter" struct. Nosotros can utilize a dot (".") operator to excerpt the individual element or we can employ "*" with dot (".") operator to select all the elements.
bat2DF.select("key", "new_batter.*").prove()
Output
+----+----+------------+
| cardinal| id| type|
+----+----+------------+
|0001|1001| Regular|
|0001|1002| Chocolate|
|0001|1003| Blueberry|
+----+----+------------+
At present we take converted the JSON to structured DataFrame.
Let's put together everything we discussed so far.
finalBatDF = (sampleDF
.select("key",
explode("batters.batter").allonym("new_batter"))
.select("key", "new_batter.*")
.withColumnRenamed("id", "bat_id")
.withColumnRenamed("type", "bat_type"))
finalBatDF.prove()
Output
+----+------+------------+
| primal|bat_id| bat_type|
+----+------+------------+
|0001| 1001| Regular|
|0001| 1002| Chocolate|
|0001| 1003| Huckleberry|
+----+------+------------+
Convert Nested "toppings" to Structured DataFrame
Permit'south convert the "toppings" nested construction to a unproblematic DataFrame. Here we use the techniques that we learned so far to extract elements from a Struct and an Array.
topDF = (sampleDF
.select("key", explode("topping").alias("new_topping"))
.select("key", "new_topping.*")
.withColumnRenamed("id", "top_id")
.withColumnRenamed("type", "top_type")
)
topDF.show(10, False)
Output
+----+------+------------------------+
|fundamental |top_id|top_type |
+----+------+------------------------+
|0001|5001 |None |
|0001|5002 |Glazed |
|0001|5005 |Sugar |
|0001|5007 |Powdered Sugar |
|0001|5006 |Chocolate with Sprinkles|
|0001|5003 |Chocolate |
|0001|5004 |Maple |
+----+------+------------------------+
I promise you have enjoyed learning almost working with JSON data in Apache Spark. For piece of cake reference, a notebook containing the examples above is available on GitHub.
Apache Spark, Spark, Apache, the Apache plume logo, and the Apache Spark project logo are either registered trademarks or trademarks of The Apache Software Foundation in the U.s. and other countries.
Photo by Anna Sullivan on Unsplash
http://lifeatexpediagroup.com
Scala Code to Read Json File in Hdfs
Source: https://medium.com/expedia-group-tech/working-with-json-in-apache-spark-1ecf553c2a8c
0 Response to "Scala Code to Read Json File in Hdfs"
Post a Comment