Transform structured data to JSON format using Spark Scala - scala

I've my "Structured data" as shown below, I need to transform it to the below shown "Expected results" type. My "Output schema" is shown as well. Appreciate if you can provide some help on how I can achieve this using Spark Scala code.
Note: Grouping on the structured data to be done the columns SN and VIN.
There should be one row for the same SN and VIN, if either SN or VIN changes, then data to be present in the next row.
Structured data:
+-----------------+-------------+--------------------+---+
|VIN |ST |SV |SN |
|FU74HZ501740XXXXX|1566799999225|44.0 |APP|
|FU74HZ501740XXXXX|1566800002758|61.0 |APP|
|FU74HZ501740XXXXX|1566800009446|23.39 |ASP|
Expected results:
Output schema:
val outputSchema = StructType(
List(
StructField("VIN", StringType, true),
StructField("EVENTS", ArrayType(
StructType(Array(
StructField("SN", StringType, true),
StructField("ST", IntegerType, true),
StructField("SV", DoubleType, true)
))))
)
)

From Spark 2.1 you can achieve this using struct and collect_list.
val df_2 = Seq(
("FU74HZ501740XXXX",1566799999225.0,44.0,"APP"),
("FU74HZ501740XXXX",1566800002758.0,61.0,"APP"),
("FU74HZ501740XXXX",1566800009446.0,23.39,"ASP")
).toDF("vin","st","sv","sn")
df_2.show(false)
+----------------+-----------------+-----+---+
|vin |st |sv |sn |
+----------------+-----------------+-----+---+
|FU74HZ501740XXXX|1.566799999225E12|44.0 |APP|
|FU74HZ501740XXXX|1.566800002758E12|61.0 |APP|
|FU74HZ501740XXXX|1.566800009446E12|23.39|ASP|
+----------------+-----------------+-----+---+
Use collect_list with struct:
df_2.groupBy("vin","sn")
.agg(collect_list(struct($"st", $"sv",$"sn")).as("events"))
.withColumn("events",to_json($"events"))
.drop(col("sn"))
This will give the wwanted output:
+----------------+---------------------------------------------------------------------------------------------+
|vin |events |
+----------------+---------------------------------------------------------------------------------------------+
|FU74HZ501740XXXX|[{"st":1.566800009446E12,"sv":23.39,"sn":"ASP"}] |
|FU74HZ501740XXXX|[{"st":1.566799999225E12,"sv":44.0,"sn":"APP"},{"st":1.566800002758E12,"sv":61.0,"sn":"APP"}]|
+----------------+---------------------------------------------------------------------------------------------+

You can get it via SparkSession.
val df = spark.read.json("/path/to/json/file/test.json")
here spark is the SparkSession object

Related

what is the order guarantee when joining two columns of a spark dataframe which are processed separately?

I have dataframe with 3 columns
date
jsonString1
jsonString2
I want to expand attributes inside json into columns. so i did something like this.
val json1 = spark.read.json(dataframe.select(col("jsonString1")).rdd.map(_.getString(0)))
val json2 = spark.read.json(dataframe.select(col("jsonString2")).rdd.map(_.getString(0)))
val json1Table = json1.selectExpr("id", "status")
val json2Table = json2.selectExpr("name", "address")
now i want to put these table together. so i did the following
val json1TableWithIndex = addColumnIndex(json1Table)
val json2TableWithIndex = addColumnIndex(json2Table)
var finalResult = json1Table
.join(json2Table, Seq("columnindex"))
.drop("columnindex")
def addColumnIndex(df: DataFrame) = spark.createDataFrame(
df.rdd.zipWithIndex.map { case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex) },
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)
After sampling few rows I observe that rows match exactly as in the source dataframe
I did not find any information on the order guarantee when joining two columns of a dataframe which are processed separately. Is this the right way to solve my problem. Any help is appreciated.
It is always risky to rely on undocumented behaviours, as your code might not work as you intended, because you only have a partial understanding of it.
You can do the same thing in a much more efficient way without using any split and join approach. Use a from_json function to create two nested columns and then flatten out the nested columns and finally drop out the intermediate JSON string columns and nested columns.
Here is an example fo the whole process.
import org.apache.spark.sql.types.{StringType, StructType, StructField}
val df = (Seq(
("09-02-2020","{\"id\":\"01\", \"status\":\"Active\"}","{\"name\":\"Abdullah\", \"address\":\"Jumeirah\"}"),
("10-02-2020","{\"id\":\"02\", \"status\":\"Dormant\"}","{\"name\":\"Ali\", \"address\":\"Jebel Ali\"}")
).toDF("date","jsonString1","jsonString2"))
scala> df.show()
+----------+--------------------+--------------------+
| date| jsonString1| jsonString2|
+----------+--------------------+--------------------+
|09-02-2020|{"id":"01", "stat...|{"name":"Abdullah...|
|10-02-2020|{"id":"02", "stat...|{"name":"Ali", "a...|
+----------+--------------------+--------------------+
val schema1 = (StructType(Seq(
StructField("id", StringType, true),
StructField("status", StringType, true)
)))
val schema2 = (StructType(Seq(
StructField("name", StringType, true),
StructField("address", StringType, true)
)))
val dfFlattened = (df.withColumn("jsonData1", from_json(col("jsonString1"), schema1))
.withColumn("jsonData2", from_json(col("jsonString2"), schema2))
.withColumn("id", col("jsonData1.id"))
.withColumn("status", col("jsonData1.status"))
.withColumn("name", col("jsonData2.name"))
.withColumn("address", col("jsonData2.address"))
.drop("jsonString1")
.drop("jsonString2")
.drop("jsonData1")
.drop("jsonData2"))
scala> dfFlattened.show()
+----------+---+-------+--------+---------+
| date| id| status| name| address|
+----------+---+-------+--------+---------+
|09-02-2020| 01| Active|Abdullah| Jumeirah|
|10-02-2020| 02|Dormant| Ali|Jebel Ali|
+----------+---+-------+--------+---------+

Reading comma separated text file in spark 1.6

I have a text file which is similar to below
20190920
123456789,6325,NN5555,123,4635,890,C,9
985632465,6467,KK6666,654,9780,636,B,8
258063464,6754,MM777,789,9461,895,N,5
And I am using spark 1.6 with scala to read this text file
val df = sqlcontext.read.option("com.databricks.spark.csv")
.option("header","false").option("inferSchema","false").load(path)
df.show()
When I used above command to read it is reading only first column. Is there anything to add to read that file with all column values.
Output I got :
20190920
123456789
985632465
258063464
3
In this case you should provide schema,So your code will look like this
val mySchema = StructType(
List(
StructField("col1", StringType, true),
StructField("col2", StringType, true),
// and other columns ...
)
)
val df = sqlcontext.read
.schema(mySchema)
.option("com.databricks.spark.csv")
.option("header","false")
.option("inferSchema","false")
.load(path)

Spark dataframe with complex & nested data

I have 3 dataframes currently
Call them dfA, dfB, and dfC
dfA has 3 cols
|Id | Name | Age |
dfB has say 5 cols. the 2nd col, is a FK reference back to dFA record.
|Id | AId | Street | City | Zip |
Similarily dfC has 3 cols, also with a reference back to dfA
|Id | AId | SomeField |
Using Spark SQL i can do an JOIN across the 3
%sql
SELECT * FROM dfA
INNER JOIN dfB ON dfA.Id = dfB.AId
INNER JOIN dfC ON dfA.Id = dfC.AId
And I'll get my resultset, but it's been "flattened" as SQL would do with tabular results like this.
I want to load it in to a complex schema like this
val destinationSchema = new StructType()
.add("id", IntegerType)
.add("name", StringType)
.add("age", StringType)
.add("b",
new StructType()
.add("street", DoubleType, true)
.add("city", StringType, true)
.add("zip", StringType, true)
)
.add("c",
new StructType()
.add("somefield", StringType, true)
)
Any ideas how to take the results of the SELECT and save to dataframe with specifying the schema?
I ultimately want to save the complex StructType, or JSON, and load this is to Mongo DB using the Mongo Spark Connector.
Or, is there a better way to accomplish this from the 3 seperate dataframes (which were originally 3 seperate CSV files that were read in)?
given three dataframes, loaded in from csv files, you can do this:
import org.apache.spark.sql.functions._
val destDF = atableDF
.join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
.join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
.select($"id",$"name",$"age",struct($"street",$"city",$"zip") as "b",struct($"somefield") as "c")
val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))
which will output:
row
"{""id"":100,""name"":""John"",""age"":""43"",""b"":{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""},""c"":{""somefield"":""appples""}}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""},""c"":{""somefield"":""bananas""}}"
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""},""c"":{""somefield"":""pears""}}"
the previous one worked if all the records had a 1:1 relationship.
here is how you can achieve it for 1:M (hint: use collect_set to group rows)
import org.apache.spark.sql.functions._
val destDF = atableDF
.join(btableDF, atableDF("id") === btableDF("id")).drop(btableDF("id"))
.join(ctableDF, atableDF("id") === ctableDF("id")).drop(ctableDF("id"))
.groupBy($"id",$"name",$"age")
.agg(collect_set(struct($"street",$"city",$"zip")) as "b",collect_set(struct($"somefield")) as "c")
val jsonDestDF = destDF.select(to_json(struct($"*")).as("row"))
display(jsonDestDF)
which will give you the following output:
row
"{""id"":102,""name"":""Damian"",""age"":""23"",""b"":[{""street"":""Short Street"",""city"":""New York"",""zip"":""70701""}],""c"":[{""somefield"":""pears""},{""somefield"":""pineapples""}]}"
"{""id"":100,""name"":""John"",""age"":""43"",""b"":[{""street"":""Dark Road"",""city"":""Washington"",""zip"":""98002""}],""c"":[{""somefield"":""appples""}]}"
"{""id"":101,""name"":""Sally"",""age"":""34"",""b"":[{""street"":""Light Ave"",""city"":""Los Angeles"",""zip"":""90210""}],""c"":[{""somefield"":""grapes""},{""somefield"":""peaches""},{""somefield"":""bananas""}]}"
sample data I used just in case anyone wants to play:
atable.csv
100,"John",43
101,"Sally",34
102,"Damian",23
104,"Rita",14
105,"Mohit",23
btable.csv:
100,"Dark Road","Washington",98002
101,"Light Ave","Los Angeles",90210
102,"Short Street","New York",70701
104,"Long Drive","Buffalo",80345
105,"Circular Quay","Orlando",65403
ctable.csv:
100,"appples"
101,"bananas"
102,"pears"
101,"grapes"
102,"pineapples"
101,"peaches"

import data with a column of type Pig Map into spark Dataframe?

So I'm trying to import data that has a column of type Pig map into a spark dataframe, and I couldn't find anything on how do I explode the map data into 3 columns with names: street, city and state. I'm probably searching for the wrong thing. Right now I can import them into 3 columns using StructType and StructField options.
val schema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("address", StringType, true))) #this is the part that I need to explode
val data = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter", ";")
.schema(schema)
.load("hdfs://localhost:8020/filename")
Example row of the data that I need to make 5 columns from:
328;Some Name;[street#streetname,city#Chicago,state#IL]
What do i need to do to explode the map into 3 columns so id have essentially a new dataframe with 5 columns ? I just started Spark and I've never used pig. I only figured out it was a pig map through searching the structure [key#value].
I'm using spark 1.6 by the way with scala. Thank you for any help.
I'm not too familiar with the pig format (there may even be libraries for it), but some good ol' fashioned string manipulation seems to work. In practice you may have to do some error checking, or you'll get index out of range errors.
val data = spark.createDataset(Seq(
(328, "Some Name", "[street#streetname,city#Chicago,state#IL]")
)).toDF("id", "name", "address")
data.as[(Long, String, String)].map(r => {
val addr = (r._3.substring(1, r._3.length - 1)).split(",")
val street = addr(0).split("#")(1)
val city = addr(1).split("#")(1)
val state = addr(2).split("#")(1)
(r._1, r._2, street, city, state)
}).toDF("id", "name", "street", "city", "state").show()
which results in
+---+---------+----------+-------+-----+
| id| name| street| city|state|
+---+---------+----------+-------+-----+
|328|Some Name|streetname|Chicago| IL|
+---+---------+----------+-------+-----+
I'm not 100% certain of the compatibility with spark 1.6, however. You may end up having to map the Dataframe (as opposed to Dataset, as I'm converting it with the .as[] call), and extract the individual value's from the Row object in your anonymous .map() function. The overall concept should be the same though.

How to use double pipe as delimiter in CSV?

Spark 1.5 and Scala 2.10.6
I have a data file that is using "¦¦" as the delimiter. I am having a hard time parsing through this to create a data frame. Can multiple delimiters be used to create a data frame? The code works with a single broken pipe but not with multiple delimiters.
My Code:
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
val df_1 = sqlContext.read
.format("com.databricks.spark.csv")
.schema(customSchema_1)
.option("delimiter", "¦¦")
.load("example.txt")
Sample file:
12345¦¦ ¦¦10
I ran into this and found a good solution, I am using spark 2.3, I have a feeling it should work all of spark 2.2+ but have not tested it. The way it works is I replace the || with a tab and then the built in csv can take a Dataset[String] . I used tab because I have commas in my data.
var df = spark.sqlContext.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "\t")
.csv(spark.sqlContext.read.textFile("filename")
.map(line => line.split("\\|\\|").mkString("\t")))
Hope this helps some else.
EDIT:
As of spark 3.0.1 this works out of the box.
example:
val ds = List("name||id", "foo||12", "brian||34", """"cray||name"||123""", "cray||name||123").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]
val csv = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "||").csv(ds)
csv: org.apache.spark.sql.DataFrame = [name: string, id: string]
csv.show
+----------+----+
| name| id|
+----------+----+
| foo| 12|
| brian| 34|
|cray||name| 123|
| cray|name|
+----------+----+
So the actual error being emitted here is:
java.lang.IllegalArgumentException: Delimiter cannot be more than one character: ¦¦
The docs corroborate this limitation and I checked the Spark 2.0 csv reader and it has the same requirement.
Given all of this, if your data is simple enough where you won't have entries containing ¦¦, I would load your data like so:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val customSchema_1 = StructType(Array(
StructField("ID", StringType, true),
StructField("FILLER", StringType, true),
StructField("CODE", StringType, true)));
// Exiting paste mode, now interpreting.
customSchema_1: org.apache.spark.sql.types.StructType = StructType(StructField(ID,StringType,true), StructField(FILLER,StringType,true), StructField(CODE,StringType,true))
scala> val rawData = sc.textFile("example.txt")
rawData: org.apache.spark.rdd.RDD[String] = example.txt MapPartitionsRDD[1] at textFile at <console>:31
scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
scala> val rowRDD = rawData.map(line => Row.fromSeq(line.split("¦¦")))
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[3] at map at <console>:34
scala> val df = sqlContext.createDataFrame(rowRDD, customSchema_1)
df: org.apache.spark.sql.DataFrame = [ID: string, FILLER: string, CODE: string]
scala> df.show
+-----+------+----+
| ID|FILLER|CODE|
+-----+------+----+
|12345| | 10|
+-----+------+----+
We tried to read data having custom delimiters and customizing column names for data frame in following way,
# Hold new column names saparately
headers ="JC_^!~_*>Year_^!~_*>Date_^!~_*>Service_Type^!~_*>KMs_Run^!~_*>
# '^!~_*>' This is field delimiter, so split string
head = headers.split("^!~_*>")
## Below command splits the S3 file with custom delimiter and converts into Dataframe
df = sc.textFile("s3://S3_Path/sample.txt").map(lambda x: x.split("^!~_*>")).toDF(head)
Passing head as parameter in toDF() assign new column names to dataframe created from text file having custom delimiters.
Hope this helps.
Starting from Spark2.8 and above support of multiple character delimiter has been added.
https://issues.apache.org/jira/browse/SPARK-24540
The above solution proposed by #lockwobr works in scala. Whoever working below Spark 2.8 and looking out for solution in PySpark you can refer to the below
ratings_schema = StructType([
StructField("user_id", StringType(), False)
, StructField("movie_id", StringType(), False)
, StructField("rating", StringType(), False)
, StructField("rating_timestamp", StringType(), True)
])
#movies_df = spark.read.csv("ratings.dat", header=False, sep="::", schema=ratings_schema)
movies_df = spark.createDataFrame(
spark.read.text("ratings.dat").rdd.map(lambda line: line[0].split("::")),
ratings_schema)
i have provided an example but you can modify it for your logic.