Loading JSON to Spark SQL - scala

I'm doing self study about JSON with Spark SQL in v2.1 and am using the data from the link
https://catalog.data.gov/dataset/air-quality-measures-on-the-national-environmental-health-tracking-network
The problem I have is when I use :
val lines = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("E:/VW/meta_plus_sample_Data.json")`
I get Spark SQL returning all the data as one row.
+--------------------+--------------------+
| data| meta|
+--------------------+--------------------+
|[[row-8eh8_xxkx-u...|[[[[1439474950, t...|
+--------------------+--------------------+`
And when I remove:
.option("multiLine", true).option("mode", "PERMISSIVE")
I get an error as
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" org.apache.spark.sql.AnalysisException: Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).json(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).json(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().;
Is there an option to achieve it in Spark SQL with each record from file as one row in table?

This is expected behavior as we have only one record(in the link provided in the question) in the having meta (object) and data (array).
As one json record is in multiple lines so we need to include multiLine option.
spark.read.option("multiLine",true).option("mode","PERMISSIVE").json("tmp.json").show()
//sample data
//+--------------------+--------------------+
//| data| meta|
//+--------------------+--------------------+
//|[[row-8eh8_xxkx-u...|[[[[1439474950, t...|
//+--------------------+--------------------+
//access meta struct columns
df.select("meta.view.*").show()
//+--------------------+-------------+--------------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+--------------------+--------------------+----------+----------------+--------+--------------------+----------+------------------------+---------------+----------------+----------------+--------------------+------+--------+-------------+-------------+--------------------+-------+--------------------+---------------+---------+----------------+--------+
//| approvals|averageRating| category| columns| createdAt| description|displayType|downloadCount| flags| grants|hideFromCatalog|hideFromDataJson| id|indexUpdatedAt| metadata| name|newBackend|numberOfComments| oid| owner|provenance|publicationAppendEnabled|publicationDate|publicationGroup|publicationStage| query|rights|rowClass|rowsUpdatedAt|rowsUpdatedBy| tableAuthor|tableId| tags|totalTimesRated|viewCount|viewLastModified|viewType|
//+--------------------+-------------+--------------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+--------------------+--------------------+----------+----------------+--------+--------------------+----------+------------------------+---------------+----------------+----------------+--------------------+------+--------+-------------+-------------+--------------------+-------+--------------------+---------------+---------+----------------+--------+
//|[[1439474950, tru...| 0|Environmental Hea...|[[, meta_data,, :...|1439381433|The Environmental...| table| 26159|[default, restora...|[[[public], false...| false| false|cjae-szjv| 1528204279|[[table, fatrow, ...|Air Quality Measu...| true| 0|12801487|[Tracking, 94g5-7...| official| false| 1439474950| 3957835| published|[[[true, [2171820...|[read]| | 1439402317| 94g5-7as2|[Tracking, 94g5-7...|3960642|[environmental ha...| 0| 3843| 1528203875| tabular|
//+--------------------+-------------+--------------------+--------------------+----------+--------------------+-----------+-------------+--------------------+--------------------+---------------+----------------+---------+--------------+--------------------+--------------------+----------+----------------+--------+--------------------+----------+------------------------+---------------+----------------+----------------+--------------------+------+--------+-------------+-------------+--------------------+-------+--------------------+---------------+---------+----------------+--------+
//to access data array we need to explode
df.selectExpr("explode(data)").show()
//+--------------------+
//| col|
//+--------------------+
//|[row-8eh8_xxkx-u3...|
//|[row-u2v5_78j5-px...|
//|[row-68zj_7qfn-sx...|
//|[row-8b4d~zt5j~da...|
//|[row-5gee.63td_z6...|
//|[row-tzyx.ssxh_pz...|
//|[row-3yj2_u42c_mr...|
//|[row-va7z.p2v8.7p...|
//|[row-r7kk_e3dm-z2...|
//|[row-bnrc~w34s-4a...|
//|[row-ezrk~m5dc_5n...|
//|[row-nyya.dvnz~c6...|
//|[row-dq3i_wt6d_c6...|
//|[row-u6rc-k3mf-cn...|
//|[row-t9c6-4d4b_r6...|
//|[row-vq6r~mxzj-e6...|
//|[row-vxqn-mrpc~5b...|
//|[row-3akn_5nzm~8v...|
//|[row-ugxn~bhax.a2...|
//|[row-ieav.mdz9-m8...|
//+--------------------+
Load multiple json records:
//json array with two records
spark.read.json(Seq(("""
[{"id":1,"name":"a"},
{"id":2,"name":"b"}]
""")).toDS).show()
//as we have 2 json objects and loaded as 2 rows
//+---+----+
//| id|name|
//+---+----+
//| 1| a|
//| 2| b|
//+---+----+

Related

Spark RDD to Dataframe

Below is the data in a file
PREFIX|Description|Destination|Num_Type
1|C1|IDD|NA
7|C2|IDDD|NA
20|C3|IDDD|NA
27|C3|IDDD|NA
30|C5|IDDD|NA
I am trying to read it and convert into Dataframe.
val file=sc.textFile("/user/cloudera-scm/file.csv")
val list=file.collect.toList
list.toDF.show
+--------------------+
| value|
+--------------------+
|PREFIX|Descriptio...|
| 1|C1|IDD|NA|
| 7|C2|IDDD|NA|
| 20|C3|IDDD|NA|
| 27|C3|IDDD|NA|
| 30|C5|IDDD|NA|
+--------------------+
I am not able to convert this to datafram with exact table form
Let's first consider your code.
// reading a potentially big file
val file=sc.textFile("/user/cloudera-scm/file.csv")
// collecting everything to the driver
val list=file.collect.toList
// converting a local list to a dataframe (this does not work)
list.toDF.show
There are ways to make your code work, but the very logic awkward. You are reading data with the executors, putting all of it on the driver to simply convert it to a dataframe (back to the executors). That's a lot of network communication, and the driver will most likely run out of memory for any reasonably large dataset.
What you can do it read the data directly as a dataframe like this (the driver does nothing and there is no unnecessary IO):
spark.read
.option("sep", "|") // specify the delimiter
.option("header", true) // to tell spark that there is a header
.option("inferSchema", true) // optional, infer the types of the columns
.csv(".../data.csv").show
+------+-----------+-----------+--------+
|PREFIX|Description|Destination|Num_Type|
+------+-----------+-----------+--------+
| 1| C1| IDD| NA|
| 7| C2| IDDD| NA|
| 20| C3| IDDD| NA|
| 27| C3| IDDD| NA|
| 30| C5| IDDD| NA|
+------+-----------+-----------+--------+

Combine two RDDs in Scala

The first RDD, user_person, is a Hive table which records every person's information:
+---------+---+----+
|person_id|age| bmi|
+---------+---+----+
| -100| 1|null|
| 3| 4|null|
...
Below is my second RDD, a Hive table that only has 40 row and only includes basic information:
| id|startage|endage|energy|
| 1| 0| 0.2| 1|
| 1| 2| 10| 3|
| 1| 10| 20| 5|
I want to compute every person's energy requirement by age scope for each row.
For example,a person's age is 4, so it require 3 energy. I want to add that info into RDD user_person.
How can I do this?
First, initialize the spark session with enableHiveSupport() and copy Hive config files (hive-site.xml, core-site.xml, and hdfs-site.xml) to Spark/conf/ directory, to enable Spark to read from Hive.
val sparkSession = SparkSession.builder()
.appName("spark-scala-read-and-write-from-hive")
.config("hive.metastore.warehouse.dir", params.hiveHost + "user/hive/warehouse")
.enableHiveSupport()
.getOrCreate()
Read the Hive tables as Dataframes as below:
val personDF= spark.sql("SELECT * from user_person")
val infoDF = spark.sql("SELECT * from person_info")
Join these two dataframes using below expression:
val outputDF = personDF.join(infoDF, $"age" >= $"startage" && $"age" < $"endage")
The outputDF dataframe contains all the columns of input dataframes.

Remove all records which are duplicate in spark dataframe

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.

Spark creates a extra column when reading a dataframe

I am reading a JSON file into a Spark Dataframe and it creates a extra column at the end.
var df : DataFrame = Seq(
(1.0, "a"),
(0.0, "b"),
(0.0, "c"),
(1.0, "d")
).toDF("col1", "col2")
df.write.mode(SaveMode.Overwrite).format("json").save("/home/neelesh/year=2018/")
val newDF = sqlContext.read.json("/home/neelesh/year=2018/*")
newDF.show
The output of newDF.show is:
+----+----+----+
|col1|col2|year|
+----+----+----+
| 1.0| a|2018|
| 0.0| b|2018|
| 0.0| c|2018|
| 1.0| d|2018|
+----+----+----+
However the JSON file is stored as:
{"col1":1.0,"col2":"a"}
{"col1":0.0,"col2":"b"}
{"col1":0.0,"col2":"c"}
{"col1":1.0,"col2":"d"}
The extra column is not added if year=2018 is removed from the path. What can be the issue here?
I am running Spark 1.6.2 with Scala 2.10.5
Could you try:
val newDF = sqlContext.read.json("/home/neelesh/year=2018")
newDF.show
+----+----+
|col1|col2|
+----+----+
| 1.0| A|
| 0.0| B|
| 0.0| C|
| 1.0| D|
+----+----+
quoting from spark 1.6
Starting from Spark 1.6.0, partition discovery only finds partitions
under the given paths by default. For the above example, if users pass
path/to/table/gender=male to either SQLContext.read.parquet or
SQLContext.read.load, gender will not be considered as a partitioning
column
Spark uses directory structure field=value as partition information see https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#partition-discovery
so in your case the year=2018 is considered a year partition and thus an additonal column

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.