How do I use a from_json() dataframe in Spark? - scala

I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))
This returns a dataframe where the root object is
jsontostructs(CAST(body AS STRING)):struct
followed by the fields in the schema (looks correct). When I try another select on the newDF
val transform = newDF.select($"propertyNameInTheParsedJsonObject")
it throws the exception
org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given
input columns: [jsontostructs(CAST(body AS STRING))];;
I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.
My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.

from_json returns a struct or (array<struct<...>>) column. It means it is a nested object. If you've provided a meaningful name:
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")
and the schema describes a plain struct you could use standard methods like
newDF.select($"parsed.propertyNameInTheParsedJsonObject")
otherwise please follow the instructions for accessing arrays.

Related

Create an empty DF using schema from another DF (Scala Spark)

I have to compare a DF with another one that is the same schema readed from a specific path, but maybe in that path there are not files so I've thought that I have to compare it with a null DF with the same columns as the original.
So I am trying to create a DF with the schema from another DF that contains a lot of columns but I can't find a solution for this. I have been reading the following posts but no one helps me:
How to create an empty DataFrame with a specified schema?
How to create an empty DataFrame? Why "ValueError: RDD is empty"?
How to create an empty dataFrame in Spark
How can I do it in scala? Or is better take other option?
originalDF.limit(0) will return an empty dataframe with the same schema.

How to parse a csv string into a Spark dataframe using scala?

I would like to convert a RDD containing records of strings, like below, to a Spark dataframe.
"Mike,2222-003330,NY,34"
"Kate,3333-544444,LA,32"
"Abby,4444-234324,MA,56"
....
The schema line is not inside the same RDD, but in a another variable:
val header = "name,account,state,age"
So now my question is, how do I use the above two, to create a dataframe in Spark? I am using Spark version 2.2.
I did search and saw a post:
Can I read a CSV represented as a string into Apache Spark using spark-csv
.
However it's not exactly what I need and I can't figure out a way to modify this piece of code to work in my case.
Your help is greatly appreciated.
The easier way would probably be to start from the CSV file and read it directly as a dataframe (by specifying the schema). You can see an example here: Provide schema while reading csv file as a dataframe.
When the data already exists in an RDD you can use toDF() to convert to a dataframe. This function also accepts column names as input. To use this functionality, first import the spark implicits using the SparkSession object:
val spark: SparkSession = SparkSession.builder.getOrCreate()
import spark.implicits._
Since the RDD contains strings it needs to first be converted to tuples representing the columns in the dataframe. In this case, this will be a RDD[(String, String, String, Int)] since there are four columns (the last age column is changed to int to illustrate how it can be done).
Assuming the input data are in rdd:
val header = "name,account,state,age"
val df = rdd.map(row => row.split(","))
.map{ case Array(name, account, state, age) => (name, account, state, age.toInt)}
.toDF(header.split(","):_*)
Resulting dataframe:
+----+-----------+-----+---+
|name| account|state|age|
+----+-----------+-----+---+
|Mike|2222-003330| NY| 34|
|Kate|3333-544444| LA| 32|
|Abby|4444-234324| MA| 56|
+----+-----------+-----+---+

How to read Hive table with column with JSON strings?

I have a hive table column (Json_String String) it has some 1000 rows, Where each row is a Json of same structure. I am trying read the json in to Dataframe as below
val df = sqlContext.read.json("select Json_String from json_table")
but it is throwing up the below exception
java.io.IOException: No input paths specified in job
is there any way to read all the rows in to dataframe as we do with Json files using wild card
val df = sqlContext.read.json("file:///home/*.json")
I think what you're asking for is to read the Hive table as usual and transform the JSON column using from_json function.
from_json(e: Column, schema: StructType): Column Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.
Given you use sqlContext in your code, I'm afraid that you use Spark < 2.1.0 which then does not offer from_json (which was added in 2.1.0).
The solution then is to use a custom user-defined function (UDF) to do the parsing yourself.
val df = sqlContext.read.json("select Json_String from json_table")
The above won't work since json operator expects a path or paths to JSON files on disk (not as a result of executing a query against a Hive table).
json(paths: String*): DataFrame Loads a JSON file (JSON Lines text format or newline-delimited JSON) and returns the result as a DataFrame.

Column having list datatype : Spark HiveContext

The following code does aggregation and create a column with list datatype:
groupBy(
"column_name_1"
).agg(
expr("collect_list(column_name_2) "
"AS column_name_3")
)
So it seems it is possible to have 'list' as column datatype in a dataframe.
I was wondering if I can write a udf that returns custom datatype, for example a python dict?
The list is a representation of spark's Array datatype. You can try using the Map datatype (pyspark.sql.types.MapType).
an example of something which creates it is: pyspark.sql.functions.create_map which creates a map from several columns
That said if you want to create a custom aggregation function to do anything not already available in pyspark.sql.functions you will need to use scala.

reading from a spark.structType in scala

I am running the following scala code:
val hiveContext=new org.apache.spark.sql.hive.HiveContex(sc)
val df=hiveContext.sql("SELECT * FROM hl7.all_index")
val rows=df.rdd
val firstStruct=rows.first.get(4)
//I know the column with index 4 IS a StructType
val fs=firstStruct.asInstanceOf[StructType]
//now it fails
//what I'm trying to achieve is
log.println(fs.apply("name"))
I know that firstStruct is of structType and that one of the StructFields' name is "name" but it seems to fail when trying to cast
I've been told that spark/hive structs differ from scala, but, in order to use StructType I needed to
import org.apache.spark.sql.types._
so I assume they actually should be the same type
I looked here: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
in order to see how to get to the structField.
Thanks!
Schema types are logical types. They don't map one-to-one to the type of objects from the column with with that schema type.
For example, Hive/SQL use BIGINT for 64 bit integers while SparkSQL uses LongType. The actual type of the data in Scala is Long. This is the issue you are having.
A struct in Hive (StructType in SparkSQL) is represented by Row in a dataframe. So, what you want to do is one of the following:
row.getStruct(4)
import org.apache.spark.sql.Row
row.getAs[Row](4)