apache spark - Inserting a dataframe as nested struct into other dataframe - scala

I have the two dataframes created in spark
xml_df:
root
|-- _defaultedgetype: string (nullable = true)
|-- _mode: string (nullable = true)
and nodes_df:
root
|-- nodes: struct (nullable = false)
| |-- _id: string (nullable = true)
| |-- _label: string (nullable = true)
the xml_df will have always just one rows, as described as follows:
+----------------+------+
|_defaultedgetype| _mode|
+----------------+------+
| undirected|static|
+----------------+------+
and the nodes_df data:
+-----+
|nodes|
+-----+
|[1,1]|
|[2,2]|
|[3,3]|
|[4,4]|
|[5,5]|
+-----+
I used the struct function in the nodes_df to put _id and _label inside the struct. Based on that, i would like to add a third column in the xml_df dataframe that contains the struct created in the nodes_df dataframe.
I tried to use a join function creating a literal id for each entry in nodes_df, but the column result as null.
Any light please?

I found why my join was not working.
I needed to use aggregation for the nodes column, so then i was able to proper join the dataframes.
i created an id for the xml_df:
StructType(List(StructField("id",IntegerType, true),
StructField("_defaultedgetype",StringType, true),
StructField("_mode",StringType, true)))
and the same for the nodes_df
val nodes_schema = StructType(List(
StructField("id",IntegerType, true),
StructField("_id",StringType, true),
StructField("_label",StringType, true)
))
i used the id 666 for both of then and used aggregation in the nodes_df
nodes_df = nodes_df.groupBy("id").agg(collect_set("nodes").as("node"))
and joined with xml_df:
xml_df = xml_df.join(nodes_df, Seq("id"),"right").drop("id")
the result is:
+----------------+------+--------------------+
|_defaultedgetype| _mode| node|
+----------------+------+--------------------+
| undirected|static|[[2,2], [3,3], [5...|
+----------------+------+--------------------+
root
|-- _defaultedgetype: string (nullable = true)
|-- _mode: string (nullable = true)
|-- node: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- _label: string (nullable = true)

Related

spark scala: extracting xml from one column

Assume df has the following structure:
root
|-- id: decimal(38,0) (nullable = true)
|-- text: string (nullable = true)
here text contains strings of roughly-XML type records. I'm then able to apply the following steps to extract the necessary entries into a flat table:
First, append the root node, since there is none originally. (Question #1: is this step necessary, or can be omitted?)
val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))
Next, parsing the XML:
val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])
This generates df3:
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
which I finally explode:
val df4 = df3.withColumn("exploded_cols", explode($"row"))
into
root
|-- row: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
|-- exploded_cols: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: string (nullable = true)
My goal is the following table:
val df5 = df4.select("exploded_cols.*")
with
root
|-- key: string (nullable = true)
|-- value: string (nullable = true)
Main question:
I want that the final table would also contain the id: decimal(38,0) (nullable = true) entries along with the exploded key, value columns, e.g.,
root
|-- id: decimal(38,0) (nullable = true)
|-- key: string (nullable = true)
|-- value: string (nullable = true)
however, I'm not sure how to call spark.read.option without selecting df2.select("text").as[String] separately into the method (see df3). Is it possible to simplify this script?
This should be straightforward, so I'm not sure a reproducible example is necessary. Also, I'm coming blind from an r background, so I'm missing all the scala basics, but trying to learn as I go.
Use from_xml function of spak-xml library.
val df = // Read source data
val schema = // Define schema of XML text
df.withColumn("xmlData", from_xml("xmlColName", schema))

How to change datatype of a field in a two-level schema tree?

Now I have a dataframe with schema:
root
|-- id: string (nullable = true)
|-- st_one: struct (nullable = true)
| |-- tid: long (nullable = true)
| |-- st_two: struct (nullable = true)
| | |-- name: string (nullable = true)
| | |-- score: long (nullable = true)
|-- ts: double (nullable = true)
|-- date: string (nullable = true)
I want to change score's type from long to double. Is there any good solution?
BTW, I'm using Scala.
I've already known how to do it by "listing" all the fields. I want a more common method that could fit even st_two contains a thousand fields or more.
You can update the struct type column st_one like this:
val df1 = df.withColumn(
"st_one",
struct(
$"st_one.tid",
struct(
$"st_one.st_two.name",
$"st_one.st_two.score".cast("double").as("score")
).as("st_two")
)
)
You can do a complex cast:
val df2 = df.withColumn("st_one", $"st_one".cast("struct<tid:long, st_two:struct<name:string, score:double>>"))

Change column names of nested data in bigquery using spark

I'm trying to write some data into BigQuery using Spark Scala, My spark df looks like,
root
|-- id: string (nullable = true)
|-- cost: double (nullable = false)
|-- nodes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- settled: string (nullable = true)
| | |-- constant: string (nullable = true)
|-- status: string (nullable = true)
I tried to change the struct of the data frame.
val schema = StructType(Array(
StructField("id", StringType, true),
StructField("cost", DoubleType, true),
StructField("nodes", StructType(Array(StructField("settled", StringType), StructField("constant", StringType)))),
StructField("status", StringType, true)))
val actualDf = spark.createDataFrame(results, schema)
But it didn't work. When this writes into the BigQuery, Column names look like as follows,
id, cost, nodes.list.element.settled, nodes.list.element.constant, status
Is there a possible way to change these column names as,
id, cost, settled, constant, status
You can explode nodes array to get flatten structure of columns, then write dataframe to bigquery.
Example:
val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS
spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- constant: string (nullable = true)
// | | |-- settled: string (nullable = true)
// |-- status: string (nullable = true)
spark.read.json(jsn_ds).
withColumn("expld",explode('nodes)).
select("*","expld.*").
drop("expld","nodes").
show()
//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0| 1| s| p| u|
//+----+---+------+--------+-------+

How can I perform ETL on a Spark Row and return it to a dataframe?

I'm currently using Scala Spark for some ETL and have a base dataframe that contains has the following schema
|-- round: string (nullable = true)
|-- Id : string (nullable = true)
|-- questions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- bonusQuestions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- difficulty : string (nullable = true)
| | |-- answerOptions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- followUpAnswers: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- school: string (nullable = true)
I only need to perform ETL on rows where the round type is primary (there are 2 types primary and secondary). However, I need both type of rows in my final table.
I'm stuck doing the ETL which should be according to -
If tag is non-bonus, the bonusQuestions should be set to null and difficulty should be null.
I'm currently able to access most fields of the DF like
val round = tr.getAs[String]("round")
Next, I'm able to get the questions array using
val questionsArray = tr.getAs[Seq[StructType]]("questions")
and can iterate using for (question <- questionsArray) {...}; However I cannot access struct fields like question.bonusQuestions or question.tagwhich returns an error
error: value tag is not a member of org.apache.spark.sql.types.StructType
Spark treats StructType as GenericRowWithSchema, more specific as Row. So instead of Seq[StructType] you have to use Seq[Row] as
val questionsArray = tr.getAs[Seq[Row]]("questions")
and in the loop for (question <- questionsArray) {...} you can get the data of Row as
for (question <- questionsArray) {
val tag = question.getAs[String]("tag")
val bonusQuestions = question.getAs[Seq[String]]("bonusQuestions")
val difficulty = question.getAs[String]("difficulty")
val answerOptions = question.getAs[Seq[String]]("answerOptions")
val followUpAnswers = question.getAs[Seq[String]]("followUpAnswers")
}
I hope the answer is helpful

Spark/Scala: join dataframes when id is nested in an array of structs

I'm using Spark's MlLib DataFrame ALS functionality on Spark 2.2.0. I had to run my userId and itemId fields through an StringIndexer to get things going
The method 'recommendForAllUsers' returns the following schema
root
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: long (nullable = true)
| | |-- rating: double (nullable = true)
|-- userIdIndex: string (nullable = true)
This is perfect for my needs (would love not to flatten it) but I need to replace userIdIndex and itemIdIndex with their actual value
for the userIdIndex was ok (I couldn't simply reverse it with IndexToString as the ALS FITTING seems to erase the link between index and value):
df.join(df2, df2("userIdIndex")===df("userIdIndex"), "left")
.select(df2("userId"), df("recommendations"))
where df2 looks like this:
+------------------+--------------------+----------+-----------+-----------+
| userId| itemId| rating|userIdIndex|itemIdIndex|
+------------------+--------------------+----------+-----------+-----------+
|glorified-consumer| item-22302| 3.0| 15.0| 4.0|
the result is this schema:
root
|-- userId: string (nullable = true)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: integer (nullable = true)
| | |-- rating: float (nullable = true)
QUESTION: for itemIdIndex, being inside an array of structures.
You can explode the array so that struct is only remained as
val tempdf2 = df2.withColumn("recommendations", explode('recommendations))
which should leave you with schema as
root
|-- userdId: string (nullable = true)
|-- recommendations: struct (nullable = true)
| |-- itemIdIndex: string (nullable = true)
| |-- rating: string (nullable = true)
Do the same for df (the first dataframe)
Then after that you can join them as
tempdf1.join(tempdf2, tempdf1("recommendations.itemIndex") === tempdf2("recommendations.itemIndex"))