spark scala create map from list in dataframe - scala

I have the schema as below:
root
|-- id: string (nullable = true)
|-- info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: long (nullable = false)
| | |-- _3: string (nullable = true)
I want the o/p to be:
id | info
111|[{aaa:{12,abc}},{xxx:{14,def}}]
222|[{ddd:{23,fgh}},{jjj:{13,ijk}}]
333|[{aaa:{96,wer}]
Please help

It seems that your "info" field contains a list, and you want to turn the first element of each list to be the key of that list.
Maybe try:
dataset.map(row => row.getAs[Seq[Row]]("info")
.map(list => Map (list.head, list.tail)))
I'm new to Scala too. And since I don't know your specific schema, the above code might not work as expected. Hope this is helpful.

Related

Adding new column for DataFrame with complex column (Array<Map<String,String>>

I am loading a Dataframe from an external source with the following schema:
|-- A: string (nullable = true)
|-- B: timestamp (nullable = true)
|-- C: long (nullable = true)
|-- METADATA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- M_1: integer (nullable = true)
| | |-- M_2: string (nullable = true)
| | |-- M_3: string (nullable = true)
| | |-- M_4: string (nullable = true)
| | |-- M_5: double (nullable = true)
| | |-- M_6: string (nullable = true)
| | |-- M_7: double (nullable = true)
| | |-- M_8: boolean (nullable = true)
| | |-- M_9: boolean (nullable = true)
|-- E: string (nullable = true)
Now, I need to add new column, METADATA_PARSED, with column type Array and the following case class:
case class META_DATA_COL(M_1: String, M_2: String, M_3, M_10:String)
My approach here, based on examples is to create a UDF and pass in the METADATA column. But since it is of a complex type I am having a lot of trouble parsing it.
On top of that in the UDF, for the "new" variable M_10, I need to do some string manipulation on the method as well. So I need to access each of the elements in the metadata column.
What would be the best way to approach this issue? I attempted to convert the source dataframe (+METADATA) to a case class; but that did not work as it was translated back to spark WrappedArray types upon entering the UDF.
you can Use something like this.
import org.apache.spark.sql.functions._
val tempdf = df.select(
explode( col("METADATA")).as("flat")
)
val processedDf = tempdf.select( col("flat.M_1"),col("flat.M_2"),col("flat.M_3"))
now write a udf
def processudf = udf((col1:Int,col2:String,col3:String) => /* do the processing*/)
this should help, i can provide some more help if you can provide more details on the processing.

Selecting fields of structs inside an array inside a dataframe

I have a PySpark dataframe loaded from a 3 GB json.gz file, with the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- articleID: string (nullable = true)
| | |-- title: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- author: string (nullable = true)
| | |-- source: string (nullable = true)
I need to drop the title, author and date fields, or create a new dataFrame that does not include these fields.
So far I've managed to get the following schema:
root
|-- _id: long (nullable = false)
|-- quote: string (nullable = true)
|-- occurrences: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- articleID: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- source: array (nullable = true)
| | | |-- element: string (containsNull = true)
using
df.select(df._id, df.quote,
array(
struct(
col("occurrences.articleID"),
col("occurrences.source")
)
).alias("occurrences"))
But I need a way to keep articleIDs and sources together in the same struct. How can I do this?
Okay, I found something that works:
clean_df = df.withColumn("exploded",explode("occurrences")).drop("occurrences")
.select(
df._id,
df.quote,
df.exploded.articleID.alias("articleID"),
df.exploded.source.alias("source")
)
.withColumn("occs", struct(col("articleID"), col("source")))
.groupBy("_id", "quote").agg(collect_set("occs").alias("occurrences"))
But if anyone has a better solution, I'd love to hear it, since this seems very round-about. (And as a sidenote, collect_set only seems to works with java 8.)

Spark: pruning nested columns/fields

I have a question about the possibility to prune nested fields.
I'm developing a source for High Energy Physics Data format (ROOT).
below is the schema for some file using a DataSource that I'm developing.
root
|-- EventAuxiliary: struct (nullable = true)
| |-- processHistoryID_: struct (nullable = true)
| | |-- hash_: string (nullable = true)
| |-- id_: struct (nullable = true)
| | |-- run_: integer (nullable = true)
| | |-- luminosityBlock_: integer (nullable = true)
| | |-- event_: long (nullable = true)
| |-- processGUID_: string (nullable = true)
| |-- time_: struct (nullable = true)
| | |-- timeLow_: integer (nullable = true)
| | |-- timeHigh_: integer (nullable = true)
| |-- luminosityBlock_: integer (nullable = true)
| |-- isRealData_: boolean (nullable = true)
| |-- experimentType_: integer (nullable = true)
| |-- bunchCrossing_: integer (nullable = true)
| |-- orbitNumber_: integer (nullable = true)
| |-- storeNumber_: integer (nullable = true)
The DataSource is here https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/experimental/package.scala#L62
When building a reader using the buildReader method of the FileFormat:
override def buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
I see that requiredSchema will always contain all of the fields/members of the top column that is being looked at. Meaning that when I want to select a particular nested field with :
df.select("EventAuxiliary.id_.run_"), requiredSchema will be again the full struct for that top column ("EventAuxiliary"). I would expect that schema would be something like this:
root
|-- EventAuxiliary: struct...
| |-- id_: struct ...
| | |-- run_: integer
since this is the only schema that has been required by the select statement.
Basically, I want to know how on the data source level I can prune nested fields. I thought that requiredSchema will be only the fields that are coming from the df.select.
I'm trying to see what avro/parquet are doing and found this: https://github.com/apache/spark/pull/14957/files
If there are suggestions/comments - would be appreciated!
Thanks!
VK

Spark/Scala: join dataframes when id is nested in an array of structs

I'm using Spark's MlLib DataFrame ALS functionality on Spark 2.2.0. I had to run my userId and itemId fields through an StringIndexer to get things going
The method 'recommendForAllUsers' returns the following schema
root
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: long (nullable = true)
| | |-- rating: double (nullable = true)
|-- userIdIndex: string (nullable = true)
This is perfect for my needs (would love not to flatten it) but I need to replace userIdIndex and itemIdIndex with their actual value
for the userIdIndex was ok (I couldn't simply reverse it with IndexToString as the ALS FITTING seems to erase the link between index and value):
df.join(df2, df2("userIdIndex")===df("userIdIndex"), "left")
.select(df2("userId"), df("recommendations"))
where df2 looks like this:
+------------------+--------------------+----------+-----------+-----------+
| userId| itemId| rating|userIdIndex|itemIdIndex|
+------------------+--------------------+----------+-----------+-----------+
|glorified-consumer| item-22302| 3.0| 15.0| 4.0|
the result is this schema:
root
|-- userId: string (nullable = true)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: integer (nullable = true)
| | |-- rating: float (nullable = true)
QUESTION: for itemIdIndex, being inside an array of structures.
You can explode the array so that struct is only remained as
val tempdf2 = df2.withColumn("recommendations", explode('recommendations))
which should leave you with schema as
root
|-- userdId: string (nullable = true)
|-- recommendations: struct (nullable = true)
| |-- itemIdIndex: string (nullable = true)
| |-- rating: string (nullable = true)
Do the same for df (the first dataframe)
Then after that you can join them as
tempdf1.join(tempdf2, tempdf1("recommendations.itemIndex") === tempdf2("recommendations.itemIndex"))

How to access elemens in Row RDD in SCALA

My row RDD looks like this:
Array[org.apache.spark.sql.Row] = Array([1,[example1,WrappedArray([**Standford,Organisation,NNP], [is,O,VP], [good,LOCATION,ADP**])]])
I have got this from converting dataframe to rdd, dataframe schema was :
root
|-- article_id: long (nullable = true)
|-- sentence: struct (nullable = true)
| |-- sentence: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tokens: string (nullable = true)
| | | |-- ner: string (nullable = true)
| | | |-- pos: string (nullable = true)
Now how do access elements in row rdd, in dataframe I can use df.select("sentence"). I am looking forward to access elements like stanford/other nested elements.
As #SarveshKumarSingh wrote in a comment you can access a the rows in a RDD[Row] like you would access any other element in an RDD. Accessing the elements in the row can be done in a couple of ways. Either simply call get like this:
rowRDD.map(row => row.get(2).asInstanceOf[MyType])
or if it is a build in type, you can avoid the type cast:
rowRDD.map(row => row.getList(4))
or you might want to simply use pattern matching, like:
rowRDD.map{case Row(field1: Long, field2: MyType) => field2}
I hope this helps :)