How to access elemens in Row RDD in SCALA - scala

My row RDD looks like this:
Array[org.apache.spark.sql.Row] = Array([1,[example1,WrappedArray([**Standford,Organisation,NNP], [is,O,VP], [good,LOCATION,ADP**])]])
I have got this from converting dataframe to rdd, dataframe schema was :
root
|-- article_id: long (nullable = true)
|-- sentence: struct (nullable = true)
| |-- sentence: string (nullable = true)
| |-- attributes: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- tokens: string (nullable = true)
| | | |-- ner: string (nullable = true)
| | | |-- pos: string (nullable = true)
Now how do access elements in row rdd, in dataframe I can use df.select("sentence"). I am looking forward to access elements like stanford/other nested elements.

As #SarveshKumarSingh wrote in a comment you can access a the rows in a RDD[Row] like you would access any other element in an RDD. Accessing the elements in the row can be done in a couple of ways. Either simply call get like this:
rowRDD.map(row => row.get(2).asInstanceOf[MyType])
or if it is a build in type, you can avoid the type cast:
rowRDD.map(row => row.getList(4))
or you might want to simply use pattern matching, like:
rowRDD.map{case Row(field1: Long, field2: MyType) => field2}
I hope this helps :)

Related

Adding new column for DataFrame with complex column (Array<Map<String,String>>

I am loading a Dataframe from an external source with the following schema:
|-- A: string (nullable = true)
|-- B: timestamp (nullable = true)
|-- C: long (nullable = true)
|-- METADATA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- M_1: integer (nullable = true)
| | |-- M_2: string (nullable = true)
| | |-- M_3: string (nullable = true)
| | |-- M_4: string (nullable = true)
| | |-- M_5: double (nullable = true)
| | |-- M_6: string (nullable = true)
| | |-- M_7: double (nullable = true)
| | |-- M_8: boolean (nullable = true)
| | |-- M_9: boolean (nullable = true)
|-- E: string (nullable = true)
Now, I need to add new column, METADATA_PARSED, with column type Array and the following case class:
case class META_DATA_COL(M_1: String, M_2: String, M_3, M_10:String)
My approach here, based on examples is to create a UDF and pass in the METADATA column. But since it is of a complex type I am having a lot of trouble parsing it.
On top of that in the UDF, for the "new" variable M_10, I need to do some string manipulation on the method as well. So I need to access each of the elements in the metadata column.
What would be the best way to approach this issue? I attempted to convert the source dataframe (+METADATA) to a case class; but that did not work as it was translated back to spark WrappedArray types upon entering the UDF.
you can Use something like this.
import org.apache.spark.sql.functions._
val tempdf = df.select(
explode( col("METADATA")).as("flat")
)
val processedDf = tempdf.select( col("flat.M_1"),col("flat.M_2"),col("flat.M_3"))
now write a udf
def processudf = udf((col1:Int,col2:String,col3:String) => /* do the processing*/)
this should help, i can provide some more help if you can provide more details on the processing.

How can I perform ETL on a Spark Row and return it to a dataframe?

I'm currently using Scala Spark for some ETL and have a base dataframe that contains has the following schema
|-- round: string (nullable = true)
|-- Id : string (nullable = true)
|-- questions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- tag: string (nullable = true)
| | |-- bonusQuestions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- difficulty : string (nullable = true)
| | |-- answerOptions: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- followUpAnswers: array (nullable = true)
| | | |-- element: string (containsNull = true)
|-- school: string (nullable = true)
I only need to perform ETL on rows where the round type is primary (there are 2 types primary and secondary). However, I need both type of rows in my final table.
I'm stuck doing the ETL which should be according to -
If tag is non-bonus, the bonusQuestions should be set to null and difficulty should be null.
I'm currently able to access most fields of the DF like
val round = tr.getAs[String]("round")
Next, I'm able to get the questions array using
val questionsArray = tr.getAs[Seq[StructType]]("questions")
and can iterate using for (question <- questionsArray) {...}; However I cannot access struct fields like question.bonusQuestions or question.tagwhich returns an error
error: value tag is not a member of org.apache.spark.sql.types.StructType
Spark treats StructType as GenericRowWithSchema, more specific as Row. So instead of Seq[StructType] you have to use Seq[Row] as
val questionsArray = tr.getAs[Seq[Row]]("questions")
and in the loop for (question <- questionsArray) {...} you can get the data of Row as
for (question <- questionsArray) {
val tag = question.getAs[String]("tag")
val bonusQuestions = question.getAs[Seq[String]]("bonusQuestions")
val difficulty = question.getAs[String]("difficulty")
val answerOptions = question.getAs[Seq[String]]("answerOptions")
val followUpAnswers = question.getAs[Seq[String]]("followUpAnswers")
}
I hope the answer is helpful

spark scala create map from list in dataframe

I have the schema as below:
root
|-- id: string (nullable = true)
|-- info: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: string (nullable = true)
| | |-- _2: long (nullable = false)
| | |-- _3: string (nullable = true)
I want the o/p to be:
id | info
111|[{aaa:{12,abc}},{xxx:{14,def}}]
222|[{ddd:{23,fgh}},{jjj:{13,ijk}}]
333|[{aaa:{96,wer}]
Please help
It seems that your "info" field contains a list, and you want to turn the first element of each list to be the key of that list.
Maybe try:
dataset.map(row => row.getAs[Seq[Row]]("info")
.map(list => Map (list.head, list.tail)))
I'm new to Scala too. And since I don't know your specific schema, the above code might not work as expected. Hope this is helpful.

Spark/Scala: join dataframes when id is nested in an array of structs

I'm using Spark's MlLib DataFrame ALS functionality on Spark 2.2.0. I had to run my userId and itemId fields through an StringIndexer to get things going
The method 'recommendForAllUsers' returns the following schema
root
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: long (nullable = true)
| | |-- rating: double (nullable = true)
|-- userIdIndex: string (nullable = true)
This is perfect for my needs (would love not to flatten it) but I need to replace userIdIndex and itemIdIndex with their actual value
for the userIdIndex was ok (I couldn't simply reverse it with IndexToString as the ALS FITTING seems to erase the link between index and value):
df.join(df2, df2("userIdIndex")===df("userIdIndex"), "left")
.select(df2("userId"), df("recommendations"))
where df2 looks like this:
+------------------+--------------------+----------+-----------+-----------+
| userId| itemId| rating|userIdIndex|itemIdIndex|
+------------------+--------------------+----------+-----------+-----------+
|glorified-consumer| item-22302| 3.0| 15.0| 4.0|
the result is this schema:
root
|-- userId: string (nullable = true)
|-- recommendations: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- itemIdIndex: integer (nullable = true)
| | |-- rating: float (nullable = true)
QUESTION: for itemIdIndex, being inside an array of structures.
You can explode the array so that struct is only remained as
val tempdf2 = df2.withColumn("recommendations", explode('recommendations))
which should leave you with schema as
root
|-- userdId: string (nullable = true)
|-- recommendations: struct (nullable = true)
| |-- itemIdIndex: string (nullable = true)
| |-- rating: string (nullable = true)
Do the same for df (the first dataframe)
Then after that you can join them as
tempdf1.join(tempdf2, tempdf1("recommendations.itemIndex") === tempdf2("recommendations.itemIndex"))

How to rename elements of an array of structs in Spark DataFrame API

I have an UDF which returns an array of tuples:
val df = spark.range(1).toDF("i")
val myUDF = udf((l:Long) => {
Seq((1,2))
})
df.withColumn("udf_result",myUDF($"i"))
.printSchema
gives
root
|-- i: long (nullable = false)
|-- test: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: integer (nullable = false)
I want to rename the elements of the struct to something meaningful instead of _1 and _2, how can this be achieved? Note that I'm aware that returning a Seq of case-classes would let me allow to give proper field names, but using Spark-Notebook (REPL) with Yarn we have many issues using case classes, so I'm looking for a solution without case-classes.
I'm using Spark 2 but with untyped DataFrames, the solution should also be applicable for Spark 1.6
It is possible to cast the output of the udf. E.g. to rename the structfields to x and y, you can do:
type-safe:
val schema = ArrayType(
StructType(
Array(
StructField("x",IntegerType),
StructField("y",IntegerType)
)
)
)
df.withColumn("udf_result",myUDF($"i").cast(schema))
or unsafe, but shorter using string-argument to cast
df.withColumn("udf_result",myUDF($"i").cast("array<struct<x:int,y:int>>"))
both will give the schema
root
|-- i: long (nullable = false)
|-- udf_result: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: integer (nullable = true)
| | |-- y: integer (nullable = true)