Spark Dataframe of WrappedArray to Dataframe[Vector] - scala

I have a spark Dataframe df with the following schema:
root
|-- features: array (nullable = true)
| |-- element: double (containsNull = false)
I would like to create a new Dataframe where each row will be a Vector of Doubles and expecting to get the following schema:
root
|-- features: vector (nullable = true)
So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala) but I fear something is wrong with it because it takes a very long time to compute even a reasonable amount of rows.
Also, if there are too many rows the application will crash with a heap space exception.
val clustSet = df.rdd.map(r => {
val arr = r.getAs[mutable.WrappedArray[Double]]("features")
val features: Vector = Vectors.dense(arr.toArray)
features
}).map(Tuple1(_)).toDF()
I suspect that the instruction arr.toArray is not a good Spark practice in this case. Any clarification would be very helpful.
Thank you!

It's because .rdd have to unserialize objects from internal in-memory format and it is very time consuming.
It's ok to use .toArray - you are operating on row level, not collecting everything to the driver node.
You can do this very easy with UDFs:
import org.apache.spark.ml.linalg._
val convertUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
val withVector = dataset
.withColumn("features", convertUDF('features))
Code is from this answer: Convert ArrayType(FloatType,false) to VectorUTD
However there author of the question didn't ask about differences

Related

Need to apply correlation matrix on the sql query on spark dataframe

I have a sample dataset which contains the data related to employees of an organisation. Please find the schema below for the dataset.
The problem which I am trying to solve here is, What is the most important criteria for an employee to stick to an organization using correlation matrix.
I am trying to solve this in sql query in spark/scala.
Schema of the Dataset
|-- satisfaction_level: float
|-- last_evaluation: float
|-- number_project: integer
|-- average_monthly_hours: integer
|-- time_spend_company: integer
|-- work_accident: integer
|-- left: integer
|-- promotion_last_5years: integer
|-- department: string
|-- salary: string
I tried with this below query but it is not yielding me any results, as per my understanding and analysis of the data I can prove that when the satisfaction_level is going down the employees tend to leave the organisation.
val correlationVal = employeesDF.stat.corr("satisfaction_level","left")
I am finding issues in writing the sql query to solve the above mentioned problems, can anybody help me with this? What is the correct way to apply correlation matrix for this problem?
Note: If there is any better/simpler approach to solve this problem using Spark, then please share me your inputs.
Here is a minimal code which works for me:
import org.apache.spark.sql.{DataFrame,Row}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext }
import org.apache.spark.sql.hive.HiveContext
val schema = StructType( Array(
StructField("col1", IntegerType, true),
StructField("col2", FloatType, true)
))
val rdd = sc.parallelize( Seq(Row(1, 1.34.toFloat), Row(2, 2.02.toFloat), Row(3, 3.4.toFloat), Row(4, 4.2.toFloat)))
val dataFrame = spark.createDataFrame(rdd, schema)
dataFrame.stat.corr("col1","col2")
The result is 0.9914 which is almost near to 1 indicating columns are correlated.

How to extract an array column from spark dataframe [duplicate]

This question already has answers here:
Access Array column in Spark
(2 answers)
Closed 5 years ago.
I have a spark dataframe with the following schema and class data:
>ab
ab: org.apache.spark.sql.DataFrame = [block_number: bigint, collect_list(to): array<string> ... 1 more field]
>ab.printSchema
root |-- block_number: long (nullable = true)
|-- collect_list(to): array (nullable = true)
| |-- element: string (containsNull = true)
|-- collect_list(from): array (nullable = true)
| |-- element: string (containsNull = true)
I want to simply merge the arrays from these two columns. I have tried to find a simple solution for this online but have not had any luck. Basically my issue comes down to two problems.
First, I know that probably the solution involves the map function. I have not been able to find any syntax that can actually compile, so for now please accept my best attempt:
ab.rdd.map(
row => {
val block = row.getLong(0)
val array1 = row(1).getAs[Array<string>]
val array1 = row(1).getAs[Array<string>]
}
)
Basically issue number 1 is very simple, and an issue that has been recurring since the day I first started using map in Scala: I can't figure out how to extract an arbitrary field for an arbitrary type from a column. I know that for the primitive types you have things like row.getLong(0) etc, but I don't understand how this should be done for things like array types.
I have seen somewhere that something like row.getAs[Array<string>](1) should work, but when I try it I get the error
error: identifier expected but ']' found.
val array1 = row.getAs[Array<string>](1)`
As far as I can tell, this is exactly the syntax I have seen in other situations but I can't tell why it's not working. I think I have seen before some other syntax that looks like row(1).getAs[Type], but I am not sure.
The second issue is: once I can extact these two arrays, what is the best way of merging them? Using the intersect function? Or is there a better approach to this whole process? For example using the brickhouse package?
Any help would be appreciated.
Best,
Paul
You don't need to switch to the RDD API, you can do it with Dataframe UDFs like this:
val mergeArrays = udf((arr1:Seq[String],arr2:Seq[String]) => arr1++arr2)
df
.withColumn("merged",mergeArrays($"collect_list(from)",$"collect_list(to)"))
.show()
The above UDF just concats the array (using the ++ operator), you can also use union or intersect etc, depending what you want to achieve.
Using the RDD API, the solution would look like this:
df.rdd.map(
row => {
val block = row.getLong(0)
val array1 = row.getAs[Seq[String]](1)
val array2 = row.getAs[Seq[String]](2)
(block,array1++array2)
}
).toDF("block","merged") // back to Dataframes

UDF to Concatenate Arrays of Undefined Case Class Buried in a Row Object

I have a dataframe, called sessions, with columns that may change over time. (Edit to Clarify: I do not have a case class for the columns - only a reflected schema.) I will consistently have a uuid and clientId in the outer scope with some other inner and outer scope columns that might constitute a tracking event so ... something like:
root
|-- runtimestamp: long (nullable = true)
|-- clientId: long (nullable = true)
|-- uuid: string (nullable = true)
|-- oldTrackingEvents: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- actionid: integer (nullable = true)
| | |-- actiontype: string (nullable = true)
| | |-- <tbd ... maps, arrays and other stuff matches sibling> section
...
|-- newTrackingEvents: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- timestamp: long (nullable = true)
| | |-- actionid: integer (nullable = true)
| | |-- actiontype: string (nullable = true)
| | |-- <tbd ... maps, arrays and other stuff matches sibling>
...
I'd like to now merge oldTrackingEvents and newTrackingEvents with a UDF containing these parameters and yet-to-be resolved code logic:
val mergeTEs = udf((oldTEs : Seq[Row], newTEs : Seq[Row]) =>
// do some stuff - figure best way
// - to merge both groups of tracking events
// - remove duplicates tracker events structures
// - limit total tracking events < 500
return result // same type as UDF input params
)
The UDF return result would be an array of of the structure that is the resulting List of the concatenated two fields.
QUESTION:
My question is how to construct such a UDF - (1) use of correct passed-in parameter types, (2) a way to manipulate these collections within a UDF and (3) a clear way to return a value that doesn't have a compiler error. I unsuccessfully tested Seq[Row] for the input / output (with val testUDF = udf((trackingEvents : Seq[Row]) => trackingEvents) and received the error java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported for a direct return of trackingEvents. However, I get no error for returning Some(1) instead of trackingEvents ... What is the best way to manipulate the collections so that I can concatenate 2 lists of identical structures as suggested by the schema above with the UDF using the activity in the comments section. The goal is to use this operation:
sessions.select(mergeTEs('oldTrackingEvents, 'newTrackingEvents).as("cleanTrackingEvents"))
And in each row, ... get back a single array of 'trackingEvents' structure in a memory / speed efficient manner.
SUPPLEMENTAL:
Looking at a question shown to me ... There's a possible hint, if relevancy exists ... Defining a UDF that accepts an Array of objects in a Spark DataFrame? ... To create struct function passed to udf has to return Product type (Tuple* or case class), not Row.
Perhaps ... this other post is relevant / useful.
I think that the question you've linked explains it all, so just to reiterate. When working with udf:
Input representation for the StructType is weakly typed Row object.
Output type for StructType has to be Scala Product. You cannot return Row object.
If this is to much burden, you should use strongly typed Dataset
val f: T => U
sessions.as[T].map(f): Dataset[U]
where T is an algebraic data type representing Session schema, and U is algebraic data type representing the result.
Alternatively ... If your goal is to merge sequences of some random row structure / schema with some manipulation, this is an alternative generally-stated approach that avoids the partitioning talk:
From the master dataframe, create dataframes for each trackingEvents section, new and old. With each, select the exploded 'trackingEvents' section's columns. Save these val dataframe declarations as newTE and oldTE.
Create another dataframe, where columns that are picked are unique to each tracking event in the arrays of oldTrackingEvents and newTrackingEvents such as each's uuid, clientId and event timestamp. Your pseudo-schema would be:
(uuid: String, clientId : Long, newTE : Seq[Long], oldTE : Seq[Long])
Use a UDF to join the two simple sequences of your structure, both Seq[Long] which is 'something like the untested' example:
val limitEventsUDF = udf { (newTE: Seq[Long], oldTE: Seq[Long], limit: Int, tooOld: Long) => {
(newTE ++ oldTE).filter(_ > tooOld).sortWith(_ > _).distinct.take(limit)
}}
The UDF will return a dataframe of cleaned tracking events & you now have a very slim dataframe with removed events to self-join back to your exploded newTE and oldTE frames after being unioned back to each other.
GroupBy as needed thereafter using collect_list.
Still ... this seems like a lot of work - Should this be voted for this as "the answer" - I'm not sure?

Applying a structure-preserving UDF to a column of structs in a dataframe

I have the schema
|-- count: struct (nullable = true)
| |-- day1: long (nullable = true)
| |-- day2: long (nullable = true)
| |-- day3: long (nullable = true)
| |-- day4: long (nullable = true)
|-- key: string (nullable = true)
and I would like to do a transformation on the data such that the structure of count is preserved, i.e., it still has four fields (day1, day2,...) of type long. The transformation I'd like to do is add the value of the day1 field to the other fields. My idea was to use a UDF but I'm not sure how 1) to have the UDF return a struct with the same structure and 2) how, within the UDF, to access the fields of the struct that it's transforming (in order to get the value of the day1 field). The logic for the UDF should be simple, something like
s : StructType => StructType(s.day1, s.day1+s.day2, s.day1+s.day3,s.day1+s.day4)
but I don't know how to get the correct types/preserve the field names of the original structure. I'm very new to Spark so any guidance is much appreciated.
Also, I would greatly appreciate it if anyone could point me to the right documentation for this type of thing. I feel that this type of simple transformation should be very simple but I was reading the Spark docs and it wasn't clear to me how this is done.
I wouldn't use udf. Just select / withColumn
import org.apache.spark.sql.functions._
import spark.implicits._
df.withColumn("count",
struct(
$"count.day1".alias("day1"),
($"count.day1" + $"count.day2").alias("day2"),
($"count.day1" + $"count.day3").alias("day3"),
($"count.day1" + $"count.day4").alias("day4")))

Matching two dataframes in scala

I have two RDDs in SCALA and converted those to dataframes.
Now I have two dataframes.One prodUniqueDF where I have two columns named prodid and uid, it is having master data of product
scala> prodUniqueDF.printSchema
root
|-- prodid: string (nullable = true)
|-- uid: long (nullable = false)
Second, ratingsDF where I have columns named prodid,custid,ratings
scala> ratingsDF.printSchema
root
|-- prodid: string (nullable = true)
|-- custid: string (nullable = true)
|-- ratings: integer (nullable = false)
I want to join the above two and replace the ratingsDF.prodid with prodUniqueDF.uid in the ratingsDF
To do this, I first registered them as 'tempTables'
prodUniqueDF.registerTempTable("prodUniqueDF")
ratingsDF.registerTempTable("ratingsDF")
And I run the code
val testSql = sql("SELECT prodUniqueDF.uid, ratingsDF.custid, ratingsDF.ratings FROM prodUniqueDF, ratingsDF WHERE prodUniqueDF.prodid = ratingsDF.prodid")
But the error comes as :
org.apache.spark.sql.AnalysisException: Table not found: prodUniqueDF; line 1 pos 66
Please help! How can I achieve the join? Is there another method to map RDDs instead?
The Joining of the DataFrames can easily be achieved,
Format is
DataFrameA.join(DataFrameB)
By default it takes an inner join, but you can also specify the type of join that you want to do and they have APi's for that
You can look here for more information.
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrame
For replacing the values in an existing column you can take help of withColumn method from the API
It would be something like this:
val newDF = dfA.withColumn("newColumnName", dfB("columnName"))).drop("columnName").withColumnRenamed("newColumnName", "columnName")
I think this might do the trick !