Converting multiple different columns to Map column with Spark Dataframe scala - scala

I have a data frame with column: user, address1, address2, address3, phone1, phone2 and so on.
I want to convert this data frame to - user, address, phone where address = Map("address1" -> address1.value, "address2" -> address2.value, "address3" -> address3.value)
I was able to convert the columns to map using:
val mapData = List("address1", "address2", "address3")
df.map(_.getValuesMap[Any](mapData))
but I am not sure how to add this to my df.
I am new to spark and scala and could really use some help here.

Spark >= 2.0
You can skip udf and use map (create_map in Python) SQL function:
import org.apache.spark.sql.functions.map
df.select(
map(mapData.map(c => lit(c) :: col(c) :: Nil).flatten: _*).alias("a_map")
)
Spark < 2.0
As far as I know there is no direct way to do it. You can use an UDF like this:
import org.apache.spark.sql.functions.{udf, array, lit, col}
val df = sc.parallelize(Seq(
(1L, "addr1", "addr2", "addr3")
)).toDF("user", "address1", "address2", "address3")
val asMap = udf((keys: Seq[String], values: Seq[String]) =>
keys.zip(values).filter{
case (k, null) => false
case _ => true
}.toMap)
val keys = array(mapData.map(lit): _*)
val values = array(mapData.map(col): _*)
val dfWithMap = df.withColumn("address", asMap(keys, values))
Another option, which doesn't require UDFs, is to struct field instead of map:
val dfWithStruct = df.withColumn("address", struct(mapData.map(col): _*))
The biggest advantage is that it can easily handle values of different types.

Related

Aggregating all Column values within a Map after groupBy in Apache Spark

I've been trying this all day long with a Dataframe but no luck so far. Already did it with a RDD but it isn't really readable, so this approach would be much better when it comes to code readability.
Take this initial and result DF, both the starting DF and what I would like to obtain after peforming .groupBy().
case class SampleRow(name:String, surname:String, age:Int, city:String)
case class ResultRow(name: String, surnamesAndAges: Map[String, (Int, String)])
val df = List(
SampleRow("Rick", "Fake", 17, "NY"),
SampleRow("Rick", "Jordan", 18, "NY"),
SampleRow("Sandy", "Sample", 19, "NY")
).toDF()
val resultDf = List(
ResultRow("Rick", Map("Fake" -> (17, "NY"), "Jordan" -> (18, "NY"))),
ResultRow("Sandy", Map("Sample" -> (19, "NY")))
).toDF()
What I've tried so far is performing the following .groupBy...
val resultDf = df
.groupBy(
Name
)
.agg(
functions.map(
selectColumn(Surname),
functions.array(
selectColumn(Age),
selectColumn(City)
)
)
)
However, the following is prompt into console.
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression '`surname`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
However, doing that would result in a single entry per surname and I would like to accumulate those in a single Map as you can see in resultDf. Is there an easy way to achieve this using DFs?
you can achieve it with a single UDF to convert your data to map:
val toMap = udf((keys: Seq[String], values1: Seq[String], values2: Seq[String]) => {
keys.zip(values1.zip(values2)).toMap
})
val myResultDF = df.groupBy("name").agg(collect_list("surname") as "surname", collect_list("age") as "age", collect_list("city") as "city").withColumn("surnamesAndAges", toMap($"surname", $"age", $"city")).drop("age", "city", "surname").show(false)
+-----+--------------------------------------+
|name |surnamesAndAges |
+-----+--------------------------------------+
|Sandy|[Sample -> [19, NY]] |
|Rick |[Fake -> [17, NY], Jordan -> [18, NY]]|
+-----+--------------------------------------+
If you are not concerned about typecasting the Dataframe to DataSet (In this case ResultRow you could do something like this
val grouped =df.withColumn("surnameAndAge",struct($"surname",$"age"))
.groupBy($"name")
.agg(collect_list("surnameAndAge").alias("surnamesAndAges"))
Then you could create a User defined function which would look like
import org.apache.spark.sql._
val arrayToMap = udf[Map[String, String], Seq[Row]] {
array => array.map {
case Row(key: String, value: String) => (key, value) }.toMap
}
Now you could use a .withColumn and call this udf
val finalData = grouped.withColumn("surnamesAndAges",arrayToMap($"surnamesAndAges"))
The Dataframe would look something like this
finalData: org.apache.spark.sql.DataFrame = [name: string, surnamesAndAges: map<string,string>]
Since Spark 2.4, you don't need to use a Spark user-defined function:
import org.apache.spark.sql.functions.{col, collect_set, map_from_entries, struct}
df.withColumn("mapEntry", struct(col("surname"), struct(col("age"), col("city"))))
.groupBy("name")
.agg(map_from_entries(collect_set("mapEntry")).as("surnameAndAges"))
Explanation
You first add a column containing a Map entry from desired columns. a Map entry is merely a struct containing two columns: first column is the key and the second column is the value. You can put another struct as the value. So here your Map entry will use column surname as key, and a struct of columns age and city as value:
struct(col("surname"), struct(col("age"), col("city")))
Then, you collect all the Map entries grouped by your groupBy key, which is column name using function collect_set, and you convert this list of Map entries to a Map using function map_from_entries

How to use non-column value in UserDefinedFunction (UDF) for adding a column to a DataFrame? [duplicate]

I want to parse the date columns in a DataFrame, and for each date column, the resolution for the date may change (i.e. 2011/01/10 => 2011 /01 if the resolution is set to "Month").
I wrote the following code:
def convertDataFrame(dataframe: DataFrame, schema : Array[FieldDataType], resolution: Array[DateResolutionType]) : DataFrame =
{
import org.apache.spark.sql.functions._
val convertDateFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDate(x, resolution)}
val convertDateTimeFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDateTime(x, resolution)}
val allColNames = dataframe.columns
val allCols = allColNames.map(name => dataframe.col(name))
val mappedCols =
{
for(i <- allCols.indices) yield
{
schema(i) match
{
case FieldDataType.Date => convertDateFunc(allCols(i), resolution(i)))
case FieldDataType.DateTime => convertDateTimeFunc(allCols(i), resolution(i))
case _ => allCols(i)
}
}
}
dataframe.select(mappedCols:_*)
}}
However it doesn't work. It seems that I can only pass Columns to UDFs. And I wonder if it will be very slow if I convert the DataFrame to RDD and apply the function on each row.
Does anyone know the correct solution? Thank you!
Just use a little bit of currying:
def convertDateFunc(resolution: DateResolutionType) = udf((x:String) =>
SparkDateTimeConverter.convertDate(x, resolution))
and use it as follows:
case FieldDataType.Date => convertDateFunc(resolution(i))(allCols(i))
On a side note you should take a look at sql.functions.trunc and sql.functions.date_format. These should at least part of the job without using UDFs at all.
Note:
In Spark 2.2 or later you can use typedLit function:
import org.apache.spark.sql.functions.typedLit
which support a wider range of literals like Seq or Map.
You can create a literal Column to pass to a udf using the lit(...) function defined in org.apache.spark.sql.functions
For example:
val takeRight = udf((s: String, i: Int) => s.takeRight(i))
df.select(takeRight($"stringCol", lit(1)))

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd
To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF
there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Loading several input files into one Dataframe in Scala / Spark 1.6

I'm trying to load several input files in to a single dataframe:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield {df}
val inputDataFrame = unionAll(dataFrames, sqlContext)
// union of all given DataFrames
private def unionAll(dataFrames: Seq[DataFrame], sqlContext: SQLContext): DataFrame = dataFrames match {
case Nil => sqlContext.emptyDataFrame
case head :: Nil => head
case head :: tail => head.unionAll(unionAll(tail, sqlContext))
}
Compiler says
Error:(40, 8) type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: scala.collection.GenTraversableOnce[?]
df <- sc.textFile(i).toDF()
^
Any idea?
First, SQLContext.read.text(...) accepts multiple filename arguments, so you can simply do:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val inputDataFrame = sqlContext.read.text(inputs: _*)
Or:
val inputDataFrame = sqlContext.read.text("input1.txt", "input2.txt", "input3.txt")
As for your code - when you write:
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield df
It is translated into:
inputs.flatMap(i => sc.textFile(i).toDF().map(df => df))
Which can't compile, because flatMap expects a function that returns a GenTraversableOnce[?], while the supplied function returns an RDD[Row] (See signature of DataFrame.map). In other words, when you write df <- sc.textFile(i).toDF() you're actually taking each row in the dataframe, and yielding a new RDD with these rows, which isn't what you intended.
What you were trying to do is simpler:
val dataFrames = for (
i <- inputs;
) yield sc.textFile(i).toDF()
But, as mentioned at the beginning, the recommended approach is using sqlContext.read.text.

How can I pass extra parameters to UDFs in Spark SQL?

I want to parse the date columns in a DataFrame, and for each date column, the resolution for the date may change (i.e. 2011/01/10 => 2011 /01 if the resolution is set to "Month").
I wrote the following code:
def convertDataFrame(dataframe: DataFrame, schema : Array[FieldDataType], resolution: Array[DateResolutionType]) : DataFrame =
{
import org.apache.spark.sql.functions._
val convertDateFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDate(x, resolution)}
val convertDateTimeFunc = udf{(x:String, resolution: DateResolutionType) => SparkDateTimeConverter.convertDateTime(x, resolution)}
val allColNames = dataframe.columns
val allCols = allColNames.map(name => dataframe.col(name))
val mappedCols =
{
for(i <- allCols.indices) yield
{
schema(i) match
{
case FieldDataType.Date => convertDateFunc(allCols(i), resolution(i)))
case FieldDataType.DateTime => convertDateTimeFunc(allCols(i), resolution(i))
case _ => allCols(i)
}
}
}
dataframe.select(mappedCols:_*)
}}
However it doesn't work. It seems that I can only pass Columns to UDFs. And I wonder if it will be very slow if I convert the DataFrame to RDD and apply the function on each row.
Does anyone know the correct solution? Thank you!
Just use a little bit of currying:
def convertDateFunc(resolution: DateResolutionType) = udf((x:String) =>
SparkDateTimeConverter.convertDate(x, resolution))
and use it as follows:
case FieldDataType.Date => convertDateFunc(resolution(i))(allCols(i))
On a side note you should take a look at sql.functions.trunc and sql.functions.date_format. These should at least part of the job without using UDFs at all.
Note:
In Spark 2.2 or later you can use typedLit function:
import org.apache.spark.sql.functions.typedLit
which support a wider range of literals like Seq or Map.
You can create a literal Column to pass to a udf using the lit(...) function defined in org.apache.spark.sql.functions
For example:
val takeRight = udf((s: String, i: Int) => s.takeRight(i))
df.select(takeRight($"stringCol", lit(1)))