Collecting unique elements during Spark aggregation - scala

Problem
I need to update this line in my code. How do I do that?
"case StringType => concat_ws(",",collect_list(col(c)))"
To only append strings that are not already in the existing field. In this example, the letter "b" would not appear twice.
Code
val df =Seq(
(1, 1.0, true, "a"),
(2, 2.0, false, "b")
(3, 2.0, false, "b")
(3, 2.0, false, "c")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf =>
(sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c)))
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id")
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(
aggExprs.head,aggExprs.tail:_*
)
.show()

You probably want to use collect_set() instead of collect_list(). This will automatically remove duplicates during the collection.
I am not sure why you want to turn the array of unique strings into a comma-delimited list. Spark can easily handle array columns and they are displayed such that each element can be seen. Still, if you absolutely must have the array converted into a comma-delimited string, use array_join in Spark 2.4+ or a UDF in earlier versions of Spark.

Related

Spark 2.3: Reading dataframe inside rdd.map()

I want to iterate through each row of an RDD using .map() and I want to use a dataframe inside the map function as follows:
val rdd = ... // rdd holding seq of ids in each row
val df = ... // df with columns `id: String` and `value: Double`
rdd
.map { case Row(listOfStrings: Seq[String]) =>
listOfStrings.foldLeft(Seq[Double]())(op = (temp, curr) => {
// calling df here
val extractValue: Double = df.filter(s"id == $curr").first()(1)
temp :+ extractValue
}
}
Above is pseudocode which I made up, and this results in an exception because I cannot call a dataframe inside .map().
The only way I can think of overcoming this is to collect df before .map() so that it is no longer a dataframe. Is there a method in which I can do this without collecting? Note that joining the rdd and df is not suitable.
Basically you have a RDD of lists of IDs RDD[Seq[String]] and a dataframe of tuples (id, value). You are trying to replace the IDs of the RDD by the corresponding values in the dataframe.
The way you try to do it is impossible in spark. You cannot reference a dataframe nor a RDD inside a map. Indeed, they are objects that you manipulate in the driver to parallelize jobs, executed by the workers. However, the code inside map is executed by a worker and a worker cannot delegate work to other workers. Only the driver can. This is why (intuitively) what you are trying to do is not possible.
You say that a join is not suitable. I am not sure why but this is exactly what I propose, in combination with a flatMap. I use the RDD API but we could write similar code using the dataframe API.
// generating data
val data = Seq(Seq("a", "b", "c"), Seq("d", "e"), Seq("f"))
val rdd = sc.parallelize(data)
val df = Seq("a" -> 1d, "b" -> 2d, "c" -> 3d,
"d" -> 4d, "e" -> 5d, "f" -> 6d)
.toDF("id", "value")
// Transforming the dataframe into a RDD[String, Double]
val rdd_df = df.rdd
.map(row => row.getAs[String]("id") -> row.getAs[Double]("value"))
val result = rdd
// We start with zipWithUniqueId to remember how the lists were arranged
.zipWithUniqueId
// we flatten the lists, remembering for each row the list id
.flatMap{ case (ids, unique_id) => ids.map(id => id -> unique_id) }
.join(rdd_df)
.map{ case(_, (unique_id, value)) => unique_id -> value }
// we reform the lists by grouping by list id
.groupByKey
.map(_._2.toArray)
scala> result.collect
res: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0), Array(4.0, 5.0), Array(6.0))

Infer Schema from rdd to Dataframe in Spark Scala

This question is a reference from (Spark - creating schema programmatically with different data types)
I am trying infer schema from rdd to Dataframe , Below is my code
def inferType(field: String) = field.split(":")(1) match {
case "Integer" => IntegerType
case "Double" => DoubleType
case "String" => StringType
case "Timestamp" => TimestampType
case "Date" => DateType
case "Long" => LongType
case _ => StringType
}
val header = c1:String|c2:String|c3:Double|c4:Integer|c5:String|c6:Timestamp|c7:Long|c8:Date
val df1 = Seq(("a|b|44.44|5|c|2018-01-01 01:00:00|456|2018-01-01")).toDF("data")
val rdd1 = df1.rdd.map(x => Row(x.getString(0).split("\\|"): _*))
val schema = StructType(header.split("\\|").map(column => StructField(column.split(":")(0), inferType(column), true)))
val df = spark.createDataFrame(rdd1, schema)
df.show()
When I do the show , it throws the below error . I have to perform this operation on larger scale data and having trouble finding the right solution, can you anybody please help me find a solution for this or any other way, where I can achieve this.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int
Thanks in advance
Short answer: String/Text cannot be specified with custom types/formats.
What you are trying to do is that to parse string as sql columns. The difference from other example is that loads from csv, you are trying to just.
Working version can be achieved like this:
// skipped other details such as schematype, spark session...
val header = "c1:String|c2:String|c3:Double|c4:Integer"
// Create `Row` from `Seq`
val row = Row.fromSeq(Seq("a|b|44.44|12|"))
// Create `RDD` from `Row`
val rdd: RDD[Row] = spark.sparkContext
.makeRDD(List(row))
.map { row =>
row.getString(0).split("\\|") match {
case Array(col1, col2, col3, col4) =>
Row.fromTuple(col1, col2, col3.toDouble, col4.toInt)
}
}
val stt: StructType = StructType(
header
.split("\\|")
.map(column => StructField(column, inferType(column), true))
)
val dataFrame = spark.createDataFrame(rdd, stt)
dataFrame.show()
The reason to create a Row from Scala types is that introducing compatible types or Row respected types here.
Note I skipped date and time related fields, date conversions are tricky. You can check my another answer how to use formatted date and timestamps here

how to get Dataframe from list spark scala

I have a list of RDD. I have iterated the rdd and for each elemet of rdd I am doing some parsing logic. Finally I getting
val mRdd = nRdd.map {
ele => //parsing logic, I have the below field
colum = Array[String] // example ['id','name','dept']<br>
c_type = Array[String] // example ['Int','String','String']<br>
value = ArrayBuffer[String] // [1,lucy,it][2,denis,cs]<br>
}
How I can get the list of dataframe in mRdd
I tried a logic to create dataframe, in this case I have to rdd first. But I can't create rdd inside rdd.
I am new in spark. I am using spark 1.6.3
Please help me
In order to convert an RDD into a Dataframe, you would need to do the following:
Approach 1 - Use createDaframe function:
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val schema = StructType(Seq(
StructField("id", IntegerType),
StructField("name", StringType),
StructField("dept", StringType)
))
createDataframe(parsedRDD, schema)
}
Read more about this approach here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
Approach 2 - Use toDF implicit function:
import sqlContext.implicits._
val mRdd: Seq[DataFrame] = nRdd.map {ele =>
val parsedRDD = ele //apply parse logic here
val columns = Seq("id", "name", "dept")
parsedRDD.toDF(columns: _*)
}

Add a list as a column to dataframe in scala [duplicate]

I have List[Double], how to convert it to org.apache.spark.sql.Column. I am trying to insert it as a column using .withColumn() to existing DataFrame.
It cannot be done directly. Column is not a data structure but a representation of a specific SQL expression. It is not bound to a specific data. You'll have to transform your data first. One way to approach this is to parallelize and join by index:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, DoubleType}
val df = Seq(("a", 2), ("b", 1), ("c", 0)).toDF("x", "y")
val aList = List(1.0, -1.0, 0.0)
val rows = df.rdd.zipWithIndex.map(_.swap)
.join(sc.parallelize(aList).zipWithIndex.map(_.swap))
.values
.map { case (row: Row, x: Double) => Row.fromSeq(row.toSeq :+ x) }
sqlContext.createDataFrame(rows, df.schema.add("z", DoubleType, false))
Another similar approach is to index and use and UDF to handle the rest:
import scala.util.Try
val indexedDf = sqlContext.createDataFrame(
df.rdd.zipWithIndex.map {
case (row: Row, i: Long) => Row.fromSeq(row.toSeq :+ i)
},
df.schema.add("idx_", "long")
)
def addValue(vs: Vector[Double]) = udf((i: Long) => Try(vs(i.toInt)).toOption)
indexedDf.withColumn("z", addValue(aList.toVector)($"idx_"))
Unfortunately both solutions will suffer from the issues. First of all passing local data through driver introduces a serious bottleneck in your program. Typically data should accessed directly from the executors. Another problem are growing RDD lineages if you want to perform this operation iteratively.
While the second issue can be addressed by checkpointing the first one makes this idea useless in general. I would strongly suggest that you either build completely structure first, and read it on Spark, or rebuild you pipeline in a way that can leverage Spark architecture. For example if data comes from an external source perform reads directly for each chunk of data using map / mapPartitions.

Dataframe state before save and after load - what's different?

I have a DF that contains some SQL expressions (coalesce, case/when etc.).
I later try to map/flatMap this DF where I get an Task not serializable error, due to the fields that contain the SQL expressions.
(Why I need to map/flatMap this DF is a separate question)
When I save this DF to a Parquet file and load it afterwards, the error is gone and I can convert to RDD and do transformations no problem!
How is the DF different before saving and after loading? In some way, the SQL expressions must have been evaluated and made persistent. How can I achieve the same thing without saving/loading? (df.perists() did not do the trick ;()
Here's test code:
val data = Seq( (1, "sku1", "EUR", 99.0, 89.0), (2, "sku2", "USD", 89.0, 79.0), (3, "sku3", "USD", 49.0, 39.9) )
val aditionalStuffForCertainSKUsMap = Map("sku1" -> List(10, 20, 30))
val listedPrice = coalesce(
List("EUR", "USD").map(c => when($"currency" === c, col(c)).otherwise(lit(null))): _*)
val df = (sc.parallelize(data)
.toDF("id", "sku", "currency", "EUR", "USD")
.withColumn("price_in_given_currency", when($"currency" === "EUR", $"EUR"*2).otherwise(1))
// .withColumn("fails_price_in_given_currency", listedPrice)
)
df.show
df.write.mode(SaveMode.Overwrite).parquet("test_df")
The data contains a sku and some SKUs represent bundles, like sku1, for which I want to add some other fields to the DF. Only when I try to access this Map[String, List[Int]] within the map() I get complaints with the fails_price_in_given_currency column, not so with the price_in_given_currency:
// If I load the df first, the map() works even when using `fails_price_in_given_currency`
//val df = sqlContext.read.parquet("test_df")
val out = df.map(d => {
val key = d.getAs[String]("sku")
aditionalStuffForCertainSKUsMap.getOrElse(key, None)
})
The error is thrown when I use fails_price_in_given_currency instead. If I however load df before the map, it will run again!