SPARK SQL : How to convert List[List[Any]) to Data Frame

SPARK SQL : How to convert List[List[Any]) to Data Frame - scala

val list = List(List(1,"Ankita"),List(2,"Kunal"))
and now I want to convert it into the data frame -
val list = List(List(1,"Ankita"),List(2,"Kunal")).toDF("id","name")
but it throws an error -
java.lang.ClassNotFoundException: Scala.any

AFAIK, List[List[Any]] cannot be converted to DataFrame directly, It need to convert to some type (here I took example to Person) List[Person]
case class Person(id: Int, name: String)
val list = List(List(1,"Ankita"),List(2,"Kunal"))
val listDf = list.map(x => Person(x(0).asInstanceOf[Int], x(1).toString)).toDF("id","name")
Another way is per the comment of user8371915, Create list of pairs and convert to DataFrame
val listDf = list.map {
case List(id: Int, name: String) => (id, name) } toDF("id", "name")

Because List (inside List) can be arbitrary size and can not using implicit type conversion.
It can be converted if you change to List of Tuple.
val list = List((1,"Ankita"),(2,"Kunal")).toDF("id","name")

Related

ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast

I am using an Aggregator to apply some custom merge on a DataFrame after grouping its records by their primary key:
case class Player(
pk: String,
ts: String,
first_name: String,
date_of_birth: String
)
case class PlayerProcessed(
var ts: String,
var first_name: String,
var date_of_birth: String
)
// Cutomer Aggregator -This just for the example, actual one is more complex
object BatchDedupe extends Aggregator[Player, PlayerProcessed, PlayerProcessed] {
def zero: PlayerProcessed = PlayerProcessed("0", null, null)
def reduce(bf: PlayerProcessed, in : Player): PlayerProcessed = {
bf.ts = in.ts
bf.first_name = in.first_name
bf.date_of_birth = in.date_of_birth
bf
}
def merge(bf1: PlayerProcessed, bf2: PlayerProcessed): PlayerProcessed = {
bf1.ts = bf2.ts
bf1.first_name = bf2.first_name
bf1.date_of_birth = bf2.date_of_birth
bf1
}
def finish(reduction: PlayerProcessed): PlayerProcessed = reduction
def bufferEncoder: Encoder[PlayerProcessed] = Encoders.product
def outputEncoder: Encoder[PlayerProcessed] = Encoders.product
}
val ply1 = Player("12121212121212", "10000001", "Rogger", "1980-01-02")
val ply2 = Player("12121212121212", "10000002", "Rogg", null)
val ply3 = Player("12121212121212", "10000004", null, "1985-01-02")
val ply4 = Player("12121212121212", "10000003", "Roggelio", "1982-01-02")
val seq_users = sc.parallelize(Seq(ply1, ply2, ply3, ply4)).toDF.as[Player]
val grouped = seq_users.groupByKey(_.pk)
val non_sorted = grouped.agg(BatchDedupe.toColumn.name("deduped"))
non_sorted.show(false)
This returns:
+--------------+--------------------------------+
|key |deduped |
+--------------+--------------------------------+
|12121212121212|{10000003, Roggelio, 1982-01-02}|
+--------------+--------------------------------+
Now, I would like to order the records based on ts before aggregating them. From here I understand that .sortBy("ts") do not guarantee the order after the .groupByKey(_.pk). So I was trying to apply the .sortBy between the .groupByKey and the .agg
The output of the .groupByKey(_.pk) is a KeyValueGroupedDataset[String,Player], being the second element an Iterator. So to apply some sorting logic there I convert it into a Seq:
val sorted = grouped.mapGroups{case(k, iter) => (k, iter.toSeq.sortBy(_.ts))}.agg(BatchDedupe.toColumn.name("deduped"))
sorted.show(false)
However, the output of .mapGroups after adding the sorting logic is a Dataset[(String, Seq[Player])]. So when I try to invoke the .agg function on it I am getting the following exception:
Caused by: ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line050e0d37885948cd91f7f7dd9e3b4da9311.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Player
How could I convert back the output of my .mapGroups(...) into a KeyValueGroupedDataset[String,Player]?
I tried to cast back to Iterator as follows:
val sorted = grouped.mapGroups{case(k, iter) => (k, iter.toSeq.sortBy(_.ts).toIterator)}.agg(BatchDedupe.toColumn.name("deduped"))
But this approach produced the following exception:
UnsupportedOperationException: No Encoder found for Iterator[Player]
- field (class: "scala.collection.Iterator", name: "_2")
- root class: "scala.Tuple2"
How else can I add the sort logic between the .groupByKey and .agg methods?

Based on the discussion above, the purpose of the Aggregator is to get the latest field values per Player by ts ignoring null values.
This can be achieved fairly easily aggregating all fields individually using max_by. With that there's no need for a custom Aggregator nor the mutable aggregation buffer.
import org.apache.spark.sql.functions._
val players: Dataset[Player] = ...
// aggregate all columns except the key individually by ts
// NULLs will be ignored (SQL standard)
val aggColumns = players.columns
.filterNot(_ == "pk")
.map(colName => expr(s"max_by($colName, if(isNotNull($colName), ts, null))").as(colName))
val aggregatedPlayers = players
.groupBy(col("pk"))
.agg(aggColumns.head, aggColumns.tail: _*)
.as[Player]
On the most recent versions of Spark you can also use the build in max_by expression:
import org.apache.spark.sql.functions._
val players: Dataset[Player] = ...
// aggregate all columns except the key individually by ts
// NULLs will be ignored (SQL standard)
val aggColumns = players.columns
.filterNot(_ == "pk")
.map(colName => max_by(col(colName), when(col(colName).isNotNull, col("ts"))).as(colName))
val aggregatedPlayers = players
.groupBy(col("pk"))
.agg(aggColumns.head, aggColumns.tail: _*)
.as[Player]

Transforming RDD[String] to RDD[myclass]

I am trying to transform RDD[String] to RDD[Picture] but could not do it. If I could manage to convert RDD to RDD[Picture] I would use the def hasValidCountry to check if the values latitude and longitude of the picture meta valid. And after that I am trying to check if user Tags are valid with def hasTags in Picture class. The problem I encounter :
Implicit conversion found: row ⇒ augmentString(row): scala.collection.immutable.StringOps
type mismatch; found : String required: Array[String]
value InterestingPics is not a member of Array[Nothing] possible cause: maybe a semicolon is missing before `value InterestingPics'?
My intention is to choose line which has valid country and tags and transform all the line to new RDD[Picture] class.
ScalaFile1 (I have updated the ScalaFile):
object Part2 {
def main(args: Array[String]): Unit = {
var spark: SparkSession = null
try {
spark = SparkSession.builder().appName("Flickr using dataframes").config("spark.master", "local[*]").getOrCreate()
val originalFlickrMeta: RDD[String] = spark.sparkContext.textFile("flickrSample.txt")
val InterestingPics = originalFlickrMeta.map(row => row.split('\t')).map(field => Picture(field(0).toString())
InterestingPics.collect
InterestingPics.take(5).foreach(println)

This works, as an example:
case class case_for_rdd(c1: Int, c2: String, c3: String)
val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv01-4.txt")
val rdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2)))
rdd.collect
More complicated example with reading into RDD from file with array. Array needs a delimiter.
1,10,100,aa|bb|cc
2,20,200,xxxxxx|yyyyyyyy|z|aaa
Some sample code, but use List as otherwise you get to see array addresses, that's what those strange strings are, courtesy of smarter
people here:
case class case_for_rdd(c1: Int, c2: String, c3: String, a4: List[String])
val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv03.txt")
val myCaseRdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2), (field(3).split("\\|").toList)))
myCaseRdd.collect
My advice is to use a DF and the splitting stuff is then easier. Also, manipulation of the rdd via transformation, then the case class is lost. Array with DF api has no such issue.

I have an solution to my question in accordence with help of #thebluephantom. Thank you very much.
val InterestingPics = originalFlickrMeta.map(line => (new Picture(line.split("\t")))).filter(f => f.c != null && f.userTags.length > 0)
InterestingPics.collect().foreach(println)

how to convert RDD[(String, Any)] to Array(Row)?

I've got a unstructured RDD with keys and values. The values is of RDD[Any] and the keys are currently Strings, RDD[String] and mainly contain Maps. I would like to make them of type Row so I can make a dataframe eventually. Here is my rdd :
removed
Most of the rdd follows a pattern except for the last 4 keys, how should this be dealt with ? Perhaps split them into their own rdd, especially for reverseDeltas ?
Thanks
Edit
This is what I've tired so far based on the first answer below.
case class MyData(`type`: List[String], libVersion: Double, id: BigInt)
object MyDataBuilder{
def apply(s: Any): MyData = {
// read the input data and convert that to the case class
s match {
case Array(x: List[String], y: Double, z: BigInt) => MyData(x, y, z)
case Array(a: BigInt, Array(x: List[String], y: Double, z: BigInt)) => MyData(x, y, z)
case _ => null
}
}
}
val parsedRdd: RDD[MyData] = rdd.map(x => MyDataBuilder(x))
how it doesn't see to match any of those cases, how can I match on Map in scala ? I keep getting nulls back when printing out parsedRdd

To convert the RDD to a dataframe you need to have fixed schema. If you define the schema for the RDD rest is simple.
something like
val rdd2:RDD[Array[String]] = rdd.map( x => getParsedRow(x))
val rddFinal:RDD[Row] = rdd2.map(x => Row.fromSeq(x))
Alternate
case class MyData(....) // all the fields of the Schema I want
object MyDataBuilder {
def apply(s:Any):MyData ={
// read the input data and convert that to the case class
}
}
val rddFinal:RDD[MyData] = rdd.map(x => MyDataBuilder(x))
import spark.implicits._
val myDF = rddFinal.toDF

there is a method for converting an rdd to dataframe
use it like below
val rdd = sc.textFile("/pathtologfile/logfile.txt")
val df = rdd.toDF()
no you have dataframe do what ever you want on it using sql queries like below
val textFile = sc.textFile("hdfs://...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF("line")
val errors = df.filter(col("line").like("%ERROR%"))
// Counts all the errors
errors.count()
// Counts errors mentioning MySQL
errors.filter(col("line").like("%MySQL%")).count()
// Fetches the MySQL errors as an array of strings
errors.filter(col("line").like("%MySQL%")).collect()

Return a tuple of Map and String from UDF

I am trying to construct a temporary column from an expensive UDF that I need to run on each row of my Dataset[Row]. Currently it looks something like:
val myUDF = udf((values: Array[Byte], schema: String) => {
val list = new MyDecoder(schema).decode(values)
val myMap = list.map(
(value: SomeStruct) => (value.field1, value.field2)
).toMap
val field3 = list.head.field3
return (myMap, field3)
})
val decoded = myDF.withColumn("decoded_tmp", myUDF(col("data"), lit(schema))
.withColumn("myMap", col("decoded_tmp._1"))
.withColumn("field3", col("decoded_tmp._2"))
.drop("decoded_tmp")
However, when I try to compile this, I get a type mismatch error:
type mismatch;
found : (scala.collection.immutable.Map[String,Double], String)
required: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
How can I get around this, or will I have to have 2 expensive UDF functions, one to produce myMap and the other to produce the field3 column?

Filtering in Scala

So suppose I have the following data (only the first few rows, this data covers an entire year) -
(2014-08-31T00:05:00.000+01:00, John)
(2014-08-31T00:11:00.000+01:00, Sarah)
(2014-08-31T00:12:00.000+01:00, George)
(2014-08-31T00:05:00.000+01:00, John)
(2014-09-01T00:05:00.000+01:00, Sarah)
(2014-09-01T00:05:00.000+01:00, George)
(2014-09-01T00:05:00.000+01:00, Jason)
I would like to filter the data so that I only see what the names are for a specific date (say, 2014-09-05). I've tried doing this using the filter function in Scala but I keep receiving the following error -
error: value xxxx is not a member of (org.joda.time.DateTime, String)
Is there another way of doing this?

The filter method takes a function, called a predicate, that takes as parameter an element of your (I'm assuming) RDD, and returns a Boolean.
The returned RDD will keep only the rows for which the predicate evaluates to true.
In your case, it seems that what you want is something like
rdd.filter{
case (date, _) => date.withTimeAtStartOfDay() == new DateTime("2017-03-31")
}

I presume from the tag your question is in the context of Spark and not pure Scala. Given that, you could filter a dataframe on a date and get the associated name(s) like this:
import org.apache.spark.sql.functions._
import sparkSession.implicits._
Seq(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah")
...
)
.toDF("date", "name")
.filter(to_date('date).equalTo(Date.valueOf("2014-09-05")))
.select("name")
Note that the Date above is java.sql.Date.

Here's a function that takes a date, a list of datetime-name pairs, and returns a list of names for the date:
def getNames(d: String, l: List[(String, String)]): List[String] = {
val date = """^([^T]*).*""".r
val dateMap = list.map {
case (x, y) => ( x match { case date(z) => z }, y )
}.
groupBy(_._1) mapValues( _.map(_._2) )
dateMap.getOrElse(d, List[String]())
}
val list = List(
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-08-31T00:11:00.000+01:00", "Sarah"),
("2014-08-31T00:12:00.000+01:00", "George"),
("2014-08-31T00:05:00.000+01:00", "John"),
("2014-09-01T00:05:00.000+01:00", "Sarah"),
("2014-09-01T00:05:00.000+01:00", "George"),
("2014-09-01T00:05:00.000+01:00", "Jason")
)
getNames("2014-09-01", list)
res1: List[String] = List(Sarah, George, Jason)

val dateTimeStringZero = "2014-08-12T00:05:00.000+01:00"
val dateTimeOne:DateTime = org.joda.time.format.ISODateTimeFormat.dateTime.withZoneUTC.parseDateTime(dateTimeStringZero)
import java.text.SimpleDateFormat
val df = new DateTime(new SimpleDateFormat("yyyy-MM-dd").parse("2014-08-12"))
println(dateTimeOne.getYear==df.getYear)
println(dateTimeOne.getMonthOfYear==df.getYear)
...