Create nested case class instance from a DataFrame - scala

I have this two case classes:
case class Inline_response_200(
nodeid: Option[String],
data: Option[List[ReadingsByEpoch_data]]
)
and
case class ReadingsByEpoch_data(
timestamp: Option[Int],
value: Option[String]
)
And I have a Cassandra table that has data like nodeid|timestamp|value. Basically, each nodeid has multiple timestamp-value pairs.
All I want to do is create instances of Inline_response_200 with their proper List of ReadingsByEpoch_data so Jackson can serialize them properly to Json.
I've tried
val res = sc.cassandraTable[Inline_response_200]("test", "taghistory").limit(100).collect()
But I get this error
java.lang.IllegalArgumentException: Failed to map constructor parameter data in com.wordnik.client.model.Inline_response_200 to a column of test.taghistory
Makes total sense because there is no column data in my Cassandra table. But then how can I create the instances correctly?
Cassandra table looks like this:
CREATE TABLE test.taghistory (
nodeid text,
timestamp text,
value text,
PRIMARY KEY (nodeid, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)
EDIT
As per Alex Ott's suggestion:
val grouped = data.groupByKey.map {
case (k, v) =>
Inline_response_200(k.getString(0), v.map(x => ReadingsByEpoch_data(x.getInt(1), x.getString(2))).toList)
}
grouped.collect().toList
I'm close but not there yet. This gives me the format I expect, however its creating one instance of Inline_response_200 per record:
[{"nodeid":"Tag3","data":[{"timestamp":1519411780,"value":"80.0"}]},{"nodeid":"Tag3","data":[{"timestamp":1519411776,"value":"76.0"}]}]
In this example I need to have one nodeid key, and an array of two timestamp-value pairs, like this:
[{"nodeid":"Tag3","data":[{"timestamp":1519411780,"value":"80.0"},{"timestamp":1519411776,"value":"76.0"}]}]`
Maybe I'm grouping the wrong way?

If you have data like nodeid|timestamp|value in your DB (yes, according to schema), you can't directly map it into structure that you created. Read data from table as pair RDD:
val data = sc.cassandraTable[(String,String,Option[String])]("test", "taghistory")
.select("nodeid","timestamp","value").keyBy[String]("nodeid")
and then transform it into structure that you need by using groupByKey on that pair RDD & transforming into Inline_response_200 class that you need, like this:
val grouped = data.groupByKey.map{case (k,v) => Inline_response_200(k,
v.map(x => ReadingsByEpoch_data(x._2, x._3)).toList)}
grouped.collect

Related

Spark dataframe ".as" function does not drop columns not present in matched case class

I am little bit confused on Spark's dataframe .as[] function,
in the documentation it says
returns a new Dataset where each record has been mapped to the specified type.
but for example, if I do:
case class Person(id: Int, name: String)
case class NewPerson(id: Int)
val person1 = Person(1, "a")
val df = Seq(person1).toDF()
val ds = df.as[NewPerson]
the ds dataset I get will still have the two columns id and name of the class Person. I would expect to have only the id column of the class NewPerson.
What did the function do here?
Actually, as method only changes the view of the data, not the data itself, as explained in the documentation:
Note that as[] only changes the view of the data that is passed into typed operations, such as map(), and does not eagerly project away any columns that are not present in the specified class.
So as does not remove columns that are not present in your case class, it just creates a view of your rows that you can use in typed operation.
Adding to Vincent Doba's answer, if you have used a case class A to create a value ds of type Dataset[A] then you can truncate it to the fields you need with the following:
val ds_clean: Dataset[A] = ds.map(identity)

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.)
How should I write this udf?
I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'.
What should I do?
The udf that I wrote:
val convert = udf[Seq[Row], Seq[Row]](blablabla...)
And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported
since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :
val schema = ArrayType(DoubleType)
val myUDF = udf((s: Seq[Row]) => {
s // just pass data without modification
}, schema)
But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.
EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)
This is an old question, I just wanted to update it according to the new version of Spark.
Since Spark 3.0.0, the method that #Raphael Roth has mentioned is deprecated. Hence, you might get an AnalysisException. The reason is that the input closure using this method doesn't have type checking and the behavior might be different from what we expect in SQL when it comes to null values.
If you really know what you're doing, you need to set spark.sql.legacy.allowUntypedScalaUDF configuration to true.
Another solution is to use case class instead of schema. For example,
case class Foo(field1: String, field2: String)
val convertFunction: Seq[Row] => Seq[Foo] = input => {
input.map {
x => // do something with x and convert to Foo
}
}
val myUdf = udf(convertFunction)

How to create udf containing Array (case class) for complex column in a dataframe

I have a dataframe which have a complex column datatype of Arraytype>. For transforming this dataframe I have created udf which can consume this column using Array [case class] as parameter. The main bottle neck here is when I create case class according to stucttype, the structfield name contains special characters for example "##field". So I provide same name to case class like this way case class (##field) and attach this to udf parameter. After interpreted in spark udf definition change name of case class field to this "$hash$hashfield". When performing transform using this dataframe it is failing because of this miss match. Please help ...
Due JVM limitations Scala stores identifiers in encoded form and currently Spark can't map ##field to $hash$hashfield.
One possible solution is to extract fields manually from raw row (but you need to know order of the fields in df, you can use df.schema for that):
val myUdf = udf { (struct: Row) =>
// Pattern match struct:
struct match {
case Row(a: String) => Foo(a)
}
// .. or extract values from Row
val `##a` = struct.getAs[String](0)
}

Convert Dataframe back to RDD of case class in Spark

I am trying to convert a dataframe of multiple case classes to an rdd of these multiple cases classes. I cant find any solution. This wrappedArray has drived me crazy :P
For example, assuming I am having the following:
case class randomClass(a:String,b: Double)
case class randomClass2(a:String,b: Seq[randomClass])
case class randomClass3(a:String,b:String)
val anRDD = sc.parallelize(Seq(
(randomClass2("a",Seq(randomClass("a1",1.1),randomClass("a2",1.1))),randomClass3("aa","aaa")),
(randomClass2("b",Seq(randomClass("b1",1.2),randomClass("b2",1.2))),randomClass3("bb","bbb")),
(randomClass2("c",Seq(randomClass("c1",3.2),randomClass("c2",1.2))),randomClass3("cc","Ccc"))))
val aDF = anRDD.toDF()
Assuming that I am having the aDF how can I get the anRDD???
I tried something like this just to get the second column but it was giving an error:
aDF.map { case r:Row => r.getAs[randomClass3]("_2")}
You can convert indirectly using Dataset[randomClass3]:
aDF.select($"_2.*").as[randomClass3].rdd
Spark DatataFrame / Dataset[Row] represents data as the Row objects using mapping described in Spark SQL, DataFrames and Datasets Guide Any call to getAs should use this mapping.
For the second column, which is struct<a: string, b: string>, it would be a Row as well:
aDF.rdd.map { _.getAs[Row]("_2") }
As commented by Tzach Zohar to get back a full RDD you'll need:
aDF.as[(randomClass2, randomClass3)].rdd
I don't know the scala API but have you considered the rdd value?
Maybe something like :
aDR.rdd.map { case r:Row => r.getAs[randomClass3]("_2")}

Listing columns on a Slick table

I have a Slick 3.0 table definition similar to the following:
case class Simple(a: String, b: Int, c: Option[String])
trait Tables { this: JdbcDriver =>
import api._
class Simples(tag: Tag) extends Table[Simple](tag, "simples") {
def a = column[String]("a")
def b = column[Int]("b")
def c = column[Option[String]]("c")
def * = (a, b, c) <> (Simple.tupled, Simple.unapply)
}
lazy val simples = TableQuery[Simples]
}
object DB extends Tables with MyJdbcDriver
I would like to be able to do 2 things:
Get a list of the column names as Seq[String]
For an instance of Simple, generate a Seq[String] that would correspond to how the data would be inserted into the database using a raw query (e.g. Simple("hello", 1, None) becomes Seq("'hello'", "1", "NULL"))
What would be the best way to do this using the Slick table definition?
First of all it is not possible to trick Slick and change the order on the left side of the <> operator in the * method without changing the order of values in Simple, the row type of Simples, i.e. what Ben assumed is not possible. The ProvenShape return type of the * projection method ensures that there is a Shape available for translating between the Column-based type in * and the client-side type and if you write def * = (c, b, a) <> Simple.tupled, Simple.unapply) having Simple defined as case class Simple(a: String, b: Int, c: Option[String]), Slick will complain with an error "No matching Shape found. Slick does not know how to map the given types...". Ergo, you can iterate over all the elements of an instance of Simple with its productIterator.
Secondly, you already have the definition of the Simples table in your code and querying metatables to get the same information you already have is not sensible. You can get all you column names with a one-liner simples.baseTableRow.create_*.map(_.name). Note that the * projection of the table also defines the columns generated when you create the table schema. So the columns not mentioned in the projection are not created and the statement above is guaranteed to return exactly what you need and not to drop anything.
To recap briefly:
To get a list of the column names of the Simples table as Seq[String] use
simples.baseTableRow.create_*.map(_.name).toSeq
To generate a Seq[String] that would correspond to how the data
would be inserted into the database using a raw query for aSimple,
an instance of Simple use aSimple.productIterator.toSeq
To get the column names, try this:
db.run(for {
metaTables <- slick.jdbc.meta.MTable.getTables("simples")
columns <- metaTables.head.getColumns
} yield columns.map {_.name}) foreach println
This will print
Vector(a, b, c)
And for the case class values, you can use productIterator:
Simple("hello", 1, None).productIterator.toVector
is
Vector(hello, 1, None)
You still have to do the value mapping, and guarantee that the order of the columns in the table and the values in the case class are the same.