Scala dataset map fails with exception No applicable constructor/method found for zero actual parameters - scala

I have the following case classes
case class FeedbackData (prefix : String, position : Int, click : Boolean,
suggestion: Suggestion,
history : List[RequestHistory],
eventTimestamp: Long)
case class Suggestion (clicks : Long, sources : List[String], ctr : Float)
case class RequestHistory (timestamp: Long, url: String)
I use it to perform a map operation on my dataset
sqlContext = ss.sqlContext
import sqlContext.implicits._
val input: Dataset[FeedbackData] = ss.read.json("filename").as(Encoders.bean(classOf[FeedbackData]))
input.map(row => transformRow(row))
At runtime I see the exception
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 24, Column 81: failed to compile:
No applicable constructor/method found for zero actual parameters; candidates are: "package.FeedbackData(java.lang.String, int, boolean, package.Suggestion, scala.collection.immutable.List, long)"
What am I doing wrong ?

Context is fine here, issue with case class, Scala long (Long) have to used instead of Java long (long):
case class A(num1 : Long, num2 : Long, num3 : Long)

Inspired from #pasha701,use case could be
case class Student(id: Int, name: String)
import spark.implicits._
val df = Seq((1, "james"), (2, "tony")).toDF("id", "name")
df.printSchema()
df.as[Student].rdd.map{
stu=>
stu.id+"\t"+stu.name
}.collect().foreach(println)
output:
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
1 james
2 tony
reference:https://spark.apache.org/docs/2.4.0/sql-getting-started.html

Related

zip function with 3 parameter

I want to transpose multiple columns in Spark SQL table
I found this solution for only two columns, I want to know how to work with zip function with three column varA, varB and varC.
import org.apache.spark.sql.functions.{udf, explode}
val zip = udf((xs: Seq[Long], ys: Seq[Long]) => xs.zip(ys))
df.withColumn("vars", explode(zip($"varA", $"varB"))).select(
$"userId", $"someString",
$"vars._1".alias("varA"), $"vars._2".alias("varB")).show
this is my dataframe schema :
`root
|-- owningcustomerid: string (nullable = true)
|-- event_stoptime: string (nullable = true)
|-- balancename: string (nullable = false)
|-- chargedvalue: string (nullable = false)
|-- newbalance: string (nullable = false)
`
i tried this code :
val zip = udf((xs: Seq[String], ys: Seq[String], zs: Seq[String]) => (xs, ys, zs).zipped.toSeq)
df.printSchema
val df4=df.withColumn("vars", explode(zip($"balancename", $"chargedvalue",$"newbalance"))).select(
$"owningcustomerid", $"event_stoptime",
$"vars._1".alias("balancename"), $"vars._2".alias("chargedvalue"),$"vars._2".alias("newbalance"))
i got this error :
cannot resolve 'UDF(balancename, chargedvalue, newbalance)' due to data type mismatch: argument 1 requires array<string> type, however, '`balancename`' is of string type. argument 2 requires array<string> type, however, '`chargedvalue`' is of string type. argument 3 requires array<string> type, however, '`newbalance`' is of string type.;;
'Project [owningcustomerid#1085, event_stoptime#1086, balancename#1159, chargedvalue#1160, newbalance#1161, explode(UDF(balancename#1159, chargedvalue#1160, newbalance#1161)) AS vars#1167]
In Scala in general you can use Tuple3.zipped
val zip = udf((xs: Seq[Long], ys: Seq[Long], zs: Seq[Long]) =>
(xs, ys, zs).zipped.toSeq)
zip($"varA", $"varB", $"varC")
Specifically in Spark SQL (>= 2.4) you can use arrays_zip function:
import org.apache.spark.sql.functions.arrays_zip
arrays_zip($"varA", $"varB", $"varC")
However you have to note that your data doesn't contain array<string> but plain strings - hence Spark arrays_zip or explode are not allowed and you should parse your data first.
val zip = udf((a: Seq[String], b: Seq[String], c: Seq[String], d: Seq[String]) => {a.indices.map(i=> (a(i), b(i), c(i), d(i)))})

Spark Frameless withColumnRenamed nested field

Let's say I have the following code
case class MyTypeInt(a: String, b: MyType2)
case class MyType2(v: Int)
case class MyTypeLong(a: String, b: MyType3)
case class MyType3(v: Long)
val typedDataset = TypedDataset.create(Seq(MyTypeInt("v", MyType2(1))))
typedDataset.withColumnRenamed(???, typedDataset.colMany('b, 'v).cast[Long]).as[MyTypeLong]
How can I implement this transformation when the field that I am trying to transform is nested? the signature of withColumnRenamed asks for a Symbol in the first parameter so I don't know how to do this...
withColumnRenamed does not allow you to transform a column. To do that, you should use withColumn. One approach would then be to cast the column and recreate the struct.
scala> val new_ds = ds.withColumn("b", struct($"b.v" cast "long" as "v")).as[MyTypeLong]
scala> new_ds.printSchema
root
|-- a: string (nullable = true)
|-- b: struct (nullable = false)
| |-- v: long (nullable = true)
Another approach would be to use map and build the object yourself:
ds.map{ case MyTypeInt(a, MyType2(b)) => MyTypeLong(a, MyType3(b)) }

How to create a Row from a given case class?

Imagine that you have the following case classes:
case class B(key: String, value: Int)
case class A(name: String, data: B)
Given an instance of A, how do I create a Spark Row? e.g.
val a = A("a", B("b", 0))
val row = ???
NOTE: Given row I need to be able to get data with:
val name: String = row.getAs[String]("name")
val b: Row = row.getAs[Row]("data")
The following seems to match what you're looking for.
scala> spark.version
res0: String = 2.3.0
scala> val a = A("a", B("b", 0))
a: A = A(a,B(b,0))
import org.apache.spark.sql.Encoders
val schema = Encoders.product[A].schema
scala> schema.printTreeString
root
|-- name: string (nullable = true)
|-- data: struct (nullable = true)
| |-- key: string (nullable = true)
| |-- value: integer (nullable = false)
val values = a.productIterator.toSeq.toArray
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
val row: Row = new GenericRowWithSchema(values, schema)
scala> val name: String = row.getAs[String]("name")
name: String = a
// the following won't work since B =!= Row
scala> val b: Row = row.getAs[Row]("data")
java.lang.ClassCastException: B cannot be cast to org.apache.spark.sql.Row
... 55 elided
Very short but probably not the fastest as it first creates a dataframe and then collects it again :
import session.implicits._
val row = Seq(a).toDF().first()
#Jacek Laskowski answer is great!
To complete:
Here some syntactic sugar:
val row = Row(a.productIterator.toSeq: _*)
And a recursive method if you happen to have nested case classes
def productToRow(product: Product): Row = {
val sequence = product.productIterator.toSeq.map {
case product : Product => productToRow(product)
case e => e
}
Row(sequence : _*)
}
I don't think there exist a public API that can do it directly. Internally Spark uses Encoder.toRow method to convert objects org.apache.spark.sql.catalyst.expressions.UnsafeRow, but this method is private. You could try to:
Obtain Encoder for the class:
val enc: Encoder[A] = ExpressionEncoder()
Use reflection to access toRow method and set it to accessible.
Call it to convert object to UnsafeRow.
Obtain RowEncoder for the expected schema (enc.schema).
Convert UnsafeRow to Row.
I haven't tried this, so I cannot guarantee it will work or not.

Spark DataFrame not supporting Char datatype

I am creating a Spark DataFrame from a text file. Say Employee file which contains String, Int, Char.
created a class:
case class Emp (
Name: String,
eid: Int,
Age: Int,
Sex: Char,
Sal: Int,
City: String)
Created RDD1 using split, then created RDD2:
val textFileRDD2 = textFileRDD1.map(attributes => Emp(
attributes(0),
attributes(1).toInt,
attributes(2).toInt,
attributes(3).charAt(0),
attributes(4).toInt,
attributes(5)))
And Final RDDS as:
finalRDD = textFileRDD2.toDF
when I create final RDD it throws the error:
java.lang.UnsupportedOperationException: No Encoder found for scala.Char"
can anyone help me out why and how to resolve it?
Spark SQL doesn't provide Encoders for Char and generic Encoders are not very useful.
You can either use a StringType:
attributes(3).slice(0, 1)
or ShortType (or BooleanType, ByteType if you accept only binary response):
attributes(3)(0) match {
case 'F' => 1: Short
...
case _ => 0: Short
}

How to convert datatype in SPARK SQL to specific datatype but RDD result to a specifical class

I am reading a csv file and need to create a RDDSchema
I read the file by using the sqlContext.csvFile
val testfile = sqlContext.csvFile("file")
testfile.registerTempTable(testtable)
I wanted to change the pick some of the fields and return an RDD type of those fields
For example : class Test(ID: String, order_date: Date, Name: String, value: Double)
Using sqlContext.sql("Select col1, col2, col3, col4 FROM ...)
val testfile = sqlContext.sql("Select col1, col2, col3, col4 FROM testtable).collect
testfile.getClass
Class[_ <: Array[org.apache.spark.sql.Row]] = class [Lorg.apache.spark.sql.Row;
So I wanted to change col1 to double, col2 to a date , and column3 to string?
Is there a way to do this in the sqlContext.sql or I have to run a map function to the result and then turn it back to RDD..
I tried to do the do the item in one statement and I got this error:
val old_rdd : RDD[Test] = sqlContext.sql("SELECT col, col2, col3,col4 FROM testtable").collect.map(t => (t(0) : String ,dateFormat.parse(dateFormat.format(1)),t(2) : String, t(3) : Double))
The issue I am having is the assignment does not result on RDD[Test] where Test is a class defined
The error is saying that the map command is coming out as an Array Class and not an RDD Class
found : Array[edu.model.Test]
[error] required: org.apache.spark.rdd.RDD[edu.model.Test]
Lets say you have a case class like this:
case class Test(
ID: String, order_date: java.sql.Date, Name: String, value: Double)
Since you load your data with csvFile with default parameters it doesn't perform any schema inference and your data is stored as plain strings. Lets assume that there are no other fields:
val df = sc.parallelize(
("ORD1", "2016-01-02", "foo", "2.23") ::
("ORD2", "2016-07-03", "bar", "9.99") :: Nil
).toDF("col1", "col2", "col3", "col4")
Your attempt to use map is wrong for more than one reason:
function you use annotates individual values with incorrect types. Not only Row.apply is of type Int => Any but also your data table contains shouldn't contain any Double values
since you collect (which doesn't makes sense here) you fetch all data to the driver and result is local Array not RDD
finally, if all previous issues were resolved, (String, Date, String, Double) is clearly not a Test
One way to handle this:
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
val casted = df.select(
$"col1".alias("ID"),
$"col2".cast("date").alias("order_date"),
$"col3".alias("name"),
$"col4".cast("double").alias("value")
)
val tests: RDD[Test] = casted.map {
case Row(id: String, date: java.sql.Date, name: String, value: Double) =>
Test(id, date, name, value)
}
You can also try to use new Dataset API but it is far from stable:
casted.as[Test].rdd