cas Exception:scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D - scala

throwing ClassCastExpection when applying knn classifier
val df = training.map{ r =>
(Vectors.dense(r.getAs[Array[Double]]("features")),r.getAs[Int]("id"))
}.toDF("features","id")
error appear
scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D
I try Seq, WrappedArray but does't work.

I am going to assume the following schema for training:
id:Integer
features: Array[Double]
Try:
val df = training.map(r => (Vectors.dense(r.getAs[Seq[Double]]("features").toArray),r.getAs[Integer]("id"))).toDF("features","id")
Datasets internally store Array objects are WrappedArray, a quick intro of which can be found here.
Array vs Wrapped array
So, you should "extract" your array of doubles by casting it to Seq[Double] instead of Array[Double]. However, the method dense needs Array[Double]. So, convert the Seq[Double] to Array[Double] using the toArray method.
val training = List((Seq(0.0,0.0),2),(Seq(1.0,1.0),5)).toDF("features","id")
training.show
+----------+---+
| features| id|
+----------+---+
|[0.0, 0.0]| 2|
|[1.0, 1.0]| 5|
+----------+---+
training: org.apache.spark.sql.DataFrame = [features: array<double>, id: int]
val df = training.map(r => (Vectors.dense(r.getAs[Seq[Double]]("training").toArray),r.getAs[Integer]("id"))).toDF("features","id")
df.show
+---------+---+
| features| id|
+---------+---+
|[0.0,0.0]| 2|
|[1.0,1.0]| 5|
+---------+---+
df: org.apache.spark.sql.DataFrame = [features: vector, id: int]
Hope this helps.

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.
Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

mock spark column functions in scala

My code is using monotonically_increasing_id function is scala
val df = List(("oleg"), ("maxim")).toDF("first_name")
.withColumn("row_id", monotonically_increasing_id)
I want to mock it in my unit test so that it returns integers 0, 1, 2, 3, ...
In my spark-shell it returns the desired result.
scala> df.show
+----------+------+
|first_name|row_id|
+----------+------+
| oleg| 0|
| maxim| 1|
+----------+------+
But in my scala applications the results are different.
How can I mock column functions?
Mocking such a function so that it produces a sequence is not simple. Indeed, spark is a parallel computing engine and accessing the data in sequence is therefore complicated.
Here is a solution you could try.
Let's define a function that zips a dataframe:
def zip(df : DataFrame, name : String) = {
df.withColumn(name, monotonically_increasing_id)
}
Then let's rewrite the function we want to test using this zip function by default:
def fun(df : DataFrame,
zipFun : (DataFrame, String) => DataFrame = zip) : DataFrame = {
zipFun(df, "id_row")
}
// let 's see what it does
fun(spark.range(5).toDF).show()
+---+----------+
| id| id_row|
+---+----------+
| 0| 0|
| 1| 1|
| 2|8589934592|
| 3|8589934593|
| 4|8589934594|
+---+----------+
It's the same as before, let's write a new function that uses zipWithIndex from the RDD API. It's a bit tedious because we have to go back and forth between the two APIs.
def zip2(df : DataFrame, name : String) = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
val newSchema = df.schema.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
fun(spark.range(5).toDF, zip2)
+---+------+
| id|id_row|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+------+
You can adapt zip2, for instance multiplying i by 2, to get what you want.
Based on answer from #Oli I came up with the following workaround:
val df = List(("oleg"), ("maxim")).toDF("first_name")
.withColumn("row_id", monotonically_increasing_id)
.withColumn("test_id", row_number().over(Window.orderBy("row_id")))
It solves my problem but I'm still interested in mocking column functions.
I mock my spark functions with this code :
val s = typedLit[Timestamp](Timestamp.valueOf("2021-05-07 15:00:46.394"))
implicit val ds = DefaultAnswer(CALLS_REAL_METHODS)
withObjectMocked[functions.type] {
when(functions.current_timestamp()).thenReturn(s)
// spark logic
}

How to get the name of a Spark Column as String?

I want to write a method to round a numeric column without doing something like:
df
.select(round($"x",2).as("x"))
Therefore I need to have a reusable column-expression like:
def roundKeepName(c:Column,scale:Int) = round(c,scale).as(c.name)
Unfortunately c.name does not exist, therefore the above code does not compile. I've found a solution for ColumName:
def roundKeepName(c:ColumnName,scale:Int) = round(c,scale).as(c.string.name)
But how can I do that with Column (which is generated if I use col("x") instead of $"x")
Not sure if the question has really been answered. Your function could be implemented like this (toString returns the name of the column):
def roundKeepname(c:Column,scale:Int) = round(c,scale).as(c.toString)
In case you don't like relying on toString, here is a more robust version. You can rely on the underlying expression, cast it to a NamedExpression and take its name.
import org.apache.spark.sql.catalyst.expressions.NamedExpression
def roundKeepname(c:Column,scale:Int) =
c.expr.asInstanceOf[NamedExpression].name
And it works:
scala> spark.range(2).select(roundKeepname('id, 2)).show
+---+
| id|
+---+
| 0|
| 1|
+---+
EDIT
Finally, if that's OK for you to use the name of the column instead of the Column object, you can change the signature of the function and that yields a much simpler implementation:
def roundKeepName(columnName:String, scale:Int) =
round(col(columnName),scale).as(columnName)
Update:
With the solution way given by BlueSheepToken, here is how you can do it dynamically assuming you have all "double" columns.
scala> val df = Seq((1.22,4.34,8.93),(3.44,12.66,17.44),(5.66,9.35,6.54)).toDF("x","y","z")
df: org.apache.spark.sql.DataFrame = [x: double, y: double ... 1 more field]
scala> df.show
+----+-----+-----+
| x| y| z|
+----+-----+-----+
|1.22| 4.34| 8.93|
|3.44|12.66|17.44|
|5.66| 9.35| 6.54|
+----+-----+-----+
scala> df.columns.foldLeft(df)( (acc,p) => (acc.withColumn(p+"_t",round(col(p),1)).drop(p).withColumnRenamed(p+"_t",p))).show
+---+----+----+
| x| y| z|
+---+----+----+
|1.2| 4.3| 8.9|
|3.4|12.7|17.4|
|5.7| 9.4| 6.5|
+---+----+----+
scala>

Trying to create dataframe with two columns [Seq(), String] - Spark

When I run the following on the spark-shell, I get a dataframe:
scala> val df = Seq(Array(1,2)).toDF("a")
scala> df.show(false)
+------+
|a |
+------+
|[1, 2]|
+------+
But when I run the following to create a dataframe with two columns:
scala> val df1 = Seq(Seq(Array(1,2)),"jf").toDF("a","b")
<console>:23: error: value toDF is not a member of Seq[Object]
val df1 = Seq(Seq(Array(1,2)),"jf").toDF("a","b")
I get the error:
Value toDF is not a member of Seq[Object].
How do I go about this? Is toDF only supported for sequences with primitive datatypes?
You need a Seq of Tuple for the toDF method to work:
val df1 = Seq((Array(1,2),"jf")).toDF("a","b")
// df1: org.apache.spark.sql.DataFrame = [a: array<int>, b: string]
df1.show
+------+---+
| a| b|
+------+---+
|[1, 2]| jf|
+------+---+
Add more tuples for more rows:
val df1 = Seq((Array(1,2),"jf"), (Array(2), "ab")).toDF("a","b")
// df1: org.apache.spark.sql.DataFrame = [a: array<int>, b: string]
df1.show
+------+---+
| a| b|
+------+---+
|[1, 2]| jf|
| [2]| ab|
+------+---+

Spark - How to convert map function output (Row,Row) tuple to one Dataframe

I need to write one scenario in Spark using Scala API.
I am passing a user defined function to a Dataframe which processes each row of data frame one by one and returns tuple(Row, Row). How can i change RDD ( Row, Row) to Dataframe (Row)? See below code sample -
**Calling map function-**
val df_temp = df_outPut.map { x => AddUDF.add(x,date1,date2)}
**UDF definition.**
def add(x: Row,dates: String*): (Row,Row) = {
......................
........................
var result1,result2:Row = Row()
..........
return (result1,result2)
Now df_temp is a RDD(Row1, Row2). my requirement is to make it one RDD or Dataframe by breaking tuple elements to 1 record of RDD or Dataframe
RDD(Row). Appreciate your help.
You can use flatMap to flatten your Row tuples, say if we start from this example rdd:
rddExample.collect()
// res37: Array[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = Array(([1,2],[3,4]), ([2,1],[4,2]))
val flatRdd = rddExample.flatMap{ case (x, y) => List(x, y) }
// flatRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[45] at flatMap at <console>:35
To convert it to data frame.
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(StructField("x", IntegerType, true)::
StructField("y", IntegerType, true)::Nil)
val df = sqlContext.createDataFrame(flatRdd, schema)
df.show
+---+---+
| x| y|
+---+---+
| 1| 2|
| 3| 4|
| 2| 1|
| 4| 2|
+---+---+