cas Exception:scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D

cas Exception:scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D - scala

throwing ClassCastExpection when applying knn classifier
val df = training.map{ r =>
(Vectors.dense(r.getAs[Array[Double]]("features")),r.getAs[Int]("id"))
}.toDF("features","id")
error appear
scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D
I try Seq, WrappedArray but does't work.

I am going to assume the following schema for training:
id:Integer
features: Array[Double]
Try:
val df = training.map(r => (Vectors.dense(r.getAs[Seq[Double]]("features").toArray),r.getAs[Integer]("id"))).toDF("features","id")
Datasets internally store Array objects are WrappedArray, a quick intro of which can be found here.
Array vs Wrapped array
So, you should "extract" your array of doubles by casting it to Seq[Double] instead of Array[Double]. However, the method dense needs Array[Double]. So, convert the Seq[Double] to Array[Double] using the toArray method.
val training = List((Seq(0.0,0.0),2),(Seq(1.0,1.0),5)).toDF("features","id")
training.show
+----------+---+
| features| id|
+----------+---+
|[0.0, 0.0]| 2|
|[1.0, 1.0]| 5|
+----------+---+
training: org.apache.spark.sql.DataFrame = [features: array<double>, id: int]
val df = training.map(r => (Vectors.dense(r.getAs[Seq[Double]]("training").toArray),r.getAs[Integer]("id"))).toDF("features","id")
df.show
+---------+---+
| features| id|
+---------+---+
|[0.0,0.0]| 2|
|[1.0,1.0]| 5|
+---------+---+
df: org.apache.spark.sql.DataFrame = [features: vector, id: int]
Hope this helps.

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

myFunc(Row): String = {
//process row
//returns string
}
appendNewCol(inputDF : DataFrame) : DataFrame ={
inputDF.withColumn("newcol",myFunc(Row))
inputDF
}
But no new column got created in my case. My myFunc passes this row to a knowledgebasesession object and that returns a string after firing rules. Can I do it this way? If not, what is the right way? Thanks in advance.
I saw many StackOverflow solutions using expr() sqlfunc(col(udf(x)) and other techniques but here my newcol is not derived directly from existing column.

Dataframe:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val myFunc = (r: Row) => {r.getAs[String]("col1") + "xyz"} // example transformation
val testDf = spark.sparkContext.parallelize(Seq(
(1, "abc"), (2, "def"), (3, "ghi"))).toDF("id", "col1")
testDf.show
val rddRes = testDf
.rdd
.map{x =>
val y = myFunc (x)
Row.fromSeq (x.toSeq ++ Seq(y) )
}
val newSchema = StructType(testDf.schema.fields ++ Array(StructField("col2", dataType =StringType, nullable =false)))
spark.sqlContext.createDataFrame(rddRes, newSchema).show
Results:
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| def|
| 3| ghi|
+---+----+
+---+----+------+
| id|col1| col2|
+---+----+------+
| 1| abc|abcxyz|
| 2| def|defxyz|
| 3| ghi|ghixyz|
+---+----+------+
With Dataset:
case class testData(id: Int, col1: String)
case class transformedData(id: Int, col1: String, col2: String)
val test: Dataset[testData] = List(testData(1, "abc"), testData(2, "def"), testData(3, "ghi")).toDS
val transformedData: Dataset[transformedData] = test
.map { x: testData =>
val newCol = x.col1 + "xyz"
transformedData(x.id, x.col1, newCol)
}
transformedData.show
As you can see datasets is more readable, plus provides strong type casting.
Since I'm unaware of your spark version, providing both solutions here. However if you're using spark v>=1.6, you should look into Datasets. Playing with rdd is fun, but can quickly devolve into longer job runs and a host of other issues that you wont foresee

mock spark column functions in scala

My code is using monotonically_increasing_id function is scala
val df = List(("oleg"), ("maxim")).toDF("first_name")
.withColumn("row_id", monotonically_increasing_id)
I want to mock it in my unit test so that it returns integers 0, 1, 2, 3, ...
In my spark-shell it returns the desired result.
scala> df.show
+----------+------+
|first_name|row_id|
+----------+------+
| oleg| 0|
| maxim| 1|
+----------+------+
But in my scala applications the results are different.
How can I mock column functions?

Mocking such a function so that it produces a sequence is not simple. Indeed, spark is a parallel computing engine and accessing the data in sequence is therefore complicated.
Here is a solution you could try.
Let's define a function that zips a dataframe:
def zip(df : DataFrame, name : String) = {
df.withColumn(name, monotonically_increasing_id)
}
Then let's rewrite the function we want to test using this zip function by default:
def fun(df : DataFrame,
zipFun : (DataFrame, String) => DataFrame = zip) : DataFrame = {
zipFun(df, "id_row")
}
// let 's see what it does
fun(spark.range(5).toDF).show()
+---+----------+
| id| id_row|
+---+----------+
| 0| 0|
| 1| 1|
| 2|8589934592|
| 3|8589934593|
| 4|8589934594|
+---+----------+
It's the same as before, let's write a new function that uses zipWithIndex from the RDD API. It's a bit tedious because we have to go back and forth between the two APIs.
def zip2(df : DataFrame, name : String) = {
val rdd = df.rdd.zipWithIndex
.map{ case (row, i) => Row.fromSeq(row.toSeq :+ i) }
val newSchema = df.schema.add(StructField(name, LongType, false))
df.sparkSession.createDataFrame(rdd, newSchema)
}
fun(spark.range(5).toDF, zip2)
+---+------+
| id|id_row|
+---+------+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+------+
You can adapt zip2, for instance multiplying i by 2, to get what you want.

Based on answer from #Oli I came up with the following workaround:
val df = List(("oleg"), ("maxim")).toDF("first_name")
.withColumn("row_id", monotonically_increasing_id)
.withColumn("test_id", row_number().over(Window.orderBy("row_id")))
It solves my problem but I'm still interested in mocking column functions.

I mock my spark functions with this code :
val s = typedLit[Timestamp](Timestamp.valueOf("2021-05-07 15:00:46.394"))
implicit val ds = DefaultAnswer(CALLS_REAL_METHODS)
withObjectMocked[functions.type] {
when(functions.current_timestamp()).thenReturn(s)
// spark logic
}

How to get the name of a Spark Column as String?

I want to write a method to round a numeric column without doing something like:
df
.select(round($"x",2).as("x"))
Therefore I need to have a reusable column-expression like:
def roundKeepName(c:Column,scale:Int) = round(c,scale).as(c.name)
Unfortunately c.name does not exist, therefore the above code does not compile. I've found a solution for ColumName:
def roundKeepName(c:ColumnName,scale:Int) = round(c,scale).as(c.string.name)
But how can I do that with Column (which is generated if I use col("x") instead of $"x")

Not sure if the question has really been answered. Your function could be implemented like this (toString returns the name of the column):
def roundKeepname(c:Column,scale:Int) = round(c,scale).as(c.toString)
In case you don't like relying on toString, here is a more robust version. You can rely on the underlying expression, cast it to a NamedExpression and take its name.
import org.apache.spark.sql.catalyst.expressions.NamedExpression
def roundKeepname(c:Column,scale:Int) =
c.expr.asInstanceOf[NamedExpression].name
And it works:
scala> spark.range(2).select(roundKeepname('id, 2)).show
+---+
| id|
+---+
| 0|
| 1|
+---+
EDIT
Finally, if that's OK for you to use the name of the column instead of the Column object, you can change the signature of the function and that yields a much simpler implementation:
def roundKeepName(columnName:String, scale:Int) =
round(col(columnName),scale).as(columnName)

Update:
With the solution way given by BlueSheepToken, here is how you can do it dynamically assuming you have all "double" columns.
scala> val df = Seq((1.22,4.34,8.93),(3.44,12.66,17.44),(5.66,9.35,6.54)).toDF("x","y","z")
df: org.apache.spark.sql.DataFrame = [x: double, y: double ... 1 more field]
scala> df.show
+----+-----+-----+
| x| y| z|
+----+-----+-----+
|1.22| 4.34| 8.93|
|3.44|12.66|17.44|
|5.66| 9.35| 6.54|
+----+-----+-----+
scala> df.columns.foldLeft(df)( (acc,p) => (acc.withColumn(p+"_t",round(col(p),1)).drop(p).withColumnRenamed(p+"_t",p))).show
+---+----+----+
| x| y| z|
+---+----+----+
|1.2| 4.3| 8.9|
|3.4|12.7|17.4|
|5.7| 9.4| 6.5|
+---+----+----+
scala>

Trying to create dataframe with two columns [Seq(), String] - Spark

When I run the following on the spark-shell, I get a dataframe:
scala> val df = Seq(Array(1,2)).toDF("a")
scala> df.show(false)
+------+
|a |
+------+
|[1, 2]|
+------+
But when I run the following to create a dataframe with two columns:
scala> val df1 = Seq(Seq(Array(1,2)),"jf").toDF("a","b")
<console>:23: error: value toDF is not a member of Seq[Object]
val df1 = Seq(Seq(Array(1,2)),"jf").toDF("a","b")
I get the error:
Value toDF is not a member of Seq[Object].
How do I go about this? Is toDF only supported for sequences with primitive datatypes?

You need a Seq of Tuple for the toDF method to work:
val df1 = Seq((Array(1,2),"jf")).toDF("a","b")
// df1: org.apache.spark.sql.DataFrame = [a: array<int>, b: string]
df1.show
+------+---+
| a| b|
+------+---+
|[1, 2]| jf|
+------+---+
Add more tuples for more rows:
val df1 = Seq((Array(1,2),"jf"), (Array(2), "ab")).toDF("a","b")
// df1: org.apache.spark.sql.DataFrame = [a: array<int>, b: string]
df1.show
+------+---+
| a| b|
+------+---+
|[1, 2]| jf|
| [2]| ab|
+------+---+

Spark - How to convert map function output (Row,Row) tuple to one Dataframe

I need to write one scenario in Spark using Scala API.
I am passing a user defined function to a Dataframe which processes each row of data frame one by one and returns tuple(Row, Row). How can i change RDD ( Row, Row) to Dataframe (Row)? See below code sample -
**Calling map function-**
val df_temp = df_outPut.map { x => AddUDF.add(x,date1,date2)}
**UDF definition.**
def add(x: Row,dates: String*): (Row,Row) = {
......................
........................
var result1,result2:Row = Row()
..........
return (result1,result2)
Now df_temp is a RDD(Row1, Row2). my requirement is to make it one RDD or Dataframe by breaking tuple elements to 1 record of RDD or Dataframe
RDD(Row). Appreciate your help.

You can use flatMap to flatten your Row tuples, say if we start from this example rdd:
rddExample.collect()
// res37: Array[(org.apache.spark.sql.Row, org.apache.spark.sql.Row)] = Array(([1,2],[3,4]), ([2,1],[4,2]))
val flatRdd = rddExample.flatMap{ case (x, y) => List(x, y) }
// flatRdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[45] at flatMap at <console>:35
To convert it to data frame.
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(StructField("x", IntegerType, true)::
StructField("y", IntegerType, true)::Nil)
val df = sqlContext.createDataFrame(flatRdd, schema)
df.show
+---+---+
| x| y|
+---+---+
| 1| 2|
| 3| 4|
| 2| 1|
| 4| 2|
+---+---+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

cas Exception:scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D - scala

throwing ClassCastExpection when applying knn classifier val df = training.map{ r => (Vectors.dense(r.getAs[Array[Double]]("features")),r.getAs[Int]("id")) }.toDF("features","id") error appear scala.collection.mutable.WrappedArray$ofRef cannot be cast to [D I try Seq, WrappedArray but does't work.

Related

How to add a new column to my DataFrame such that values of new column are populated by some other function in scala?

mock spark column functions in scala

How to get the name of a Spark Column as String?

Trying to create dataframe with two columns [Seq(), String] - Spark

Spark - How to convert map function output (Row,Row) tuple to one Dataframe

Categories

Resources