Issue with VectorUDT when using Spark ML

Issue with VectorUDT when using Spark ML - scala

I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD.
Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas:
def inputSchema = new StructType().add("features", new VectorUDT())
def bufferSchema: StructType =
StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)
override def dataType: DataType = ArrayType(DoubleType,true)
VectorUDT is what I would use with spark.mllib.linalg.Vector:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
However, when I try to import it from spark.ml instead: import org.apache.spark.ml.linalg.VectorUDT
I get a runtime error (no errors during the build):
class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg
Is it expected/can you suggest a workaround?
I am using Spark 2.0.0

In Spark 2.0.0, the proper way to go is to use org.apache.spark.ml.linalg.SQLDataTypes.VectorType instead of VectorUDT. It was introduced in this issue.

Related

Does using Scala implicit classes feature on Spark Dataframe is a monkey patching?

I'm trying to add side-effect functionality to Spark DataFrame by expanding DataFrame class using Scala implicit classes feature for the reason that "Dataset Transform Method" only allows returning DataFrame.
From Wikipedia - "The term monkey patch ... referred to changing code sneakily – and possibly incompatibly with other such patches – at runtime"
In this post the writer warns from "Monkey Patching with Implicit Classes", but I'm not sure his claims are correct because we are not
changing any classes.
Does the following example is potentially "monkey patching" and could somehow be incompatible with Spark future version, or because I'm not overwriting the current DataFrame class and just expanding, it can be no harm?
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{col, get_json_object}
object dataFrameSql {
implicit class DataFrameExSql(dataFrame: DataFrame) {
def writeDFbyPartition(repartition: Int, output: String): Unit = {
dataFrame
.repartition(repartition)
.write
.option("partitionOverwriteMode", "dynamic")
.mode(SaveMode.Overwrite)
.parquet(output)
}
}
}

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?

You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}

Converting from java.util.List to spark dataset

I am still very new to spark and scala, but very familiar with Java. I have some java jar that has a function that returns an List (java.util.List) of Integers, but I want to convert these to a spark dataset so I can append it to another column and then perform a join. Is there any easy way to do this? I've tried things similar to this code:
val testDSArray : java.util.List[Integer] = new util.ArrayList[Integer]()
testDSArray.add(4)
testDSArray.add(7)
testDSArray.add(10)
val testDS : Dataset[Integer] = spark.createDataset(testDSArray, Encoders.INT())
but it gives me compiler errors (cannot resolve overloaded method)?

If you look at the type signature you will see that in Scala the encoder is passed in a second (and implicit) parameter list.
You may:
Pass it in another parameter list.
val testDS = spark.createDataset(testDSArray)(Encoders.INT)
Don't pass it, and leave the Scala's implicit mechanism resolves it.
import spark.implicits._
val testDS = spark.createDataset(testDSArray)
Convert the Java's List to a Scala's one first.
import collection.JavaConverters._
import spark.implicits._
val testDS = testDSArray.asScala.toDS()

Spark SQL's Scala API - TimestampType - No Encoder found for org.apache.spark.sql.types.TimestampType

I am using Spark 2.1 with Scala 2.11 on a Databricks notebook
What is exactly TimestampType ?
We know from SparkSQL's documentation that's the official timestamp type is TimestampType, which is apparently an alias for java.sql.Timestamp :
TimestampType can be found here in the SparkSQL's Scala API
We have a difference when using a schema and the Dataset API
When parsing {"time":1469501297,"action":"Open"} from the Databricks' Scala Structured Streaming example
Using a Json schema --> OK (I do prefer using the elegant Dataset API) :
val jsonSchema = new StructType().add("time", TimestampType).add("action", StringType)
val staticInputDF =
spark
.read
.schema(jsonSchema)
.json(inputPath)
Using the Dataset API --> KO: No Encoder found for TimestampType
Creating the Event class
import org.apache.spark.sql.types._
case class Event(action: String, time: TimestampType)
--> defined class Event
Errors when reading the events from DBFS on databricks.
Note: we don't get the error when using java.sql.Timestamp as a type for "time"
val path = "/databricks-datasets/structured-streaming/events/"
val events = spark.read.json(path).as[Event]
Error message
java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.types.TimestampType
- field (class: "org.apache.spark.sql.types.TimestampType", name: "time")
- root class:

Combining the schema read method .schema(jsonSchema) and the as[Type] method containing the type java.sql.Timestamp will solve this issue. The idea came to be after reading from the Structured Streaming documentation Creating streaming DataFrames and streaming Datasets
These examples generate streaming DataFrames that are untyped, meaning
that the schema of the DataFrame is not checked at compile time, only
checked at runtime when the query is submitted. Some operations like
map, flatMap, etc. need the type to be known at compile time. To do
those, you can convert these untyped streaming DataFrames to typed
streaming Datasets using the same methods as static DataFrame.
val path = "/databricks-datasets/structured-streaming/events/"
val jsonSchema = new StructType().add("time", TimestampType).add("action", StringType)
case class Event(action: String, time: java.sql.Timestamp)
val staticInputDS =
spark
.read
.schema(jsonSchema)
.json(path)
.as[Event]
staticInputDF.printSchema
Will output :
root
|-- time: timestamp (nullable = true)
|-- action: string (nullable = true)

TimestampType is not an alias for java.sql.Timestamp, but rather a representation of a timestamp type for Spark internal usage. In general you don't want to use TimestampType in your code. The idea is that java.sql.Timestamp is supported by Spark SQL natively, so you can define you event class as follows:
case class Event(action: String, time: java.sql.Timestamp)
Internally, Spark will then use TimestampType to model the type of a value at runtime, when compiling and optimizing your query, but this is not something you're interested in most of the time.

Unable to find encoder for type stored in Dataset when attempting to perform flatMap on a DataFrame in Spark 2.0 [duplicate]

This question already has answers here:
Encoder error while trying to map dataframe row to updated row
(4 answers)
Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?
(3 answers)
Closed 5 years ago.
I keep getting the following compile time error:
Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes)
are supported by importing spark.implicits._
Support for serializing other types will be added in future releases.
I've just upgraded from Spark v1.6 to v2.0.2, and a whole bunch of code using DataFrame are complaining about this error. The code where it is complaining looks like the following.
def doSomething(data: DataFrame): Unit = {
data.flatMap(row => {
...
})
.reduceByKey(_ + _)
.sortByKey(ascending = false)
}
Previous SO posts suggest to
pull out the case class (defined in object)
perform implicit imports
However, I don't have any case classes, as I am using DataFrame which is equal to DataSet[Row], and also, I've inlined the 2 implicit imports as follows, without any help to get rid of this message.
val sparkSession: SparkSession = ???
val sqlContext: SQLContext = ???
import sparkSession.implicits._
import sqlContext.implicits._
Note that I've looked at the documentation for DataSet and Encoder. The docs says something like the following.
Scala
Encoders are generally created automatically through implicits from a
SparkSession, or can be explicitly created by calling static methods on
Encoders.
import spark.implicits._
val ds = Seq(1, 2, 3).toDS() // implicitly provided (spark.implicits.newIntEncoder)
However, my method doesn't have access to SparkSession. Also, when I try that line import spark.implicits._, IntelliJ can't even find it. When I say that my DataFrame is a DataSet[Row], I really do mean it.
This question is marked as a possible duplicate, but please let me clarify.
I have no case class or business object associated.
I am using .flatMap while the other question is using .map
implicit imports do not seem to help
passing a RowEncoder produces a compile-time error e.g. data.flatMap(row => { ... }, RowEncoder(data.schema)) (too many arguments)
I'm reading the other posts, and let me add, I guess I don't know how this new Spark 2.0 Datasets/DataFrame API is supposed to work. In the Spark shell, the code below works. Note that I start the spark shell like this $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
val schema = StructType(Array(
StructField("x1", StringType, true),
StructField("x2", StringType, true),
StructField("x3", StringType, true),
StructField("x4", StringType, true),
StructField("x5", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.schema(schema)
.load("/Users/jwayne/Downloads/mydata.csv")
df.columns.map(col => {
df.groupBy(col)
.count()
.map(_.getString(0))
.collect()
.toList
})
.toList
However, when I run this as a part of a test unit, I get the same unable to find encoder error. Why does this work in the shell but not in my test units?
In the shell, I typed in :imports and :implicits and placed those in my scala files/source, but that doesn't help either.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Issue with VectorUDT when using Spark ML - scala

In Spark 2.0.0, the proper way to go is to use org.apache.spark.ml.linalg.SQLDataTypes.VectorType instead of VectorUDT. It was introduced in this issue.

Related

Does using Scala implicit classes feature on Spark Dataframe is a monkey patching?

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

Converting from java.util.List to spark dataset

Spark SQL's Scala API - TimestampType - No Encoder found for org.apache.spark.sql.types.TimestampType

Unable to find encoder for type stored in Dataset when attempting to perform flatMap on a DataFrame in Spark 2.0 [duplicate]

Categories

Resources