Benefit of using case class in spark dataframe [duplicate]

Benefit of using case class in spark dataframe [duplicate] - scala

This question already has answers here:
Spark 2.0 Dataset vs DataFrame
(3 answers)
Closed 4 years ago.
What is the advantage of using case class in spark dataframe? I can define the schema using "inferschema" option or define Structtype fields.
I referred
"https://docs.scala-lang.org/tour/case-classes.html" but could not understand what are the advantages of using case class apart from generating schema using reflection.

inferschema can be an expensive operation and will defer error behavior unnecessarily. consider the following pseudocode
val df = loadDFWithSchemaInference
//doing things that takes time
df.map(row => row.getAs[String]("fieldName")).//more stuff
now in your this code you already have the assumption baked in that fieldName is of type String but it's only expressed and ensured late in your processing leading to unfortunate errors in case it wasn't actually a String
now if you'd do this instead
val df = load.as[CaseClass]
or
val df = load.option("schema", predefinedSchema)
the fact that fieldName is a String will be a precondition and thus your code will be more robust and less error prone.
schema inference is very handy to have if you do explorative things in the REPL or e.g. Zeppelin but should not be used in operational code.
Edit Addendum:
I personally prefer to use case classes over schemas because I prefer the Dataset API to the Dataframe API (which is Dataset[Row]) for similar robustness reasons.

Related

How to concatenate transformations on a spark scala dataframe?

I am teaching myself scala (so as to use it with Apache Spark) and wanted to know if there would be some way to concatenate a series of transformations on a Spark DataFrame. E.g. let's assume we have a list of transformations
l: List[(String, String)] = List(("field1", "nonEmpty"), ("field2", "notNull"))
and a Spark DataFrame
df, such that the desired result would be
df.filter(df("field1") =!= "").filter(df("field2").isNotNull).
I was thinking perhaps this could be done using function composition or list folding or something, but I really don't know how. Any help would be greatly appreciated.
Thanks!

Yes, it is perfectly possible. But it depends of you really want, I mean, Spark provides Pipelines, that allows to compose your transformations and create a pipeline that can be serialized. You can create your custom transformers, here an example. You can include your "filter" stages in custom transformations, you will be able to use later, for example, in a Spark structured streaming.
Other option is to use Spark datasets and use the transform api. That seems more functional and elegant.
Scala has a lot of possibilities to create your own api, but take a look first to these approaches.

Yes you can fold over an existing Dataframe. You could keep all columns in a list and don't bother with other intermediary types:
val df =
???
val columns =
List(
col("1") =!= "",
col("2").isNotNull,
col("3") > 10
)
val filtered =
columns.foldLeft(df)((df, col) => df.filter(col))

How to use Latent Dirichlet Allocation (migrating from spark.mllib package)?

I am using Apache Spark 2.1.2 and I want to use Latent Dirichlet allocation (LDA).
Previously I was using org.apache.spark.mllib package and I could run this without any problems, but now after starting using spark.ml I am getting an error.
val lda = new LDA().setK(numTopics).setMaxIter(numIterations)
val docs = spark.createDataset(documents)
val ldaModel = lda.fit(docs)
As you may have noticed, I'm converting documents RDD to a dataset object and am not sure if this is the correct way of doing this.
In this last line with .fit I am getting the following error:
java.lang.IllegalArgumentException: Field "features" does not exist.
My docs dataset looks like this:
scala> docs.take(2)
res28: Array[(Long, org.apache.spark.ml.linalg.Vector)] = Array((0,(7336,[1,2,4,5,12,13,19,24,26,42,48,49,57,59,63,73,81,89,99,106,113,114,141,151,157,160,177,181,198,261,266,267,272,297,307,314,315,359,383,385,410,416,422,468,471,527,564,629,717,744,763,837,890,928,932,951,961,1042,1134,1174,1305,1604,1653,1850,2119,2159,2418,2634,2836,3002,3132,3594,4103,4316,4852,5065,5107,5632,5945,6378,6597,6658],[1.0,1.0,1.0.......
My previous documents before converting them to a dataset:
documents: org.apache.spark.rdd.RDD[(Long, org.apache.spark.ml.linalg.Vector)] = MapPartitionsRDD[2520]
How to get rid of the error above?

The main difference between spark mllib and spark ml is that spark ml operates on Dataframes (or Datasets) while mllib operates directly on RDDs of very defined structure.
You don't need to do much to make your code work with spark ml, but I'd still suggest to go through their documentation page and understand the differences, because you will come against more and more differences as you shift more and more towards spark ml. A good starting page with all the basics is here https://spark.apache.org/docs/2.1.0/ml-pipeline.html.
But to your code, all that is needed is just to give a correct column name to each column and it should be working just fine. Probably the easiest way to do so would be to utilise the implicit method toDF on the underlying RDD:
import spark.implicits._
val lda = new LDA().setK(numTopics).setMaxIter(numIterations)
val docs = documents.toDF("label", "features")
val ldaModel = lda.fit(docs)

Pyspark throwing error: py4j.Py4JException: Method getstate([]) does not exist [duplicate]

Background
My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib?
When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Unfortunately similar approach in PySpark doesn't work so well:
labelsAndPredictions = testData.map(
lambda lp: (lp.label, model.predict(lp.features))
labelsAndPredictions.first()
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Instead of that official documentation recommends something like this:
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
So what is going on here? There is no broadcast variable here and Scala API defines predict as follows:
/**
* Predict values for a single data point using the model trained.
*
* #param features array representing a single data point
* #return Double prediction from the trained model
*/
def predict(features: Vector): Double = {
topNode.predict(features)
}
/**
* Predict values for the given data set using the model trained.
*
* #param features RDD representing data points to be predicted
* #return RDD of predictions for each of the given data points
*/
def predict(features: RDD[Vector]): RDD[Double] = {
features.map(x => predict(x))
}
so at least at the first glance calling from action or transformation is not a problem since prediction seems to be a local operation.
Explanation
After some digging I figured out that the source of the problem is a JavaModelWrapper.call method invoked from DecisionTreeModel.predict. It access SparkContext which is required to call Java function:
callJavaFunc(self._sc, getattr(self._java_model, name), *a)
Question
In case of DecisionTreeModel.predict there is a recommended workaround and all the required code is already a part of the Scala API but is there any elegant way to handle problem like this in general?
Only solutions I can think of right now are rather heavyweight:
pushing everything down to JVM either by extending Spark classes through Implicit Conversions or adding some kind of wrappers
using Py4j gateway directly

Communication using default Py4J gateway is simply not possible. To understand why we have to take a look at the following diagram from the PySpark Internals document [1]:
Since Py4J gateway runs on the driver it is not accessible to Python interpreters which communicate with JVM workers through sockets (See for example PythonRDD / rdd.py).
Theoretically it could be possible to create a separate Py4J gateway for each worker but in practice it is unlikely to be useful. Ignoring issues like reliability Py4J is simply not designed to perform data intensive tasks.
Are there any workarounds?
Using Spark SQL Data Sources API to wrap JVM code.
Pros: Supported, high level, doesn't require access to the internal PySpark API
Cons: Relatively verbose and not very well documented, limited mostly to the input data
Operating on DataFrames using Scala UDFs.
Pros: Easy to implement (see Spark: How to map Python with Scala or Java User Defined Functions?), no data conversion between Python and Scala if data is already stored in a DataFrame, minimal access to Py4J
Cons: Requires access to Py4J gateway and internal methods, limited to Spark SQL, hard to debug, not supported
Creating high level Scala interface in a similar way how it is done in MLlib.
Pros: Flexible, ability to execute arbitrary complex code. It can be don either directly on RDD (see for example MLlib model wrappers) or with DataFrames (see How to use a Scala class inside Pyspark). The latter solution seems to be much more friendly since all ser-de details are already handled by existing API.
Cons: Low level, required data conversion, same as UDFs requires access to Py4J and internal API, not supported
Some basic examples can be found in Transforming PySpark RDD with Scala
Using external workflow management tool to switch between Python and Scala / Java jobs and passing data to a DFS.
Pros: Easy to implement, minimal changes to the code itself
Cons: Cost of reading / writing data (Alluxio?)
Using shared SQLContext (see for example Apache Zeppelin or Livy) to pass data between guest languages using registered temporary tables.
Pros: Well suited for interactive analysis
Cons: Not so much for batch jobs (Zeppelin) or may require additional orchestration (Livy)
Joshua Rosen. (2014, August 04) PySpark Internals. Retrieved from https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

How can one use spark Catalyst?

According to this
Spark Catalyst is An implementation-agnostic framework for manipulating trees of relational operators and expressions.
I want to use Spark Catalyst to parse SQL DMLs and DDLs to write and generate custom Scala code for. However, it is not clear for me by reading the code if there is any wrapper class around Catalyst that I can use? The ideal wrapper would receive a sql statement and produces the equivalent Scala code. For my use case would look like this
def generate("select substring(s, 1, 3) as from t1") =
{ // custom code
return custom_scala_code_which is executable given s as List[String]
}
This is a simple example, but the idea is that I don't want to write another parser and I need to parse many SQL functionality from a legacy system that I have to write a custom Scala implementation for them.
In a more general question, with a lack of class level design documentation, how can someone learn the code base and make contributions?

Spark takes SQL queries using spark.sql. For example: you can just feed the string SELECT * FROM table as an argument to such as spark.sql("SELECT * FROM table") after having defined your dataframe as "table". To define your dataframe as "table" for use in SQL queries create a temporary view using
DataFrame.createOrReplaceTempView("table")
You can see examples here:
https://spark.apache.org/docs/2.1.0/sql-programming-guide.html#running-sql-queries-programmatically

Dataframe automatically changes into RDD and optimise the code, and this optimization is done through Catalyst. When a programmer writes a code in Dataframe , internally code will be optimized. For more detail visit
Catalyst optimisation in Spark

Error while using spark spark mapPartition [duplicate]

This question already has answers here:
Encoder error while trying to map dataframe row to updated row
(4 answers)
Closed 5 years ago.
So I have this code
val expanededDf = io.readInputs().mapPartitions{
(iter:Iterator[Row]) => {
iter.map{
(item:Row) => {
val myNewColumn = getUdf($"someColumnOriginal")
Row.fromSeq(item.toSeq :+(myNewColumn))
}
}
}
}
I am getting a exception:"Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases."
My imports are:
import spark.implicits._
import org.apache.spark.sql._
I have to use the UDF as the function is very complex making some REST calls. Basically the code tries to add a new column into a Row using a particular column value and then returns a dataframe. I have tried using withColumn but since I am dealing here with Petabytes of data it is extremely slow. I am a newbie to spark and scala and hence I apologise in advance if my question is extremely lame.

First of all, withColumn is the way to go, and if it's slow, it's probably because your job needs tuning, and I think switching to RDDs won't make it any faster.
But anyway...you are not supposed to refer to a DataFrame within the function that is called on every single row of an RDD.
To better understand what's happening, when running a spark program, there's a Driver, which is the master, and there are the Executors, which are the slaves.
The slaves don't know about DataFrames, only the driver does.
There is another important point, when you're writing code that runs in the executor, you must be careful when referencing variables that are in the Driver's scope. If you do, Spark will try to serialize them and send them to the Executors. It's ok if it's what you want AND if those objects are small AND if Spark knows how to serialize them.
In this case, Spark is trying to serialize $"someColumnOriginal", which is an object of class Column, but it doesn't know how and it fails.
In this case, to make it work, you have to know in what position the field you want is, let's say it's in position 2, you would write
Row.fromSeq(item.toSeq :+ item.get(2))
You can get the position by looking at the schema if it's available (item.schema, rdd.schema), and since it's an int, it can be done outside the loops and Spark will be able to serialize that.
You can read this article http://www.cakesolutions.net/teamblogs/demystifying-spark-serialisation-error for more about serialization.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse