This question already has answers here:
Encoder error while trying to map dataframe row to updated row
(4 answers)
Closed 5 years ago.
So I have this code
val expanededDf = io.readInputs().mapPartitions{
(iter:Iterator[Row]) => {
iter.map{
(item:Row) => {
val myNewColumn = getUdf($"someColumnOriginal")
Row.fromSeq(item.toSeq :+(myNewColumn))
}
}
}
}
I am getting a exception:"Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases."
My imports are:
import spark.implicits._
import org.apache.spark.sql._
I have to use the UDF as the function is very complex making some REST calls. Basically the code tries to add a new column into a Row using a particular column value and then returns a dataframe. I have tried using withColumn but since I am dealing here with Petabytes of data it is extremely slow. I am a newbie to spark and scala and hence I apologise in advance if my question is extremely lame.
First of all, withColumn is the way to go, and if it's slow, it's probably because your job needs tuning, and I think switching to RDDs won't make it any faster.
But anyway...you are not supposed to refer to a DataFrame within the function that is called on every single row of an RDD.
To better understand what's happening, when running a spark program, there's a Driver, which is the master, and there are the Executors, which are the slaves.
The slaves don't know about DataFrames, only the driver does.
There is another important point, when you're writing code that runs in the executor, you must be careful when referencing variables that are in the Driver's scope. If you do, Spark will try to serialize them and send them to the Executors. It's ok if it's what you want AND if those objects are small AND if Spark knows how to serialize them.
In this case, Spark is trying to serialize $"someColumnOriginal", which is an object of class Column, but it doesn't know how and it fails.
In this case, to make it work, you have to know in what position the field you want is, let's say it's in position 2, you would write
Row.fromSeq(item.toSeq :+ item.get(2))
You can get the position by looking at the schema if it's available (item.schema, rdd.schema), and since it's an int, it can be done outside the loops and Spark will be able to serialize that.
You can read this article http://www.cakesolutions.net/teamblogs/demystifying-spark-serialisation-error for more about serialization.
Related
I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.
I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe
I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.
I'm trying to read a sequence file with custom Writable subclasses for both K and V of a sequencefile input to a spark job.
the vast majority of rows need to be filtered out by a match to a broadcast variable ("candidateSet") and the Kclass.getId. Unfortunately values V are deserialized for every record no matter what with the standard approach, and according to a profile that is where the majority of time is being spent.
here is my code. note my most recent attempt to read here as "Writable" generically, then later cast back which worked functionally but still caused the full deserialize in the iterator.
val rdd = sc.sequenceFile(
path,
classOf[MyKeyClassWritable],
classOf[Writable]
).filter(a => candidateSet.value.contains(a._1.getId))```
Turns out Twitter has a library that handles this case pretty well. Specifically, using this class allows to evaluate the serialized fields in a later step by reading them as DataInputBuffers
https://github.com/twitter/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/RawSequenceFileRecordReader.java
This question already has answers here:
Spark 2.0 Dataset vs DataFrame
(3 answers)
Closed 4 years ago.
What is the advantage of using case class in spark dataframe? I can define the schema using "inferschema" option or define Structtype fields.
I referred
"https://docs.scala-lang.org/tour/case-classes.html" but could not understand what are the advantages of using case class apart from generating schema using reflection.
inferschema can be an expensive operation and will defer error behavior unnecessarily. consider the following pseudocode
val df = loadDFWithSchemaInference
//doing things that takes time
df.map(row => row.getAs[String]("fieldName")).//more stuff
now in your this code you already have the assumption baked in that fieldName is of type String but it's only expressed and ensured late in your processing leading to unfortunate errors in case it wasn't actually a String
now if you'd do this instead
val df = load.as[CaseClass]
or
val df = load.option("schema", predefinedSchema)
the fact that fieldName is a String will be a precondition and thus your code will be more robust and less error prone.
schema inference is very handy to have if you do explorative things in the REPL or e.g. Zeppelin but should not be used in operational code.
Edit Addendum:
I personally prefer to use case classes over schemas because I prefer the Dataset API to the Dataframe API (which is Dataset[Row]) for similar robustness reasons.
I can read in a Spark dataframe as a custom object like this:
spark.read.csv("path/to/file").as[Gizmo]
But how can I convert a single Spark Row object to its equivalent case class? (If you're worried about why I want to do this, please consult this question.) Clearly Spark knows how to do this, but I don't see any straightforward way of accomplishing it (short of converting the Row into an RDD of length 1 and then converting back).
row.as[Gizmo] // doesn't work. What goes here?
I have several different queries I need to perform on several different parquet files using Spark. Each of the queries is different, and has its own function which applies it. For example:
def query1(sqtx: sqlContext): DataFrame = {
sqtx.sql("select clients as people, reputation from table1") }
def query2(sqtx: sqlContext): DataFrame = {
sqtx.sql("select passengers as people, reputation from table2") }
and so on. As you can see, while all the queries are different, the schema of all the outputs is identical.
After querying, I want to use unionAll on all the successful outputs. And here comes my question - how? Using ParSeq.map is not possible here, since the mapping will be different for every query, and using Future doesn't really seems to fit in this case (I need to use onComplete on each one, see if it failed or not, etc.)
Any ideas how to do this simply?