Does Spark collect_list send data to the driver? - pyspark

With this snippet in pyspark:
df.groupBy('id').agg(collect_list('feature'))
I keep running out of memory on the driver.
So, I'm assuming collection process takes place on the driver.
If this is correct, is a UDAF implemented in Scala my only option to avoid this OOM?
Thanks

No, the "collect" is a little misleading here.

Related

Convert a JavaPairRDD into list without collect() [duplicate]

We know that if we need to convert RDD to a list, then we should use collect(). but this function puts a lot of stress on the driver (as it brings all the data from different executors to the driver) which causes performance degradation or worse (whole application may fail).
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
Basically in current scenario where we deal with huge amount of data in batch or stream data processing, APIs like collect() and collectAsMap() has become completely useless in a real project with real amount of data. We can use it in demo code, but that's all there to use for these APIs. So why to have an API which we can not even use (Or am I missing something).
Can there be a better way to achieve the same result through some other method or can we implement collect() and collectAsMap() in a more effective way other that just calling
List<String> myList= RDD.collect.toList (which effects performance)
I looked up to google but could not find anything which can be effective. Please help if someone has got a better approach.
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
No, and there can't be. And if there were such a way, collect would be implemented using it in the first place.
Well, technically you could implement List interface on top of RDD (or most of it?), but that would be a bad idea and quite pointless.
So why to have an API which we can not even use (Or am I missing something).
collect is intended to be used for cases where only large RDDs are inputs or intermediate results, and the output is small enough. If that's not your case, use foreach or other actions instead.
As you want to collect the Data in a Java Collection, the data has to collect on single JVM as the java collections won't be distributed. There is no way to get all data in collection by not getting data. The interpretation of problem space is wrong.
collect and similar are not meant to be used in normal spark code. They are useful for things like debugging, testing, and in some cases when working with small datasets.
You need to keep your data inside of the rdd, and use rdd transformations and actions without ever taking the data out. Methods like collect which pull you data out of spark and onto your driver defeat the purpose and undo any advantage that spark might be providing since now you're processing all of your data on a single machine anyway.

Rank a Spark stream dataset column

Im using Spark 2.3.1's Structured Streaming API. Is it possible to rank values in a column of a spark stream dataframe? I tried using the following code to then realize after the exception message that its not possible for stream context to iterate over the entire window.
.withColumn("rank", row_number().over(Window.orderBy($"transactionTime")))
throws
org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets
Can anyone help me with an idea to calculate rank/percentile?
So it seems that window operations are not supported in spark structured streaming api yet.
Look forward to the upcoming releases from Apache Spark
Yes, unfortunately there is no useful API to do what you need, although I tried workaround using Scala groupBy and mapGroupWithState, e.g.:
val stream = ...
stream
.groupByKey(_.id)
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(<function>)
and <function> will receive Iterator over your data. You may sort it and implement rank, dense_rank etc.
However, you have requested window without partition key information(which will lead to OOM issues for huge data volume), in this case, you may add same value for all records using withColumn.
Note: you don't need to keep state in GroupState, you just need API to do what you need.
Hope it helps!

Using Scala Pickling serialization In APACHE SPARK over KryoSerializer and JavaSerializer

While searching best Serialization techniques for apache-spark I found below link
https://github.com/scala/pickling#scalapickling
which states Serialization in scala will be more faster and automatic with this framework.
And as Scala Pickling has following advantages. (Ref - https://github.com/scala/pickling#what-makes-it-different)
So, I wanted to know whether this Scala Pickling (PickleSerializer) can be used in apache-spark instead of KryoSerializer.
If yes what are the necessary changes is to be done. (Example would be helpful)
If No why not. (Please explain)
Thanks in advance. And forgive me if I am wrong.
Note : I am using scala language to code apache-spark (Version. 1.4.1) application.
I visited Databricks for a couple of months in 2014 to try and incorporate a PicklingSerializer into Spark somehow, but couldn't find a way to include type information needed by scala/pickling into Spark without changing interfaces in Spark. At the time, it was a no-go to change interfaces in Spark. E.g., RDDs would need to include Pickler[T] type information into its interface in order for the generation mechanism in scala/pickling to kick in.
All of that changed though with Spark 2.0.0. If you use Datasets or DataFrames, you get so-called Encoders. This is even more specialized than scala/pickling.
Use Datasets in Spark 2.x. It's much more performant on the serialization front than plain RDDs

Spark scala coding standards

I am reaching out to the community to understand the impact of coding in certain ways in scala for Spark. I received some review comments that I feel need discussion. Coming from a traditional Java and OOP background, I am writing my opinion and questions here. I would appreciate if you could chime in with your wisdom. I am in a Spark 1.3.0 environment.
1. Use of for loops: Is it against the rules to use for loops?
There are distributed data structures like RDDs and DataFrames in Spark. We should not be collecting and using for loops on them, as the computation will end up happening on the driver node alone. This will have adverse affects especially if the data is large.
But if I have a utility map that stores parameters for the job, it is fine to use a for loop on it if desired. Using a for loop or a map on the iteratable is a coding choice. It is important to understand that this map here is different from map on a distributed data structure. This map will still happen on the driver node alone.
2. Use of var vs val
val is an immutable reference to an object and var is a mutable reference. In the example below
val driverDf = {
var df = dataLoader.loadDriverInput()
df = df.sqlContext.createDataFrame(df.rdd, df.schema)
df.persist(StorageLevel.MEMORY_AND_DISK_SER)
}
Even though we have used var for the df, driverDf is an immutable reference to the originally created data frame. This kind of use for var is perfectly fine.
Similarly the following is also fine.
var driverDf = dataLoader.loadDriverInput();
driverDf = applyTransformations (driverDf)
def applyTransformations (driverDf:DataFrame)={...}
Are there any generic rules that say vars cannot be used in Spark environment?
3. Use of if-else vs case, not throw exceptions
Is it against standard practices to not throw exceptions or not to use if-else?
4. Use of hive context vs sql context
Are there any performance implications of using SQLContext vs HiveContext (I know HiveContext extends SQLContext) for underneath Hive tables?
Is it against standards to create multiple HiveContexts in the program. My job is iterates through a part of a whole data frame of values every time. Whole data frame is cached in a one hive context. Each iteration data frame is created from the whole data using a new hive context and cached. This cache is purged at the end of iteration. This approach gave me performance improvements in Spark 1.3.0. Is this approach breaking any standards?
I appreciate the responses.
Regarding loops, as you mentioned correctly, you should prefer RDD map to perform operations in parallel and on multiple nodes. For smaller iterables, you can go with for loop. Again it comes down to the driver memory and time it takes to iterate.
For smaller sets of around 100, the distributed way of handling will incur unnecessary network usage rather than giving performance boost
val or var is a choice at scala level rather than spark. I never heard of it. Its dependent on your requirement.
No sure what you you asked. The only major negative for using if-else is making them cumbersome and while handling inner-if-else. Apart from that, all should be fine. An exception can be thrown based on a condition. I see that's the one of many ways to handle issues in an otherwise Happy path flow.
As mentioned here, the compiler generates more byte code for match..case rather than simple if. So its simple condition check vs code readability - complex condition check
HiveContext gives the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. Please not in spark 2.0, both HIveContext and SQLContext are replaced with SparkSession.

MapReduce implementation in Scala

I'd like to find out good and robust MapReduce framework, to be utilized from Scala.
To add to the answer on Hadoop: there are at least two Scala wrappers that make working with Hadoop more palatable.
Scala Map Reduce (SMR): http://scala-blogs.org/2008/09/scalable-language-and-scalable.html
SHadoop: http://jonhnny-weslley.blogspot.com/2008/05/shadoop.html
UPD 5 oct. 11
There is also Scoobi framework, that has awesome expressiveness.
http://hadoop.apache.org/ is language agnostic.
Personally, I've become a big fan of Spark
http://spark-project.org/
You have the ability to do in-memory cluster computing, significantly reducing the overhead you would experience from disk-intensive mapreduce operations.
You may be interested in scouchdb, a Scala interface to using CouchDB.
Another idea is to use GridGain. ScalaDudes have an example of using GridGain with Scala. And here is another example.
A while back, I ran into exactly this problem and ended up writing a little infrastructure to make it easy to use Hadoop from Scala. I used it on my own for a while, but I finally got around to putting it on the web. It's named (very originally) ScalaHadoop.
For a scala API on top of hadoop check out Scoobi, it is still in heavy development but shows a lot of promise. There is also some effort to implement distributed collections on top of hadoop in the Scala incubator, but that effort is not usable yet.
There is also a new scala wrapper for cascading from Twitter, called Scalding.
After looking very briefly over the documentation for Scalding it seems
that while it makes the integration with cascading smoother it still does
not solve what I see as the main problem with cascading: type safety.
Every operation in cascading operates on cascading's tuples (basically a
list of field values with or without a separate schema), which means that
type errors, I.e. Joining a key as a String and key as a Long leads
to run-time failures.
to further jshen's point:
hadoop streaming simply uses sockets. using unix streams, your code (any language) simply has to be able to read from stdin and output tab delimited streams. implement a mapper and if needed, a reducer (and if relevant, configure that as the combiner).
I've added MapReduce implementation using Hadoop on Github with few test cases here: https://github.com/sauravsahu02/MapReduceUsingScala.
Hope that helps. Note that the application is already tested.