What's the difference between collect and count actions? - scala

When I was coding scala program in local-mode spark, the code is similar to RDD.map(x => function()).collect, there is no log output in the console for a long time and I guess it stucks. So I change the action collect into count, the whole execution completed soon. Plus, there is little records produced by the map phase to be collected by collect, so problem can't be caused by the network trans when sending back the result to the driver.
Can anyone know the reason or has run into the similar problem?

The method count sums the number of entries of the RDD for every partition, and it returns an integer consisting in that number, hence the data transfer is minimal; in the other hand, the method collect as its name says brings all the entries to the driver as a List therefore if there isn't enough memory you may get several exceptions (this is why it's not recommended to do a collect if you aren't sure that it will fit in your driver, there are normally faster alternatives like take, which also triggers lazy transformations), besides it requires to transfer much more data.

Related

How to iteratively bring a Dataframe back to the driver in Spark

Main question: How do you safely (without risking crashing due to OOM) iterate over every row (guaranteed every row) in a dataframe from the driver node in Spark? I need to control how big the data is as it comes back, operate on it, and discard it to retrieve the next batch (say 1000 rows at a time or something)
I am trying to safaely and iteratively bring the data in a potentially large Dataframe back to the driver program so that I may use the data to perform HTTP calls. I have been attempting use someDf.foreachPartition{makeApiCall(_)} and allowing the Executors to handle the calls. It works - but debugging and handling errors has proven to be pretty difficult when launching in prod envs, especially on failed calls.
I know there is someDf.collect() action, which brings ALL the data back to the driver all at once. However, this solution is not suggested, because if you have a very large DF, you risk crashing the driver.
Any suggestions?
if the data does not fit into memory, you could use something like :
df.toLocalIterator().forEachRemaining( row => {makeAPICall(row)})
but toLocalIterator has considerable overhead compared to collect
Or you can collect your dataframe batch-wise (which does essentially the same as toLocalIterator):
val partitions = df.rdd.partitions.map(_.index)
partitions.toStream.foreach(i => df.where(spark_partition_id() === lit(i)).collect().map(row => makeAPICall(row)))
It is a bad idea to bring all that data back to driver because driver is just 1 node and it will become the bottleneck. The scalability will be lost. If you had to do this then think twice if you really need a big data application? probably not.
dataframe.collect() is the best way to bring the data to driver and it will bring all data. The alternative is toLocalIterator which will bring data of the largest partition which can be big too. So this should be rarely used and for small amount of data only.
If you insist then you can write the output to a file or queue and read that file in a controlled manner. This will be partially scalable solution which I won't prefer.

Spark Scala - processing different child dataframes parallely in bulk

I am working on a fraudulent transaction detection project which makes use of spark and primarily uses rule-based approach to risk score incoming transactions. For this rule based approach, several maps are created from the historical data to represent the various patterns in transactions and these are then used later while scoring the transaction. Due to rapid increase in data size, we are now modifying code to generate these maps at each account level.
earlier code was for eg.
createProfile(clientdata)
but now it becomes
accountList.map(account=>createProfile(clientData.filter(s"""account=${account}""")))
Using this approach , the profiles are generated but since this operations are happening sequentially , hence it doesn't seem to be feasible.
Also, createProfile function itself is making use of dataframes, sparkContext/SparkSessions hence, this is leading to the issueof not able to send these tasks to worker nodes as based on my understanding only driver can access the dataframes and sparkSession/sparkContext. Hence , the following code is not working
import sparkSession.implicit._
val accountListRdd=accountList.toSeq.toDF("accountNumber")
accountList.rdd.map(accountrow=>createProfile(clientData.filter(s"""account=${accountrow.get(0).toString}""")))
The above code is not working but represents the logic for the desired output behaviour.
Another approach, i am looking at is using multithreading at driver level using scala Future .But even in this scenario , there are many jvm objects being created in a single createProfile function call , so by increasing threads , even if this approach works , it can lead to a lot jvm objects, which itself canlead to garbage collection and memory overhead issues.
just to put timing perspective, createProfile takes about 10 min on average for a single account and we have 3000 accounts , so sequentially it will take many days. With multi threading even if we achieve a factor of 10 , it will take many days. So we need parallelism in the order of 100s .
One of things that could have worked in case it existed was ..lets say if there is Something like a spark groupBy within a groupBY kind of operation, where at first level we can group By "account" and then do other operations
(currently issue is UDF won't be able to handle the kind of operations we want to perform)
Another solution if practically possible is the way SPark Streaming works-
it has a forEachRDD method and also spark.streaming.concurrentjobs parameter which allows processing of multiple RDDs in parallel . I am not sure how it works but maybe that kind of implementation may help.
Above is the problem description and my current views on it.
Please let me know if anyone has any idea about this! Also ,I will prefer a logical change rather than suggestion of different technology

Efficient way to check if there are NA's in pyspark

I have a pyspark dataframe, named df. I want to know if his columns contains NA's, I don't care if it is just one row or all of them. The problem is, my current way to know if there are NA's, is this one:
from pyspark.sql import functions as F
if (df.where(F.isnull('column_name')).count() >= 1):
print("There are nulls")
else:
print("Yey! No nulls")
The issue I see here, is that I need to compute the number of nulls in the whole column, and that is a huge amount of time wasted, because I want the process to stop when it finds the first null.
I thought about this solution but I am not sure it works (because I work in a cluster with a lot of other people so the execution time depends on the multiple jobs other people run in the cluster, so I can't compare the two approaches in even conditions):
(df.where(F.isnull('column_name')).limit(1).count() == 1)
Does adding the limit help ? Are there more efficient ways to achieve this ?
There is no non-exhaustive search for something that isn't there.
We can probably squeeze a lot more performance out of your query for the case where a record with a null value exists (see below), but what about when it doesn't? If you're planning on running this query multiple times, with the answer changing each time, you should be aware (I don't mean to imply that you aren't) that if the answer is "there are no null values in the entire dataframe", then you will have to scan the entire dataframe to know this, and there isn't a fast way to do that. If you need this kind of information frequently and the answer can frequently be "no", you'll almost certainly want to persist this kind of information somewhere, and update it whenever you insert a record that might have null values by checking just that record.
Don't use count().
count() is probably making things worse.
In the count case Spark used wide transformation and actually applies LocalLimit on each partition and shuffles partial results to perform GlobalLimit.
In the take case Spark used narrow transformation and evaluated LocalLimit only on the first partition.
In other words, .limit(1).count() is likely to select one example from each partition of your dataset, before selecting one example from that list of examples. Your intent is to abort as soon as a single example is found, but unfortunately, count() doesn't seem smart enough to achieve that on its own.
As alluded to by the same example, though, you can use take(), first(), or head() to achieve the use case you want. This will more effectively limit the number of partitions that are examined:
If no shuffle is required (no aggregations, joins, or sorts), these operations will be optimized to inspect enough partitions to satisfy the operation - likely a much smaller subset of the overall partitions of the dataset.
Please note, count() can be more performant in other cases. As the other SO question rightly pointed out,
neither guarantees better performance in general.
There may be more you can do.
Depending on your storage method and schema, you might be able to squeeze more performance out of your query.
Since you aren't even interested in the value of the row that was chosen in this case, you can throw a select(F.lit(True)) between your isnull and your take. This should in theory reduce the amount of information the workers in the cluster need to transfer. This is unlikely to matter if you have only a few columns of simple types, but if you have complex data structures, this can help and is very unlikely to hurt.
If you know how your data is partitioned and you know which partition(s) you're interested in or have a very good guess about which partition(s) (if any) are likely to contain null values, you should definitely filter your dataframe by that partition to speed up your query.

Convert a JavaPairRDD into list without collect() [duplicate]

We know that if we need to convert RDD to a list, then we should use collect(). but this function puts a lot of stress on the driver (as it brings all the data from different executors to the driver) which causes performance degradation or worse (whole application may fail).
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
Basically in current scenario where we deal with huge amount of data in batch or stream data processing, APIs like collect() and collectAsMap() has become completely useless in a real project with real amount of data. We can use it in demo code, but that's all there to use for these APIs. So why to have an API which we can not even use (Or am I missing something).
Can there be a better way to achieve the same result through some other method or can we implement collect() and collectAsMap() in a more effective way other that just calling
List<String> myList= RDD.collect.toList (which effects performance)
I looked up to google but could not find anything which can be effective. Please help if someone has got a better approach.
Is there any other way to convert RDD into any of the java util collection without using collect() or collectAsMap() etc which does not cause performance degrade?
No, and there can't be. And if there were such a way, collect would be implemented using it in the first place.
Well, technically you could implement List interface on top of RDD (or most of it?), but that would be a bad idea and quite pointless.
So why to have an API which we can not even use (Or am I missing something).
collect is intended to be used for cases where only large RDDs are inputs or intermediate results, and the output is small enough. If that's not your case, use foreach or other actions instead.
As you want to collect the Data in a Java Collection, the data has to collect on single JVM as the java collections won't be distributed. There is no way to get all data in collection by not getting data. The interpretation of problem space is wrong.
collect and similar are not meant to be used in normal spark code. They are useful for things like debugging, testing, and in some cases when working with small datasets.
You need to keep your data inside of the rdd, and use rdd transformations and actions without ever taking the data out. Methods like collect which pull you data out of spark and onto your driver defeat the purpose and undo any advantage that spark might be providing since now you're processing all of your data on a single machine anyway.

Bring data of DataFrame back to local node for further actions (count / show) in spark/scala

I'm using Spark 1.6 in Scala.
I know it's some of the ideas behind the Spark Framework. But I couldn't answer it to myself by reading different tutorials.. (maybe the wrong ones).
I joined two DataFrames to a new one (nDF). Now I know, it's not yet proceeded, as long I say show, first or count.
But since I want to do exactly this, I want to inspect nDF in different ways:
nDF.show
nDF.count
nDF.filter()
..and so on, it would each time take a long time, since the original DataFrames are big. Couldn't I bring/copy the data to this new one. So I could solve these new actions as quick as on the original sets? (First I thought it's 'collect', but it only returns a Array, no DataFrame)
This is a classic scenario. When you join 2 Dataframes spark doesn't do any operation as it evaluates lazily when an action called on the resulting dataframe . Action mean show, count, print etc.
Now when show, count is being called on nDF, spark is evaluating the resultant dataframe every time i.e once when you called show, then when count is being called and so on. This means internally it is performing map/reduce every time an action is called on the resultant dataframe.
Spark doesn't cache the resulting dataframe in memory unless it is hinted to do so by doing df.cache / df.persist.
So when you do
val nDF = a.join(b).persist
And then call the count/show it will evaluate the nDF once and store the resulting dataframe in memory. Hence subsequent actions will be faster.
However the fist evaluation might be little slower also you need to using little more executor memory.
If the memory available to you is good with respect to the size of your dataset, what you're probably looking for is df.cache(). If the size of your dataset is too much, consider using df.persist() as it allows different levels of persistence.
Hope this is what you're looking for. Cheers