dask dataframe groupby resulting in one partition memory issue - group-by

I am reading in 64 compressed csv files (probably 70-80 GB) into one dask data frame then run groupby with aggregations.
The job never completed because appereantly the groupby creates a data frame with only one partition.
This post and this post already addressed this issue but focusing on the computational graph and not the memory issue you run into, when your resulting data frame is too large.
I tried a workaround with repartioning but the job still wont complete.
What am I doing wrong, will I have to use map_partition? This is very confusing as I expect Dask will take care of partitioning everything even after aggregation operations.
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1, memory_limit='8GB',diagnostics_port=5000)
client
dask.config.set(scheduler='processes')
dB3 = dd.read_csv("boden/expansion*.csv", # read in parallel
blocksize=None, # 64 files
sep=',',
compression='gzip'
)
aggs = {
'boden': ['count','min']
}
dBSelect=dB3.groupby(['lng','lat']).agg(aggs).repartition(npartitions=64)
dBSelect=dBSelect.reset_index()
dBSelect.columns=['lng','lat','bodenCount','boden']
dBSelect=dBSelect.drop('bodenCount',axis=1)
with ProgressBar(dt=30): dBSelect.compute().to_parquet('boden/final/boden_final.parq',compression=None)

Most groupby aggregation outputs are small and fit easily in one partition. Clearly this is not the case in your situation.
To resolve this you should use the split_out= parameter to your groupby aggregation to request a certain number of output partitions.
df.groupby(['x', 'y', 'z']).mean(split_out=10)
Note that using split_out= will significantly increase the size of the task graph (it has to mildly shuffle/sort your data ahead of time) and so may increase scheduling overhead.

Related

Spark configurations for Out of memory error [duplicate]

Cluster setup -
Driver has 28gb
Workers have 56gb each (8 workers)
Configuration -
spark.memory.offHeap.enabled true
spark.driver.memory 20g
spark.memory.offHeap.size 16gb
spark.executor.memory 40g
My job -
//myFunc just takes a string s and does some transformations on it, they are very small strings, but there's about 10million to process.
//Out of memory failure
data.map(s => myFunc(s)).saveAsTextFile(outFile)
//works fine
data.map(s => myFunc(s))
Also, I de-clustered / removed spark from my program and it completed just fine(successfully saved to a file) on a single server with 56gb of ram. This shows that it just a spark configuration issue. I reviewed https://spark.apache.org/docs/latest/configuration.html#memory-management and the configurations I currently have seem to be all that should be needed to be changed for my job to work. What else should I be changing?
Update -
Data -
val fis: FileInputStream = new FileInputStream(new File(inputFile))
val bis: BufferedInputStream = new BufferedInputStream(fis);
val input: CompressorInputStream = new CompressorStreamFactory().createCompressorInputStream(bis);
br = new BufferedReader(new InputStreamReader(input))
val stringArray = br.lines().toArray()
val data = sc.parallelize(stringArray)
Note - this does not cause any memory issues, even though it is incredibly inefficient. I can't read from it using spark because it throws some EOF errors.
myFunc, I can't really post the code for it because it's complex. But basically, the input string is a deliminated string, it does some deliminator replacement, date/time normalizing and things like that. The output string will be roughly the same size as an input string.
Also, it works fine for smaller data sizes, and the output is correct and roughly the same size as input data file, as it should be.
You current solution does not take advantage of spark. You are loading the entire file into an array in memory, then using sc.parallelize to distribute it into an RDD. This is hugely wasteful of memory (even without spark) and will of course cause out of memory problems for large files.
Instead, use sc.textFile(filePath) to create your RDD. Then spark is able to smartly read and process the file in chunks, so only a small portion of it needs to be in memory at a time. You are also able to take advantage of parallelism this way, as spark will be able to read and process the file in parallel, with however many executors and corse your have, instead of needing the read the entire file on a single thread on a single machine.
Assuming that myFunc can look at only a single line at a time, then this program should have a very small memory footprint.
Would help if you put more details of what going on in your program before and after the MAP.
Second command (only Map) does not do anything unless an action is triggered. Your file is probably not partitioned and driver is doing the work. Below should force data to workers evenly and protect OOM on a single node. It will cause shuffling of data though.
Updating solution after looking at your code, will be better if you do this
val data = sc.parallelize(stringArray).repartition(8)
data.map(s => myFunc(s)).saveAsTextFile(outFile)

Spark shuffle read takes significant time for small data

We are running the following stage DAG and experiencing long shuffle read time for relatively small shuffle data sizes (about 19MB per task)
One interesting aspect is that waiting tasks within each executor/server have equivalent shuffle read time. Here is an example of what it means: for the following server one group of tasks waits about 7.7 minutes and another one waits about 26 s.
Here is another example from the same stage run. The figure shows 3 executors / servers each having uniform groups of tasks with equal shuffle read time. The blue group represents killed tasks due to speculative execution:
Not all executors are like that. There are some that finish all their tasks within seconds pretty much uniformly, and the size of remote read data for these tasks is the same as for the ones that wait long time on other servers.
Besides, this type of stage runs 2 times within our application runtime. The servers/executors that produce these groups of tasks with large shuffle read time are different in each stage run.
Here is an example of task stats table for one of the severs / hosts:
It looks like the code responsible for this DAG is the following:
output.write.parquet("output.parquet")
comparison.write.parquet("comparison.parquet")
output.union(comparison).write.parquet("output_comparison.parquet")
val comparison = data.union(output).except(data.intersect(output)).cache()
comparison.filter(_.abc != "M").count()
We would highly appreciate your thoughts on this.
Apparently the problem was JVM garbage collection (GC). The tasks had to wait until GC is done on the remote executors. The equivalent shuffle read time resulted from the fact that several tasks were waiting on a single remote host performing GC. We followed advise posted here and the problem decreased by an order of magnitude. There is still small correlation between GC time on remote hosts and local shuffle read time. In the future we think to try shuffle service.
Since google brought me here with the same problem but I needed another solution...
Another possible reason for small shuffle size taking a long time to read could be the data is split over many partitions. For example (apologies this is pyspark as it is all I have used):
my_df_with_many_partitions\ # say has 1000 partitions
.filter(very_specific_filter)\ # only very few rows pass
.groupby('blah')\
.count()
The shuffle write from the filter above will be very small, so for the stage after we will have a very small amount to read. But to read it you need to check a lot of empty partitions. One way to address this would be:
my_df_with_many_partitions\
.filter(very_specific_filter)\
.repartition(1)\
.groupby('blah')\
.count()

Spark Streaming mapWithState seems to rebuild complete state periodically

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.
The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc _).numPartitions(20). In this state we have only a few keys (~100) mapped to Sets with up ~160.000 entries, which grow throughout the application. The entire state is up to 3GB, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes.
While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images:
The yellow fields represent the high processing time.
A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says.
My understanding of skipped is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of skips varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration.
Is this a bug in the mapWithState() functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the Set in the state need to copy data? Or is it more likely to be a flaw in my application?
Is this a bug in the mapWithState() functionality or is this intended
behaviour?
This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.
Here is the relevant piece of code from MapWithStateDStream:
override def initialize(time: Time): Unit = {
if (checkpointDuration == null) {
checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
}
super.initialize(time)
}
Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:
private[streaming] object InternalMapWithStateDStream {
private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}
Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.
This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.
In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.
Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].
This has two performance implications, which occur only once per 10 batches:
The traversal and deletion process has its price
If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)

Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.
Let's say I create an RDD:
val rdd = sc.textFile(file)
Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?
Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.
Finally, if I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
val rdd = sc.textFile(file)
Does that mean that the file is now partitioned across the nodes?
The file remains wherever it was. The elements of the resulting RDD[String] are the lines of the file. The RDD is partitioned to match the natural partitioning of the underlying file system. The number of partitions does not depend on the number of nodes you have.
It is important to understand that when this line is executed it does not read the file(s). The RDD is a lazy object and will only do something when it must. This is great because it avoids unnecessary memory usage.
For example, if you write val errors = rdd.filter(line => line.startsWith("error")), still nothing happens. If you then write val errorCount = errors.count now your sequence of operations will need to be executed because the result of count is an integer. What each worker core (executor thread) will do in parallel then, is read a file (or piece of file), iterate through its lines, and count the lines starting with "error". Buffering and GC aside, only a single line per core will be in memory at a time. This makes it possible to work with very large data without using a lot of memory.
I want to count the number of objects in the RDD, however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
There is no rdd.size method. There is rdd.count, which counts the number of elements in the RDD. rdd.map(x => x / rdd.count) will not work. The code will try to send the rdd variable to all workers and will fail with a NotSerializableException. What you can do is:
val count = rdd.count
val normalized = rdd.map(x => x / count)
This works, because count is an Int and can be serialized.
If I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
map does not change the number of elements. I don't know what you mean by "size". But yes, you need to perform an action, such as count to get anything out of the RDD. You see, no work at all is performed until you perform an action. (When you perform count, only the per-partition count will be sent back to the driver, of course, not "all the information".)
Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3 on HDFS). It's not an intention to split every file between all available nodes.
However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.
UPDATE: an article describing RDD internals: https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf

Spark: increase number of partitions without causing a shuffle?

When decreasing the number of partitions one can use coalesce, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).
I would like to do the opposite sometimes, but repartition induces a shuffle. I think a few months ago I actually got this working by using CoalescedRDD with balanceSlack = 1.0 - so what would happen is it would split a partition so that the resulting partitions location where all on the same node (so small net IO).
This kind of functionality is automatic in Hadoop, one just tweaks the split size. It doesn't seem to work this way in Spark unless one is decreasing the number of partitions. I think the solution might be to write a custom partitioner along with a custom RDD where we define getPreferredLocations ... but I thought that is such a simple and common thing to do surely there must be a straight forward way of doing it?
Things tried:
.set("spark.default.parallelism", partitions) on my SparkConf, and when in the context of reading parquet I've tried sqlContext.sql("set spark.sql.shuffle.partitions= ..., which on 1.0.0 causes an error AND not really want I want, I want partition number to change across all types of job, not just shuffles.
Watch this space
https://issues.apache.org/jira/browse/SPARK-5997
This kind of really simple obvious feature will eventually be implemented - I guess just after they finish all the unnecessary features in Datasets.
I do not exactly understand what your point is. Do you mean you have now 5 partitions, but after next operation you want data distributed to 10? Because having 10, but still using 5 does not make much senseā€¦ The process of sending data to new partitions has to happen sometime.
When doing coalesce, you can get rid of unsued partitions, for example: if you had initially 100, but then after reduceByKey you got 10 (as there where only 10 keys), you can set coalesce.
If you want the process to go the other way, you could just force some kind of partitioning:
[RDD].partitionBy(new HashPartitioner(100))
I'm not sure that's what you're looking for, but hope so.
As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.
You can write :
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
# you spark code here with some transformation and at least one action
df = df.withColumn("sum", sum(df.A).over(your_window_function))
df.count() # your action
df = df.filter(df.B <10)
df = df.count()
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
# you reduce the number of partition because you know you will have a lot
# less data
df = df.withColumn("max", max(df.A).over(your_other_window_function))
df.count() # your action