I need to partition dataset data by 6 columns: region/year/month/day/id/quadkey
Where on top level I have just binary region state, and at very bottom level is actually where it is get into many partitions.
So lets say we have 2 regions/usually 1 year/usually 1 month/3-4 days/100-150 ids/ 50-200 quadkeys
When I performing this I get really unbalanced shuffle operation and sometimes executors are failed due to exceeding memory limits.
Also I've noticed from History UI that some tasks at hat phase are very big (~15Gb) when others are much smaller (~1Gb).
I've tried to play with
sqlContext.setConf("spark.sql.shuffle.partitions", "3000")
Also I've tried to extend number of executors, but with same memory settings. That the errors that I get:
19/04/10 09:47:36 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
...
// stage: DataFrame
val partitionColumns = List("region", "year", "month", "day", "id", "quadkey")
stage.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.option("compression", "gzip")
.mode(SaveMode.Append)
.save(destUrl)
I've expected to have balanced tasks at Save Stage, what shuffle settings I should set for this? Or I have to have executors with higher than 20-25 Gb memory ? What should be scaling approach in such case?
One approach could be to add more columns to repartition, & that column to have high cardinality (id of records or some random values)
if number of files become large then try setting numPartitions followed by partitioning columns.
df.repartition(numPartitions, partition_cols_including_high_cardinality_column:_*).write........
===========================================================================
Edit:
In scenarios where data is skewed with some partition combinations having more data than others, repartitioning them with same column might not be a good idea.
In repartition, all data matching partition key combination will first collected on same executor and one file will be produced if your partitionBy and repartition have same column arguments. So in this case, few partition combination will have files like ~15Gb and some like ~1Gb which is not ideal for datasources like HDFS
So what I'm suggesting here is to have repartition columns that distribute data evenly on executors. Consider this, we have repartitioned data on some column combination E, it produces lets say 400 rows for each executor to work on, then each executor will write its data based on partitionBy spec. And when you check your final output, each partition will have number of files equal to number of executors that received rows with same partitionBy spec. Number of executors are decided by repartition column spec.
What I suggested above is to have different set of columns for repartition which will help distribute data evenly on executors. And if for some reason its not possible on data, then add some random columns (technique called salting). Option of adding numPartitions fixes the upper bound of number of executors working on data thereby fixing number of files written to a partition directory. Setting numPartitions is extremely helpful when your repartition column has high cardinality as this can create many files in your output directories.
import org.apache.spark.sql.functions.rand
df.repartition(numPartitions, $"some_col_1", rand)
.write.partitionBy("some_col")
.parquet("partitioned_lake")
here by fixing numPartitions, we are sure that output for every partitionBy spec will have maximum of numPartitions files.
helpful link - http://tantusdata.com/spark-shuffle-case-2-repartitioning-skewed-data/
Hope this helps
Related
I am reading in 64 compressed csv files (probably 70-80 GB) into one dask data frame then run groupby with aggregations.
The job never completed because appereantly the groupby creates a data frame with only one partition.
This post and this post already addressed this issue but focusing on the computational graph and not the memory issue you run into, when your resulting data frame is too large.
I tried a workaround with repartioning but the job still wont complete.
What am I doing wrong, will I have to use map_partition? This is very confusing as I expect Dask will take care of partitioning everything even after aggregation operations.
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1, memory_limit='8GB',diagnostics_port=5000)
client
dask.config.set(scheduler='processes')
dB3 = dd.read_csv("boden/expansion*.csv", # read in parallel
blocksize=None, # 64 files
sep=',',
compression='gzip'
)
aggs = {
'boden': ['count','min']
}
dBSelect=dB3.groupby(['lng','lat']).agg(aggs).repartition(npartitions=64)
dBSelect=dBSelect.reset_index()
dBSelect.columns=['lng','lat','bodenCount','boden']
dBSelect=dBSelect.drop('bodenCount',axis=1)
with ProgressBar(dt=30): dBSelect.compute().to_parquet('boden/final/boden_final.parq',compression=None)
Most groupby aggregation outputs are small and fit easily in one partition. Clearly this is not the case in your situation.
To resolve this you should use the split_out= parameter to your groupby aggregation to request a certain number of output partitions.
df.groupby(['x', 'y', 'z']).mean(split_out=10)
Note that using split_out= will significantly increase the size of the task graph (it has to mildly shuffle/sort your data ahead of time) and so may increase scheduling overhead.
Let's assume for the following that only one Spark job is running at every point in time.
What I get so far
Here is what I understand what happens in Spark:
When a SparkContext is created, each worker node starts an executor.
Executors are separate processes (JVM), that connects back to the driver program. Each executor has the jar of the driver program. Quitting a driver, shuts down the executors. Each executor can hold some partitions.
When a job is executed, an execution plan is created according to the lineage graph.
The execution job is split into stages, where stages containing as many neighbouring (in the lineage graph) transformations and action, but no shuffles. Thus stages are separated by shuffles.
I understand that
A task is a command sent from the driver to an executor by serializing the Function object.
The executor deserializes (with the driver jar) the command (task) and executes it on a partition.
but
Question(s)
How do I split the stage into those tasks?
Specifically:
Are the tasks determined by the transformations and actions or can be multiple transformations/actions be in a task?
Are the tasks determined by the partition (e.g. one task per per stage per partition).
Are the tasks determined by the nodes (e.g. one task per stage per node)?
What I think (only partial answer, even if right)
In https://0x0fff.com/spark-architecture-shuffle, the shuffle is explained with the image
and I get the impression that the rule is
each stage is split into #number-of-partitions tasks, with no regard for the number of nodes
For my first image I'd say that I'd have 3 map tasks and 3 reduce tasks.
For the image from 0x0fff, I'd say there are 8 map tasks and 3 reduce tasks (assuming that there are only three orange and three dark green files).
Open questions in any case
Is that correct? But even if that is correct, my questions above are not all answered, because it is still open, whether multiple operations (e.g. multiple maps) are within one task or are separated into one tasks per operation.
What others say
What is a task in Spark? How does the Spark worker execute the jar file? and How does the Apache Spark scheduler split files into tasks? are similar, but I did not feel that my question was answered clearly there.
You have a pretty nice outline here. To answer your questions
A separate task does need to be launched for each partition of data for each stage. Consider that each partition will likely reside on distinct physical locations - e.g. blocks in HDFS or directories/volumes for a local file system.
Note that the submission of Stages is driven by the DAG Scheduler. This means that stages that are not interdependent may be submitted to the cluster for execution in parallel: this maximizes the parallelization capability on the cluster. So if operations in our dataflow can happen simultaneously we will expect to see multiple stages launched.
We can see that in action in the following toy example in which we do the following types of operations:
load two datasources
perform some map operation on both of the data sources separately
join them
perform some map and filter operations on the result
save the result
So then how many stages will we end up with?
1 stage each for loading the two datasources in parallel = 2 stages
A third stage representing the join that is dependent on the other two stages
Note: all of the follow-on operations working on the joined data may be performed in the same stage because they must happen sequentially. There is no benefit to launching additional stages because they can not start work until the prior operation were completed.
Here is that toy program
val sfi = sc.textFile("/data/blah/input").map{ x => val xi = x.toInt; (xi,xi*xi) }
val sp = sc.parallelize{ (0 until 1000).map{ x => (x,x * x+1) }}
val spj = sfi.join(sp)
val sm = spj.mapPartitions{ iter => iter.map{ case (k,(v1,v2)) => (k, v1+v2) }}
val sf = sm.filter{ case (k,v) => v % 10 == 0 }
sf.saveAsTextFile("/data/blah/out")
And here is the DAG of the result
Now: how many tasks ? The number of tasks should be equal to
Sum of (Stage * #Partitions in the stage)
This might help you better understand different pieces:
Stage: is a collection of tasks. Same process running against
different subsets of data (partitions).
Task: represents a unit of
work on a partition of a distributed dataset. So in each stage,
number-of-tasks = number-of-partitions, or as you said "one task per
stage per partition”.
Each executer runs on one yarn container, and
each container resides on one node.
Each stage utilizes multiple executers, each executer is allocated multiple vcores.
Each vcore can execute exactly one task at a time
So at any stage, multiple tasks could be executed in parallel. number-of-tasks running = number-of-vcores being used.
If I understand correctly there are 2 ( related ) things that confuse you:
1) What determines the content of a task?
2) What determines the number of tasks to be executed?
Spark's engine "glues" together simple operations on consecutive rdds, for example:
rdd1 = sc.textFile( ... )
rdd2 = rdd1.filter( ... )
rdd3 = rdd2.map( ... )
rdd3RowCount = rdd3.count
so when rdd3 is (lazily) computed, spark will generate a task per partition of rdd1 and each task will execute both the filter and the map per line to result in rdd3.
The number of tasks is determined by the number of partitions. Every RDD has a defined number of partitions. For a source RDD that is read from HDFS ( using sc.textFile( ... ) for example ) the number of partitions is the number of splits generated by the input format. Some operations on RDD(s) can result in an RDD with a different number of partitions:
rdd2 = rdd1.repartition( 1000 ) will result in rdd2 having 1000 partitions ( regardless of how many partitions rdd1 had ).
Another example is joins:
rdd3 = rdd1.join( rdd2 , numPartitions = 1000 ) will result in rdd3 having 1000 partitions ( regardless of partitions number of rdd1 and rdd2 ).
( Most ) operations that change the number of partitions involve a shuffle, When we do for example:
rdd2 = rdd1.repartition( 1000 )
what actually happens is the task on each partition of rdd1 needs to produce an end-output that can be read by the following stage so to make rdd2 have exactly 1000 partitions ( How they do it? Hash or Sort ). Tasks on this side are sometimes referred to as "Map ( side ) tasks".
A task that will later run on rdd2 will act on one partition ( of rdd2! ) and would have to figure out how to read/combine the map-side outputs relevant to that partition. Tasks on this side are sometimes referred to as "Reduce ( side ) tasks".
The 2 questions are related: the number of tasks in a stage is the number of partitions ( common to the consecutive rdds "glued" together ) and the number of partitions of an rdd can change between stages ( by specifying the number of partitions to some shuffle causing operation for example ).
Once the execution of a stage commences, its tasks can occupy task slots. The number of concurrent task-slots is numExecutors * ExecutorCores. In general, these can be occupied by tasks from different, non-dependent stages.
i am running a pyspark job on ec2 cluster with 4 workers.
i get this error :
2018-07-05 08:20:44 WARN TaskSetManager:66 - Lost task 1923.0 in stage 18.0 (TID 21385, 10.0.5.97, executor 3): java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:260)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:190)
at org.apache.spark.serializer.DummySerializerInstance$1.close(DummySerializerInstance.java:65)
at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:173)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:194)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
i looked at https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
tried increasing shuffle partitioned - same issue .
my data looks fairly evenly partitioned across executors .
i want to try the workaround of assigning Null or None to dataframes , the question is if it will indeed remove intermediate shuffle files , and if the linage will not be kept .
for instance if my code looks like this :
df1 = sqlContext.read.parquet(...)
df2= df1.filter()
df3 = df2.groupBy(*groupList).agg(....)
and i will put
df1 = Null
after like 1 - will it save shuffle space , isn't it needed and will be re-computed for df2 , df3 ?
second question -
will checkpointing df1 or df2 help by breaking the linage?
what is a feasible solution when dealing with data larger than my storage (around 400GB of raw data processed)
UPDATE
removing cache of a dataframe between 2 phases that needs this dataframe helped and i got no errors .
i wonder how it help with the intermediate shuffle files .
I did face the similar situation. The reason is the that while using group by operations and joins data will be shuffled. As this shuffle data is temporary data while executing in spark applications this will be stored in a directory that spark.local.dir in the spark-defaults.conf file is pointing to which normally is a tmp directory with less space.
In general to avoid this error in the spark-defaults.conf file update the spark.local.dir to a location which has more memory.
When decreasing the number of partitions one can use coalesce, which is great because it doesn't cause a shuffle and seems to work instantly (doesn't require an additional job stage).
I would like to do the opposite sometimes, but repartition induces a shuffle. I think a few months ago I actually got this working by using CoalescedRDD with balanceSlack = 1.0 - so what would happen is it would split a partition so that the resulting partitions location where all on the same node (so small net IO).
This kind of functionality is automatic in Hadoop, one just tweaks the split size. It doesn't seem to work this way in Spark unless one is decreasing the number of partitions. I think the solution might be to write a custom partitioner along with a custom RDD where we define getPreferredLocations ... but I thought that is such a simple and common thing to do surely there must be a straight forward way of doing it?
Things tried:
.set("spark.default.parallelism", partitions) on my SparkConf, and when in the context of reading parquet I've tried sqlContext.sql("set spark.sql.shuffle.partitions= ..., which on 1.0.0 causes an error AND not really want I want, I want partition number to change across all types of job, not just shuffles.
Watch this space
https://issues.apache.org/jira/browse/SPARK-5997
This kind of really simple obvious feature will eventually be implemented - I guess just after they finish all the unnecessary features in Datasets.
I do not exactly understand what your point is. Do you mean you have now 5 partitions, but after next operation you want data distributed to 10? Because having 10, but still using 5 does not make much sense… The process of sending data to new partitions has to happen sometime.
When doing coalesce, you can get rid of unsued partitions, for example: if you had initially 100, but then after reduceByKey you got 10 (as there where only 10 keys), you can set coalesce.
If you want the process to go the other way, you could just force some kind of partitioning:
[RDD].partitionBy(new HashPartitioner(100))
I'm not sure that's what you're looking for, but hope so.
As you know pyspark use some kind of "lazy" way of running. It will only do the computation when there is some action to do (for exemple a "df.count()" or a "df.show()". So what you can do is define the a shuffle partition between those actions.
You can write :
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=100")
# you spark code here with some transformation and at least one action
df = df.withColumn("sum", sum(df.A).over(your_window_function))
df.count() # your action
df = df.filter(df.B <10)
df = df.count()
sparkSession.sqlContext().sql("set spark.sql.shuffle.partitions=10")
# you reduce the number of partition because you know you will have a lot
# less data
df = df.withColumn("max", max(df.A).over(your_other_window_function))
df.count() # your action
I have a text file consisting of a large number of random floating values separated by spaces.
I am loading this file into a RDD in scala.
How does this RDD get partitioned?
Also, is there any method to generate custom partitions such that all partitions have equal number of elements along with an index for each partition?
val dRDD = sc.textFile("hdfs://master:54310/Data/input*")
keyval=dRDD.map(x =>process(x.trim().split(' ').map(_.toDouble),query_norm,m,r))
Here I am loading multiple text files from HDFS and process is a function I am calling.
Can I have a solution with mapPartitonsWithIndex along with how can I access that index inside the process function? Map shuffles the partitions.
How does an RDD gets partitioned?
By default a partition is created for each HDFS partition, which by default is 64MB. Read more here.
How to balance my data across partitions?
First, take a look at the three ways one can repartition his data:
1) Pass a second parameter, the desired minimum number of partitions
for your RDD, into textFile(), but be careful:
In [14]: lines = sc.textFile("data")
In [15]: lines.getNumPartitions()
Out[15]: 1000
In [16]: lines = sc.textFile("data", 500)
In [17]: lines.getNumPartitions()
Out[17]: 1434
In [18]: lines = sc.textFile("data", 5000)
In [19]: lines.getNumPartitions()
Out[19]: 5926
As you can see, [16] doesn't do what one would expect, since the number of partitions the RDD has, is already greater than the minimum number of partitions we request.
2) Use repartition(), like this:
In [22]: lines = lines.repartition(10)
In [23]: lines.getNumPartitions()
Out[23]: 10
Warning: This will invoke a shuffle and should be used when you want to increase the number of partitions your RDD has.
From the docs:
The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
3) Use coalesce(), like this:
In [25]: lines = lines.coalesce(2)
In [26]: lines.getNumPartitions()
Out[26]: 2
Here, Spark knows that you will shrink the RDD and gets advantage of it. Read more about repartition() vs coalesce().
But will all this guarantee that your data will be perfectly balanced across your partitions? Not really, as I experienced in How to balance my data across the partitions?
The loaded rdd is partitioned by default partitioner: hash code. To specify custom partitioner, use can check rdd.partitionBy(), provided with your own partitioner.
I don't think it's ok to use coalesce() here, as by api docs, coalesce() can only be used when we reduce number of partitions, and even we can't specify a custom partitioner with coalesce().
You can generate custom partitions using the coalesce function:
coalesce(numPartitions: Int, shuffle: Boolean = false): RDD[T]