How to speed up downloading a CSV file locally with PySpark (databricks)? - pyspark

We created an ImageClassifier to predict wether certain instagram images are of a certain class. Running this model works fine.
#creating deep image feauturizer using the InceptionV3 lib
featurizer = DeepImageFeaturizer(inputCol="image",
outputCol="features",
modelName="InceptionV3")
#using lr for speed and reliability
lr = LogisticRegression(maxIter=5, regParam=0.03,
elasticNetParam=0.5, labelCol="label")
#define Pipeline
sparkdn = Pipeline(stages=[featurizer, lr])
spark_model = sparkdn.fit(df)
We made this seperately from our basetable (which runs on a higher cluster). We need to extract the spark_model predictions as a csv to import it back in the other notebook and merge it with our basetable.
To do this we have tried the following
image_final_estimation = spark_model.transform(image_final)
display(image_final_estimation) #since this gives an option in databricks to
download the csv
AND
image_final_estimation.coalesce(1).write.csv(path = 'imagesPred2.csv') #and then we would be able to read it back in with spark.read.csv
The thing is these operations take very long (probably due to the nature of the task) and they crash our cluster. We are able to show our outcome, but not only with '.show()', not with the display() method.
Is there any other way to save this csv locally? Or how can we improve the speed of these tasks?
Please note that we use the community edition of Databricks.

When storing DataFrames on files, a good way to parallelize the writing is defining an appropiate number of Partitions related to that DataFrame/RDD.
In the code shown by you, you are using the coalesce function (which it basically reduces the number of partitions to 1, which reduces the effects of parallelism).
On Databricks Community Edition, I tried the following test with a CSV Dataset provided by Databricks( https://docs.databricks.com/getting-started/databricks-datasets.html). The idea is to measure the time elapsed by writing the data into a csv using one partition vs using many partitions.
carDF = spark.read.option("header", True).csv("dbfs:/databricks-datasets/Rdatasets/data-001/csv/car/*")
print("Total count of Rows {0}".format(carDF.count()))
print("Original Partitions Number: {0}".format(carDF.rdd.getNumPartitions()))
>>Total count of Rows 39005
>>Original Partitions Number: 7
%timeit carDF.write.format("csv").mode("overwrite").save("/tmp/caroriginal")
>>2.79 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, up to now, writing the dataset on a local file using 7 partitions took 2.79 seconds
newCarDF = carDF.coalesce(1)
print("Total count of Rows {0}".format(newCarDF.count()))
print("New Partitions Number: {0}".format(newCarDF.rdd.getNumPartitions()))
>>Total count of Rows 39005
>>New Partitions Number: 1
%timeit newCarDF.write.format("csv").mode("overwrite").save("/tmp/carmodified")
>>4.13 s ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, for the same DataFrame, writing to a csv with one partition took 4,13 seconds.
In conclusion, in this case the "coalesce(1)" part is affecting on the writing performance.
Hope this helps

Related

Parallelism in Cassandra read using Scala

I am trying to invoke parallel reading from Cassandra table using spark. But I am not able to invoke parallelism as only one reads is happening any given time. What approach should be followed to achieve the same?
I'd recommend you go with below approach source Russell Spitzer's Blog
Manually dividing our partitions using a Union of partial scans :
Pushing the task to the end-user is also a possibility (and the current workaround.) Most end users already understand why they have long partitions and know in general the domain their column values fall in. This makes it possible for them to manually divide up a request so that it chops up large partitions.
For example, assuming the user knows clustering column c spans from 1 to 1000000. They could write code like
val minRange = 0
val maxRange = 1000000
val numSplits = 10
val subSize = (maxRange - minRange) / numSplits
sc.union(
(minRange to maxRange by subSize)
.map(start =>
sc.cassandraTable("ks", "tab")
.where("c > $start and c < ${start + subSize}"))
)
Each RDD would contain a unique set of tasks drawing only portions of full partitions. The union operation joins all those disparate tasks into a single RDD. The maximum number of rows any single Spark Partition would draw from a single Cassandra partition would be limited to maxRange/ numSplits. This approach, while requiring user intervention, would preserve locality and would still minimize the jumps between disk sectors.
Also read-tuning-parameters

dask dataframe groupby resulting in one partition memory issue

I am reading in 64 compressed csv files (probably 70-80 GB) into one dask data frame then run groupby with aggregations.
The job never completed because appereantly the groupby creates a data frame with only one partition.
This post and this post already addressed this issue but focusing on the computational graph and not the memory issue you run into, when your resulting data frame is too large.
I tried a workaround with repartioning but the job still wont complete.
What am I doing wrong, will I have to use map_partition? This is very confusing as I expect Dask will take care of partitioning everything even after aggregation operations.
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1, memory_limit='8GB',diagnostics_port=5000)
client
dask.config.set(scheduler='processes')
dB3 = dd.read_csv("boden/expansion*.csv", # read in parallel
blocksize=None, # 64 files
sep=',',
compression='gzip'
)
aggs = {
'boden': ['count','min']
}
dBSelect=dB3.groupby(['lng','lat']).agg(aggs).repartition(npartitions=64)
dBSelect=dBSelect.reset_index()
dBSelect.columns=['lng','lat','bodenCount','boden']
dBSelect=dBSelect.drop('bodenCount',axis=1)
with ProgressBar(dt=30): dBSelect.compute().to_parquet('boden/final/boden_final.parq',compression=None)
Most groupby aggregation outputs are small and fit easily in one partition. Clearly this is not the case in your situation.
To resolve this you should use the split_out= parameter to your groupby aggregation to request a certain number of output partitions.
df.groupby(['x', 'y', 'z']).mean(split_out=10)
Note that using split_out= will significantly increase the size of the task graph (it has to mildly shuffle/sort your data ahead of time) and so may increase scheduling overhead.

Spark: stage X contains task of a very large size when running sc.binaryFiles()

I'm trying to load ~1M file set stored on S3. When running sc.binaryFiles("s3a://BUCKETNAME/*").count()
I'm getting WARN TaskSetManager: Stage 0 contains a task of very large size (177 KB). The maximum recommended task size is 100 KB. This is followed by failed tasks
I see that it infers 128 partitions for this stage, which is too low, note that when running the same command on 400K files bucket, number of partitions will be much higher (~2K partitions) and the action will succeed.
setting a higher minPartitions didn't help;
setting a higher spark.default.parallelism didn't help as well.
the only thing that worked was to create multiple smaller RDDs of 1000 files each, and running sc.union on them, Problem with this approach is that it's too slow.
How can this issue be mitigated?
UPDATE:
went on to see how number of partitions is settled in BinaryFileRDD.getPartitions() which got me to this piece of code:
def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) {
val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
val defaultParallelism = sc.defaultParallelism
val files = listStatus(context).asScala
val totalBytes = files.filterNot(_.isDirectory).map(_.getLen + openCostInBytes).sum
val bytesPerCore = totalBytes / defaultParallelism
val maxSplitSize = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
super.setMaxSplitSize(maxSplitSize)
}
I followed the computation and it still didn't make sense, I should get a much larger number.
So I tried to reduce the config.FILES_MAX_PARTITION_BYTES config (spark.files.maxPartitionBytes) - this did increase the number of partitions, and made the job finish, however I'm still getting the original warning (with a somewhat smaller task size), and still, the munber of partitions is way smaller than when running on a 400K file set.
The problem was rooted in the sizes of the files: To my surprise, the files in s3 were not uploaded properly, and their size was 100 times smaller than they should've been. This caused setMinPartitions to calculate splits that contained large amount of small files. Each split is essentially a comma separated string of file paths, since we have many files per split, we got a very long instruction string that should be communicated to all workers. This burdened the network and caused the entire flow to fail. Setting spark.files.maxPartitionBytes to a lower value solved the issue.

Spark shuffle read takes significant time for small data

We are running the following stage DAG and experiencing long shuffle read time for relatively small shuffle data sizes (about 19MB per task)
One interesting aspect is that waiting tasks within each executor/server have equivalent shuffle read time. Here is an example of what it means: for the following server one group of tasks waits about 7.7 minutes and another one waits about 26 s.
Here is another example from the same stage run. The figure shows 3 executors / servers each having uniform groups of tasks with equal shuffle read time. The blue group represents killed tasks due to speculative execution:
Not all executors are like that. There are some that finish all their tasks within seconds pretty much uniformly, and the size of remote read data for these tasks is the same as for the ones that wait long time on other servers.
Besides, this type of stage runs 2 times within our application runtime. The servers/executors that produce these groups of tasks with large shuffle read time are different in each stage run.
Here is an example of task stats table for one of the severs / hosts:
It looks like the code responsible for this DAG is the following:
output.write.parquet("output.parquet")
comparison.write.parquet("comparison.parquet")
output.union(comparison).write.parquet("output_comparison.parquet")
val comparison = data.union(output).except(data.intersect(output)).cache()
comparison.filter(_.abc != "M").count()
We would highly appreciate your thoughts on this.
Apparently the problem was JVM garbage collection (GC). The tasks had to wait until GC is done on the remote executors. The equivalent shuffle read time resulted from the fact that several tasks were waiting on a single remote host performing GC. We followed advise posted here and the problem decreased by an order of magnitude. There is still small correlation between GC time on remote hosts and local shuffle read time. In the future we think to try shuffle service.
Since google brought me here with the same problem but I needed another solution...
Another possible reason for small shuffle size taking a long time to read could be the data is split over many partitions. For example (apologies this is pyspark as it is all I have used):
my_df_with_many_partitions\ # say has 1000 partitions
.filter(very_specific_filter)\ # only very few rows pass
.groupby('blah')\
.count()
The shuffle write from the filter above will be very small, so for the stage after we will have a very small amount to read. But to read it you need to check a lot of empty partitions. One way to address this would be:
my_df_with_many_partitions\
.filter(very_specific_filter)\
.repartition(1)\
.groupby('blah')\
.count()

YARN killing containers for exceeding memory limits

I am running into an issue where YARN is killing my containers for exceeding memory limits:
Container killed by YARN for exceeding memory limits. physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I have 20 nodes that are of m3.2xlarge so they have:
cores: 8
memory: 30
storage: 200 gb ebs
The gist of my application is that I have a couple 100k assets for which I have historical data generated for each hour of the last year, with a total dataset size of 2TB uncompressed. I need to use this historical data to generate a forecast for each asset. My setup is that I first use s3distcp to move the data stored as indexed lzo files to hdfs. I then pull the data in and pass it to sparkSql to handle the json:
val files = sc.newAPIHadoopFile("hdfs:///local/*",
classOf[com.hadoop.mapreduce.LzoTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text],conf)
val lzoRDD = files.map(_._2.toString)
val data = sqlContext.read.json(lzoRDD)
I then use a groupBy to group the historical data by asset, creating a tuple of (assetId,timestamp,sparkSqlRow). I figured this data structure would allow for better in memory operations when generating the forecasts per asset.
val p = data.map(asset => (asset.getAs[String]("assetId"),asset.getAs[Long]("timestamp"),asset)).groupBy(_._1)
I then use a foreach to iterate over each row, calculate the forecast, and finally write the forecast back out as a json file to s3.
p.foreach{ asset =>
(1 to dateTimeRange.toStandardHours.getHours).foreach { hour =>
// determine the hour from the previous year
val hourFromPreviousYear = (currentHour + hour.hour) - timeRange
// convert to seconds
val timeToCompare = hourFromPreviousYear.getMillis
val al = asset._2.toList
println(s"Working on asset ${asset._1} for hour $hour with time-to-compare: $timeToCompare")
// calculate the year over year average for the asset
val yoy = calculateYOYforAsset2(al, currentHour, asset._1)
// get the historical data for the asset from the previous year
val pa = asset._2.filter(_._2 == timeToCompare)
.map(row => calculateForecast(yoy, row._3, asset._1, (currentHour + hour.hour).getMillis))
.foreach(json => writeToS3(json, asset._1, (currentHour + hour.hour).getMillis))
}
}
Is there a better way to accomplish this so that I don't hit the memory issue with YARN?
Is there a way to chunk the assets so that the foreach only operates on about 10k at a time vs all 200k of the assets?
Any advice/help appreciated!
Its not your code. And don't worry foreach does not run all those lambdas concurrently. The problem is that Spark's default value of spark.yarn.executor.memoryOverhead (or recently renamed in 2.3+ to spark.executor.memoryOverhead) is overly conservative which causes your executors to be killed when under load.
The solution is (as suggested by the error message) to increase that value. I would start with setting it to 1GB (set to 1024) or more if you are requesting lots of memory for each executor. The goal is to get jobs running without any executors being killed.
Alternatively if you control the cluster, you could disable YARN memory enforcement via by setting the configs yarn.nodemanager.pmem-check-enabled and yarn.nodemanager.vmem-check-enabled to false in yarn-site.xml