Performance issue while converting pyspark dataframe to JSON

Performance issue while converting pyspark dataframe to JSON - pyspark

I would like to insert pyspark dataframe content to Redis in an effective way. Trying a couple methods but none of them are giving expected results.
Converting df to json takes 30 seconds. The goal is to SET the json payload into Redis cluster for consumption.
I'm also trying to make use of spark-redis https://github.com/RedisLabs/spark-redis/blob/master/doc/python.md library to insert the results into Redis, so that the results are inserted into Redis by all the worker nodes to see if it makes a huge difference. Even this process takes the same amount of time to get the results inserted into Redis
I'm looking for experts suggestions on how to clear my bottleneck and see if I can bring it up to less than 5 seconds, Thanks.
I'm using EMR cluster with 1+4 nodes with 16 cores and 64 Gigs memory each.
js = json.dumps(df.toJSON().collect()) #takes 29 seconds
redis.set(key1, js) #takes 1 second
df.write.format("org.apache.spark.sql.redis").option("table", key1).mode('append').save() #takes 28 seconds
the first two lines of code to convert the df into json taking 29 seconds and setting into redis taking 1 second.
or
last line of code uses worker nodes to insert the df content directly into Redis, but takes like 28 seconds.

Related

Spark dataset write to parquet file takes forever

spark scala App is getting stuck at the below statement and it's running more than 3 hours before getting timeout due to timeout settings. Any pointers on how to understand and interpret the job execution in the yarnUI and debug this issue are appreciated.
dataset
.repartition(100,$"Id")
.write
.mode(SaveMode.Overwrite)
.partitionBy(dateColumn)
.parquet(temppath)
I have a bunch of joins and the largest dataset is ~15 Million and the smallest is < 100 rows. I tried multiple options like increasing the executory memory and spark driver memory but no luck so far. Note I have cached the datasets I am using multiple times and the final dataset storage level is set to Memory_desk_ser.
Not sure whether below executors summary this will or not
executors (summary)
Total_tasks Input shuffle_read shuffle_write
7749 98 GB 77GB 106GB
Appreciate any pointers on how to go about and understand the bottle based on the query plan or any other info.

Spark How to write to parquet file from data using synchronous API

I have a use case which I am trying to solve using Spark. The use case is that I have to call an API which expects a batchSize and token and then it gives back the token for next page. It gives me a list of JSON objects. Now I have to call this API till all the results are returned and write them all to s3 in parquet format. The size of returned object can range from 0 to 100 million.
My approach is that I am first getting let's say a batch of 1 million object, I convert them into a dataset and then writing to parquet using
dataSet.repartition(1).write.mode(SaveMode.Append)
.option("mapreduce.fileoutputcommitter.algorithm.version", "2")
.parquet(s"s3a://somepath/")
and then repeat the process till my API says that there's no more data, i.e. token is null
So the process is that those API calls will have to be run on the driver and sequentially. And once I get a million I will write to s3.
I have been seeing these memory issues on driver.
Application application_1580165903122_19411 failed 1 times due to AM Container for appattempt_1580165903122_19411_000001 exited with exitCode: -104
Diagnostics: Container [pid=28727,containerID=container_1580165903122_19411_01_000001] is running beyond physical memory limits. Current usage: 6.6 GB of 6.6 GB physical memory used; 16.5 GB of 13.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1580165903122_19411_01_000001 :
I have seen some weird behavior in a sense that, sometimes 30 million works fine and sometimes it fail due to this. Even 1 million fails sometimes.
I am wondering if I am doing some very silly mistake or is there any better approach for this?

This design is not scalable and putting a lot of pressure on the driver, so it is expected for it to crash. Additionally a lot of data is accumulated in memory before writing to s3.
I will recommend you to use Spark streaming to read data from API.In this way many executors will do the work and the solution will be much scalable. Here is an example - RestAPI service call from Spark Streaming
In those executors you can accumulate the API response in a balanced way, say accumulate 20,000 records but not wait for 5M records. After say 20,000 write them to S3 in "append" mode. The "append" mode will help multiple process work in tandem and not step on each other.

How to optimize Spark for writing large amounts of data to S3

I do a fair amount of ETL using Apache Spark on EMR.
I'm fairly comfortable with most of the tuning necessary to get good performance, but I have one job that I can't seem to figure out.
Basically, I'm taking about 1 TB of parquet data - spread across tens of thousands of files in S3 - and adding a few columns and writing it out partitioned by one of the date attributes of the data - again, parquet formatted in S3.
I run like this:
spark-submit --conf spark.dynamicAllocation.enabled=true --num-executors 1149 --conf spark.driver.memoryOverhead=5120 --conf spark.executor.memoryOverhead=5120 --conf spark.driver.maxResultSize=2g --conf spark.sql.shuffle.partitions=1600 --conf spark.default.parallelism=1600 --executor-memory 19G --driver-memory 19G --executor-cores 3 --driver-cores 3 --class com.my.class path.to.jar <program args>
The size of the cluster is dynamically determined based on the size of the input data set, and the num-executors, spark.sql.shuffle.partitions, and spark.default.parallelism arguments are calculated based on the size of the cluster.
The code roughly does this:
va df = (read from s3 and add a few columns like timestamp and source file name)
val dfPartitioned = df.coalesce(numPartitions)
val sqlDFProdDedup = spark.sql(s""" (query to dedup against prod data """);
sqlDFProdDedup.repartition($"partition_column")
.write.partitionBy("partition_column")
.mode(SaveMode.Append).parquet(outputPath)
When I look at the ganglia chart, I get a huge resource spike while the de-dup logic runs and some data shuffles, but then the actual writing of the data only uses a tiny fraction of the resources and runs for several hours.
I don't think the primary issue is partition skew, because the data should be fairly distributed across all the partitions.
The partition column is essentially a day of the month, so each job typically only has 5-20 partitions, depending on the span of the input data set. Each partition typically has about 100 GB of data across 10-20 parquet files.
I'm setting spark.sql.files.maxRecordsPerFile to manage the size of those output files.
So, my big question is: how can I improve the performance here?
Simply adding resources doesn't seem to help much.
I've tried making the executors larger (to reduce shuffling) and also to increase the number of CPUs per executor, but that doesn't seem to matter.
Thanks in advance!

Zack, I have a similar use case with 'n' times more files to process on a daily basis. I am going to assume that you are using the code above as is and trying to improve the performance of the overall job. Here are couple of my observations:
Not sure what the coalesce(numPartitions) number actually is and why its being used before de-duplication process. Your spark-submit shows you are creating 1600 partitions and thats good enough to start with.
If you are going to repartition before write then the coalesce above may not be beneficial at all as re-partition will shuffle data.
Since you claim writing 10-20 parquet files it means you are only using 10-20 cores in writing in the last part of your job which is the main reason its slow. Based on 100 GB estimate the parquet file ranges from approx 5GB to 10 GB, which is really huge and I doubt one will be able to open them on their local laptop or EC2 machine unless they use EMR or similar (with huge executor memory if reading whole file or spill to disk) because the memory requirement will be too high. I will recommend creating parquet files of around 1GB to avoid any of those issues.
Also if you create 1GB parquet file, you will likely speed up the process 5 to 10 times as you will be using more executors/cores to write them in parallel. You can actually run an experiment by simply writing the dataframe with default partitions.
Which brings me to the point that you really don't need to use re-partition as you want to write.partitionBy("partition_date") call. Your repartition() call is actually forcing the dataframe to only have max 30-31 partitions depending upon the number of days in that month which is what is driving the number of files being written. The write.partitionBy("partition_date") is actually writing the data in S3 partition and if your dataframe has say 90 partitions it will write 3 times faster (3 *30). df.repartition() is forcing it to slow it down. Do you really need to have 5GB or larger files?
Another major point is that Spark lazy evaluation is sometimes too smart. In your case it will most likely only use the number of executors for the whole program based on the repartition(number). Instead you should try, df.cache() -> df.count() and then df.write(). What this does is that it forces spark to use all available executor cores. I am assuming you are reading files in parallel. In your current implementation you are likely using 20-30 cores. One point of caution, as you are using r4/r5 machines, feel free to up your executor memory to 48G with 8 cores. I have found 8cores to be faster for my task instead of standard 5 cores recommendation.
Another pointer is to try ParallelGC instead of G1GC. For the use case like this when you are reading 1000x of files, I have noticed it performs better or not any worse than G1Gc. Please give it a try.
In my workload, I use coalesce(n) based approach where 'n' gives me a 1GB parquet file. I read files in parallel using ALL the cores available on the cluster. Only during the write part my cores are idle but there's not much you can do to avoid that.
I am not sure how spark.sql.files.maxRecordsPerFile works in conjunction with coalesce() or repartition() but I have found 1GB seems acceptable with pandas, Redshift spectrum, Athena etc.
Hope it helps.
Charu

Here are some optimizations for faster running.
(1) File committer - this is how Spark will read the part files out to the S3 bucket. Each operation is distinct and will be based upon
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
Description
This will write the files directly to part files instead or initially loading them to temp files and copying them over to their end-state part files.
(2) For file size you can derive it based upon getting the average number of bytes per record. Below I am figuring out the number of bytes per record to figure the number of records for 1024 MBs. I would try it first with 1024MBs per partition, then move upwards.
import org.apache.spark.util.SizeEstimator
val numberBytes : Long = SizeEstimator.estimate(inputDF.rdd)
val reduceBytesTo1024MB = numberBytes/123217728
val numberRecords = inputDF.count
val recordsFor1024MB = (numberRecords/reduceBytesTo1024MB).toInt + 1
(3) [I haven't tried this] EMR Committer - if you are using EMR 5.19 or higher, since you are outputting Parquet. You can set the Parquet optimized writer to TRUE.
spark.sql.parquet.fs.optimized.committer.optimization-enabled true

Slow count of >1 billion rows from Cassandra via Apache Spark [duplicate]

I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Next I imported 1.5 million rows in Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numeric values i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to test the whole thing I just created a SparkSession, connected to Cassandra and make a simple select count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs the count and takes about 28 seconds to finish the Job, distributed in 13 Tasks (in Spark UI, the total Input for the Tasks is 331.6MB)
Questions:
Is that the expected performance? If not, what am I missing?
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks? (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame)
Update
A common operation I would like to test over this data:
Query a large data set, say, from 100,000 ~ N rows grouped by pid
Select ev, a list<double>
Perform an average on each member, assuming by now each list has the same length i.e df.groupBy('pid').agg(avg(df['ev'][1]))
As #zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set. The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job).
Edit: I misunderstood his suggestion. #zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here
Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.

Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.

I see that it is very old question but maybe someone needs it now.
When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".

Spark Cassandra Connector has bad read performance, produces a lot of disk writes by kworker

Some followups at the bottom
I have a test installation of Spark and Cassandra where I have 6 nodes with 128GiB of RAM and 16 CPU cores each. Each node runs Spark and Cassandra. I set up my keyspace with the SimpleStrategy and a replication factor of 3 (i.e., fairly standard).
My table is very simple, like this:
create table if not exists mykeyspace.values (channel_id timeuuid, day int, time bigint, value double, primary key ((channel_id, day), time)) with clustering order by (time asc)
time is simply a unix timestamp in nanoseconds (the measuring devices the values come from are that precise and this precision is wanted), day is this timestamp in days (i.e., days since 1970-01-01).
I now inserted about 200 GiB of values for about 400 channels and tested a very simple thing - calculate the 10-minute average of every channel:
sc.
cassandraTable("mykeyspace", "values").
map(r => (r.getLong("time"), r.getUUID("channel_id"), r.getDouble("value"))).
map(t => (t._1 / 600L / 1000000000L, t._2) -> (t._3, 1.0)).
reduceByKey((a, b) => (a._1 + b._1) -> (a._2 + b._2)).
map(t => (t._1._1 * 600L * 1000000000L, t._1._2, t._2._1 / t._2._2))
when I now do this calculation, even without saving the result (i.e., by using a simple count()) this takes a VERY long time and I have a very bad read performance.
When I do top on the nodes, Cassandra's java process takes about 800% CPU, which is OK because this is about half the load the node can take; the other half is taken by Spark.
However, I noticed a strange thing:
When I run iotop I expect to see a lot of disk read, but I see a lot of disk WRITE instead, all of which comes from kworker.
When I do iostat -x -t 10, I also see a lot of writes going on.
Swap is disabled.
When I run a similar calculation directly on the CSV files the data came from, which are stored in HDFS and loaded via sc.newAPIHadoopFile with a custom input format, the process finishes much faster (the calculation takes about an hour with Cassandra but about 5 minutes with files from HDFS).
So where can I start troubleshooting and tuning?
Followup 1
With the help of RussS (see comments) I discovered that logging was set to DEBUG. I disabled this, set logging to ERROR, and also disabled GC logging, but this did not change anything at all.
I also tried keyBy as the very same user pointed out, but this also did not change anything.
I also tried doing it locally, I tried it once from .net and once from Scala, and here, the database is accessed as expected, i.e., no writes.
Followup 2
I think I got it. For once, I didn't see the forest for the trees, because the hour I stated earlier for 200GiB is still about 56 MiB/s throughput. Since the hardware I run my installation on is far from optional (it is a high performance server which runs Microsoft HyperV which in turn runs the nodes virtually, and the hard disks of this machine are quite slow) this is indeed a throughput I expect. Since the host is just one machine with one RAID array where the disks of the nodes are virtual HDDs, I can't expect the performance to magically go through the roof.
I also tried to run Spark standalone which improves the performance a bit (I now get about 75 MiB/s), and also the constant writes are gone with this - I only get occasional spikes I expect because of shuffling.
For the CSV files being much faster, the reason is that the raw data in CSV is about 50 GiB and my custom FileInputFormat that reads it, does it line by line, and is also using a very fast string-to-double parser which only knows the US format but is faster than Java's parseDouble or Scala's toDouble. With this special tweaking I get about 170MiB/s in YARN mode.
So I suppose I should, for once, improve my CQL queries to limit the data that gets read, and try to tweak some YARN settings.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse