Efficient way of reading large dataset (10TB) from S3 using EMR Spark, performing filtering operations on partitioned columns, and writing back to S3 - scala

I'd like to read a large (~10TB per day) log dataset from S3 using EMR Spark, do some filtering based on partitioned and non-partitioned columns, and write the results back to S3. After filtering, there should be less than 1TB data left in each day.
The dataset is partitioned in this format day/hour/col1/col2. There is relatively similar size of data in each hour for every day. That's why I chunk up the data and read day by day and partition the query by hour. Here is a sample query:
for(date <- 1 to 30){
var output_path = "s3a://dest-bucket/logs/day=%d/".format(date)
spark.table("s3a://source-bucket/logs/day=%d/")
.filter('col1.isin("A", "B", "C")) // col1 is partitioned in source S3
//.filter('col3.isin("X", "Y", "Z")) --> Point 4
.withColumn("hid", hash($"id") % 2000) --> Point 5
.select("col1","col2","col3","col4")
// .repartition(col("hid")) --> Point 5
.repartition(col("hour")) --> Point 1
.write
// .partitionBy("hid")
.partitionBy("hour") --> Point 2
.option("header","true").mode("overwrite").parquet(output_path)
}
The objective is to perform the query as fast as possible and with least operation issues (OOM, lost/failed executors, etc). Currently, my query takes 10-15 hours per log day. And sometimes it fails due to too many lost executors. However, copying the entire files (no filtering) using aws cli takes the same amount on a single node. So, I'm hoping to expedite the query using a large cluster.
In order to reach this objective, I have several questions that can shed light on the problem:
For Spark (v2.4.8) to understand that the source is partitioned on a column, do I need a repartition(col("hour")) statement? What are the pros and cons of doing repartition on a partitioned data source? Even if Spark does Partition Discovery and figures out that the data is partitioned on multiple columns, is there any benefit to doing a repartition based on knowing that the data is fairly uniformly distributed over one column (hour).
In order to expedite the write process, I have a partitionBy("hour") statement, to make write go 24x faster. Otherwise, Spark writes one output file at a time (essentially the move part out of _temporary folder is done sequentially). I don't understand the reason. How could I make the executors perform write operation in parallel?
If I have a partitionBy in write, would I benefit from having a repartition statement?
I can forgo of the condition on the non-partitioned column col3, if that could significantly increase the query performance.
My ultimate goal is to hash-partition the users based on their IDs and rewrite the data with that partition (a new column called hid). However, the reshuffling makes the cluster fail. I have 200 nodes in the cluster each with 128 GB of memory (m5.8xlarge). There should be enough memory to hold the data. Is there any way to optimize the reshuffling based on a new unpartitioned column?
Please see below for the spark configuration I use.
spark-shell \
--master yarn \
--conf "spark.executor.instances=1199" \
--conf "spark.default.parallelism=5995" \
--conf "spark.sql.shuffle.partitions=5995" \
--conf "spark.driver.cores=5" \
--conf "spark.driver.maxResultSize=20g" \
--conf "spark.driver.memory=18g" \
--conf "spark.driver.memoryOverhead=3g" \
--conf "spark.executor.cores=5" \
--conf "spark.executor.memory=18g" \
--conf "spark.executor.memoryOverhead=3g" \
--conf "spark.executor.heartbeatInterval=3600" \
--conf "spark.hadoop.orc.overwrite.output.file=true" \
--conf "spark.hadoop.parquet.enable.summary-metadata=false" \
--conf "spark.sql.execution.arrow.pyspark.enabled=true" \
--conf "spark.sql.execution.arrow.pyspark.fallback.enabled=true" \
--conf "spark.sql.parquet.mergeSchema=false" \
--conf "spark.sql.parquet.int96RebaseModeInRead=CORRECTED" \
--conf "spark.sql.debug.maxToStringFields=100" \
--conf "spark.hadoop.fs.s3.connection.maximum=1000" \
--conf "spark.hadoop.fs.s3.connection.timeout=300000" \
--conf "spark.hadoop.fs.s3.threads.core=250" \
--conf "spark.hadoop.fs.s3a.connection.maximum=1000" \
--conf "spark.hadoop.fs.s3a.connection.timeout=300000" \
--conf "spark.hadoop.fs.s3a.threads.core=250"\
--conf "spark.reducer.maxBlocksInFlightPerAddress=20"
Any advice is appreciated.

Related

Spark to Synapse "truncate" not working as expected

I have a simple requirement to write a dataframe from spark (databricks) to a synapse dedicated pool table and keep refreshing (truncating) it on daily basis without dropping it.
Documentation suggests to use truncate with overwrite mode but that doesn't seem to be working as expected for me. As i continue to see table creation date getting updated
I am using
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", synapse_jdbc) \
.option("tempDir", tempDir) \
.option("useAzureMSI", "true") \
.option("dbTable", table_name) \
.mode("overwrite") \
.option("truncate","true") \
.save()
But there doesn't seem to be any difference whether i use truncate or not. Creation date/time of the table in synapse gets updated everytime i execute the above from databricks. Can anyone please help with this, what am i missing?
I already have a workaround which works but seems more like a workaround
.option("preActions", "truncate table "+table_name) \
.mode("append") \
I tried to reproduce your scenario in my environment and the truncate is not working for me with the synapse connector.
While researching this issue I found that not all Options are supported to synapse connector In the Official Microsoft document the provided the list of supported options like dbTable, query, user, password, url, encrypt=true, jdbcDriver, tempDir, tempCompression, forwardSparkAzureStorageCredentials, useAzureMSI , enableServicePrincipalAuth, etc.
Truncate ate option is supported to the jdbc format nots synapse connector.
When I change the format form com.databricks.spark.sqldw to jdbc it's working fine now.
My Code:
df.write.format("jdbc")
.option("url",synapse_jdbc)
.option("forwardSparkAzureStorageCredentials", "true")
.option("dbTable", table_name)
.option("tempDir", tempdir)
.option("truncate","true")
.mode("overwrite")
.save()
First execution:
Second execution:
conclusion: For both the time when code is executed table creation time is same means overwrite is not dropping table it is truncating table

How to read Hive Table with Spark-Sql efficiently

I have a table with 20GB data in hive, I am reading the table using spark with hive context and I am able to see the data and schema as expected.
However it is taking around 40 mins to read the data, is there any alternative to read the data from hive table efficiently.
Hive Table Sample_Table - 20 GB, No partitions, using ORC Snappy Compression. (data exploded to 120 GB while reading from spark)
spark = SparkSession.builder().enableHiveSupport()getOrCreate()
val spark_table = spark.sql(select * from Sample_Table)
Environment Details -
Not using any cloud
Nodes - Around 850, total memory - 160 TB, 80 V cores per node, upto 300GB memory per node, 22 Disks per node
Spark-Submit command -
/mapr/spark/bin/spark-submit \
--verbose
--num-executors=30 \
--conf spark.locality.wait=60s \
--conf spark.network.timeout=14080s \
--driver-memory=20G \
--executor-memory=15G \
--conf spark.blacklist.enabled=true \
--conf spark.shuffle.service.enabled=true \
--master yarn \
--name=Sample_xxx \
--conf spark.default.parallelism=25 \
--conf spark.task.cpus=3 \
--conf spark.broadcast.compress=true \
--conf spark.io.compression.codec=lz4 \
--conf spark.shuffle.compress=true \
--conf "spark.executor.cores=3" \
--conf spark.shuffle.spill.compress=true \
--conf spark.rdd.compress=true \
--conf spark.sql.shuffle.partitions=1000 \
--conf spark.yarn.executor.memoryOverhead=3G \
--conf spark.sql.tungsten.enabled=true \
--queue sample1XX \
--class XXX yy.jar
I am reading multiple tables and performing multiple transformations, that's the reason I have below configurations in spark submit command
20GB should not take very long, though if it decompresses to 120GB that is a more substantial workload; please describe what kind of hardware you are using.
I guess from the path to spark-submit that you are using the MapR distribution of Hadoop. Does this include a user interface giving additional insight into performance? Look particularly for memory usage / garbage collection.
Are there any other jobs running on your cluster that might be taking resources?
Is the 40 minutes merely to load the data, or does this include your processing (bear in mind that Spark loads data lazily, so timings can sometimes be misleading)?
The main thing I notice is that spark.yarn.executor.memory is not a valid spark setting, so your executors may be short of memory.
Use the --executor-memory setting instead.
Also add the --verbose flag and study the output when you run your job, to ensure that the settings are correct and are being parsed in the way you expect.
Also check the executor logs to ensure that there are no errors, particularly due to running out of memory.
Minor observations:
You are setting spark.shuffle.spill.compress twice (harmless but unnecessary)
You are setting both --num-executors and spark.dynamicAllocation.enabled=true, which are contradictory, and normally result in a warning message saying that dynamic allocation will be disabled.

Hive table created with Spark not visible from HUE / Hive GUI

I am creating a hive table from scala using the next code:
val spark = SparkSession
.builder()
.appName("self service")
.enableHiveSupport()
.master("local")
.getOrCreate()
spark.sql("CREATE TABLE default.TEST_TABLE (C1 INT)")
The table must be successfully created, because if I run this code twice I receive an error saying the table already exists.
However, when I try to access this table from the GUI (HUE), I cannot see any table in Hive, so it seems it's being saved in a different path that the used by Hive in HUE to get this information.
Do you know what should I do to see the tables I create from my code from the HUE/Hive web GUI?
Any help will be very appreciated.
Thank you very much.
I seems to me you have not added hive-site.xml to the proper path.
Hive-site has the properties that spark need to connect successfully with Hive and you should add this to the directory
SPARK_HOME/conf/
You can also add this file by using spark.driver.extraClassPath and give the directory where this file exists. For example in pyspark submit
/usr/bin/spark2-submit \
--conf spark.driver.extraClassPath=/../ Directory with Hive-site.xml / \
--master yarn --deploy-mode client --driver-memory nG --executor-memory nG \
--executor-cores n myScript.py

How to efficiently query a hive table in spark using hive context?

I have a 1.6T Hive table with time series data. I am using Hive 1.2.1
and Spark 1.6.1 in scala.
Following is the query which I have in my code. But I always get Java out of memory error.
val sid_data_df = hiveContext.sql(s"SELECT time, total_field, sid, year, date FROM tablename WHERE sid = '$stationId' ORDER BY time LIMIT 4320000 ")
By iteratively selecting few records at a time from hive table, I am trying to do a sliding window on the resultant dataframe
I have a cluster of 4 nodes with 122 GB of memory, 44 vCores. I am using 425 GB memory out of 488 GB available. I am giving the spark-submit with the following parameters
--num-executors 16 --driver-memory 4g --executor-memory 22G --executor-cores 10 \
--conf "spark.sql.shuffle.partitions=1800" \
--conf "spark.shuffle.memory.fraction=0.6" \
--conf "spark.storage.memoryFraction=0.4" \
--conf "spark.yarn.executor.memoryOverhead=2600" \
--conf "spark.yarn.nodemanager.resource.memory-mb=123880" \
--conf "spark.yarn.nodemanager.resource.cpu-vcores=43"
kindly give me suggestions on how to optimize this and successfully fetch data from hive table.
Thanks
The problem is likely here:
LIMIT 4320000
You should avoid using LIMIT to subset large number of records. In Spark, LIMIT moves all rows to a single partition and is likely to cause serious performance and stability issues.
See for example How to optimize below spark code (scala)?
I am trying to do a sliding window on this resultant dataframeiteratively by selecting few records at a time.
This doesn't sound right. Sliding window operations can be usually achieved with some combination of window function, and timestamp-based window buckets.

use an external library in pyspark job in a Spark cluster from google-dataproc

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this:
I started a ssh session with the master node of my cluster, then I input:
pyspark --packages com.databricks:spark-csv_2.11:1.2.0
Then it launched a pyspark shell in which I input:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv')
df.show()
And it worked.
My next step is to launch this job from my main machine using the command:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py
But here It does not work and I get an error. I think because I did not gave the --packages com.databricks:spark-csv_2.11:1.2.0 as an argument, but I tried 10 different ways to give it and I did not manage.
My question are:
was the databricks csv library installed after I typed pyspark --packages com.databricks:spark-csv_2.11:1.2.0
can I write a line in my job.py in order to import it?
or what params should I give to my gcloud command to import it or install it?
Short Answer
There are quirks in ordering of arguments where --packages isn't accepted by spark-submit if it comes after the my_job.py argument. To workaround this, you can do the following when submitting from Dataproc's CLI:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 before the .py file in your command.
Long Answer
So, this is actually a different issue than the known lack of support for --jars in gcloud beta dataproc jobs submit pyspark; it appears that without Dataproc explicitly recognizing --packages as a special spark-submit-level flag, it tries to pass it after the application arguments so that spark-submit lets the --packages fall through as an application argument rather than properly parsing it as a submission-level option. Indeed, in an SSH session, the following does not work:
# Doesn't work if job.py depends on that package.
spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0
But switching the order of the arguments does work again, even though in the pyspark case, both orderings work:
# Works with dependencies on that package.
spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0
pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py
So even though spark-submit job.py is supposed to be a drop-in replacement for everything that previously called pyspark job.py, the difference in parse ordering for things like --packages means it's not actually a 100% compatible migration. This might be something to follow up with on the Spark side.
Anyhow, fortunately there's a workaround, since --packages is just another alias for the Spark property spark.jars.packages, and Dataproc's CLI supports properties just fine. So you can just do the following:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Note that the --properties must come before the my_job.py, otherwise it gets sent as an application argument rather than as a configuration flag. Hope that works for you! Note that the equivalent in an SSH session would be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py.
Additionally to #Dennis.
Note that if you need to load multiple external packages, you need to specify a custom escape character like so:
--properties ^#^spark.jars.packages=org.elasticsearch:elasticsearch-spark_2.10:2.3.2,com.data‌​bricks:spark-avro_2.10:2.0.1
Note the ^#^ right before the package list.
See gcloud topic escaping for more details.