How to read Hive Table with Spark-Sql efficiently - scala

I have a table with 20GB data in hive, I am reading the table using spark with hive context and I am able to see the data and schema as expected.
However it is taking around 40 mins to read the data, is there any alternative to read the data from hive table efficiently.
Hive Table Sample_Table - 20 GB, No partitions, using ORC Snappy Compression. (data exploded to 120 GB while reading from spark)
spark = SparkSession.builder().enableHiveSupport()getOrCreate()
val spark_table = spark.sql(select * from Sample_Table)
Environment Details -
Not using any cloud
Nodes - Around 850, total memory - 160 TB, 80 V cores per node, upto 300GB memory per node, 22 Disks per node
Spark-Submit command -
/mapr/spark/bin/spark-submit \
--verbose
--num-executors=30 \
--conf spark.locality.wait=60s \
--conf spark.network.timeout=14080s \
--driver-memory=20G \
--executor-memory=15G \
--conf spark.blacklist.enabled=true \
--conf spark.shuffle.service.enabled=true \
--master yarn \
--name=Sample_xxx \
--conf spark.default.parallelism=25 \
--conf spark.task.cpus=3 \
--conf spark.broadcast.compress=true \
--conf spark.io.compression.codec=lz4 \
--conf spark.shuffle.compress=true \
--conf "spark.executor.cores=3" \
--conf spark.shuffle.spill.compress=true \
--conf spark.rdd.compress=true \
--conf spark.sql.shuffle.partitions=1000 \
--conf spark.yarn.executor.memoryOverhead=3G \
--conf spark.sql.tungsten.enabled=true \
--queue sample1XX \
--class XXX yy.jar
I am reading multiple tables and performing multiple transformations, that's the reason I have below configurations in spark submit command

20GB should not take very long, though if it decompresses to 120GB that is a more substantial workload; please describe what kind of hardware you are using.
I guess from the path to spark-submit that you are using the MapR distribution of Hadoop. Does this include a user interface giving additional insight into performance? Look particularly for memory usage / garbage collection.
Are there any other jobs running on your cluster that might be taking resources?
Is the 40 minutes merely to load the data, or does this include your processing (bear in mind that Spark loads data lazily, so timings can sometimes be misleading)?
The main thing I notice is that spark.yarn.executor.memory is not a valid spark setting, so your executors may be short of memory.
Use the --executor-memory setting instead.
Also add the --verbose flag and study the output when you run your job, to ensure that the settings are correct and are being parsed in the way you expect.
Also check the executor logs to ensure that there are no errors, particularly due to running out of memory.
Minor observations:
You are setting spark.shuffle.spill.compress twice (harmless but unnecessary)
You are setting both --num-executors and spark.dynamicAllocation.enabled=true, which are contradictory, and normally result in a warning message saying that dynamic allocation will be disabled.

Related

Efficient way of reading large dataset (10TB) from S3 using EMR Spark, performing filtering operations on partitioned columns, and writing back to S3

I'd like to read a large (~10TB per day) log dataset from S3 using EMR Spark, do some filtering based on partitioned and non-partitioned columns, and write the results back to S3. After filtering, there should be less than 1TB data left in each day.
The dataset is partitioned in this format day/hour/col1/col2. There is relatively similar size of data in each hour for every day. That's why I chunk up the data and read day by day and partition the query by hour. Here is a sample query:
for(date <- 1 to 30){
var output_path = "s3a://dest-bucket/logs/day=%d/".format(date)
spark.table("s3a://source-bucket/logs/day=%d/")
.filter('col1.isin("A", "B", "C")) // col1 is partitioned in source S3
//.filter('col3.isin("X", "Y", "Z")) --> Point 4
.withColumn("hid", hash($"id") % 2000) --> Point 5
.select("col1","col2","col3","col4")
// .repartition(col("hid")) --> Point 5
.repartition(col("hour")) --> Point 1
.write
// .partitionBy("hid")
.partitionBy("hour") --> Point 2
.option("header","true").mode("overwrite").parquet(output_path)
}
The objective is to perform the query as fast as possible and with least operation issues (OOM, lost/failed executors, etc). Currently, my query takes 10-15 hours per log day. And sometimes it fails due to too many lost executors. However, copying the entire files (no filtering) using aws cli takes the same amount on a single node. So, I'm hoping to expedite the query using a large cluster.
In order to reach this objective, I have several questions that can shed light on the problem:
For Spark (v2.4.8) to understand that the source is partitioned on a column, do I need a repartition(col("hour")) statement? What are the pros and cons of doing repartition on a partitioned data source? Even if Spark does Partition Discovery and figures out that the data is partitioned on multiple columns, is there any benefit to doing a repartition based on knowing that the data is fairly uniformly distributed over one column (hour).
In order to expedite the write process, I have a partitionBy("hour") statement, to make write go 24x faster. Otherwise, Spark writes one output file at a time (essentially the move part out of _temporary folder is done sequentially). I don't understand the reason. How could I make the executors perform write operation in parallel?
If I have a partitionBy in write, would I benefit from having a repartition statement?
I can forgo of the condition on the non-partitioned column col3, if that could significantly increase the query performance.
My ultimate goal is to hash-partition the users based on their IDs and rewrite the data with that partition (a new column called hid). However, the reshuffling makes the cluster fail. I have 200 nodes in the cluster each with 128 GB of memory (m5.8xlarge). There should be enough memory to hold the data. Is there any way to optimize the reshuffling based on a new unpartitioned column?
Please see below for the spark configuration I use.
spark-shell \
--master yarn \
--conf "spark.executor.instances=1199" \
--conf "spark.default.parallelism=5995" \
--conf "spark.sql.shuffle.partitions=5995" \
--conf "spark.driver.cores=5" \
--conf "spark.driver.maxResultSize=20g" \
--conf "spark.driver.memory=18g" \
--conf "spark.driver.memoryOverhead=3g" \
--conf "spark.executor.cores=5" \
--conf "spark.executor.memory=18g" \
--conf "spark.executor.memoryOverhead=3g" \
--conf "spark.executor.heartbeatInterval=3600" \
--conf "spark.hadoop.orc.overwrite.output.file=true" \
--conf "spark.hadoop.parquet.enable.summary-metadata=false" \
--conf "spark.sql.execution.arrow.pyspark.enabled=true" \
--conf "spark.sql.execution.arrow.pyspark.fallback.enabled=true" \
--conf "spark.sql.parquet.mergeSchema=false" \
--conf "spark.sql.parquet.int96RebaseModeInRead=CORRECTED" \
--conf "spark.sql.debug.maxToStringFields=100" \
--conf "spark.hadoop.fs.s3.connection.maximum=1000" \
--conf "spark.hadoop.fs.s3.connection.timeout=300000" \
--conf "spark.hadoop.fs.s3.threads.core=250" \
--conf "spark.hadoop.fs.s3a.connection.maximum=1000" \
--conf "spark.hadoop.fs.s3a.connection.timeout=300000" \
--conf "spark.hadoop.fs.s3a.threads.core=250"\
--conf "spark.reducer.maxBlocksInFlightPerAddress=20"
Any advice is appreciated.

Using Spark-Submit to write to S3 in "local" mode using S3A Directory Committer

I'm currently running PySpark via local mode. I want to be able to efficiently output parquet files to S3 via the S3 Directory Committer. This PySpark instance is using the local disk, not HDFS, as it is being submitted via spark-submit --master local[*].
I can successfully write to my S3 Instance without enabling the directory committer. However, this involves writing staging files to S3 and renaming them, which is slow and unreliable. I would like for Spark to write to my local filesystem as a temporary store, and then copy to S3.
I have the following configuration in my PySpark conf:
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "directory")
self.spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
My spark-submit command looks like this:
spark-submit --master local[*] --py-files files.zip --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark.internal.io.cloud.PathOutputCommitProtocol --driver-memory 4G --name clean-raw-recording_data main.py
spark-submit gives me the following error, due to the requisite JAR not being in place:
java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
My questions are:
Which JAR (specifically, the maven coordinates) do I need to include in spark-submit --packages in order to be able to reference PathOutputCommitProtocol?
Once I have (1) working, will I be able to use PySpark's local mode to stage temporary files on the local filesystem? Or is HDFS a strict requirement?
I need this to be running in local mode, not cluster mode.
EDIT:
I got this to work with the following configuration:
Using pyspark version 3.1.2 and the package
org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253.
I needed to add the cloudera repository using the --repositories option for spark-submit:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
you need the spark-hadoop-cloud module for the release of spark you are using
the committer is happy using the local fs (it's now the public integration test suites work https://github.com/hortonworks-spark/cloud-integration. all that's needed is a "real" filesystem shared across all workers and the spark driver, so the driver gets the manifests of each pending commit.
print the _SUCCESS file after a job to see what the committer did: 0 byte file == old committer, JSON with diagnostics == new one

Hive table created with Spark not visible from HUE / Hive GUI

I am creating a hive table from scala using the next code:
val spark = SparkSession
.builder()
.appName("self service")
.enableHiveSupport()
.master("local")
.getOrCreate()
spark.sql("CREATE TABLE default.TEST_TABLE (C1 INT)")
The table must be successfully created, because if I run this code twice I receive an error saying the table already exists.
However, when I try to access this table from the GUI (HUE), I cannot see any table in Hive, so it seems it's being saved in a different path that the used by Hive in HUE to get this information.
Do you know what should I do to see the tables I create from my code from the HUE/Hive web GUI?
Any help will be very appreciated.
Thank you very much.
I seems to me you have not added hive-site.xml to the proper path.
Hive-site has the properties that spark need to connect successfully with Hive and you should add this to the directory
SPARK_HOME/conf/
You can also add this file by using spark.driver.extraClassPath and give the directory where this file exists. For example in pyspark submit
/usr/bin/spark2-submit \
--conf spark.driver.extraClassPath=/../ Directory with Hive-site.xml / \
--master yarn --deploy-mode client --driver-memory nG --executor-memory nG \
--executor-cores n myScript.py

How to efficiently query a hive table in spark using hive context?

I have a 1.6T Hive table with time series data. I am using Hive 1.2.1
and Spark 1.6.1 in scala.
Following is the query which I have in my code. But I always get Java out of memory error.
val sid_data_df = hiveContext.sql(s"SELECT time, total_field, sid, year, date FROM tablename WHERE sid = '$stationId' ORDER BY time LIMIT 4320000 ")
By iteratively selecting few records at a time from hive table, I am trying to do a sliding window on the resultant dataframe
I have a cluster of 4 nodes with 122 GB of memory, 44 vCores. I am using 425 GB memory out of 488 GB available. I am giving the spark-submit with the following parameters
--num-executors 16 --driver-memory 4g --executor-memory 22G --executor-cores 10 \
--conf "spark.sql.shuffle.partitions=1800" \
--conf "spark.shuffle.memory.fraction=0.6" \
--conf "spark.storage.memoryFraction=0.4" \
--conf "spark.yarn.executor.memoryOverhead=2600" \
--conf "spark.yarn.nodemanager.resource.memory-mb=123880" \
--conf "spark.yarn.nodemanager.resource.cpu-vcores=43"
kindly give me suggestions on how to optimize this and successfully fetch data from hive table.
Thanks
The problem is likely here:
LIMIT 4320000
You should avoid using LIMIT to subset large number of records. In Spark, LIMIT moves all rows to a single partition and is likely to cause serious performance and stability issues.
See for example How to optimize below spark code (scala)?
I am trying to do a sliding window on this resultant dataframeiteratively by selecting few records at a time.
This doesn't sound right. Sliding window operations can be usually achieved with some combination of window function, and timestamp-based window buckets.

use an external library in pyspark job in a Spark cluster from google-dataproc

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this:
I started a ssh session with the master node of my cluster, then I input:
pyspark --packages com.databricks:spark-csv_2.11:1.2.0
Then it launched a pyspark shell in which I input:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv')
df.show()
And it worked.
My next step is to launch this job from my main machine using the command:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py
But here It does not work and I get an error. I think because I did not gave the --packages com.databricks:spark-csv_2.11:1.2.0 as an argument, but I tried 10 different ways to give it and I did not manage.
My question are:
was the databricks csv library installed after I typed pyspark --packages com.databricks:spark-csv_2.11:1.2.0
can I write a line in my job.py in order to import it?
or what params should I give to my gcloud command to import it or install it?
Short Answer
There are quirks in ordering of arguments where --packages isn't accepted by spark-submit if it comes after the my_job.py argument. To workaround this, you can do the following when submitting from Dataproc's CLI:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 before the .py file in your command.
Long Answer
So, this is actually a different issue than the known lack of support for --jars in gcloud beta dataproc jobs submit pyspark; it appears that without Dataproc explicitly recognizing --packages as a special spark-submit-level flag, it tries to pass it after the application arguments so that spark-submit lets the --packages fall through as an application argument rather than properly parsing it as a submission-level option. Indeed, in an SSH session, the following does not work:
# Doesn't work if job.py depends on that package.
spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0
But switching the order of the arguments does work again, even though in the pyspark case, both orderings work:
# Works with dependencies on that package.
spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0
pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py
So even though spark-submit job.py is supposed to be a drop-in replacement for everything that previously called pyspark job.py, the difference in parse ordering for things like --packages means it's not actually a 100% compatible migration. This might be something to follow up with on the Spark side.
Anyhow, fortunately there's a workaround, since --packages is just another alias for the Spark property spark.jars.packages, and Dataproc's CLI supports properties just fine. So you can just do the following:
gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
--properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py
Note that the --properties must come before the my_job.py, otherwise it gets sent as an application argument rather than as a configuration flag. Hope that works for you! Note that the equivalent in an SSH session would be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py.
Additionally to #Dennis.
Note that if you need to load multiple external packages, you need to specify a custom escape character like so:
--properties ^#^spark.jars.packages=org.elasticsearch:elasticsearch-spark_2.10:2.3.2,com.data‌​bricks:spark-avro_2.10:2.0.1
Note the ^#^ right before the package list.
See gcloud topic escaping for more details.