How to efficiently query a hive table in spark using hive context? - scala

I have a 1.6T Hive table with time series data. I am using Hive 1.2.1
and Spark 1.6.1 in scala.
Following is the query which I have in my code. But I always get Java out of memory error.
val sid_data_df = hiveContext.sql(s"SELECT time, total_field, sid, year, date FROM tablename WHERE sid = '$stationId' ORDER BY time LIMIT 4320000 ")
By iteratively selecting few records at a time from hive table, I am trying to do a sliding window on the resultant dataframe
I have a cluster of 4 nodes with 122 GB of memory, 44 vCores. I am using 425 GB memory out of 488 GB available. I am giving the spark-submit with the following parameters
--num-executors 16 --driver-memory 4g --executor-memory 22G --executor-cores 10 \
--conf "spark.sql.shuffle.partitions=1800" \
--conf "spark.shuffle.memory.fraction=0.6" \
--conf "spark.storage.memoryFraction=0.4" \
--conf "spark.yarn.executor.memoryOverhead=2600" \
--conf "spark.yarn.nodemanager.resource.memory-mb=123880" \
--conf "spark.yarn.nodemanager.resource.cpu-vcores=43"
kindly give me suggestions on how to optimize this and successfully fetch data from hive table.
Thanks

The problem is likely here:
LIMIT 4320000
You should avoid using LIMIT to subset large number of records. In Spark, LIMIT moves all rows to a single partition and is likely to cause serious performance and stability issues.
See for example How to optimize below spark code (scala)?
I am trying to do a sliding window on this resultant dataframeiteratively by selecting few records at a time.
This doesn't sound right. Sliding window operations can be usually achieved with some combination of window function, and timestamp-based window buckets.

Related

Efficient way of reading large dataset (10TB) from S3 using EMR Spark, performing filtering operations on partitioned columns, and writing back to S3

I'd like to read a large (~10TB per day) log dataset from S3 using EMR Spark, do some filtering based on partitioned and non-partitioned columns, and write the results back to S3. After filtering, there should be less than 1TB data left in each day.
The dataset is partitioned in this format day/hour/col1/col2. There is relatively similar size of data in each hour for every day. That's why I chunk up the data and read day by day and partition the query by hour. Here is a sample query:
for(date <- 1 to 30){
var output_path = "s3a://dest-bucket/logs/day=%d/".format(date)
spark.table("s3a://source-bucket/logs/day=%d/")
.filter('col1.isin("A", "B", "C")) // col1 is partitioned in source S3
//.filter('col3.isin("X", "Y", "Z")) --> Point 4
.withColumn("hid", hash($"id") % 2000) --> Point 5
.select("col1","col2","col3","col4")
// .repartition(col("hid")) --> Point 5
.repartition(col("hour")) --> Point 1
.write
// .partitionBy("hid")
.partitionBy("hour") --> Point 2
.option("header","true").mode("overwrite").parquet(output_path)
}
The objective is to perform the query as fast as possible and with least operation issues (OOM, lost/failed executors, etc). Currently, my query takes 10-15 hours per log day. And sometimes it fails due to too many lost executors. However, copying the entire files (no filtering) using aws cli takes the same amount on a single node. So, I'm hoping to expedite the query using a large cluster.
In order to reach this objective, I have several questions that can shed light on the problem:
For Spark (v2.4.8) to understand that the source is partitioned on a column, do I need a repartition(col("hour")) statement? What are the pros and cons of doing repartition on a partitioned data source? Even if Spark does Partition Discovery and figures out that the data is partitioned on multiple columns, is there any benefit to doing a repartition based on knowing that the data is fairly uniformly distributed over one column (hour).
In order to expedite the write process, I have a partitionBy("hour") statement, to make write go 24x faster. Otherwise, Spark writes one output file at a time (essentially the move part out of _temporary folder is done sequentially). I don't understand the reason. How could I make the executors perform write operation in parallel?
If I have a partitionBy in write, would I benefit from having a repartition statement?
I can forgo of the condition on the non-partitioned column col3, if that could significantly increase the query performance.
My ultimate goal is to hash-partition the users based on their IDs and rewrite the data with that partition (a new column called hid). However, the reshuffling makes the cluster fail. I have 200 nodes in the cluster each with 128 GB of memory (m5.8xlarge). There should be enough memory to hold the data. Is there any way to optimize the reshuffling based on a new unpartitioned column?
Please see below for the spark configuration I use.
spark-shell \
--master yarn \
--conf "spark.executor.instances=1199" \
--conf "spark.default.parallelism=5995" \
--conf "spark.sql.shuffle.partitions=5995" \
--conf "spark.driver.cores=5" \
--conf "spark.driver.maxResultSize=20g" \
--conf "spark.driver.memory=18g" \
--conf "spark.driver.memoryOverhead=3g" \
--conf "spark.executor.cores=5" \
--conf "spark.executor.memory=18g" \
--conf "spark.executor.memoryOverhead=3g" \
--conf "spark.executor.heartbeatInterval=3600" \
--conf "spark.hadoop.orc.overwrite.output.file=true" \
--conf "spark.hadoop.parquet.enable.summary-metadata=false" \
--conf "spark.sql.execution.arrow.pyspark.enabled=true" \
--conf "spark.sql.execution.arrow.pyspark.fallback.enabled=true" \
--conf "spark.sql.parquet.mergeSchema=false" \
--conf "spark.sql.parquet.int96RebaseModeInRead=CORRECTED" \
--conf "spark.sql.debug.maxToStringFields=100" \
--conf "spark.hadoop.fs.s3.connection.maximum=1000" \
--conf "spark.hadoop.fs.s3.connection.timeout=300000" \
--conf "spark.hadoop.fs.s3.threads.core=250" \
--conf "spark.hadoop.fs.s3a.connection.maximum=1000" \
--conf "spark.hadoop.fs.s3a.connection.timeout=300000" \
--conf "spark.hadoop.fs.s3a.threads.core=250"\
--conf "spark.reducer.maxBlocksInFlightPerAddress=20"
Any advice is appreciated.

How to speed up spark df.write jdbc to postgres database?

I am new to spark and am attempting to speed up appending the contents of a dataframe, (that can have between 200k and 2M rows) to a postgres database using df.write:
df.write.format('jdbc').options(
url=psql_url_spark,
driver=spark_env['PSQL_DRIVER'],
dbtable="{schema}.{table}".format(schema=schema, table=table),
user=spark_env['PSQL_USER'],
password=spark_env['PSQL_PASS'],
batchsize=2000000,
queryTimeout=690
).mode(mode).save()
I tried increasing the batchsize but that didn't help, as completing this task still took ~4hours. I've also included some snapshots below from aws emr showing more details about how the job ran. The task to save the dataframe to the postgres table was only assigned to one executor (which I found strange), would speeding this up involve dividing this task between executors?
Also, I have read spark's performance tuning docs but increasing the batchsize, and queryTimeout have not seemed to improve performance. (I tried calling df.cache() in my script before df.write, but runtime for the script was still 4hrs)
Additionally, my aws emr hardware setup and spark-submit are:
Master Node (1): m4.xlarge
Core Nodes (2): m5.xlarge
spark-submit --deploy-mode client --executor-cores 4 --num-executors 4 ...
Spark is a distributed data processing engine, so when you are processing your data or saving it on file system it uses all its executors to perform the task.
Spark JDBC is slow because when you establish a JDBC connection, one of the executors establishes link to the target database hence resulting in slow speeds and failure.
To overcome this problem and speed up data writes to the database you need to use one of the following approaches:
Approach 1:
In this approach you need to use postgres COPY command utility in order to speed up the write operation. This requires you to have psycopg2 library on your EMR cluster.
The documentation for COPY utility is here
If you want to know the benchmark differences and why copy is faster visit here!
Postgres also suggests using COPY command for bulk inserts. Now how to bulk insert a spark dataframe.
Now to implement faster writes, first save your spark dataframe to EMR file system in csv format and also repartition your output so that no file contains more than 100k rows.
#Repartition your dataframe dynamically based on number of rows in df
df.repartition(10).write.option("maxRecordsPerFile", 100000).mode("overwrite").csv("path/to/save/data)
Now read the files using python and execute copy command for each file.
import psycopg2
#iterate over your files here and generate file object you can also get files list using os module
file = open('path/to/save/data/part-00000_0.csv')
file1 = open('path/to/save/data/part-00000_1.csv')
#define a function
def execute_copy(fileName):
con = psycopg2.connect(database=dbname,user=user,password=password,host=host,port=port)
cursor = con.cursor()
cursor.copy_from(fileName, 'table_name', sep=",")
con.commit()
con.close()
To gain additional speed boost, since you are using EMR cluster you can leverage python multiprocessing to copy more than one file at once.
from multiprocessing import Pool, cpu_count
with Pool(cpu_count()) as p:
print(p.map(execute_copy, [file,file1]))
This is the approach recommended as spark JDBC can't be tuned to gain higher write speeds due to connection constraints.
Approach 2:
Since you are already using an AWS EMR cluster you can always leverage the hadoop capabilities to perform your table writes faster.
So here we will be using sqoop export to export our data from emrfs to the postgres db.
#If you are using s3 as your source path
sqoop export --connect jdbc:postgresql:hostname:port/postgresDB --table target_table --export-dir s3://mybucket/myinputfiles/ --driver org.postgresql.Driver --username master --password password --input-null-string '\\N' --input-null-non-string '\\N' --direct -m 16
#If you are using EMRFS as your source path
sqoop export --connect jdbc:postgresql:hostname:port/postgresDB --table target_table --export-dir /path/to/save/data/ --driver org.postgresql.Driver --username master --password password --input-null-string '\\N' --input-null-non-string '\\N' --direct -m 16
Why sqoop?
Because sqoop opens multiple connections with the database based on the number of mapper specified. So if you specify -m as 8 then 8 concurrent connection streams will be there and those will write data to the postgres.
Also, for more information on using sqoop go through this AWS Blog, SQOOP Considerations and SQOOP Documentation.
If you can hack around your way with code then Approach 1 will definitely give you the performance boost you seek and if you are comfortable with hadoop components like SQOOP then go with second approach.
Hope it helps!
Spark side tuning => Perform repartition on Datafarme so that there would multiple executor writing to DB in parallel
df
.repartition(10) // No. of concurrent connection Spark to PostgreSQL
.write.format('jdbc').options(
url=psql_url_spark,
driver=spark_env['PSQL_DRIVER'],
dbtable="{schema}.{table}".format(schema=schema, table=table),
user=spark_env['PSQL_USER'],
password=spark_env['PSQL_PASS'],
batchsize=2000000,
queryTimeout=690
).mode(mode).save()
Postgresql side tuning =>
There will need to bump up below parameters on PostgreSQL respectively.
max_connections determines the maximum number of concurrent
connections to the database server. The default is typically 100
connections.
shared_buffers configuration parameter determines how much
memory is dedicated to PostgreSQL to use for caching data.
To solve the performance issue, you generally need to resolve the below 2 bottlenecks:
Make sure the spark job is writing the data in parallel to DB -
To resolve this make sure you have a partitioned dataframe. Use "df.repartition(n)" to partiton the dataframe so that each partition is written in DB parallely.
Note - Large number of executors will also lead to slow inserts. So start with 5 partitions and increase the number of partitions by 5 till you get optimal performance.
Make sure the DB has enough compute, memory and storage required for ingesting bulk data.
By repartitioning the dataframe you can achieve a better write performance is a known answer. But there is an optimal way of repartitioning your dataframe.
Since you are running this process on an EMR cluster , First get to know about the instance type and the number of cores that are running on each of your slave instances. According to that specify your number of partitions on a dataframe.
In your case you are using m5.xlarge(2 slaves) which will have 4 vCPUs each which means 4 threads per instance. So 8 partitions will give you an optimal result when you are dealing with huge data.
Note : Number of partitions should be increased or decreased based on your data size.
Note : Batch size is also something you should consider in your writes. Bigger the batch size better the performance

How to read Hive Table with Spark-Sql efficiently

I have a table with 20GB data in hive, I am reading the table using spark with hive context and I am able to see the data and schema as expected.
However it is taking around 40 mins to read the data, is there any alternative to read the data from hive table efficiently.
Hive Table Sample_Table - 20 GB, No partitions, using ORC Snappy Compression. (data exploded to 120 GB while reading from spark)
spark = SparkSession.builder().enableHiveSupport()getOrCreate()
val spark_table = spark.sql(select * from Sample_Table)
Environment Details -
Not using any cloud
Nodes - Around 850, total memory - 160 TB, 80 V cores per node, upto 300GB memory per node, 22 Disks per node
Spark-Submit command -
/mapr/spark/bin/spark-submit \
--verbose
--num-executors=30 \
--conf spark.locality.wait=60s \
--conf spark.network.timeout=14080s \
--driver-memory=20G \
--executor-memory=15G \
--conf spark.blacklist.enabled=true \
--conf spark.shuffle.service.enabled=true \
--master yarn \
--name=Sample_xxx \
--conf spark.default.parallelism=25 \
--conf spark.task.cpus=3 \
--conf spark.broadcast.compress=true \
--conf spark.io.compression.codec=lz4 \
--conf spark.shuffle.compress=true \
--conf "spark.executor.cores=3" \
--conf spark.shuffle.spill.compress=true \
--conf spark.rdd.compress=true \
--conf spark.sql.shuffle.partitions=1000 \
--conf spark.yarn.executor.memoryOverhead=3G \
--conf spark.sql.tungsten.enabled=true \
--queue sample1XX \
--class XXX yy.jar
I am reading multiple tables and performing multiple transformations, that's the reason I have below configurations in spark submit command
20GB should not take very long, though if it decompresses to 120GB that is a more substantial workload; please describe what kind of hardware you are using.
I guess from the path to spark-submit that you are using the MapR distribution of Hadoop. Does this include a user interface giving additional insight into performance? Look particularly for memory usage / garbage collection.
Are there any other jobs running on your cluster that might be taking resources?
Is the 40 minutes merely to load the data, or does this include your processing (bear in mind that Spark loads data lazily, so timings can sometimes be misleading)?
The main thing I notice is that spark.yarn.executor.memory is not a valid spark setting, so your executors may be short of memory.
Use the --executor-memory setting instead.
Also add the --verbose flag and study the output when you run your job, to ensure that the settings are correct and are being parsed in the way you expect.
Also check the executor logs to ensure that there are no errors, particularly due to running out of memory.
Minor observations:
You are setting spark.shuffle.spill.compress twice (harmless but unnecessary)
You are setting both --num-executors and spark.dynamicAllocation.enabled=true, which are contradictory, and normally result in a warning message saying that dynamic allocation will be disabled.

Hive table created with Spark not visible from HUE / Hive GUI

I am creating a hive table from scala using the next code:
val spark = SparkSession
.builder()
.appName("self service")
.enableHiveSupport()
.master("local")
.getOrCreate()
spark.sql("CREATE TABLE default.TEST_TABLE (C1 INT)")
The table must be successfully created, because if I run this code twice I receive an error saying the table already exists.
However, when I try to access this table from the GUI (HUE), I cannot see any table in Hive, so it seems it's being saved in a different path that the used by Hive in HUE to get this information.
Do you know what should I do to see the tables I create from my code from the HUE/Hive web GUI?
Any help will be very appreciated.
Thank you very much.
I seems to me you have not added hive-site.xml to the proper path.
Hive-site has the properties that spark need to connect successfully with Hive and you should add this to the directory
SPARK_HOME/conf/
You can also add this file by using spark.driver.extraClassPath and give the directory where this file exists. For example in pyspark submit
/usr/bin/spark2-submit \
--conf spark.driver.extraClassPath=/../ Directory with Hive-site.xml / \
--master yarn --deploy-mode client --driver-memory nG --executor-memory nG \
--executor-cores n myScript.py

Spark cannot find the postgres jdbc driver

EDIT: See the edit at the end
First of all, I am using Spark 1.5.2 on Amazon EMR and using Amazon RDS for my postgres database. Second is that I am a complete newbie in this world of Spark and Hadoop and MapReduce.
Essentially my problem is the same as for this guy:
java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL
So the dataframe is loaded, but when I try to evaluate it (doing df.show(), where df is the dataframe) gives me the error:
java.sql.SQLException: No suitable driver found for jdbc:postgresql://mypostgres.cvglvlp29krt.eu-west-1.rds.amazonaws.com:5432/mydb
I should note that I start spark like this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar
The solutions suggest delivering the jar onto the worker nodes and setting the classpath on them somehow, which I don't really understand how to do. But then they say that apparently the issue was fixed in Spark 1.4, and I'm using 1.5.2, and still having this issue, so what is going on?
EDIT: Looks like I resolved the issue, however I still don't quite understand why this works and the thing above doesn't, so I guess my questions is now why does doing this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar --conf spark.driver.extraClassPath=/home/hadoop/postgresql-9.4.1207.jre7.jar --jars /home/hadoop/postgresql-9.4.1207.jre7.jar
solve the problem? I just added the path as a parameter into some more of the flags it seems.
spark-shell --driver-class-path .... --jars ... works because all jar files listed in --jars are automatically distributed over the cluster.
Alternatively you could use
spark-shell --packages org.postgresql:postgresql:9.4.1207.jre7
and specify driver class as an option for DataFrameReader / DataFrameWriter
val df = sqlContext.read.format("jdbc").options(Map(
"url" -> url, "dbtable" -> table, "driver" -> "org.postgresql.Driver"
)).load()
or even manually copy required jars to the workers and place these somewhere on the CLASSPATH.