Spark dataframe Join issue - scala

Below code snippet works fine. (Read CSV, Read Parquet and join each other)
//Reading csv file -- getting three columns: Number of records: 1
df1=spark.read.format("csv").load(filePath)
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 30 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Its weired that below code snippet doesnt work. (Read Hbase, Read Parquet and join each other)(Difference is reading from Hbase)
//Reading from Hbase (It read from hbase properly -- getting three columns: Number of records: 1
df1=read from Hbase code
// It read from Hbase properly and able to show one record.
df1.show
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 56 tasks (1024.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Then I have added spark.driver.maxResultSize=5g, then another error started occuring, Java Heap space error (run at ThreadPoolExecutor.java). If I observe memory usage in Manager I see that usage just keeps going up until it reaches ~ 50GB, at which point the OOM error occurs. So for whatever reason the amount of RAM being used to perform this operation is ~10x greater than the size of the RDD I'm trying to use.
If I persist df1 in memory&disk and do a count(). Program works fine. Code snippet is below
//Reading from Hbase -- getting three columns: Number of records: 1
df1=read from Hbase code
**df1.persist(StorageLevel.MEMORY_AND_DISK)
val cnt = df1.count()**
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
It works with file even it has the same data but not with Hbase. Running this on 100 worknode cluster with 125 GB of memory on each. So memory is not the problem.
My question here is both the file and Hbase has same data and both read and able to show() the data. But why only Hbase is failing. I am struggling to understand what might be going wrong with this code. Any suggestions will be appreciated.

When the data is being extracted spark is unaware of number of rows which are retrieved from HBase, hence the strategy is opted would be sort merge join.
thus it tries to sort and shuffle the data across the executors.
to avoid the problem, we can use broadcast join at the same time we don't wont to sort and shuffle the data across the from df2 using the key column, which shows the last statement in your code snippet.
however to bypass this (since it is only one row) we can use Case expression for the columns to be padded.
example:
df.withColumn(
"newCol"
,when(col("df2col1").eq(lit(hbaseKey))
,lit(hbaseValueCol1))
.otherwise(lit(null))

I'm sometimes struggling with this error too. Often this occurs when spark tries to broadcast a large table during a join (that happens when spark's optimizer underestimates the size of the table, or the statistics are not correct). As there is no hint to force sort-merge join (How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?), the only option is to disable broadcast joins by setting spark.sql.autoBroadcastJoinThreshold= -1

When I have problem with memory during a join it usually means one of two reasons:
You have too few partitions in dataframes (partitions are too big)
There are many duplicates in the two dataframes on the key on which you join, and the join explodes your memory.
Ad 1. I think you should look at number of partitions you have in each table before join. When Spark reads a file it does not necessarily keep the same number of partitions as was the original table (parquet, csv or other). Reading from csv vs reading from HBase might create different number of partitions and that is why you see differences in performance. Too large partitions become even larger after join and this creates memory problem. Have a look at the Peak Execution Memory per task in Spark UI. This will give you some idea about your memory usage per task. I found it best to keep it below 1 Gb.
Solution: Repartition your tables before the join.
Ad. 2 Maybe not the case here but worth checking.

Related

Spark- write 128 MB size parquet files

I have a DataFrame (df) with more than 1 billion rows
df.coalesce(5)
.write
.partitionBy("Country", "Date")
.mode("append")
.parquet(datalake_output_path)
From the above command I understand only 5 worker nodes in my 100 worker node cluster (spark 2.4.5) will be performing all the tasks. Using coalesce(5) takes the process 7 hours to complete.
Should I try repartition instead of coalesce?
Is there a more faster/ efficient way to write out 128 MB size parquet files or do I need to first calculate the size of my dataframe to determine how many partitions are required.
For example if the size of my dataframe is 1 GB and spark.sql.files.maxPartitionBytes = 128MB should I first calculate No. of partitions required as 1 GB/ 128 MB = approx(8) and then do repartition(8) or coalesce(8) ?
The idea is to maximize the size of parquet files in the output at the time of writing and be able to do so quickly (faster).
You can get the size (dfSizeDiskMB) of your dataframe df by persisting it and then checking the Storage tab on the Web UI as in this answer. Armed with this information and an estimate of the expected Parquet compression ratio you can then estimate the number of partitions you need to achieve your desired output file partition size e.g.
val targetOutputPartitionSizeMB = 128
val parquetCompressionRation = 0.1
val numOutputPartitions = dfSizeDiskMB * parquetCompressionRatio / targetOutputPartitionSizeMB
df.coalesce(numOutputPartitions).write.parquet(path)
Note that spark.files.maxPartitionBytes is not relevant here as it is:
The maximum number of bytes to pack into a single partition when reading files.
(Unless df is the direct result of reading an input data source with no intermediate dataframes created. More likely the number of partitions for df is dictated by spark.sql.shuffle.partitions, being the number of partitions for Spark to use for dataframes created from joins and aggregations).
Should I try repartition instead of coalesce?
coalesce is usually better as it can avoid the shuffle associated with repartition, but note the warning in the docs about potentially losing parallelism in the upstream stages depending on your use case.
Coalesce is better if you are coming from higher no of partitions to lower no. However, if before writing the df, your code isn't doing shuffle , then coalesce will be pushed down to the earliest point possible in DAG.
What you can do is process your df in say 100 partitions or whatever number you seem appropriate and then persist it before writing your df.
Then bring your partitions down to 5 using coalesce and write it. This should probably give you a better performance

How spark shuffle partitions and partition by tag along with each other

I am reading a set of 10,000 parquet files of 10 TB cumulative size from HDFS and writing it back to HDFS in partitioned manner using following code
spark.read.orc("HDFS_LOC").repartition(col("x")).write.partitionBy("x").orc("HDFS_LOC_1")
I am using
spark.sql.shuffle.partitions=8000
I see that spark had written 5000 different partitions of "x" to HDFS(HDFS_LOC_1) . How is shuffle partitions of "8000" is being used in this entire process. I see that there are only 15,000 files got written across all partitions of "x". Does it mean that spark tried to create 8000 files at every partition of "X" and found during write time that there were not enough data to write 8000 files at each partition and ended up writing fewer files ? Can you please help me understand this?
The setting spark.sql.shuffle.partitions=8000 will set the default shuffling partition number of your Spark programs. If you try to execute a join or aggregations just after setting this option, you will see this number taking effect (you can confirm that with df.rdd.getNumPartitions()). Please refer here for more information.
In your case though, you are using this setting with repartition(col("x") and partitionBy("x"). Therefore your program will not be affected by this setting without using a join or an aggregation transformation first. The difference between repartition and partitionBy is that, the first will partition the data in memory, creating cardinality("x") number of partitions, when the second one will write approximately the same number of partitions to HDFS. Why approximately? Well because there are more factors that determine the exact number of output files. Please check the following resources to get a better understanding over this topic:
Difference between df.repartition and DataFrameWriter partitionBy?
pyspark: Efficiently have partitionBy write to same number of total partitions as original table
So the first thing to consider when using repartitioning by column repartition(*cols) or partitionBy(*cols), is the number of unique values (cardinality) that the column (or the combination of columns) has.
That being said, if you want to ensure that you will create 8000 partitions i.e output files, use repartition(partitionsNum, col("x")) where partitionsNum == 8000 in your case then call write.orc("HDFS_LOC_1"). Otherwise, if you want to keep the number of partitions close to the cardinality of x, just call partitionBy("x") to your original df and then write.orc("HDFS_LOC_1") for storing the data to HDFS. This will create cardinality(x) folders with your partitioned data.

Spark dataset write to parquet file takes forever

spark scala App is getting stuck at the below statement and it's running more than 3 hours before getting timeout due to timeout settings. Any pointers on how to understand and interpret the job execution in the yarnUI and debug this issue are appreciated.
dataset
.repartition(100,$"Id")
.write
.mode(SaveMode.Overwrite)
.partitionBy(dateColumn)
.parquet(temppath)
I have a bunch of joins and the largest dataset is ~15 Million and the smallest is < 100 rows. I tried multiple options like increasing the executory memory and spark driver memory but no luck so far. Note I have cached the datasets I am using multiple times and the final dataset storage level is set to Memory_desk_ser.
Not sure whether below executors summary this will or not
executors (summary)
Total_tasks Input shuffle_read shuffle_write
7749 98 GB 77GB 106GB
Appreciate any pointers on how to go about and understand the bottle based on the query plan or any other info.

Slow count of >1 billion rows from Cassandra via Apache Spark [duplicate]

I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Next I imported 1.5 million rows in Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numeric values i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to test the whole thing I just created a SparkSession, connected to Cassandra and make a simple select count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs the count and takes about 28 seconds to finish the Job, distributed in 13 Tasks (in Spark UI, the total Input for the Tasks is 331.6MB)
Questions:
Is that the expected performance? If not, what am I missing?
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks? (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame)
Update
A common operation I would like to test over this data:
Query a large data set, say, from 100,000 ~ N rows grouped by pid
Select ev, a list<double>
Perform an average on each member, assuming by now each list has the same length i.e df.groupBy('pid').agg(avg(df['ev'][1]))
As #zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set. The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job).
Edit: I misunderstood his suggestion. #zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here
Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.
Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.
I see that it is very old question but maybe someone needs it now.
When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".

How to iterate over large Cassandra table in small chunks in Spark

In my test environment I have 1 Cassandra node and 3 Spark nodes. I want to iterate over apparently large table that has about 200k rows, each roughly taking 20-50KB.
CREATE TABLE foo (
uid timeuuid,
events blob,
PRIMARY KEY ((uid))
)
Here is scala code that is executed at spark cluster
val rdd = sc.cassandraTable("test", "foo")
// This pulls records in memory, taking ~6.3GB
var count = rdd.select("events").count()
// Fails nearly immediately with
// NoHostAvailableException: All host(s) tried for query failed [...]
var events = rdd.select("events").collect()
Cassandra 2.0.9, Spark: 1.2.1, Spark-cassandra-connector-1.2.0-alpha2
I tried to only run collect, without count - in this case it just fails fast with NoHostAvailableException.
Question: what is the correct approach to iterate over large table reading and processing small batch of rows at a time?
There are 2 settings in the Cassandra Spark Connector to adjust the chunk size (put them in the SparkConf object):
spark.cassandra.input.split.size: number of rows per Spark partition (default 100000)
spark.cassandra.input.page.row.size: number of rows per fetched page (ie network roundtrip) (default 1000)
Furthermore, you shouldn't use the collect action in your example because it will fetch all the rows in the driver application memory and may raise an out of memory exception. You can use the collect action only if you know for sure it will produce a small number of rows. The count action is different, it produce only a integer. So I advise you to load your data from Cassandra like you did, process it, and store the result (in Cassandra, HDFS, whatever).