How to iterate over large Cassandra table in small chunks in Spark - scala

In my test environment I have 1 Cassandra node and 3 Spark nodes. I want to iterate over apparently large table that has about 200k rows, each roughly taking 20-50KB.
CREATE TABLE foo (
uid timeuuid,
events blob,
PRIMARY KEY ((uid))
)
Here is scala code that is executed at spark cluster
val rdd = sc.cassandraTable("test", "foo")
// This pulls records in memory, taking ~6.3GB
var count = rdd.select("events").count()
// Fails nearly immediately with
// NoHostAvailableException: All host(s) tried for query failed [...]
var events = rdd.select("events").collect()
Cassandra 2.0.9, Spark: 1.2.1, Spark-cassandra-connector-1.2.0-alpha2
I tried to only run collect, without count - in this case it just fails fast with NoHostAvailableException.
Question: what is the correct approach to iterate over large table reading and processing small batch of rows at a time?

There are 2 settings in the Cassandra Spark Connector to adjust the chunk size (put them in the SparkConf object):
spark.cassandra.input.split.size: number of rows per Spark partition (default 100000)
spark.cassandra.input.page.row.size: number of rows per fetched page (ie network roundtrip) (default 1000)
Furthermore, you shouldn't use the collect action in your example because it will fetch all the rows in the driver application memory and may raise an out of memory exception. You can use the collect action only if you know for sure it will produce a small number of rows. The count action is different, it produce only a integer. So I advise you to load your data from Cassandra like you did, process it, and store the result (in Cassandra, HDFS, whatever).

Related

How spark shuffle partitions and partition by tag along with each other

I am reading a set of 10,000 parquet files of 10 TB cumulative size from HDFS and writing it back to HDFS in partitioned manner using following code
spark.read.orc("HDFS_LOC").repartition(col("x")).write.partitionBy("x").orc("HDFS_LOC_1")
I am using
spark.sql.shuffle.partitions=8000
I see that spark had written 5000 different partitions of "x" to HDFS(HDFS_LOC_1) . How is shuffle partitions of "8000" is being used in this entire process. I see that there are only 15,000 files got written across all partitions of "x". Does it mean that spark tried to create 8000 files at every partition of "X" and found during write time that there were not enough data to write 8000 files at each partition and ended up writing fewer files ? Can you please help me understand this?
The setting spark.sql.shuffle.partitions=8000 will set the default shuffling partition number of your Spark programs. If you try to execute a join or aggregations just after setting this option, you will see this number taking effect (you can confirm that with df.rdd.getNumPartitions()). Please refer here for more information.
In your case though, you are using this setting with repartition(col("x") and partitionBy("x"). Therefore your program will not be affected by this setting without using a join or an aggregation transformation first. The difference between repartition and partitionBy is that, the first will partition the data in memory, creating cardinality("x") number of partitions, when the second one will write approximately the same number of partitions to HDFS. Why approximately? Well because there are more factors that determine the exact number of output files. Please check the following resources to get a better understanding over this topic:
Difference between df.repartition and DataFrameWriter partitionBy?
pyspark: Efficiently have partitionBy write to same number of total partitions as original table
So the first thing to consider when using repartitioning by column repartition(*cols) or partitionBy(*cols), is the number of unique values (cardinality) that the column (or the combination of columns) has.
That being said, if you want to ensure that you will create 8000 partitions i.e output files, use repartition(partitionsNum, col("x")) where partitionsNum == 8000 in your case then call write.orc("HDFS_LOC_1"). Otherwise, if you want to keep the number of partitions close to the cardinality of x, just call partitionBy("x") to your original df and then write.orc("HDFS_LOC_1") for storing the data to HDFS. This will create cardinality(x) folders with your partitioned data.

Memory Errors with Iterative Spark DataFrame Creation

I am converting raw records that arrive to me in a single zlib compressed files into enriched parquet records for later processing in Spark. I don't control the zlib file, and I need the parquet consistent with other processing. I'm working in Pyspark and Spark 2.3. My approach works except when the zlib file is reasonably large (~300MB). It holds in memory fine, but Spark is running out of memory. If i shove my driver memory up (8g), it works. It feels like a memory leak from using function calls, as shown below.
The enrichment process explodes the data in Spark somewhat, so I'm using an iterative dataframe creation and decent size driver memory (4g default). I load all of the original decompressed file into memory and then pass a chunk of this to a simple routine to create a spark dataframe, add columns through the enrichment process, and then write to parquet. The chunks are set in size. The enrichment itself can vary, but overall the chunk size should not create too large of a dataframe in the driver, i believe. The output chunks, in parquet format, are around 100MB.
def process_file(gzfile, spark, chunk_size=2000000):
# load_data_and decompress
data = load_original_data(gzfile)
if len(data) == 0:
raise ValueError("No records loaded from file ", gzfile)
chunks = len(data) // chunk_size + 1
offset = 0
for chunk in range(chunks):
# convert the chunk into a spark dataframe
df = raw_to_spark(data[offset:offset+chunk_size], spark)
offset += chunk_size
# enrich the data while in a spark dataframe w/ more columns
df = extract_fields_from_raw(df)
save_to_parquet(df, parquet_output_path)
return
def raw_to_spark(events: List[str], spark: pyspark.sql.SparkSession) -> pyspark.sql.DataFrame:
"""
convert the list of raw strings into a spark dataframe so we can do all the processing. this list is large in
memory so we pop one list while building the new one. then throw the new list into spark.
"""
schema = StructType([StructField("event", StringType())])
rows = [] # make the list smaller as we create the row list for the dataframe
while events:
event = events.pop()
if event.count(",") >= 6: # make sure there are 7 fields at least
rows.append(pyspark.sql.Row(event))
rdd = spark.sparkContext.parallelize(rows, numSlices=2000) # we need to partition in order to pass to workers
return spark.createDataFrame(rdd, schema=schema)
def extract_fields_from_raw(df: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:
"""
this adds columns to the dataframe for enrichment before saving
"""
So, Spark dataframes of size <2M are being created as I loop through the decompressed data. Each of these dataframes should have no problem sitting in the 4g driver space. I get the same error, e.g., if I were to use a chunk of 1M. The Spark log shows that the out of memory error will have the same memory consumption before failure, e.g, 4.5gb of memory used.
I suspect what is happening is that memory isn't being released after each call to raw_to_spark, but I'm not sure how to show that... and I can't see logically how to push the Spark dataframe conversion into a function otherwise.
Am I missing a best practice? thanks.

Spark dataframe Join issue

Below code snippet works fine. (Read CSV, Read Parquet and join each other)
//Reading csv file -- getting three columns: Number of records: 1
df1=spark.read.format("csv").load(filePath)
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 30 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Its weired that below code snippet doesnt work. (Read Hbase, Read Parquet and join each other)(Difference is reading from Hbase)
//Reading from Hbase (It read from hbase properly -- getting three columns: Number of records: 1
df1=read from Hbase code
// It read from Hbase properly and able to show one record.
df1.show
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 56 tasks (1024.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Then I have added spark.driver.maxResultSize=5g, then another error started occuring, Java Heap space error (run at ThreadPoolExecutor.java). If I observe memory usage in Manager I see that usage just keeps going up until it reaches ~ 50GB, at which point the OOM error occurs. So for whatever reason the amount of RAM being used to perform this operation is ~10x greater than the size of the RDD I'm trying to use.
If I persist df1 in memory&disk and do a count(). Program works fine. Code snippet is below
//Reading from Hbase -- getting three columns: Number of records: 1
df1=read from Hbase code
**df1.persist(StorageLevel.MEMORY_AND_DISK)
val cnt = df1.count()**
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
It works with file even it has the same data but not with Hbase. Running this on 100 worknode cluster with 125 GB of memory on each. So memory is not the problem.
My question here is both the file and Hbase has same data and both read and able to show() the data. But why only Hbase is failing. I am struggling to understand what might be going wrong with this code. Any suggestions will be appreciated.
When the data is being extracted spark is unaware of number of rows which are retrieved from HBase, hence the strategy is opted would be sort merge join.
thus it tries to sort and shuffle the data across the executors.
to avoid the problem, we can use broadcast join at the same time we don't wont to sort and shuffle the data across the from df2 using the key column, which shows the last statement in your code snippet.
however to bypass this (since it is only one row) we can use Case expression for the columns to be padded.
example:
df.withColumn(
"newCol"
,when(col("df2col1").eq(lit(hbaseKey))
,lit(hbaseValueCol1))
.otherwise(lit(null))
I'm sometimes struggling with this error too. Often this occurs when spark tries to broadcast a large table during a join (that happens when spark's optimizer underestimates the size of the table, or the statistics are not correct). As there is no hint to force sort-merge join (How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?), the only option is to disable broadcast joins by setting spark.sql.autoBroadcastJoinThreshold= -1
When I have problem with memory during a join it usually means one of two reasons:
You have too few partitions in dataframes (partitions are too big)
There are many duplicates in the two dataframes on the key on which you join, and the join explodes your memory.
Ad 1. I think you should look at number of partitions you have in each table before join. When Spark reads a file it does not necessarily keep the same number of partitions as was the original table (parquet, csv or other). Reading from csv vs reading from HBase might create different number of partitions and that is why you see differences in performance. Too large partitions become even larger after join and this creates memory problem. Have a look at the Peak Execution Memory per task in Spark UI. This will give you some idea about your memory usage per task. I found it best to keep it below 1 Gb.
Solution: Repartition your tables before the join.
Ad. 2 Maybe not the case here but worth checking.

How to create multiple Spark tasks to query Cassandra partitions

I have an application that is using Spark (with Spark Job Server) that uses a Cassandra store. My current setup is that of a client mode running with master=local[*]. So there is a single Spark executor which is also the driver process that is using all 8 cores of the machine. I have a Cassandra instance running on the same machine.
The Cassandra tables have a primary key of the form ((datasource_id, date), clustering_col_1...clustering_col_n) where date is a single day of the form "2019-02-07" and is part of a composite partition key.
In my Spark application, I am running a query like so:
df.filter(col("date").isin(days: _*))
In the Spark physical plan, I notice that these filters along with the filter for the "datasource_id" partition key are pushed up to the Cassandra CQL query.
For our biggest datasources, I know that the partitions are around 30MB in size. So I have the following setting in the Spark Job Server configuration:
spark.cassandra.input.split.size_in_mb = 1
However I notice that there is no parallelization in the Cassandra loading step. Though there are multiple Cassandra partitions that are >1MB, there are no additional spark partitions created. There is only a single task that does all the querying on a single core, thus taking ~20 secs to load data for a 1 month date range that corresponds to ~1 million rows.
I have tried the alternative approach below:
df union days.foldLeft(df)((df: DataFrame, day: String) => {
df.filter(col("date").equalTo(day))
})
This does indeed create a spark partition (or task) for every "day" partition in cassandra. However, for smaller datasources where the cassandra partitions are much smaller in size, this method proves to be quite expensive in terms of excessive tasks created and the overhead due to their coordination. For these datasources, it would be totally fine to lump many cassandra partitions into one spark partition. Hence why I thought using the spark.cassandra.input.split.size_in_mb configuration would prove useful in dealing with both small and large datasources.
Is my understanding wrong? Is there something else that I'm missing in order for this configuration to take effect?
P.S. I have also read the answers about using joinWithCassandraTable. However, our code relies on using DataFrame. Also, converting from a CassandraRDD to a DataFrame is not very viable for us since our schema is dynamic and cannot be specified using case classes.

Slow count of >1 billion rows from Cassandra via Apache Spark [duplicate]

I have setup Spark 2.0 and Cassandra 3.0 on a local machine (8 cores, 16gb ram) for testing purposes and edited spark-defaults.conf as follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Next I imported 1.5 million rows in Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numeric values i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to test the whole thing I just created a SparkSession, connected to Cassandra and make a simple select count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs the count and takes about 28 seconds to finish the Job, distributed in 13 Tasks (in Spark UI, the total Input for the Tasks is 331.6MB)
Questions:
Is that the expected performance? If not, what am I missing?
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to 4, why is creating 13 Tasks? (Also made sure the number of partitions calling rdd.getNumPartitions() on my DataFrame)
Update
A common operation I would like to test over this data:
Query a large data set, say, from 100,000 ~ N rows grouped by pid
Select ev, a list<double>
Perform an average on each member, assuming by now each list has the same length i.e df.groupBy('pid').agg(avg(df['ev'][1]))
As #zero323 suggested, I deployed a external machine (2Gb RAM, 4 cores, SSD) with Cassandra just for this test, and loaded the same data set. The result of the df.select().count() was an expected greater latency and overall poorer performance in comparison with my previous test (took about 70 seconds to finish the Job).
Edit: I misunderstood his suggestion. #zero323 meant to let Cassandra perform the count instead of using Spark SQL, as explained in here
Also I wanted to point out that I am aware of the inherent anti-pattern of setting a list<double> instead a wide row for this type of data, but my concerns at this moment are more the time spent on retrieval of a large dataset rather than the actual average computation time.
Is that the expected performance? If not, what am I missing?
It looks slowish but it is not exactly unexpected. In general count is expressed as
SELECT 1 FROM table
followed by Spark side summation. So while it is optimized it still rather inefficient because you have fetch N long integers from the external source just to sum these locally.
As explained by the docs Cassandra backed RDD (not Datasets) provide optimized cassandraCount method which performs server side counting.
Theory says the number of partitions of a DataFrame determines the number of tasks Spark will distribute the job in. If I am setting the spark.sql.shuffle.partitions to (...), why is creating (...) Tasks?
Because spark.sql.shuffle.partitions is not used here. This property is used to determine number of partitions for shuffles (when data is aggregated by some set of keys) not for Dataset creation or global aggregations like count(*) (which always use 1 partition for final aggregation).
If you interested in controlling number of initial partitions you should take a look at spark.cassandra.input.split.size_in_mb which defines:
Approx amount of data to be fetched into a Spark partition. Minimum number of resulting Spark partitions is 1 + 2 * SparkContext.defaultParallelism
As you can see another factor here is spark.default.parallelism but it is not exactly a subtle configuration so depending on it in general is not an optimal choice.
I see that it is very old question but maybe someone needs it now.
When running Spark on local machine it is very important to set into SparkConf master "local[*]" that according to documentation allows to run Spark with as many worker threads as logical cores on your machine.
It helped me to increase performance of count() operation by 100% on local machine comparing to master "local".