I am converting raw records that arrive to me in a single zlib compressed files into enriched parquet records for later processing in Spark. I don't control the zlib file, and I need the parquet consistent with other processing. I'm working in Pyspark and Spark 2.3. My approach works except when the zlib file is reasonably large (~300MB). It holds in memory fine, but Spark is running out of memory. If i shove my driver memory up (8g), it works. It feels like a memory leak from using function calls, as shown below.
The enrichment process explodes the data in Spark somewhat, so I'm using an iterative dataframe creation and decent size driver memory (4g default). I load all of the original decompressed file into memory and then pass a chunk of this to a simple routine to create a spark dataframe, add columns through the enrichment process, and then write to parquet. The chunks are set in size. The enrichment itself can vary, but overall the chunk size should not create too large of a dataframe in the driver, i believe. The output chunks, in parquet format, are around 100MB.
def process_file(gzfile, spark, chunk_size=2000000):
# load_data_and decompress
data = load_original_data(gzfile)
if len(data) == 0:
raise ValueError("No records loaded from file ", gzfile)
chunks = len(data) // chunk_size + 1
offset = 0
for chunk in range(chunks):
# convert the chunk into a spark dataframe
df = raw_to_spark(data[offset:offset+chunk_size], spark)
offset += chunk_size
# enrich the data while in a spark dataframe w/ more columns
df = extract_fields_from_raw(df)
save_to_parquet(df, parquet_output_path)
return
def raw_to_spark(events: List[str], spark: pyspark.sql.SparkSession) -> pyspark.sql.DataFrame:
"""
convert the list of raw strings into a spark dataframe so we can do all the processing. this list is large in
memory so we pop one list while building the new one. then throw the new list into spark.
"""
schema = StructType([StructField("event", StringType())])
rows = [] # make the list smaller as we create the row list for the dataframe
while events:
event = events.pop()
if event.count(",") >= 6: # make sure there are 7 fields at least
rows.append(pyspark.sql.Row(event))
rdd = spark.sparkContext.parallelize(rows, numSlices=2000) # we need to partition in order to pass to workers
return spark.createDataFrame(rdd, schema=schema)
def extract_fields_from_raw(df: pyspark.sql.DataFrame) -> pyspark.sql.DataFrame:
"""
this adds columns to the dataframe for enrichment before saving
"""
So, Spark dataframes of size <2M are being created as I loop through the decompressed data. Each of these dataframes should have no problem sitting in the 4g driver space. I get the same error, e.g., if I were to use a chunk of 1M. The Spark log shows that the out of memory error will have the same memory consumption before failure, e.g, 4.5gb of memory used.
I suspect what is happening is that memory isn't being released after each call to raw_to_spark, but I'm not sure how to show that... and I can't see logically how to push the Spark dataframe conversion into a function otherwise.
Am I missing a best practice? thanks.
Related
I have 5 steps which produced df_a,df_b,df_c,df_e and df_f.
Each step generates a dataframe (df_a for instance), and persist as parquet files.
The file is used in the sequential step (df_b for example).
The processing time for df_e and df_f took about 30 mins.
However, I started a new spark session, read df_c parquet to a data frame and processed df_e and df_f, it took less than a minute.
Was it because when we read the parquet file, it stores in a more compact way?
Should I overwrite the dataframe with the spark read after writing to storage to improve the performance?
Using Parquet file format is definitely best from processing point of view but as you said in your case
Why don't u convert the very first step to parquet( start reading parquet file itself in very first step).
If storing intermediate df as parquet improves your performance , then may be you can apply filter to focus on the part of Data you require to read , this is as good as reading from parquet file format. because if you are storing intermediate df as parquet it is also taking some time to store the data and again reading the data into spark.
I have a DataFrame (df) with more than 1 billion rows
df.coalesce(5)
.write
.partitionBy("Country", "Date")
.mode("append")
.parquet(datalake_output_path)
From the above command I understand only 5 worker nodes in my 100 worker node cluster (spark 2.4.5) will be performing all the tasks. Using coalesce(5) takes the process 7 hours to complete.
Should I try repartition instead of coalesce?
Is there a more faster/ efficient way to write out 128 MB size parquet files or do I need to first calculate the size of my dataframe to determine how many partitions are required.
For example if the size of my dataframe is 1 GB and spark.sql.files.maxPartitionBytes = 128MB should I first calculate No. of partitions required as 1 GB/ 128 MB = approx(8) and then do repartition(8) or coalesce(8) ?
The idea is to maximize the size of parquet files in the output at the time of writing and be able to do so quickly (faster).
You can get the size (dfSizeDiskMB) of your dataframe df by persisting it and then checking the Storage tab on the Web UI as in this answer. Armed with this information and an estimate of the expected Parquet compression ratio you can then estimate the number of partitions you need to achieve your desired output file partition size e.g.
val targetOutputPartitionSizeMB = 128
val parquetCompressionRation = 0.1
val numOutputPartitions = dfSizeDiskMB * parquetCompressionRatio / targetOutputPartitionSizeMB
df.coalesce(numOutputPartitions).write.parquet(path)
Note that spark.files.maxPartitionBytes is not relevant here as it is:
The maximum number of bytes to pack into a single partition when reading files.
(Unless df is the direct result of reading an input data source with no intermediate dataframes created. More likely the number of partitions for df is dictated by spark.sql.shuffle.partitions, being the number of partitions for Spark to use for dataframes created from joins and aggregations).
Should I try repartition instead of coalesce?
coalesce is usually better as it can avoid the shuffle associated with repartition, but note the warning in the docs about potentially losing parallelism in the upstream stages depending on your use case.
Coalesce is better if you are coming from higher no of partitions to lower no. However, if before writing the df, your code isn't doing shuffle , then coalesce will be pushed down to the earliest point possible in DAG.
What you can do is process your df in say 100 partitions or whatever number you seem appropriate and then persist it before writing your df.
Then bring your partitions down to 5 using coalesce and write it. This should probably give you a better performance
Consider general algorithm:
val first : DataFrame = ... // about 100 Mb
val second : DataFrame = ... // about 5 GMb
val third : DataFrame = ... // about 7 GMb
val fourth : DataFrame = ... // about 13 GMb
//all dataframe are filtered, renamed all columns. A new colun is added into `third` and `fourth`
val = firstAndSecond = first.join(first("first_id") === second("second_id"))
val = thirdsAndAll = firstAndSecond.join(firstAndSecond("some_id") === third("third_id"))
val = fourthAndAll = thirdsAndAll.join(thirdsAndAll("other_id") === fourth("fourth_id"))
fourthAndAll.write.mode(saveMode = SaveMode.Overwrite).parquet("file://C:path")
Notes*
All data frames are read to and written from SSD drive.
Read and write operation performed to/in parquet files
Program was run on to Threadripper with 8 cores (16 virtual), 80Gb Ram (spark consumed about 25Gb), also 99% of time
(except the situation when last file is writeen) all 16 cores are loaded 100%.
The problem
I have very different sizes in output parquet files from 100 kb to 500Nb. Also bigger files have very long time to write.
E.g. Every file is writen by some threads which wrote 500Mb, 450Mb (etc.) files perform their work too long. (for 500Mb it was about 8 hours)
Any thoughts how to setup spark to write parquet file with more or less equal size CPU load?
(I work mostly with RDDs, have not much experience with dataframe api but I guess it works the same way)
I guess .parquet creates a file per partition of the rdd. If file sizes are drastically different then the data is not evenly distributed across partitions.
It also could decrease the performance of join operation. repartition command should help.
So question is in the subject. I think I dont understand correctly the work of repartition. In my mind when I say somedataset.repartition(600) I expect all data would be partioned by equal size across the workers (let say 60 workers).
So for example. I would have a big chunk of data to load in unbalanced files, lets say 400 files, where 20 % are 2Gb size and others 80% are about 1 Mb. I have the code to load this data:
val source = sparkSession.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter","\t")
.load(mypath)
Than I want convert raw data to my intermediate object, filter irrelevvant records, convert to final object (with additional attributes) and than partition by some columns and write to parquet. In my mind it seems reasonable to balance data (40000 partitions) across workers and than do the work like that:
val ds: Dataset[FinalObject] = source.repartition(600)
.map(parse)
.filter(filter.IsValid(_))
.map(convert)
.persist(StorageLevel.DISK_ONLY)
val count = ds.count
log(count)
val partitionColumns = List("region", "year", "month", "day")
ds.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.mode(SaveMode.Append)
.save(destUrl)
But it fails with
ExecutorLostFailure (executor 7 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
34.6 GB of 34.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When I do not do repartition everything is fine. Where I do not understand repartition correct?
Your logic is correct for repartition as well as partitionBy but before using repartition you need to keep in mind this thing from several sources.
Keep in mind that repartitioning your data is a fairly expensive
operation. Spark also has an optimized version of repartition() called
coalesce() that allows avoiding data movement, but only if you are
decreasing the number of RDD partitions.
If you want that your task must be done then please increase drivers and executors memory
In my test environment I have 1 Cassandra node and 3 Spark nodes. I want to iterate over apparently large table that has about 200k rows, each roughly taking 20-50KB.
CREATE TABLE foo (
uid timeuuid,
events blob,
PRIMARY KEY ((uid))
)
Here is scala code that is executed at spark cluster
val rdd = sc.cassandraTable("test", "foo")
// This pulls records in memory, taking ~6.3GB
var count = rdd.select("events").count()
// Fails nearly immediately with
// NoHostAvailableException: All host(s) tried for query failed [...]
var events = rdd.select("events").collect()
Cassandra 2.0.9, Spark: 1.2.1, Spark-cassandra-connector-1.2.0-alpha2
I tried to only run collect, without count - in this case it just fails fast with NoHostAvailableException.
Question: what is the correct approach to iterate over large table reading and processing small batch of rows at a time?
There are 2 settings in the Cassandra Spark Connector to adjust the chunk size (put them in the SparkConf object):
spark.cassandra.input.split.size: number of rows per Spark partition (default 100000)
spark.cassandra.input.page.row.size: number of rows per fetched page (ie network roundtrip) (default 1000)
Furthermore, you shouldn't use the collect action in your example because it will fetch all the rows in the driver application memory and may raise an out of memory exception. You can use the collect action only if you know for sure it will produce a small number of rows. The count action is different, it produce only a integer. So I advise you to load your data from Cassandra like you did, process it, and store the result (in Cassandra, HDFS, whatever).