LazyFrame memory usage (polars.scan_csv vs polars.read_csv, single threaded) - python-polars

I have some sample csv files and two programs to read/filter/concat the csvs.
Here is the LazyFrame version of the code:
import os
os.environ["POLARS_MAX_THREADS"] = "1"
import polars as pl
df = pl.concat(
[
pl.scan_csv("test.csv").filter(pl.col("x3") > 0),
pl.scan_csv("test1.csv").filter(pl.col("x3") > 0),
pl.scan_csv("test2.csv").filter(pl.col("x3") > 0),
]
).collect()
The eager version replaces scan_csv with read_csv.
Now I would expect the LazyFrame version to perform just as well, but instead it uses more memory. (And more memory still if we increase the number of cores.) I generated the following graph with mprof:
Is this a reliable representation of the memory usage?
Will this ever be improved, or is it necessary for lazy evaluation to work this way?

A lazy concat will parallelize the work over its inputs. This might give a bit more memory usage than the sequential reads in eager. That's why you see it increase when you use more cores.
The predicates are pushed down to the scan level. If your memory usage does not drop because of that, you probably don't filter out many rows. Because we want memory to stay low if we DO filter out many rows, a lazy reader works on smaller chunks and probably has more heap fragmentation.
lazy does not optimize for these micro benchmarks, but looks at a longer query at a whole. When you start selecting subsets of columns/rows (either directly or by grouping), slicing, filtering, lazy will do a lot less work and use much less memory than eager.

Related

Performance Improvement in scala dataframe operations

I am using a table which is partitioned by load_date column and is weekly optimized with delta optimize command as source dataset for my use case.
The table schema is as shown below:
+-----------------+--------------------+------------+---------+--------+---------------+
| ID| readout_id|readout_date|load_date|item_txt| item_value_txt|
+-----------------+--------------------+------------+---------+--------+---------------+
Later this table will be pivoted on columns item_txt and item_value_txt and many operations are applied using multiple window functions as shown below:
val windowSpec = Window.partitionBy("id","readout_date")
val windowSpec1 = Window.partitionBy("id","readout_date").orderBy(col("readout_id") desc)
val windowSpec2 = Window.partitionBy("id").orderBy("readout_date")
val windowSpec3 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow)
val windowSpec4 = Window.partitionBy("id").orderBy("readout_date").rowsBetween(Window.unboundedPreceding, Window.currentRow-1)
These window functions are used to achieve multiple logic on the data. Even there are few joins used to process the data.
The final table is partitioned with readout_date and id and could see the performance is very poor as it take much time for 100 ids and 100 readout_date
If I am not partitioning the final table I am getting the below error.
Job aborted due to stage failure: Total size of serialized results of 129 tasks (4.0 GiB) is bigger than spark.driver.maxResultSize 4.0 GiB.
The expected count of id in production is billions and I expect much more throttling and performance issues while processing with complete data.
Below provided the cluster configuration and utilization metrics.
Please let me know if anything is wrong while doing repartitioning, any methods to improve cluster utilization, to improve performance...
Any leads Appreciated!
spark.driver.maxResultSize is just a setting you can increase it. BUT it's set at 4Gigs to warn you you are doing bad things and you should optimize your work. You are doing the correct thing asking for help to optimize.
The first thing I suggest if you care about performance get rid of the windows. The first 3 windows you use could be achieved using Groupby and this will perform better. The last two windows are definitely harder to reframe as a group by, but with some reframing of the problem you might be able to do it. The trick could be to use multiple queries instead of one. And you might think that would perform worse but i'm here to tell you if you can avoid using a window you will get better performance almost every time. Windows aren't bad things, they are a tool to be used but they do not perform well on unbounded data. (Can you do anything as an intermediate step to reduce the data the window needs to examine?) Or can you use aggregate functions to complete the work without having to use a window? You should explore your options.
Given your other answers, you should be grouping by ID not windowing by Id. And likely using aggregates(sum) by week of year/month. This would likely give you really speedy performance with the loss of some granularity. This would give you enough insight to decide to look into something deeper... or not.
If you wanted more accuracy, I'd suggest using:
Converting your null's to 0's.
val windowSpec1 = Window.partitionBy("id").orderBy(col("readout_date") asc) // asc is important as it flips the relationship so that it groups the previous nulls
Then create a running total on the SIG_XX VAL or whatever signal you want to look into. Call the new column 'null-partitions'.
This will effectively allow you to group the numbers(by null-partitions) and you can then run aggregate functions using group by to complete your calculations. Window and group by can do the same thing, windows just more expensive in how it moves data, slowing things down. Group by uses a more of the cluster to do the work and speeds up the process.

Can I process a DataFrame using Polars without constructing the entire output in memory?

To load a large dataset into Polars efficiently one can use the lazy API and the scan_* functions. This works well when we are performing an aggregation (so we have a big input dataset but a small result). However, if I want to process a big dataset in it's entirety (for example, change a value in each row of a column), it seems that there is no way around using collect and loading the whole (result) dataset into memory.
Is it instead possible to write a LazyFrame to disk directly, and have the processing operate on chunks of the dataset sequentially, in order to limit memory usage?
Edit (2023-01-08)
Polars' has growing support for streaming/out of core processing.
To run a query streaming collect your LazyFrame with collect(streaming=True).
If the result does not fit into memory, try to sink it to disk with sink_parquet.
Old answer (not true anymore).
Polars' algorithms are not streaming, so they need all data in memory for the operations like join, groupby, aggregations etc. So writing to disk directly would still have those intermediate DataFrames in memory.
There are of course things you can do. Depending on the type of query you do, it may lend itself to embarrassingly parallellizaton. A sum could for instance easily be computed in chunks.
You could also process columns in smaller chunks. This allows you to still compute harder aggregations/ computations.
Use lazy
If you have many filters in your query and polars is able to do them at the scan, your memory pressure is reduced to the selectivity ratio.
I just encountered a case where Polars manages memory much better using Lazy. When using the join function I highly recommend using scan_csv/scan_parquet/scan_ipc if memory is an issue.
import polars as pl
# combine datasets
PATH_1 = "/.../big_dataset.feather"
PATH_2 = "/.../other_big_dataset.feather"
big_dataset_1 = pl.scan_ipc(PATH_1)
big_dataset_2 = pl.scan_ipc(PATH_2)
big_dataset_expanded = big_dataset_1.join(
big_dataset_2, right_on="id_1", left_on="id_2", how="left"
)
big_dataset_expanded = big_dataset_expanded.collect()

How to purge spark driver memory after collect()

I have to do a lot of little collect() operations in my application to send data through HTTPcall.
val payload = sparkSession.sql(s"select * from table where ID = id").toJSON.collect().mkString("\n")
Is there a way to purge used objects to free some memory space in my driver between operations?
First off, I agree with #Luis Miguel Mejia Suarez here in that collects are generally bad practice and a bad code smell. I'd take a look at why you are doing collects, and determine if you can do this in a different way.
As for your actual question, the garbage collector will free any unreferenced memory once memory starts getting tight. The code snippet you showed above should be fine since the output of collect is immediately operated on and then discarded so that output should be removed during the next GC pause, while the mkString output would be kept. So make sure that this applies to the other collect statements you are using.
Additionally, if you are seeing long GC pauses, consider lowering your driver memory size, so that there's less memory to collect. You might also look into tuning your GC parameters. There's lots of documentation on that on the internet, and it is too intricate to describe in detail here.
Finally, you can force the JVM to run garbage collection. You should be able to use System.gc() (https://docs.oracle.com/javase/7/docs/api/java/lang/System.html#gc()). This is a Java function but Scala should be able to call it as well.

Spark memory limit exceeded issue

i have a job that run on spark and is written in scala im using spark RDD. because of the expensive group by operations i get this error:
Container killed by YARN for exceeding memory limits. 22.4 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
i increased memory over head but i get the same. i use 10 machines of r4.xlarge. i tried using r4.2xlarge and even r4.4xlarge but also same error. the data is im testing on is 5GB gzipped data( almost 50 unzipped of data and almost 6 million record).
some configurations:
spark.executor.memory: 20480M
spark.driver.memory: 21295M
spark.yarn.executor.memoryOverhead: 2g
spark.executor.instances: 10
And the code looks like this:
val groupedEntitiesRDD = datasetRDD
.groupBy(_.entityId)
.map({ case (key, valueIterator) => key -> valueIterator.toList })
.persist(StorageLevel.MEMORY_AND_DISK)
val deduplicatedRDD = groupedEntitiesRDD
.flatMap({ case (_, entities) => deduplication(entities) })
def deduplication(entities: List[StreamObject[JsValue]]): List[StreamObject[JsValue]] = {
entities
.groupBy(_.deduplicationKey)
.values
.map(duplicates => duplicates.maxBy(_.processingTimestamp.toEpochSecond))
.toList
}
From my experience and from what I have read in the release notes of Spark 2.x, one needs to allocate a lot more off heap memory (spark.yarn.executor.memoryOverhead) than in Spark 1.x.
You have only assigned 2G to memoryOverhead and 20GB memory. I believe you would get better results if you change that to say 8G memoryOverhead and 14GB executor memory.
Should you still still run into memory issues (like actual OOMs being thrown), you will need to look into data skews. Especially groupBy operations will frequently cause serious data skews.
One final thing, you write that you use RDDs - I hope you mean DataFrames or DataSets? RDDs has very low performance with groupBy (see for instance this blog post for reason why) so if you are on RDDs you should use reduceByKey instead. BUT essentially you should use DataFrames (or DataSets) instead, where groupBy is indeed the right way to go.
EDIT!
You asked in a comment how to convert groupBy to reduceByKey. You can do that like this:
datasetRDD
.map{case(entityID, streamObject) => (entityID, List(streamObject))}
.reduceByKey(_++_)
.flatMap{case(_, entities) => deduplication(entities)
You haven't specified the data structure of these entities, but it looks like you are looking for some max value and in effect throwing away unwanted data. That should be build into the reduceByKey-operation, such that you filter away unnecessary data while reducing.

One billion length List in Scala?

Just as a load test, I was playing with different data structures in Scala. Just wondering what it takes to work or even create a one billion length array. 100 million seems to be no problem, of course there's no real magic about the number 1,000,000,000. I'm just seeing how far you can push it.
I had to bump up memory on most of the tests. export JAVA_OPTS="-Xms4g -Xmx8g"
// insanity begins ...
val buf = (0 to 1000000000 - 1).par.map { i => i }.toList
// java.lang.OutOfMemoryError: GC overhead limit exceeded
However preallocating an ArrayInt works pretty well. It takes about 9 seconds to iterate and build the object. Interestingly, doing almost anything with ListBuffer seems to automatically take advantage of all cores. However, the code above will not finish (at least with 8gb Xmx).
I understand that this is not a common case and I'm just messing around. But if you had to pull some massive thing into memory, is there a more efficient technique? Is Array with type as efficient as it gets?
The per-element overhead of a List is considerable. Each element is held in a cons cell (case class ::) which means there is one object with two fields for every element. On a 32-bit JVM that's 16 bytes per element (not counting the element value itself). On a 64-bit JVM it's going to be even higher.
List is not a good container type for extremely large contents. Its primary feature is very efficient head / tail decomposition. If that's something you need then you may just have to deal with the memory cost. If it's not, try to choose a more efficient representation.
For what it's worth, I consider memory overhead considerations to be one thing that justifies using Array. There are lots of caveats around using arrays, so be careful if you go that way.
Given that the JVM can sensibly arrange an Array of Ints in memory, if you really need to iterate over them it would indeed be the most efficient approach. It would generate much the same code if you did exactly the same thing with Java.