pyspark coalesce gives no space left on device - pyspark

I am trying to coalesce my dataframe into a single (part) file as below
df.coalesce(1).write.format('avro').save('file:///home/newbie/testdata/')
Initially this used to work fine but as the size of data grows from hundreds to millions of rows in dataframe, I am getting the below error
java.io.IOException: No space left on device
This is because of the spark temp space (default: /tmp/) is running out of memory. So I changed it to /local/scratch/ and it working fine now.
But I am not very much convinced with the approach of providing spark.local.dir to some location with more space. It may blow up anytime.
So is there any way that I can write the dataframe into a single file without loading all the data into the memory?
In other words, with minimal usage of (heap) memory, dump the rows to persistent storage (disk)

Related

Spark: partitioning based on the data size

Working on some Spark job I faced with the problem of data skew.
I working on the spatial data and it looks like this:
<key - id of area; value - some meta data that also contains a size and link to binary file>.
The main problem that data is not distributed well. As it is spatial data for instance areas that contain big cities are very large. But areas with fields and countryside is to small.
So some Spark partitions for one RDD have size 400kb and other > 1gb. As a result long running last tasks.
Key contains only numbers (like 45949539) and for this moment all data will be partitioned by hash partitioner using this key - completely randomly.
My next transformations are "flatmap" an "reduce by key". (Yes, long running last tasks are here)
<id1, data1> -> [<id1, data1>, <id2, data2> ....]
While flatmap I produce all my internal data from binary file. So size of the file I know (As I said before - from meta data).
But I need some specific partitioner that can mix large and small areas within one partition.
Is it real to do in Spark?
This job is a some kind of legacy and I can use RDD only.

How to iteratively bring a Dataframe back to the driver in Spark

Main question: How do you safely (without risking crashing due to OOM) iterate over every row (guaranteed every row) in a dataframe from the driver node in Spark? I need to control how big the data is as it comes back, operate on it, and discard it to retrieve the next batch (say 1000 rows at a time or something)
I am trying to safaely and iteratively bring the data in a potentially large Dataframe back to the driver program so that I may use the data to perform HTTP calls. I have been attempting use someDf.foreachPartition{makeApiCall(_)} and allowing the Executors to handle the calls. It works - but debugging and handling errors has proven to be pretty difficult when launching in prod envs, especially on failed calls.
I know there is someDf.collect() action, which brings ALL the data back to the driver all at once. However, this solution is not suggested, because if you have a very large DF, you risk crashing the driver.
Any suggestions?
if the data does not fit into memory, you could use something like :
df.toLocalIterator().forEachRemaining( row => {makeAPICall(row)})
but toLocalIterator has considerable overhead compared to collect
Or you can collect your dataframe batch-wise (which does essentially the same as toLocalIterator):
val partitions = df.rdd.partitions.map(_.index)
partitions.toStream.foreach(i => df.where(spark_partition_id() === lit(i)).collect().map(row => makeAPICall(row)))
It is a bad idea to bring all that data back to driver because driver is just 1 node and it will become the bottleneck. The scalability will be lost. If you had to do this then think twice if you really need a big data application? probably not.
dataframe.collect() is the best way to bring the data to driver and it will bring all data. The alternative is toLocalIterator which will bring data of the largest partition which can be big too. So this should be rarely used and for small amount of data only.
If you insist then you can write the output to a file or queue and read that file in a controlled manner. This will be partially scalable solution which I won't prefer.

Spark: OutOfMemory despite MEMORY_AND_DISK_SER

I wrote some piece of code that reads multiple parquet files and caches them for subsequent use. My code looks simplified like this
val data = SparkStartup.sqlContext.read.parquet(...)
data.setName(...).persist(StorageLevel.MEMORY_AND_DISK_SER).collect()
map += data
The parquet file are in total about 11g. I config my application by:
val sparkConfig = new SparkConf().setAppName(...).setMaster("local[128]")
sparkConfig.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConfig.set("spark.kryoserializer.buffer.max", "512m");
sparkConfig.set("spark.kryoserializer.buffer", "256");
sparkConfig.set("spark.driver.maxResultSize", "0");
sparkConfig.set("spark.driver.memory", "9g");
I thought that by using MEMORY_AND_DISK_SER, Spark would spill out to disk if too much memory is used. However, I get `java.lang.OutOfMemoryError: Java heap space errors at
at java.util.Arrays.copyOf(Arrays.java:3230)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)
at com.esotericsoftware.kryo.io.Output.writeString(Output.java:306)
at com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:105)
at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:81)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
Why is this? I start my application with -Xmx9g -Dspark.executor.memory=9g -Dspark.executor.cores=3. For the files that are read before everything crashes, I can see in the SparkUI that a parquet files takes 9x its size when read to memory.
It is because you are calling collect() in your driver application. This returns an Array of your data items, which would need to fit into memory.
You should instead work with the data RDD and map, reduce, group, etc your large set of data into some desired result, and then collect() that smaller amount of data.

Why do Redshift COPY queries use (much) more disk space for tables with a sort key

I have a large set of data on S3 in the form of a few hundred CSV files that are ~1.7 TB in total (uncompressed). I am trying to copy it to an empty table on a Redshift cluster.
The cluster is empty (no other tables) and has 10 dw2.large nodes. If I set a sort key on the table, the copy commands uses up all available disk space about 25% of the way through, and aborts. If there's no sort key, the copy completes successfully and never uses more than 45% of the available disk space. This behavior is consistent whether or not I also set a distribution key.
I don't really know why this happens, or if it's expected. Has anyone seen this behavior? If so, do you have any suggestions for how to get around it? One idea would be to try importing each file individually, but I'd love to find a way to let Redshift deal with that part itself and do it all in one query.
Got an answer to this from the Redshift team. The cluster needs free space of at least 2.5x the incoming data size to use as temporary space for the sort. You can upsize your cluster, copy the data, and resize it back down.
Each dw2.large box has 0.16 TB disk space. When you said you you have cluster of 10 nodes, total space available is around 1.6 TB.
You have mentioned that you have around 1.7 TB raw data ( uncompressed) to be loaded in redshift.
When you load data to redshift using copy commands redshift automatically compresses you data and load it table.
once you load any db table you can see compression encoding by below query
Select "column", type, encoding
from pg_table_def where tablename = 'my_table_name'
Once you load your data when table has no sort key. See what are compression are being applied.
I suggested you drop and create table each time when you load data for your testing So that compressions encoding will be analysed each time.Once you load your table using copy commands see below link and fire script to determine table size
http://docs.aws.amazon.com/redshift/latest/dg/c_analyzing-table-design.html
Since when you apply sort key for your table and load data , sort key also occupies some disk space.
Since table with out sort key need less disk space than table with sort key.
You need to make sure that compression are being applied to table.
When we have sort key applied it need more space to store. When you apply sort key you need to check if you are loading data in sorted order as well,so that data will be stored in sorted fashion. This we need to avoid vacuum command to sort table after data being loaded.

Postgres determine size of all blobs

hi i'm trying to find the size of all blobs. I always used this
SELECT sum(pg_column_size(pg_largeobject)) lob_size FROM pg_largeobject
but while my database is growing ~40GB this takes several hours and loads the cpu too much.
is there any more efficent way?
Some of the functions mentioned in Database Object Management Functions give an immediate result for the entire table.
I'd suggest pg_table_size(regclass) which is defined as:
Disk space used by the specified table, excluding indexes (but
including TOAST, free space map, and visibility map)
It differs from sum(pg_column_size(tablename)) FROM tablename because it counts entire pages, so that includes the padding between the rows, the dead rows (updated or deleted and not reused), and the row headers.