I have a 5MB dataset with 330 columns and perform SQL queries for replacing some characters in the dataset, persisting my dataframe at disk-level and unpersist that dataframe correctly. But in my JVM some references are still in my memory.
is there any option to remove those objects from memory?
Related
I am trying to coalesce my dataframe into a single (part) file as below
df.coalesce(1).write.format('avro').save('file:///home/newbie/testdata/')
Initially this used to work fine but as the size of data grows from hundreds to millions of rows in dataframe, I am getting the below error
java.io.IOException: No space left on device
This is because of the spark temp space (default: /tmp/) is running out of memory. So I changed it to /local/scratch/ and it working fine now.
But I am not very much convinced with the approach of providing spark.local.dir to some location with more space. It may blow up anytime.
So is there any way that I can write the dataframe into a single file without loading all the data into the memory?
In other words, with minimal usage of (heap) memory, dump the rows to persistent storage (disk)
Is there any performance difference or considerations between the following two pyspark statements:
df5 = df5.drop("Ratings")
and
df6 = df5.drop("Ratings)
Not specifically targeting the drop function, but any operation. Was wondering what happens under the hood when you overwrite a variable compared to creating a new one.
Also, is the behavior and performance considerations the same if this was an RDD and not a dataframe ?
No, There won't be any difference in the operation.
In case of Numpy, There is a option of flag which shows whether its own the data or not.
variable_name.flag
In case of Pyspark, the Dataframe is immutable and every change in the dataframe creates a new Dataframe. How does it do ? well, Dataframe is stored in distributed fashion. So, to move data in memory costs. Therefore, they change the ownership of data from a Dataframe to another, more particularly where index of the data is stored.
and
Dataframe is way better than RDD. Here is a good blog.
Dataframe RDD and dataset
I would like to know what happen when I put in cache a RDD then get a new RDD by modifying a limited number of values.
rdd.cache
val rdd2 = rdd.map(x=>if(cond) partitionValue else x)
The part of RDD which hasn't been touch is it still in cache if I used rdd2 ?
Moreover I need to update the partition in which are the modified values so I
val rdd2bis = rdd2.partitionBy(HashPartioner(nbPart))
And I would like to iterate this process for each datapoint :
Find in which partition should go one value.
Modify my value and put it in the right partition using partitionBy.
So my main question is if partitionBy keeps output RDD in memory if only few members have been modified?
I know that the partitionBy gives a new RDD as output but is there any chance that some of the non modified cached values are still in cache for the generated RDD.
I would like to know what happen when i put in cache a RDD then get a new RDD by modifying limited number of value.
If you literally modify mutable objects in place you'll end up with programming which is incorrect and nondeterministic.
The part of rdd which hasn't been touch is it still in cache if i used rdd2 ?
If you map without modifying existing objects it won't affect cached data at all. rdd should be cached as it was (unless evicted due to memory issues), rdd2 won't be. Nevertheless data is not copied so "unchanged" records in rdd2 reference the same objects as rdd.
if partitionBy keeps output RDD in memory if only few members have been modified
No. partitionBy requires standard shuffle mechanism. Once again it doesn't really affected cached state of the rdd.
I tested writing with:
df.write.partitionBy("id", "name")
.mode(SaveMode.Append)
.parquet(filePath)
However if I leave out the partitioning:
df.write
.mode(SaveMode.Append)
.parquet(filePath)
It executes 100x(!) faster.
Is it normal for the same amount of data to take 100x longer to write when partitioning?
There are 10 and 3000 unique id and name column values respectively.
The DataFrame has 10 additional integer columns.
The first code snippet will write a parquet file per partition to file system (local or HDFS). This means that if you have 10 distinct ids and 3000 distinct names this code will create 30000 files. I suspect that overhead of creating files, writing parquet metadata, etc is quite large (in addition to shuffling).
Spark is not the best database engine, if your dataset fits in memory I suggest to use a relational database. It will be faster and easier to work with.
I wrote some piece of code that reads multiple parquet files and caches them for subsequent use. My code looks simplified like this
val data = SparkStartup.sqlContext.read.parquet(...)
data.setName(...).persist(StorageLevel.MEMORY_AND_DISK_SER).collect()
map += data
The parquet file are in total about 11g. I config my application by:
val sparkConfig = new SparkConf().setAppName(...).setMaster("local[128]")
sparkConfig.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
sparkConfig.set("spark.kryoserializer.buffer.max", "512m");
sparkConfig.set("spark.kryoserializer.buffer", "256");
sparkConfig.set("spark.driver.maxResultSize", "0");
sparkConfig.set("spark.driver.memory", "9g");
I thought that by using MEMORY_AND_DISK_SER, Spark would spill out to disk if too much memory is used. However, I get `java.lang.OutOfMemoryError: Java heap space errors at
at java.util.Arrays.copyOf(Arrays.java:3230)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)
at com.esotericsoftware.kryo.io.Output.writeString(Output.java:306)
at com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:105)
at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:81)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
Why is this? I start my application with -Xmx9g -Dspark.executor.memory=9g -Dspark.executor.cores=3. For the files that are read before everything crashes, I can see in the SparkUI that a parquet files takes 9x its size when read to memory.
It is because you are calling collect() in your driver application. This returns an Array of your data items, which would need to fit into memory.
You should instead work with the data RDD and map, reduce, group, etc your large set of data into some desired result, and then collect() that smaller amount of data.