Spark DataFrame save as parquet - out of memory - scala

I am using spark to read a file from s3, then I load it to a dataframe and then i try to write it to hdfs as parquet.
The thing is that when the file is big (65G), and for some reason I get out of memory...in any case I have no idea why I get the outof memory because it looks like the data is partitioned pretty well.
this is a sampel of my code:
val records = gzCsvFile.filter { x => x.length == 31 }.map { x =>
var d:Date = Date.valueOf(x(0))
//var z = new GregorianCalendar();z.getWeekYear
var week = (1900+d.getYear )* 1000 + d.getMonth()*10 + Math.ceil(d.getDate()/7.0).toInt
Row(d, Timestamp.valueOf(x(1)), toLong( x(2)), toLong(x(3)), toLong(x(4)), toLong(x(5)), toLong(x(6)), toLong(x(7)), toLong(x(8)), toLong(x(9)), toLong(x(10)), toLong(x(11)), toLong(x(12)), toLong(x(13)), toLong(x(14)), toLong(x(15)), toLong(x(16)), toLong(x(17)), toLong(x(18)), toLong(x(19)), toLong(x(20)), toLong(x(21)), toLong(x(22)), toLong(x(23)), toLong(x(24)), toLong(x(25)), x(26).trim(), toLong(x(27)), toLong(x(28)), toLong(x(29)), toInt(x(30)), week)
}
var cubeDF = sqlContext.createDataFrame(records, cubeSchema)
cubeDF.write.mode(SaveMode.Overwrite).partitionBy("CREATION_DATE","COUNTRY_ID","CHANNEL_ID" ).parquet(cubeParquetUrl)
does anyone have any idea what is going on?

You are hitting this bug: https://issues.apache.org/jira/browse/SPARK-8890
Parquet's memory consumption when writing output out is substantially larger than we thought. In the soon-to-be-released Spark 1.5, we turn to sorting the data before writing a large number of parquet partitions out to reduce memory consumption.

Related

Use Apache Spark efficiently to push data to elasticsearch

I have 27 million records in an xml file, that I want to push it into elasticsearch index
Below is the code snippet written in spark scala, i'l be creating a spark job jar and going to run on AWS EMR
How can I efficiently use the spark to complete this exercise? Please guide.
I have a gzipped xml of 12.5 gb which I am loading into spark dataframe. I am new to Spark..(Should I split this gzip file? or spark executors will take care of it?)
class ReadFromXML {
def createXMLDF(): DataFrame = {
val spark: SparkSession = SparkUtils.getSparkInstance("Spark Extractor")
import spark.implicits._
val m_df: DataFrame = SparkUtils.getDataFrame(spark, "temp.xml.gz").coalesce(5)
var new_df: DataFrame = null
new_df = m_df.select($"CountryCode"(0).as("countryCode"),
$"PostalCode"(0).as("postalCode"),
$"state"(0).as("state"),
$"county"(0).as("county"),
$"city"(0).as("city"),
$"district"(0).as("district"),
$"Identity.PlaceId".as("placeid"), $"Identity._isDeleted".as("deleted"),
$"FullStreetName"(0).as("street"),
functions.explode($"Text").as("name"), $"name".getField("BaseText").getField("_VALUE")(0).as("nameVal"))
.where($"LocationList.Location._primary" === "true")
.where("(array_contains(_languageCode, 'en'))")
.where(functions.array_contains($"name".getField("BaseText").getField("_languageCode"), "en"))
new_df.drop("name")
}
}
object PushToES extends App {
val spark = SparkSession
.builder()
.appName("PushToES")
.master("local[*]")
.config("spark.es.nodes", "awsurl")
.config("spark.es.port", "port")
.config("spark.es.nodes.wan.only", "true")
.config("spark.es.net.ssl", "true")
.getOrCreate()
val extractor = new ReadFromXML()
val df = extractor.createXMLDF()
df.saveToEs("myindex/_doc")
}
Update 1:
I have splitted files in 68M each and to read this single file it takes 3.7 mins
I wast trying to use snappy instead of gzip compression codec
So converted the gz file into snappy file and added below in config
.config("spark.io.compression.codec", "org.apache.spark.io.SnappyCompressionCodec")
But it returns empty dataframe
df.printschema returns just "root"
Update 2:
I have managed to run with lzo format..it takes very less time to decompress and load in dataframe.
Is it a good idea to iterate over each lzo compressed file of size 140 MB and create dataframe?
or
should i load set of 10 files in a dataframe ?
or
should I load all 200 lzo compressed files each of 140MB in a single dataframe?. if yes then how much memory should be allocated to master as i think this will be loaded on master?
While reading file from s3 bucket, "s3a" uri can improve performance? or "s3" uri is ok for EMR?
Update 3:
To test a small set of 10 lzo files.. I used below configuration.
EMR Cluster took overall 56 minutes from which step(Spark application) took 48 mins to process 10 files
1 Master - m5.xlarge
4 vCore, 16 GiB memory, EBS only storage
EBS Storage:32 GiB
2 Core - m5.xlarge
4 vCore, 16 GiB memory, EBS only storage
EBS Storage:32 GiB
With below Spark tuned parameters learnt from https://idk.dev/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/
[
{
"Classification": "yarn-site",
"Properties": {
"yarn.nodemanager.vmem-check-enabled": "false",
"yarn.nodemanager.pmem-check-enabled": "false"
}
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "false"
}
},
{
"Classification": "spark-defaults",
"Properties": {
"spark.network.timeout": "800s",
"spark.executor.heartbeatInterval": "60s",
"spark.dynamicAllocation.enabled": "false",
"spark.driver.memory": "10800M",
"spark.executor.memory": "10800M",
"spark.executor.cores": "2",
"spark.executor.memoryOverhead": "1200M",
"spark.driver.memoryOverhead": "1200M",
"spark.memory.fraction": "0.80",
"spark.memory.storageFraction": "0.30",
"spark.executor.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.driver.extraJavaOptions": "-XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'",
"spark.yarn.scheduler.reporterThread.maxFailures": "5",
"spark.storage.level": "MEMORY_AND_DISK_SER",
"spark.rdd.compress": "true",
"spark.shuffle.compress": "true",
"spark.shuffle.spill.compress": "true",
"spark.default.parallelism": "4"
}
},
{
"Classification": "mapred-site",
"Properties": {
"mapreduce.map.output.compress": "true"
}
}
]
Here are some of the tips from my side.
Read the data in parquet format or any format. Re-partition it as per your need. Data conversion may consume time so read it in spark and then process it. Try to create map and format data before starting load. This would help easy debugging in case of complex map.
val spark = SparkSession
.builder()
.appName("PushToES")
.enableHiveSupport()
.getOrCreate()
val batchSizeInMB=4; // change it as you need
val batchRetryCount= 3
val batchWriteRetryWait = 10
val batchEntries= 10
val enableSSL = true
val wanOnly = true
val enableIdempotentInserts = true
val esNodes = [yourNode1, yourNode2, yourNode3]
var esConfig = Map[String, String]()
esConfig = esConfig + ("es.node"-> esNodes.mkString)(","))
esConfig = esConfig + ("es.port"->port.toString())
esConfig = esConfig + ("es.batch.size.bytes"->(batchSizeInMB*1024*1024).toString())
esConfig = esConfig + ("es.batch.size.entries"->batchEntries.toString())
esConfig = esConfig + ("es.batch.write.retry.count"->batchRetryCount.toString())
esConfig = esConfig + ("es.batch.write.retry.wait"->batchWriteRetryWait.toString())
esConfig = esConfig + ("es.batch.write.refresh"->"false")
if(enableSSL){
esConfig = esConfig + ("es.net.ssl"->"true")
esConfig = esConfig + ("es.net.ssl.keystore.location"->"identity.jks")
esConfig = esConfig + ("es.net.ssl.cert.allow.self.signed"->"true")
}
if (wanOnly){
esConfig = esConfig + ("es.nodes.wan.only"->"true")
}
// This helps if some task fails , so data won't be dublicate
if(enableIdempotentInserts){
esConfig = esConfig + ("es.mapping.id" ->"your_primary_key_column")
}
val df = "suppose you created it using parquet format or any format"
Actually data is inserted at executor level and not at driver level
try giving only 2-4 core to each executor so that not so many connections are open at same time.
You can vary document size or entries as per your ease. Please read about them.
write data in chunks this would help you in loading large dataset in future
and try creating index map before loading data. And prefer little nested data as you have that functionality in ES
I mean try to keep some primary key in your data.
val dfToInsert = df.withColumn("salt", ceil(rand())*10).cast("Int").persist()
for (i<-0 to 10){
val start = System.currentTimeMillis
val finalDF = dfToInsert.filter($"salt"===i)
val counts = finalDF.count()
println(s"count of record in chunk $i -> $counts")
finalDF.drop("salt").saveToES("indexName",esConfig)
val totalTime = System.currentTimeMillis - start
println(s"ended Loading data for chunk $i. Total time taken in Seconds : ${totalTime/1000}")
}
Try to give some alias to your final DF and update that in each run. As you would not like to disturb your production server
at time of load
Memory
This can not be generic. But just to give you a kick start
keep 10-40 executor as per your data size or budget. keep each
executor 8-16gb size and 5 gb overhead. (This can vary as your
document can be large or small in size). If needed keep maxResultSize 8gb.
Driver can have 5 cores and 30 g ram
Important Things.
You need to keep config in variable as you can change it as per Index
Insertion happens on executor not on driver, So try to keep lesser
connection while writing. Each core would open one connection.
document insertion can be with batch entry size or document size.
Change it as per your learning while doing multiple runs.
Try to make your solution robust. It should be able to handle all size data.
Reading and writing both can be tuned but try to format your data as
per document map before starting load. This would help in easy
debugging, If data document is little complex and nested.
Memory of spark-submit can also be tuned as per your learning while running
jobs. Just try to look at insertion time by varying memory and batch
size.
Most important thing is design. If you are using ES than create
your map while keeping end queries and requirement in mind.
Not a complete answer but still a bit long for a comment. There are a few tips I would like to suggest.
It's not clear but I assume your worry hear is the execution time. As suggested in the comments you can improve the performance by adding more nodes/executors to the cluster. If the gzip file is loaded without partitioning in spark, then you should split it to a reasonable size. (Not too small - This will make the processing slow. Not too big - executors will run OOM).
parquet is a good file format when working with Spark. If you can convert your XML to parquet. It's super compressed and lightweight.
Reading on your comments, coalesce does not do a full shuffle. The coalesce algorithm changes the number of nodes by moving data from some partitions to existing partitions. This algorithm obviously cannot increase the number of partitions. Use repartition instead. The operation is costly but it can increase the number of partitions. Check this for more facts: https://medium.com/#mrpowers/managing-spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

spark streaming - use previous calculated dataframe in next iteration

I have a streaming app that take a dstream and run an sql manipulation over the Dstream and dump it to file
dstream.foreachRDD { rdd =>
{spark.read.json(rdd)
.select("col")
.filter("value = 1")
.write.csv("s3://..")
now I need to be able to take into account the previous calculation (from eaelier batch) in my calculation (something like the following):
dstream.foreachRDD { rdd =>
{val df = spark.read.json(rdd)
val prev_df = read_prev_calc()
df.join(prev_df,"id")
.select("col")
.filter(prev_df("value)
.equalTo(1)
.write.csv("s3://..")
is there a way to write the calc result in memory somehow and use it as an input to to the calculation
Have you tried using the persist() method on a DStream? It will automatically persist every RDD of that DStream in memory.
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared.
Also, DStreams generated by window-based operations are automatically persisted in memory.
For more details, you can check https://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence
https://spark.apache.org/docs/0.7.2/api/streaming/spark/streaming/DStream.html
If you are looking only for one or two previously calculated dataframes, you should look into Spark Streaming Window.
Below snippet is from spark documentation.
val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)
or even simpler, if we want to do a word count over the last 20 seconds of data, every 10 seconds, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.
// Reduce last 20 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(20), Seconds(10))
more details and examples at-
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations

Reading Mongo data from Spark

I am reading data from mongodb on a spark job, using the com.mongodb.spark.sql connector (v 2.0.0).
It works fine for most db's, but for a specific db, the stage takes a long time and the number of partitions is very high.
My program is set on 128 partitions (x2 number of vCPUs) which works fine after some testing the we did. On this load the number jumps to 2061 partitions and the stage takes several minutes to process, even though I am using a filter and the document clearly states that filters are done on the underlining data source (https://docs.mongodb.com/spark-connector/v2.0/scala/datasets-and-sql/)
This is how I read data:
val readConfig: ReadConfig = ReadConfig(
Map(
"spark.mongodb.input.uri" -> s"${mongodb.uri}/?${mongodb.uriParams}",
"spark.mongodb.input.database" -> s"${mongodb.dbNamesConfig.siteInstances}",
"collection" -> params.collectionName
), None)
val df: DataFrame = sparkSession.read.format("com.mongodb.spark.sql").options(readConfig.asOptions)
.schema(impressionSchema)
.load()
println("df: " + df.rdd.getNumPartitions) // this is 2061 partitions
val filteredDF = df.coalesce(128).filter(
$"_timestamp".isNotNull
.and($"_timestamp".between(new Timestamp(start.getMillis()), new Timestamp(end.getMillis())))
.and($"component_type" === SITE_INSTANCE_CHART_COMPONENT)
)
println("filteredDF: " + filteredDF.rdd.getNumPartitions) // 128 after using coalesce
filteredDF.select(
$"iid",
$"instance_id".as("instanceId"),
$"_global_visitor_key".as("globalVisitorKey"),
$"_timestamp".as("timestamp"),
$"_timestamp".cast(DataTypes.DateType).as("date")
)
Data is not very big (Shuffle Write is 20MB for this stage) and even if I filter only 1 document, the run time is the same (only the Shuffle Write is much smaller).
How can solve this?
Thanks
Nir

How to Persist an array in spark

I'm comparing two tables to find out difference between them (i.e Source and destination), for that I'm loading those tables to memory and the comparison happens as expected in the machine of configuration 8GB memory and 4 cores but when comparing large amount of data the system hangs and runs out of memory, so I used persist() of storagelevel DISK_ONLY
the machine is capable of holding 100,000 rows in memory to store that to DISK at a time and do the rest comparison operations, I'm trying like below:
var partition = math.ceil(c / 100000.toFloat).toInt
println(partition + " partition")
var a = 1
var data = spark.sparkContext.parallelize(Seq(""))
var offset = 0
for (s <- a to partition) {
val query = "(select * from destination LIMIT 100000 OFFSET " + offset + ") as src"
data = data.union(spark.read.jdbc(url, query, connectionProperties).rdd.map(_.mkString(","))).persist(StorageLevel.DISK_ONLY)
offset += 100000
}
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
yes off-course I can use partition but the problem is I need to supply Lowerbounds,Upperbounds,NumPartitions dynamically for fetching 100,000 I tried like:
val destination = spark.read.options(options).jdbc(options("url"), options("dbtable"), "EMPLOYEE_ID", 1, 22, 21, new java.util.Properties()).rdd.map(_.mkString(","))
it takes too much of time and storing those files into partitions though comparing operation is Iterative in nature its reading all the partitions for each and every step.
Coming to the problem
val dest = data.collect.toArray
val s = spark.sparkContext.parallelize(dest, 1).persist(StorageLevel.DISK_ONLY)
the above lines convert all the partitioned RDD's to Array and parallelizing it to single partition so I don't want to iterate through all the partitions again and again. But val dest = data.collect.toArray can't able to convert some huge amount of lines because of shortage in memory and seems it won't allow to Persist() an array in spark.
Is there is any way I can store and parallelize to one partition in DISK
Sorry for being a noob.
Thanks you..!

Memory efficient way of union a sequence of RDDs from Files in Apache Spark

I'm currently trying to train a set of Word2Vec Vectors on the UMBC Webbase Corpus (around 30GB of text in 400 files).
I often run into out of memory situations even on 100 GB plus Machines. I run Spark in the application itself. I tried to tweak a little bit, but I am not able to perform this operation on more than 10 GB of textual data. The clear bottleneck of my implementation is the union of the previously computed RDDs, that where the out of memory exception comes from.
Maybe one you have the experience to come up with a more memory efficient implementation than this:
object SparkJobs {
val conf = new SparkConf()
.setAppName("TestApp")
.setMaster("local[*]")
.set("spark.executor.memory", "100g")
.set("spark.rdd.compress", "true")
val sc = new SparkContext(conf)
def trainBasedOnWebBaseFiles(path: String): Unit = {
val folder: File = new File(path)
val files: ParSeq[File] = folder.listFiles(new TxtFileFilter).toIndexedSeq.par
var i = 0;
val props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");
props.setProperty("nthreads","2")
val pipeline = new StanfordCoreNLP(props);
//preprocess files parallel
val training_data_raw: ParSeq[RDD[Seq[String]]] = files.map(file => {
//preprocess line of file
println(file.getName() +"-" + file.getTotalSpace())
val rdd_lines: Iterator[Option[Seq[String]]] = for (line <- Source.fromFile(file,"utf-8").getLines) yield {
//performs some preprocessing like tokenization, stop word filtering etc.
processWebBaseLine(pipeline, line)
}
val filtered_rdd_lines = rdd_lines.filter(line => line.isDefined).map(line => line.get).toList
println(s"File $i done")
i = i + 1
sc.parallelize(filtered_rdd_lines).persist(StorageLevel.MEMORY_ONLY_SER)
})
val rdd_file = sc.union(training_data_raw.seq)
val starttime = System.currentTimeMillis()
println("Start Training")
val word2vec = new Word2Vec()
word2vec.setVectorSize(100)
val model: Word2VecModel = word2vec.fit(rdd_file)
println("Training time: " + (System.currentTimeMillis() - starttime))
ModelUtil.storeWord2VecModel(model, Config.WORD2VEC_MODEL_PATH)
}}
}
Like Sarvesh points out in the comments, it is probably too much data for a single machine. Use more machines. We typically see the need for 20–30 GB of memory to work with a file of 1 GB. By this (extremely rough) estimate you'd need 600–800 GB of memory for the 30 GB input. (You can get a more accurate estimate by loading a part of the data.)
As a more general comment, I'd suggest you avoid using rdd.union and sc.parallelize. Use instead sc.textFile with a wildcard to load all files into a single RDD.
Have you tried getting word2vec vectors from a smaller corpus? I tell you this cause I was running the word2vec spark implementation on a much smaller one and I got issues with it cause there is this issue: http://mail-archives.apache.org/mod_mbox/spark-issues/201412.mbox/%3CJIRA.12761684.1418621192000.36769.1418759475999#Atlassian.JIRA%3E
So for my use case that issue made the word2vec spark implementation a bit useless. Thus I used spark for massaging my corpus but not for actually getting the vectors.
As other suggested stay away from calling rdd.union.
Also I think .toList will probably gather every line from the RDD and collect it in your Driver Machine ( the one used to submit the task) probably this is why you are getting out-of-memory. You should totally avoid turning the RDD into a list!