Why pyspark runs out of memory for a map only program? - pyspark

I have a simple pyspark programs that reads 2 text files at a time, convert each line into json object and write it to parquet file like this:
for f in chunk(files, 2):
file_rdd = sc.textFile(f)
df = (file_rdd
.partitionBy("created_year", "created_month")
I run the job with yarn and the configuration is like this:
conf = (SparkConf()
.set("spark.executor.memory", '6g')
.set('spark.executor.instances', '6')
.set('spark.executor.cores', '2')
.set("parquet.enable.summary-metadata", "false")
.set("spark.sql.parquet.compression.codec", 'snappy')
This looks like a map only program so why it runs into out of memory for large input file?

Spark has a lot of moving parts. It reads the data from text to partitions (which are generally in memory), you are decoding the json which might cause issues if your line is very long (i.e. a large json object), you do a partitionBy which might have too many elements.
I would start by trying to increase the number of partitions to begin with (i.e. use repartition as opposed to coalesce which only reduces the number of partitions), I would also try to write without the partitionBy and if that fails attempt to find the longest json and try to analyze it only (i.e. map json string to length and take the longest).


reading batch of json files into dataframe

All -
I have millions of single json files, and I want to ingest all into a Spark dataframe. However, I didn't see a append call, where I can append json as additions. Instead, the only way I can make it work is:
for all json files do:
df_tmp = spark.read.json("/path/to/jsonfile", schema=my_schema)
df = df.union(df_tmp)
df is the final aggregated dataframe. This approach works a few hundreds files, but as it approaches thousands, it is getting slower and slower. I suspect this cost of dataframe create and merge are signficant, and it feels awkward as well. Is there a better approach? TIA
You can just pass the path to the folder instead of individual file and it will read all the files in it.
For example, your files are in a folder called JsonFiles, you can write,
df = spark.read.json("/path/to/JsonFiles/")

What's the best way to write a big file to S3?

I'm using zeppelin and spark, and I'd like to take a 2TB file from S3 and run transformations on it in Spark, and then send it up to S3 so that I can work with the file in Jupyter notebook. The transformations are pretty straightforward.
I'm reading the file as a parquet file. I think it's about 2TB, but I'm not sure how to verify.
It's about 10M row and 5 columns, so it's pretty big.
I tried to do my_table.write.parquet(s3path) and I tried my_table.write.option("maxRecordsPerFile", 200000).parquet(s3path). How do I come up with the right way to write a big parquet file?
These are the points you could consider...
1) maxRecordsPerFile setting:
Spark writes a single file out per task.
The number of saved files is = the number of partitions of the RDD/Dataframe being saved. Thus, this could result in ridiculously large files (of couse you can repartition your data and save repartition means shuffles the data across the networks.).
To limit number of records per file
my_table.write.option("maxRecordsPerFile", numberOfRecordsPerFile..yourwish).parquet(s3path)
It can avoid generating huge files.
2) If you are using AWS Emr (Emrfs) this could be one of the point you can consider.
When the EMRFS S3-optimized Committer is Not Used :
When using the S3A file system.
When using an output format other than Parquet, such as ORC or text.
3) Using compression techniques , algo version and other spark configurations:
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
.config("spark.hadoop.parquet.enable.summary-metadata", false)
.config("spark.sql.parquet.mergeSchema", false)
.config("spark.sql.parquet.filterPushdown", true) // for reading purpose
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.parquet.compression.codec", "snappy")
4) fast upload and other props in case you are using s3a:
.config("fs.s3a.connection.ssl.enabled", "true")
The S3a connector will incrementally write blocks, but the (obsolete) version shipping with spark in hadoop-2.7.x doesn't handle it very well. IF you can, update all hadoop- Jars to 2.8.5 or 2.9.x.
the option "fs.s3a.multipart.size controls the size of the block. There's a limit of 10K blocks, so the max file you can upload is that size * 10,000. For very large files, use a bigger number than the default of "64M"

apache spark textfile to a string

val test= sc.textFile(12,logFile).cache()
In the above code snippet, I am trying to make apache spark to parallelize reading a huge text file.
How do i store the contents of this onto a string ?
I was earlier doing this to read
val lines = scala.io.Source.fromFile(logFile, "utf-8").getLines.mkString
but then now i am trying to make the read faster using spark context.
Reading the file into a String through Spark is very unlikely to be faster than reading it directly - to work efficiently in Spark you should keep everything in RDD form and do your processing that way, only reducing down to a (small) value at the end. Reading it in Spark just means you'll read it into memory locally, serialize the chunks and send them out to your cluster nodes, then serialize them again to send them back to your local machine and gather them together. Spark is a powerful tool but it's not magical; it can only parallelize operations that are actually parallel. (Do you even know that reading the file into memory is the bottleneck? Always benchmark before optimizing)
But to answer your question, you could use
Just don't expect it to be any faster than reading the file locally.
Collect the values, and then iterate them:
var string = ""
test.collect.foreach({i => string += i} )

Using scala to dump result processed by Spark to HDFS

I'm a bit confused to find the right way to save data into HDFS after processing them with spark.
This is what I'm trying to do. I'm calculating min, max and SD of numeric fields. My input files have millions of rows, but output will have only around 15-20 fields. So, the output is a single value(scalar) for each field.
For example: I will load all the rows of FIELD1 into an RDD, and at the end, I will get 3 single values for FIELD 1(MIN, MAX, SD). I concatenated these three values into temporary string. In the end, I will have 15 to twenty rows, containing 4 columns in this following format
This is a snippet of the code:
//create rdd
val data = sc.textFile("hdfs://x.x.x.x/"+args(1)).cache()
//just get the first column
val values = data.map(_.split(",",-1)(1))
val data_double= values.map(x=>if(x==""){0}else{x}.toDouble)
val min_value= data_double.map((_,1)).reduceByKey((_+_)).sortByKey(true).take(1)(0)._1
val max_value= data_double.map((_,1)).reduceByKey((_+_)).sortByKey(false).take(1)(0)._1
val SD = data_double.stdev
So, i have 3 variables, min_value, max_value and SD that I want to store back to hdfs.
Question 1:
Since the output will be rather small, do I just save it locally on the server? or should I dump it to HDFS. Seems to me like dumping the file locally makes better sense.
Question 2:
In spark, I can just call the following to save an RDD into text file
How do I accomplish the same thing in for a String variable that is not an RDD in scala? should I parallelize my result into an RDD first and then call saveAsTextFile?
To save locally just do
Then save the resulting array with something like from this question. And yes if the data set is small, and can easily fit in memory you should collect and bring it to the driver of the program. Another option if the data is a little to large to store in memory is just some_RDD.coalesce(numParitionsToStoreOn). Keep in mind coalesce also takes a boolean shuffle, if you are doing calculations on the data before coalescing, you should set this to true to get more parallelism on the calculations. Coalesce will reduce the number of nodes that store data when you call some_RDD.saveAsTextFile("hdfs://namenode/path"). If the file is very small but you need it on hdfs, call repartition(1), which is the same as coalesce(1,true), this will ensure that your data is only saved on one node.
So if all you want to do is save three values in HDFS you can do this.
Basically you are just putting the 3 vars in a tuple, wrap that in a List and set the parallelism to one since the data is very small
Answer 1: Since you just need several scalar, I'd like to say storing them in local file system. You can first do val localValue = rdd.collect(), which will collect all data from workers to master. And then you call java.io to write things to disk.
Answer 2: You can do sc.parallelize(yourString).saveAsTextFile("hdfs://host/yourFile"). The will write things to part-000*. If you want to have all things in one file, hdfs dfs -getmerge is here to help you.

Spark Streaming Iterative Algorithm

I want to create a Spark Streaming application coded in Scala.
I want my application to:
read from a HDFS Text File line by line
analyze every line as String and if needed modify it and:
keep state that is needed for the analysis in some kind of data structures (Hashes probably)
output of everything on text files (any kind)
I've had no problems with the first step:
val lines = ssc.textFileStream("hdfs://localhost:9000/path/")
My analysis consist in searching a match in the Hashes for some fields of the String analyzed, that's why I need to maintain a state and do the process iteratively.
The data in those Hashes is also extracted by the strings analyzed.
What can I do for next steps?
Since you just have to read one HDFS text file line by line, you probably do not need to Spark Streaming for that. You can just use Spark.
val lines = sparkContext.textFile("...")
Then you can use mapPartition to do a distributed processing of the whole partitioned file.
val processedLines = lines.mapPartitions { partitionAsIterator =>
In that function, you can walk through the lines in the partition, store state stuff in a hashmap, etc. and finally return another iterator of output records corresponding to that partition.
Now if you want share state across partitions, then you probably have to do some more aggregations like groupByKey() or reduceByKey() on processedLines dataset.