Spark is crashing when computing big files - scala

I have a program in Scala that read a CSV file, add a new column to the Dataframe and save the result as a parquet file. It works perfectly on small files (<5 Go) but when I try to use bigger files (~80 Go) it always fail when it should write the parquet file with this stacktrace :
16/10/20 10:03:37 WARN scheduler.TaskSetManager: Lost task 14.0 in stage 4.0 (TID 886, reached end of stream after reading 136445 bytes; 1245184 bytes expected
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
If anyone know what could cause this, that would help me a lot !
System used
Spark 2.0.1
Scala 2.11
Hadoop HDFS 2.7.3
All running in Docker in a 6 machine cluster (each 4 cores and 16 Go of RAM)
Example code
var df ="header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName)))
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)

Here are few points that might help you:
I think you should check distribution of your ipix column data, it might happen that you have data skew, so 1 or few partitions might be much bigger than other. Those fat partitions might be such that 1 task that is working on the fat partition might fail. It probably has something to do with output of your function a2p. I'd test first to run this job even without repartitioning(just remove this call and try to see if it succeeds - without repartition call it will use default partitions split probably by size of input csv file)
I also hope that your input csv is not gzip-ed(since gzip-ed data it's not splittable, so all data will be in 1 partition)

Can you provide code?
perhaps the code you wrote are running on driver? how do you process the file?
there is a special Spark functionality of handling big data, for example RDD.
once you do:
You bring the rdd to the driver memory, hence not using the abilities of spark.
Code that handles big data should run on slaves.
please check this : differentiate driver code and work code in Apache Spark

The problem looks like the read failed when decompress a stream of shuffled data in YARN mode.
Try the following code and see how it goes.
var df ="header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName))).persist(StorageLevel.MEMORY_AND_DISK)
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)
Pyspark in Azure - need to configure sparkContext

Using spark Notebook in Azure Synapse, I'm processing some data from parquet files, and outputting it as different parquet files. I produced a working script and started applying it to different datasets, all working fine until I cam across a dataset containing dates older than 1900.
For this issue, I came across this article (which I took to be applicable to my scenario):
Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
The fix is to add this code chunk, which I did, to the top of my notebook:
from pyspark import SparkContext
sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)
Unfortunately this generated another error:
Py4JJavaError: An error occurred while calling :
java.lang.IllegalStateException: Promise already completed. at
scala.concurrent.Promise.complete(Promise.scala:53) at
scala.concurrent.Promise.complete$(Promise.scala:52) at
at scala.concurrent.Promise.success(Promise.scala:86) at
scala.concurrent.Promise.success$(Promise.scala:86) at
at org.apache.spark.SparkContext.(SparkContext.scala:683) at
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) at
at java.lang.reflect.Constructor.newInstance(
at py4j.reflection.MethodInvoker.invoke( at
py4j.reflection.ReflectionEngine.invoke( at
py4j.Gateway.invoke( at
at at
I've tried looking into resolutions, but this is getting outside of my area of expertise. I want my Synapse spark notebook to run, even on date fields where the date is less than 1900. Any ideas?
I was able to solve this problem by changing the overall configuration for my spark pool (which you will probably want to do as well, unless you want to add config code to every notebook you make). To do this, open up Synapse Studio, then go Manage > Apache Spark pools, click the three dots by your pool (which will be hidden until you mouse over them, great design Microsoft), then select Apache Spark configuration.
From there, create a new configuration, and add a configuration property. For the property, enter spark.sql.parquet.int96RebaseModeInRead and the value enter CORRECTED. Note that spark.sql.parquet.int96RebaseModeInRead does NOT show up as a suggested property, you have to enter it yourself.
Apply your changes, save everything, and make sure your new configuration is selected. It might take a bit for the new changes to be reflected in your notebooks, but it should work from there. If you notice some funky date issues with older dates, try changing CORRECTED to LEGACY.

'Unsupported encoding: DELTA_BYTE_ARRAY' while writing parquet data to csv using pyspark

I want to convert parquet files in binary format to csv files. I am using the following commands in spark.
val source ="path to parquet file")
source.coalesce(1).write.format("com.databricks.spark.csv").option("header","true").save("path to csv")
This works when i start spark in HDFS server and run these commands. When I try copying the same parquet file to my local system and start pyspark and run these commands it is giving error.
I am able to set binary as string property to true and able to read parquet files in my local pyspark. But when I execute the command to write to csv, it gives the following error.
2018-10-01 14:45:11 WARN ZlibFactory:51 - Failed to load/initialize
native-zlib library 2018-10-01 14:45:12 ERROR Utils:91 - Aborting task
java.lang.UnsupportedOperationException: Unsupported encoding:
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
What should be done to resolve this error in local machine as the same works in hdfs? Any idea to resolve this would be of great help. Thank you.
You can try disabling the VectorizedReader.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
This is not a solution but it is a workaround.
Consequences of disabling it will be
Getting an exception in Spark 2.x reading parquet files where some columns are DELTA_BYTE_ARRAY encoded.
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
If turn off the vectorized reader property, reading these files works fine.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
These files are written with the Parquet V2 writer, as delta byte array encoding is a Parquet v2 feature. The Spark 2.x vectorized reader does not appear to support that format.
Issue already created on apache’s jira. To solve this particular work around.
Cons of using this solution.
Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. But spark 2.x doesn’t support this feature for parquet version two so we need to rely on this solution until further releases.
Adding these 2 flags helped me overcome the error.
parquet.split.files false
spark.sql.parquet.enableVectorizedReader false

PySpark - Reading data from MongoDB using mongo-spark connector results in MongoQueryException for exceeding document size

I'm trying MongoDB's new Spark connector to read data from MongoDB. I supplied the DB and collection details to Spark conf object while starting the application. And then use the following piece of code to read to a dataframe
reader ="com.mongodb.spark.sql.DefaultSource")
df = reader.options(partitioner='MongoSplitVectorPartitioner').load()
Then write this dataframe to a parquet file
It starts a job with a number of tasks which all succeed except the last task, which fails the whole writing job. I'm getting an exception on document size being greater than 16 MB.
org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:269)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:148)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: com.mongodb.MongoQueryException: Query failed with error code 16493 and error message 'Tried to create string longer than 16MB' on server
at com.mongodb.connection.ProtocolHelper.getQueryFailureException(
at com.mongodb.connection.GetMoreProtocol.execute(
at com.mongodb.connection.GetMoreProtocol.execute(
at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(
at com.mongodb.connection.DefaultServerConnection.executeProtocol(
at com.mongodb.connection.DefaultServerConnection.getMore(
at com.mongodb.operation.QueryBatchCursor.getMore(
at com.mongodb.operation.QueryBatchCursor.hasNext(
at com.mongodb.MongoBatchCursorAdapter.hasNext(
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.api.python.SerDeUtil$
at org.apache.spark.api.python.SerDeUtil$
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:110)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1801)
at org.apache.spark.api.python.PythonRunner$
I do not maintain the Mongo database. So I'm not sure how a document greater than 16 MB exists in first place. Could probably be use of GridFS.
Is there a way I can skip processing the bad records?
I tried using a udf to filter but it also failed on the same error
import sys
from pyspark.sql.functions import udf, col
size_filter_udf = udf(lambda entry: sys.getsizeof(entry), IntegerType())
filtered_df = df.where(size_filter_udf(col("caseNotes")) < 16000000)

Spark write to Postgresql. BatchUpdateException?

I have a simple Spark job that reads large log files, filters them, and writes results to a new table. The simplified Scala driver app code is:
val sourceRdd = sc.textFile(sourcePath)
val parsedRdd = sourceRdd.flatMap(parseRow)
val filteredRdd = parsedRdd.filter(l => filterLogEntry(l, beginDateTime, endDateTime))
val dataFrame = sqlContext.createDataFrame(filteredRdd)
val writer = dataFrame.write
val properties = new Properties()
properties.setProperty("user", "my_user")
properties.setProperty("password", "my_password")
writer.jdbc("jdbc:postgresql://ip_address/database_name", "my_table", properties)
This works perfectly on smaller batches. On a large batch, after two hours of execution, I see about 8 million records in the target table and the Spark job has failed with the following error:
Caused by: java.sql.BatchUpdateException: Batch entry 524 INSERT INTO my_table <snip> was aborted. Call getNextException to see the cause.
at org.postgresql.jdbc.BatchResultHandler.handleError(
at org.postgresql.core.v3.QueryExecutorImpl$ErrorTrackingResultHandler.handleError(
at org.postgresql.core.v3.QueryExecutorImpl.processResults(
at org.postgresql.core.v3.QueryExecutorImpl.flushIfDeadlockRisk(
at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(
at org.postgresql.core.v3.QueryExecutorImpl.execute(
at org.postgresql.jdbc.PgStatement.executeBatch(
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:277)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:276)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
If I copy paste the given SQL INSERT statement into an SQL console, it works fine. In the Postgresql server log I see:
(this is unmodified/unanonymized log)
2012016-04-26 22:38:09 GMT [3769-12] nginxlogs_admin#nginxlogs ERROR: syntax error at or near "was" at character 544
2016-04-26 22:38:09 GMT [3769-13] nginxlogs_admin#nginxlogs STATEMENT: INSERT INTO log_entries2 (client,host,req_t,request,seg,server,timestamp_value) VALUES ('','""','0.000s','"GET /bid?apnx_id=&ip= HTTP/1.1"','samba_info_has_geo','','2015-08-02T20:24:30.482000112') was aborted. Call getNextException to see the cause.
It seems like Spark sent the text "was aborted. Call getNextException..." to Postgresql which triggered this specific error. That seems like a legitimate Spark bug. The second question is why did Spark abort this in the first place?
So, afaik, I can't call getNextException because I'm not using JDBC directly but going through Spark.
FYI, this is with Spark 1.6.1 and Scala 2.11.
If anyone else is searching and hits this, my database server (running in a VM) hit disk space limits, Spark seemed to get confused by this error, not log the real error, cause a different internal error, and log the results of that. Technically, this is probably an internal Spark bug responding to an uncommon database disk full error.

Table imported as parquet from sqoop not working in spark

I imported a table from mssql server with Sqoop 1.4.5 in parquet format. But when I try to load it from Spark shell, it throws error like :
scala> val df1 = sqlContext.load("/home/bipin/Customer2")
scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.NullPointerException
at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)
at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
at scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)
at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
at scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(
at scala.concurrent.forkjoin.ForkJoinTask.doExec(
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(
I looked at the sqoop parquet folder and it's structure is different than the one that I created on Spark. How can I make the parquet file work ?
Use parquetFile instead of load. load is for data stored as DataFrame. More examples in guide.
val df1 = sqlContext.parquetFile("/home/bipin/Customer2")
The Parquet version is not compatible. Spark 1.2 use parquet version 1.6. But your parquet file maybe in the version 1.7 or higher. Parquet1.6 reader cannot parse the parquet 1.7 file.
The next version of spark(maybe 1.5) will use the parquet1.7 in the future that appears in the pom.xml of the master branch.
This is a bug in Spark 1.1 which comes from the parquet library, see PARQUET-136.
For a string or a binary column, if all values in a single column trunk are null, so do the min & max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found here.
This fix was brought into Spark 1.2.0, see SPARK-3968.
Easiest fix is probably to upgrade, or if that is not possible ensure there are no columns which have only null values!