Spark Dataframe writes part files to _temporary in instead directly creating partFiles in output directory [duplicate] - scala

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3

That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).

This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Related

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark Dataframe. Here is my code
And all classes and object used are serialized. Fails on AWS EMR spark job (works fine in local Machine.)
val newSchema = df.schema.add("errorList", ArrayType(new StructType()
.add("fieldName" , StringType)
.add("value" , StringType)
.add("message" , StringType)))
//Validators is a Sequence of validations on columns in a Row.
// Validator method signature
// def checkForErrors(row: Row): (fieldName, value, message) ={
// logic to validate the field in a row }
val validateRow: Row => Row = (row: Row)=>{
val errorList = validators.map(validator => validator.checkForErrors(row)
Row.merge(row, Row(errorList))
}
val validateDf = df.map(validateRow)(RowEncoder.apply(newSchema))
Versions : Spark 2.4.7 and Scala 2.11.8
Any ideas on why this might happen or if someone had the same issue.
I faced a very similar problem with EMR release 6.8.0 - in particular, the spark.jars configuration was not respected for me on EMR (I pointed it at a location of a JAR in S3), even though it seems to be normally accepted Spark parameter.
For me, the solution was to follow this guide ("How do I resolve the "java.lang.ClassNotFoundException" in Spark on Amazon EMR?"):
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
In CDK (where our EMR cluster definitino is), I set up an EMR step to be executed immediately after cluster creation the rewrite the spark.driver.extraClassPath and spark.executor.extraClassPath to also contain the location of my additional JAR (in my case, the JAR physically comes in a Docker image, but you could also set up a boostrap action to copy it on the cluster from S3), as per their code in the article under "For Amazon EMR release version 6.0.0 and later,". The reason you have to do this "rewriting" is because EMR already populates these spark.*.extraClassPath with a bunch of its own JAR location, e.g. for JARs that contain the S3 drivers, so you effectively have to append your own JAR location, rather than just straight up setting the spark.*.extraClassPath to your location. If you do the latter (I tried it), then you will lose lot of the EMR functionality such as being able to read from S3.
#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0

IllegalStateException: _spark_metadata/0 doesn't exist while compacting batch 9

We have Streaming Application implemented using Spark Structured Streaming which tries to read data from Kafka topics and write it to HDFS Location.
Sometimes application fails with Exception:
_spark_metadata/0 doesn't exist while compacting batch 9
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
We are not able to resolve this issue.
Only solution I found is to delete checkpoint location files which will make the job read the topic/data from beginning as soon as we run the application again. However, this is not a feasible solution for production application.
Does anyone has a solution for this error without deleting checkpoint such that I can continue from where the last run was failed?
Sample code of application:
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
[...] // do some processing
dfProcessed.writeStream
.format("csv")
.option("format", "append")
.option("path",hdfsPath)
.option("checkpointlocation","")
.outputmode(append)
.start
The error message
_spark_metadata/n.compact doesn't exist when compacting batch n+10
can show up when you
process some data into a FileSink with checkpoint enabled, then
stop your streaming job, then
change the output directory of the FileSink while keeping the same checkpointLocation, then
restart the streaming job
Quick Solution (not for production)
Just delete the files in checkpointLocation and restart the application.
Stable Solution
As you do not want to delete your checkpoint files, you could simply copy the missing spark metadata files from the old File Sink output path to the new output Path. See below to understand what are the "missing spark metadata files".
Background
To understand, why this IllegalStateException is being thrown, we need to understand what is happening behind the scene in the provided file output path. Let outPathBefore be the name of this path. When your streaming job is running and processing data the job creates a folder outPathBefore/_spark_metadata. In that folder you will find a file named after micro-batch Identifier containing the list of files (partitioned files) the data has been written to, e.g:
/home/mike/outPathBefore/_spark_metadata$ ls
0 1 2 3 4 5 6 7
In this case we have details for 8 micro batches. The content of one of the files looks like
/home/mike/outPathBefore/_spark_metadata$ cat 0
v1
{"path":"file:///tmp/file/before/part-00000-99bdc705-70a2-410f-92ff-7ca9c369c58b-c000.csv","size":2287,"isDir":false,"modificationTime":1616075186000,"blockReplication":1,"blockSize":33554432,"action":"add"}
By default, on each tenth micro batch, these files are getting compacted, meaning the contents of the files 0, 1, 2, ..., 9 will be stored in a compacted file called 9.compact.
This procedure continuous for the subsequent ten batches, i.e. in the micro batch 19 the job aggregates the last 10 files which are 9.compact, 10, 11, 12, ..., 19.
Now, imagine you had the streaming job running until micro batch 15 which means the job has created the following files:
/home/mike/outPathBefore/_spark_metadata/0
/home/mike/outPathBefore/_spark_metadata/1
...
/home/mike/outPathBefore/_spark_metadata/8
/home/mike/outPathBefore/_spark_metadata/9.compact
/home/mike/outPathBefore/_spark_metadata/10
...
/home/mike/outPathBefore/_spark_metadata/15
After the fifteenth micro batch you stopped the streaming job and changed the output path of the File Sink to, say, outPathAfter. As you keep the same checkpointLocation the streaming job will continue with micro-batch 16. However, it now creates the metadata files in the new out path:
/home/mike/outPathAfter/_spark_metadata/16
/home/mike/outPathAfter/_spark_metadata/17
...
Now, and this is where the Exception is thrown: When reaching micro batch 19, the job tries to compact the tenth latest files from spark metadata folder. However, it can only find the files 16, 17, 18 but it does not find 9.compact, 10 etc. Hence the error message says:
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
Documentation
The Structured Streaming Programming Guide explains on Recovery Semantics after Changes in a Streaming Query:
"Changes to output directory of a file sink are not allowed: sdf.writeStream.format("parquet").option("path", "/somePath") to sdf.writeStream.format("parquet").option("path", "/anotherPath")"
Databricks has also written some details in the article Streaming with File Sink: Problems with recovery if you change checkpoint or output directories
Error caused by checkpointLocation because checkpointLocation stores old or deleted data information. You just need to delete the folder containing checkpointLocation.
Explore more :https://kb.databricks.com/streaming/file-sink-streaming.html
Example :
df.writeStream
.format("parquet")
.outputMode("append")
.option("checkpointLocation", "D:/path/dir/checkpointLocation")
.option("path", "D:/path/dir/output")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()
You need to do delete directory checkpointLocation.
This article introduces the mechanism and gives a good way to recover from a deleted _spark_metadata folder in Spark Structured Streaming:
https://dev.to/kevinwallimann/how-to-recover-from-a-deleted-sparkmetadata-folder-546j
"Create dummy log files:
If the metadata log files are irrecoverable, we could create dummy log files for the missing micro-batches.
In our example, this could be done like this:
for i in {0..1}; do echo v1 > "/tmp/destination/_spark_metadata/$i"; done
This will create the files
/tmp/destination/_spark_metadata/0
/tmp/destination/_spark_metadata/1
Now, the query can be restarted and should finish without errors."
As my previous output folder was not recoverable anymore. I tried this dummy solution, which could work to get rid of the IllegalStateException: _spark_metadata/... doesn't exist exception.

'Unsupported encoding: DELTA_BYTE_ARRAY' while writing parquet data to csv using pyspark

I want to convert parquet files in binary format to csv files. I am using the following commands in spark.
sqlContext.setConf("spark.sql.parquet.binaryAsString","true")
val source = sqlContext.read.parquet("path to parquet file")
source.coalesce(1).write.format("com.databricks.spark.csv").option("header","true").save("path to csv")
This works when i start spark in HDFS server and run these commands. When I try copying the same parquet file to my local system and start pyspark and run these commands it is giving error.
I am able to set binary as string property to true and able to read parquet files in my local pyspark. But when I execute the command to write to csv, it gives the following error.
2018-10-01 14:45:11 WARN ZlibFactory:51 - Failed to load/initialize
native-zlib library 2018-10-01 14:45:12 ERROR Utils:91 - Aborting task
java.lang.UnsupportedOperationException: Unsupported encoding:
DELTA_BYTE_ARRAY
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:577)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:627)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:47)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:550)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:536)
at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:536)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:164)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:263)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:161)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
What should be done to resolve this error in local machine as the same works in hdfs? Any idea to resolve this would be of great help. Thank you.
You can try disabling the VectorizedReader.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
This is not a solution but it is a workaround.
Consequences of disabling it will be https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html
Problem:
Getting an exception in Spark 2.x reading parquet files where some columns are DELTA_BYTE_ARRAY encoded.
Exception:
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
Solution:
If turn off the vectorized reader property, reading these files works fine.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
Explanation:
These files are written with the Parquet V2 writer, as delta byte array encoding is a Parquet v2 feature. The Spark 2.x vectorized reader does not appear to support that format.
Issue already created on apache’s jira. To solve this particular work around.
Cons of using this solution.
Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. But spark 2.x doesn’t support this feature for parquet version two so we need to rely on this solution until further releases.
Adding these 2 flags helped me overcome the error.
parquet.split.files false
spark.sql.parquet.enableVectorizedReader false

is there a way to optimize spark sql code?

updated:
I`m using spark sql 1.5.2. Trying to read many parquet files and filter and aggregate rows - there are ~35M of rows stored in ~30 files in my hdfs and it takes more than 10 minutes to process
val logins_12 = sqlContext.read.parquet("events/2015/12/*/login")
val l_12 = logins_12.where("event_data.level >= 90").select(
"pid",
"timestamp",
"event_data.level"
).withColumn("event_date", to_date(logins_12("timestamp"))).drop("timestamp").toDF("pid", "level", "event_date").groupBy("pid", "event_date").agg(Map("level"->"max")).toDF("pid", "event_date", "level")
l_12.first()
my spark is running in two node cluster with 8cores and 16Gb ram each, scala output makes me thing the computation runs in just one thread:
scala> x.first()
[Stage 1:=======> (50 + 1) / 368]
when I try count() instead of first() it looks like two threads are doing computations. which is still less than I was expecting as there are ~30 files that can be processed in parallel
scala> l_12.count()
[Stage 4:=====> (34 + 2) / 368]
I'm starting spark console with 14g for executor and 4g for driver in yarn-client mode
./bin/spark-shell -Dspark.executor.memory=14g -Dspark.driver.memory=4g --master yarn-client
my default config for spark:
spark.executor.memory 2g
spark.logConf true
spark.eventLog.dir maprfs:///apps/spark
spark.eventLog.enabled true
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni
spark.executor.extraClassPath
spark.yarn.historyServer.address http://test-01:18080
there are 200 partitions of the rdd
scala> logins_12.rdd.partitions.size
res2: Int = 368
scala> l_12.rdd.partitions.size
res0: Int = 200
is there a way to optimize this code?
thanks
Both behaviors are more or less expected. Spark is rather lazy and it not only doesn't execute transformations unless you trigger an action but can also skip tasks if there are not required for the output. Since first requires only a single element it can compute only one partition. It is most likely the reason why you see only one running thread at some point.
Regarding the second issue it is most likely a matter of configuration. Assuming there is nothing wrong with YARN configuration (I don't use YARN but yarn.nodemanager.resource.cpu-vcores looks like a possible source of the problem) it is most likely a matter of Spark defaults. As you can read in the Configuration guide spark.executor.cores on Yarn is by default set to 1. Two workers gives two running threads.

Table imported as parquet from sqoop not working in spark

I imported a table from mssql server with Sqoop 1.4.5 in parquet format. But when I try to load it from Spark shell, it throws error like :
scala> val df1 = sqlContext.load("/home/bipin/Customer2")
scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.NullPointerException
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:298)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297)
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
.
.
.
at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)
at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
at scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)
at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
at scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I looked at the sqoop parquet folder and it's structure is different than the one that I created on Spark. How can I make the parquet file work ?
Use parquetFile instead of load. load is for data stored as DataFrame. More examples in guide.
val df1 = sqlContext.parquetFile("/home/bipin/Customer2")
The Parquet version is not compatible. Spark 1.2 use parquet version 1.6. But your parquet file maybe in the version 1.7 or higher. Parquet1.6 reader cannot parse the parquet 1.7 file.
The next version of spark(maybe 1.5) will use the parquet1.7 in the future that appears in the pom.xml of the master branch.
This is a bug in Spark 1.1 which comes from the parquet library, see PARQUET-136.
For a string or a binary column, if all values in a single column trunk are null, so do the min & max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found here.
This fix was brought into Spark 1.2.0, see SPARK-3968.
Easiest fix is probably to upgrade, or if that is not possible ensure there are no columns which have only null values!