Apache Spark- Writing parquet with snappy compression errors - pyspark

**Using
Spark v3.0.2
JAR File - snappy-java-1.1.8.2
HADOOP=3.2.2
JAVA - java-1.8.0-openjdk.x86_64**
Executing : With and without compression key value (default is 'snappy').
df.write.option("compression", "snappy").mode("overwrite").partitionBy(part_labels).parquet(output_path)

I think the pyspark API is slightly different from the Java/Scala API. Try this:
df.write.parquet(output_path, mode="overwrite", partitionBy=part_labels, compression="snappy")

Related

Problem creating KSQL stream - java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy

I'm trying to create a stream using KSQL but I'm getting an error returned.
The statement I'm running is:
create stream s1 with (kafka_topic = 'T3_NON_END', value_format = 'avro');
I get a NoClassDefFoundError - org.xerial.snappy.Snappy
From what I've read this is because /tmp is set as noexec. It looks from the Confluent website and also from looking at other applications that use Snappy, that a directory path needs to be passed.
https://docs.confluent.io/5.4.2/ksql/docs/troubleshoot-ksql.html
Does anyone know how I can pass the directory path for Snappy when using KSQL?
Sounds like you're asking how to set JVM flags
export KSQL_OPTS='-D...'
ksql-server-start...
this is because /tmp is set as noexec.
Not necessarily. The error is saying snappy is not on the JVM classpath

Apache Bean Spark Runner does not work on streaming mode - java.lang.IllegalAccessException

I have a streaming beam application that runs on Flink.
When I tried to switch it to spark runner with EMR (5.30.1), with both apache bean (2.23.0 and 2.24.0), I am getting following error:
Exception in thread "main" java.lang.IllegalAccessException: Class org.apache.spark.sql.streaming.DataStreamReader can not access a member of class org.apache.beam.runners.spark.structuredstreaming.translation.streaming.DatasetSourceStreaming with modifiers ""
at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:102)
at java.lang.Class.newInstance(Class.java:436)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:161)
at org.apache.beam.runners.spark.structuredstreaming.translation.streaming.ReadSourceTranslatorStreaming.translateTransform(ReadSourceTranslatorStreaming.java:71)
at org.apache.beam.runners.spark.structuredstreaming.translation.PipelineTranslator.applyTransformTranslator(PipelineTranslator.java:159)
at org.apache.beam.runners.spark.structuredstreaming.translation.PipelineTranslator.visitPrimitiveTransform(PipelineTranslator.java:214)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:664)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:656)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:656)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.visit(TransformHierarchy.java:656)
at org.apache.beam.sdk.runners.TransformHierarchy$Node.access$600(TransformHierarchy.java:317)
at org.apache.beam.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:251)
at org.apache.beam.sdk.Pipeline.traverseTopologically(Pipeline.java:463)
at org.apache.beam.runners.spark.structuredstreaming.translation.PipelineTranslator.translate(PipelineTranslator.java:173)
at org.apache.beam.runners.spark.structuredstreaming.SparkStructuredStreamingRunner.translatePipeline(SparkStructuredStreamingRunner.java:195)
at org.apache.beam.runners.spark.structuredstreaming.SparkStructuredStreamingRunner.run(SparkStructuredStreamingRunner.java:148)
at org.apache.beam.runners.spark.structuredstreaming.SparkStructuredStreamingRunner.run(SparkStructuredStreamingRunner.java:68)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:317)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:303)
at com.adp.vantage.sonic.pipelines.KafkaPipelineConsumer.main(KafkaPipelineConsumer.java:136)
The command line is:
spark-submit --class com.adp.vantage.sonic.pipelines.KafkaPipelineConsumer deploy/sonic-payroll-event-tracker-spark-2.10.0.jar --runner=SparkStructuredStreamingRunner --streaming=true --env=dit
The DatasetSourceStreaming is not a public class, so looks to me this will never work, does anybody manage to make it work at all?
Unfortunately, SparkStructuredStreamingRunner still doesn’t support streaming job (please, take a look on “Note" section here)
There is a Jira issue for tracking this - in two words, it’s blocked because of issue on Spark with multiple aggregations for Structured Streaming component.
So far, it’s recommended to use a “Classical” RDD or Portable Spark runners to run streaming Beam pipelines on Spark.

'Unsupported encoding: DELTA_BYTE_ARRAY' while writing parquet data to csv using pyspark

I want to convert parquet files in binary format to csv files. I am using the following commands in spark.
sqlContext.setConf("spark.sql.parquet.binaryAsString","true")
val source = sqlContext.read.parquet("path to parquet file")
source.coalesce(1).write.format("com.databricks.spark.csv").option("header","true").save("path to csv")
This works when i start spark in HDFS server and run these commands. When I try copying the same parquet file to my local system and start pyspark and run these commands it is giving error.
I am able to set binary as string property to true and able to read parquet files in my local pyspark. But when I execute the command to write to csv, it gives the following error.
2018-10-01 14:45:11 WARN ZlibFactory:51 - Failed to load/initialize
native-zlib library 2018-10-01 14:45:12 ERROR Utils:91 - Aborting task
java.lang.UnsupportedOperationException: Unsupported encoding:
DELTA_BYTE_ARRAY
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:577)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:627)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:47)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:550)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:536)
at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:536)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:164)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:263)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:161)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
What should be done to resolve this error in local machine as the same works in hdfs? Any idea to resolve this would be of great help. Thank you.
You can try disabling the VectorizedReader.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
This is not a solution but it is a workaround.
Consequences of disabling it will be https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html
Problem:
Getting an exception in Spark 2.x reading parquet files where some columns are DELTA_BYTE_ARRAY encoded.
Exception:
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
Solution:
If turn off the vectorized reader property, reading these files works fine.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
Explanation:
These files are written with the Parquet V2 writer, as delta byte array encoding is a Parquet v2 feature. The Spark 2.x vectorized reader does not appear to support that format.
Issue already created on apache’s jira. To solve this particular work around.
Cons of using this solution.
Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. But spark 2.x doesn’t support this feature for parquet version two so we need to rely on this solution until further releases.
Adding these 2 flags helped me overcome the error.
parquet.split.files false
spark.sql.parquet.enableVectorizedReader false

Spark Dataframe writes part files to _temporary in instead directly creating partFiles in output directory [duplicate]

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Table imported as parquet from sqoop not working in spark

I imported a table from mssql server with Sqoop 1.4.5 in parquet format. But when I try to load it from Spark shell, it throws error like :
scala> val df1 = sqlContext.load("/home/bipin/Customer2")
scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.NullPointerException
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:298)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297)
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
.
.
.
at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)
at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
at scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)
at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
at scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I looked at the sqoop parquet folder and it's structure is different than the one that I created on Spark. How can I make the parquet file work ?
Use parquetFile instead of load. load is for data stored as DataFrame. More examples in guide.
val df1 = sqlContext.parquetFile("/home/bipin/Customer2")
The Parquet version is not compatible. Spark 1.2 use parquet version 1.6. But your parquet file maybe in the version 1.7 or higher. Parquet1.6 reader cannot parse the parquet 1.7 file.
The next version of spark(maybe 1.5) will use the parquet1.7 in the future that appears in the pom.xml of the master branch.
This is a bug in Spark 1.1 which comes from the parquet library, see PARQUET-136.
For a string or a binary column, if all values in a single column trunk are null, so do the min & max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found here.
This fix was brought into Spark 1.2.0, see SPARK-3968.
Easiest fix is probably to upgrade, or if that is not possible ensure there are no columns which have only null values!