Table imported as parquet from sqoop not working in spark - scala

I imported a table from mssql server with Sqoop 1.4.5 in parquet format. But when I try to load it from Spark shell, it throws error like :
scala> val df1 = sqlContext.load("/home/bipin/Customer2")
scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.NullPointerException
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:298)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297)
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
.
.
.
at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)
at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
at scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)
at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
at scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I looked at the sqoop parquet folder and it's structure is different than the one that I created on Spark. How can I make the parquet file work ?

Use parquetFile instead of load. load is for data stored as DataFrame. More examples in guide.
val df1 = sqlContext.parquetFile("/home/bipin/Customer2")

The Parquet version is not compatible. Spark 1.2 use parquet version 1.6. But your parquet file maybe in the version 1.7 or higher. Parquet1.6 reader cannot parse the parquet 1.7 file.
The next version of spark(maybe 1.5) will use the parquet1.7 in the future that appears in the pom.xml of the master branch.

This is a bug in Spark 1.1 which comes from the parquet library, see PARQUET-136.
For a string or a binary column, if all values in a single column trunk are null, so do the min & max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found here.
This fix was brought into Spark 1.2.0, see SPARK-3968.
Easiest fix is probably to upgrade, or if that is not possible ensure there are no columns which have only null values!

Related

Apache Spark- Writing parquet with snappy compression errors

**Using
Spark v3.0.2
JAR File - snappy-java-1.1.8.2
HADOOP=3.2.2
JAVA - java-1.8.0-openjdk.x86_64**
Executing : With and without compression key value (default is 'snappy').
df.write.option("compression", "snappy").mode("overwrite").partitionBy(part_labels).parquet(output_path)
I think the pyspark API is slightly different from the Java/Scala API. Try this:
df.write.parquet(output_path, mode="overwrite", partitionBy=part_labels, compression="snappy")

Unable to load hive table from spark dataframe with more than 25 columns in hdp 3

We were trying to populate hive table from spark shell. Dataframe with 25 columns got successfully added to the hive table using hive warehouse connector. But for more than this limit we got below error:
Caused by: java.lang.IllegalArgumentException: Missing required char ':' at 'struct<_c0:string,_c1:string,_c2:string,_c3:string,_c4:string,_c5:string,_c6:string,_c7:string,_c8:string,_c9:string,_c10:string,_c11:string,_c12:string,_c13:string,_c14:string,_c15:string,_c16:string,_c17:string,_c18:string,_c19:string,_c20:string,_c21:string,_c22:string,_c23:string,...^ 2 more fields>'
at org.apache.orc.TypeDescription.requireChar(TypeDescription.java:293)
Below is the sample input file data (input file is of type csv).
|col1 |col2 |col3 |col4 |col5 |col6 |col7 |col8 |col9 |col10 |col11 |col12 |col13 |col14 |col15 |col16 |col17|col18 |col19 |col20 |col21 |col22 |col23 |col24 |col25|col26 |
|--------------------|-----|-----|-------------------|--------|---------------|-----------|--------|--------|--------|--------|--------|--------|--------|--------|------|-----|---------------------------------------------|--------|-------|---------|---------|---------|------------------------------------|-----|----------|
|11111100000000000000|CID81|DID72|2015-08-31 00:17:00|null_val|919122222222222|1627298243 |null_val|null_val|null_val|null_val|null_val|null_val|Download|null_val|Mobile|NA |x-nid:xyz<-ch-nid->N4444.245881.ABC-119490111|12452524|1586949|sometext |sometext |sometext1|8b8d94af-5407-42fa-9c4f-baaa618377c8|Click|2015-08-31|
|22222200000000000000|CID82|DID73|2015-08-31 00:57:00|null_val|919122222222222|73171145211|null_val|null_val|null_val|null_val|null_val|null_val|Download|null_val|Tablet|NA |x-nid:xyz<-ch-nid->N4444.245881.ABC-119490111|12452530|1586956|88200211 |88200211 |sometext2|9b04580d-1669-4eb3-a5b0-4d9cec422f93|Click|2015-08-31|
|33333300000000000000|CID83|DID74|2015-08-31 00:17:00|null_val|919122222222222|73171145211|null_val|null_val|null_val|null_val|null_val|null_val|Download|null_val|Laptop|NA |x-nid:xyz<-ch-nid->N4444.245881.ABC-119490111|12452533|1586952|sometext2|sometext2|sometext3|3ab8511d-6f85-4e1f-8b11-a1d9b159f22f|Click|2015-08-31|
Spark shell was instantiated using below command:
spark-shell --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar --conf spark.hadoop.metastore.catalog.default=hive --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://sandbox-hdp.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;user=raj_ops"
Version of HDP is 3.0.1
Hive table was created using below command:
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.createTable("tablename").ifNotExists().column()...create()
Data was saved using below command:
df.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").option("table", "tablename").mode("append").save()
Kindly help us on this.
Thank you in advance.
I faced this problem, after thoroughly examining the source code of the following classes:
org.apache.orc.TypeDescription
org.apache.spark.sql.types.StructType
org.apache.spark.util.Utils
I found out that the culprit was the variable DEFAULT_MAX_TO_STRING_FIELDS inside the class org.apache.spark.util.Utils:
/* The performance overhead of creating and logging strings for wide schemas can be large. To limit the impact, we bound the number of fields to include by default. This can be overridden by setting the 'spark.debug.maxToStringFields' conf in SparkEnv. */
val DEFAULT_MAX_TO_STRING_FIELDS = 25
so, after setting this property, by example: conf.set("spark.debug.maxToStringFields", "128"); in my application, the issue has gone.
I hope it can help others.

'Unsupported encoding: DELTA_BYTE_ARRAY' while writing parquet data to csv using pyspark

I want to convert parquet files in binary format to csv files. I am using the following commands in spark.
sqlContext.setConf("spark.sql.parquet.binaryAsString","true")
val source = sqlContext.read.parquet("path to parquet file")
source.coalesce(1).write.format("com.databricks.spark.csv").option("header","true").save("path to csv")
This works when i start spark in HDFS server and run these commands. When I try copying the same parquet file to my local system and start pyspark and run these commands it is giving error.
I am able to set binary as string property to true and able to read parquet files in my local pyspark. But when I execute the command to write to csv, it gives the following error.
2018-10-01 14:45:11 WARN ZlibFactory:51 - Failed to load/initialize
native-zlib library 2018-10-01 14:45:12 ERROR Utils:91 - Aborting task
java.lang.UnsupportedOperationException: Unsupported encoding:
DELTA_BYTE_ARRAY
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:577)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:627)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:47)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:550)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:536)
at org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:141)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:536)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:164)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:263)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:161)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
What should be done to resolve this error in local machine as the same works in hdfs? Any idea to resolve this would be of great help. Thank you.
You can try disabling the VectorizedReader.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
This is not a solution but it is a workaround.
Consequences of disabling it will be https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html
Problem:
Getting an exception in Spark 2.x reading parquet files where some columns are DELTA_BYTE_ARRAY encoded.
Exception:
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
Solution:
If turn off the vectorized reader property, reading these files works fine.
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
Explanation:
These files are written with the Parquet V2 writer, as delta byte array encoding is a Parquet v2 feature. The Spark 2.x vectorized reader does not appear to support that format.
Issue already created on apache’s jira. To solve this particular work around.
Cons of using this solution.
Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. But spark 2.x doesn’t support this feature for parquet version two so we need to rely on this solution until further releases.
Adding these 2 flags helped me overcome the error.
parquet.split.files false
spark.sql.parquet.enableVectorizedReader false

Spark Dataframe writes part files to _temporary in instead directly creating partFiles in output directory [duplicate]

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Spark is crashing when computing big files

I have a program in Scala that read a CSV file, add a new column to the Dataframe and save the result as a parquet file. It works perfectly on small files (<5 Go) but when I try to use bigger files (~80 Go) it always fail when it should write the parquet file with this stacktrace :
16/10/20 10:03:37 WARN scheduler.TaskSetManager: Lost task 14.0 in stage 4.0 (TID 886, 10.0.0.10): java.io.EOFException: reached end of stream after reading 136445 bytes; 1245184 bytes expected
at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If anyone know what could cause this, that would help me a lot !
System used
Spark 2.0.1
Scala 2.11
Hadoop HDFS 2.7.3
All running in Docker in a 6 machine cluster (each 4 cores and 16 Go of RAM)
Example code
var df = spark.read.option("header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName)))
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)
Here are few points that might help you:
I think you should check distribution of your ipix column data, it might happen that you have data skew, so 1 or few partitions might be much bigger than other. Those fat partitions might be such that 1 task that is working on the fat partition might fail. It probably has something to do with output of your function a2p. I'd test first to run this job even without repartitioning(just remove this call and try to see if it succeeds - without repartition call it will use default partitions split probably by size of input csv file)
I also hope that your input csv is not gzip-ed(since gzip-ed data it's not splittable, so all data will be in 1 partition)
Can you provide code?
perhaps the code you wrote are running on driver? how do you process the file?
there is a special Spark functionality of handling big data, for example RDD.
once you do:
someRdd.collect()
You bring the rdd to the driver memory, hence not using the abilities of spark.
Code that handles big data should run on slaves.
please check this : differentiate driver code and work code in Apache Spark
The problem looks like the read failed when decompress a stream of shuffled data in YARN mode.
Try the following code and see how it goes.
var df = spark.read.option("header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName))).persist(StorageLevel.MEMORY_AND_DISK)
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)
There is also a similar issue Spark job failing in YARN mode