Pyspark in Azure - need to configure sparkContext - pyspark

Using spark Notebook in Azure Synapse, I'm processing some data from parquet files, and outputting it as different parquet files. I produced a working script and started applying it to different datasets, all working fine until I cam across a dataset containing dates older than 1900.
For this issue, I came across this article (which I took to be applicable to my scenario):
Problems when writing parquet with timestamps prior to 1900 in AWS Glue 3.0
The fix is to add this code chunk, which I did, to the top of my notebook:
%%pyspark
from pyspark import SparkContext
sc = SparkContext()
# Get current sparkconf which is set by glue
conf = sc.getConf()
# add additional spark configurations
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
# Restart spark context
sc.stop()
sc = SparkContext.getOrCreate(conf=conf)
# create glue context with the restarted sc
glueContext = GlueContext(sc)
Unfortunately this generated another error:
Py4JJavaError: An error occurred while calling
None.org.apache.spark.api.java.JavaSparkContext. :
java.lang.IllegalStateException: Promise already completed. at
scala.concurrent.Promise.complete(Promise.scala:53) at
scala.concurrent.Promise.complete$(Promise.scala:52) at
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187)
at scala.concurrent.Promise.success(Promise.scala:86) at
scala.concurrent.Promise.success$(Promise.scala:86) at
scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:187)
at
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$sparkContextInitialized(ApplicationMaster.scala:408)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:910)
at
org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32)
at org.apache.spark.SparkContext.(SparkContext.scala:683) at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method) at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:238) at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238) at
java.lang.Thread.run(Thread.java:748)
I've tried looking into resolutions, but this is getting outside of my area of expertise. I want my Synapse spark notebook to run, even on date fields where the date is less than 1900. Any ideas?

I was able to solve this problem by changing the overall configuration for my spark pool (which you will probably want to do as well, unless you want to add config code to every notebook you make). To do this, open up Synapse Studio, then go Manage > Apache Spark pools, click the three dots by your pool (which will be hidden until you mouse over them, great design Microsoft), then select Apache Spark configuration.
From there, create a new configuration, and add a configuration property. For the property, enter spark.sql.parquet.int96RebaseModeInRead and the value enter CORRECTED. Note that spark.sql.parquet.int96RebaseModeInRead does NOT show up as a suggested property, you have to enter it yourself.
Apply your changes, save everything, and make sure your new configuration is selected. It might take a bit for the new changes to be reflected in your notebooks, but it should work from there. If you notice some funky date issues with older dates, try changing CORRECTED to LEGACY.

Related

scala spark raises an error related to derby everytime when doing toDF() or createDataFrame

I am new to scala and scala-api spark and I tried scala-api spark recently on my own computer, which means I run the spark locally by setting SparkSession.builder().master("local[*]"). at first I succeeded in reading the text file using spark.sparkContext.textFile(). After having got the corresponding rdd, I tried convert the rdd to a spark DataFrame, but failed again and again.
To be specific, I used two methods, 1) toDF() and 2) spark.createDataFrame(), all failed, both two methods gave me similar error as shown below.
2018-10-16 21:14:27 ERROR Schema:125 - Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#199549a5, see the next exception for details.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
I examined the error message, it seems that the errors are related to apache.derby and some connection to some database is failed. I do not know what JDBC is actually. I am somewhat familiar with pyspark and I have never been asked to configure any JDBC database, WHY SCALA-API SPARK need it? what should I do to avoid this error? why scala-api spark dataframe need JDBC or any database while scala-api spark RDD doesn't?
For future googler:
I have googled for several hours and still have no idea about how to get rid of this error. But the origin of this problem is very clear: my sparksession enables the support for Hive which then need to specify the database. To solve this problem, we need to disable the support for Hive, since I am running spark on my own mac, it is ok to do this.
So I download the spark source file and build it by myself using the command
./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests
omits -Phive -Phive-thriftserver.
I tested self-built spark, and metastore_db folder has never been created and so fat so good.
For the detail, please refer to this post: Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

Spark and Amazon S3 not setting credentials in executors

Im doing a Spark program that reads and writes from Amazon S3.My problem is that It works if I execute in local mode (--master local[6]) but if i execute in the cluster (in other machines) I got an error with the credentials:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 33, mmdev02.stratio.com): com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:155)
at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
My code is as follows:
val conf = new SparkConf().setAppName("BackupS3")
val sc = SparkContext.getOrCreate(conf)
sc.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-" + region + ".amazonaws.com")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.buffer.dir", "/var/tmp/spark")
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true");
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
I can write to Amazon S3 but cannot read! I also had to send some properties when I do spark-submit because my region is Frankfurt and I had to enable V4:
--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
I tried passing the credentials this way too. If i put them in the hdfs-site.xml in every machine it works.
My question, is how can I do it from code? Why are the executors not getting the config i pass them from the code?
I'm using Spark 1.5.2, hadoop-aws 2.7.1 and aws-java-sdk 1.7.4.
Thanks
Don't put secrets the keys, that leads to loss of secrets
If you are running in EC2, your secrets will be picked up automatically from the IAM feature; the client asks a magic web server for session secrets.
...which means: it may be that spark's automatic credential propagation is getting in the way. Unset your AWS_ env vars before submitting the work.
If you set these properties explicitly in your code, the values will only be visible to the driver process. The executors will not have a chance to pick up those credentials.
If you had set them in actual config file like core-site.xml, they will propagate.
Your code would work in local mode because all operations are happening in a single process.
Why it works on a cluster on small files but not large ones (*): The code could also work on unpartitioned files, where read operations are performed in the driver and partitions are then broadcast to executors. On partitioned files, where executors read individual partitions, the credentials won't be set on the executors so it fails.
Best to use standard mechanisms for passing credentials, or better yet, use EC2 roles and IAM policies in your cluster as EricJ's answer suggests. By default, if you do not provide credentials, EMRFS will look up temporary credentials via EC2 instance metadata service.
(*) I am still learning about this myself, and I may need to revise this answer as I learn more

Connecting to AWS Redshift with Zeppelin Spark 2.0 and Pyspark

I need to read Redshift data into dataframes in Zeppelin. For the last several months I've been using Spark 2.0 via Zeppelin on AWS to open csv and json S3 files successfully.
I used to be able to connect to Redshift from Zeppelin on AWS EMR with Spark 1.6.2 (maybe 1.6.1), using this code:
%pyspark
from pyspark.sql import SQLContext, Row
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func
#Load the data
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = sqlContext.read.format('jdbc').options(url='jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password', dbtable=aquery).load()
dfMinDates.show()
and it worked. That was summer of 2016.
I haven't had need of it since then and now AWS has Spark 2.0.
The new syntax is
myDF = spark.read.jdbc like this:
%pyspark
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"
dfMinDates = spark.read.jdbc("jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password", dbtable=aquery).load()
dfMinDates.show()
but I get this error:
Py4JJavaError: An error occurred while calling o119.jdbc. :
java.sql.SQLException: No suitable driver at
java.sql.DriverManager.getDriver(DriverManager.java:315) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54)
at scala.Option.getOrElse(Option.scala:121) at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:53)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237)
at
org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at
py4j.Gateway.invoke(Gateway.java:280) at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79) at
py4j.GatewayConnection.run(GatewayConnection.java:211) at
java.lang.Thread.run(Thread.java:745) (, Py4JJavaError(u'An error occurred
while calling o119.jdbc.\n', JavaObject id=o121), )
I researched the Spark 2.0 documentation, and found this:
The JDBC driver class must be visible to the primordial class loader
on the client session and on all executors. This is because Java’s
DriverManager class does a security check that results in it ignoring
all drivers not visible to the primordial class loader when one goes
to open a connection. One convenient way to do this is to modify
compute_classpath.sh on all worker nodes to include your driver JARs.
I don't know how to implement this and did more reading from various posts, some blogs and some posts in stackoverflow and found this:
spark.driver.extraClassPath = org.postgresql.Driver
I did this in the Interpreter settings page of Zeppelin, but still I get the same error.
I tried to add a Postgres Interpreter, and I'm not sure I did it right (because I wasn't sure whether to put it in the Spark interpreter or Python interpreter), and I chose the Spark interpreter. Now the Postgres interpreter also has all the same settings as the Spark interpreter, which might not matter, but still I get the same error.
In Spark 1.6, I just don't remember going through all this trouble.
As an experiment, I spun up an EMR cluster with Spark 1.6.2 and tried the old code that used to work, and got the same error as above!
The Zeppelin site has Postgres covered but their information looks like code rather than how to set up the interpreters, so I don't know how to use it.
I'm out of ideas and references.
Any suggestions are much appreciated!
You need to use Amazon's Redshift specific driver. You can download it from here: http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html.
However, if you're using EMR it's already in place (at /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar) and you can just tell Zeppelin where it is.
Here's how to declare it: AWS Redshift driver in Zeppelin

Spark is crashing when computing big files

I have a program in Scala that read a CSV file, add a new column to the Dataframe and save the result as a parquet file. It works perfectly on small files (<5 Go) but when I try to use bigger files (~80 Go) it always fail when it should write the parquet file with this stacktrace :
16/10/20 10:03:37 WARN scheduler.TaskSetManager: Lost task 14.0 in stage 4.0 (TID 886, 10.0.0.10): java.io.EOFException: reached end of stream after reading 136445 bytes; 1245184 bytes expected
at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
If anyone know what could cause this, that would help me a lot !
System used
Spark 2.0.1
Scala 2.11
Hadoop HDFS 2.7.3
All running in Docker in a 6 machine cluster (each 4 cores and 16 Go of RAM)
Example code
var df = spark.read.option("header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName)))
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)
Here are few points that might help you:
I think you should check distribution of your ipix column data, it might happen that you have data skew, so 1 or few partitions might be much bigger than other. Those fat partitions might be such that 1 task that is working on the fat partition might fail. It probably has something to do with output of your function a2p. I'd test first to run this job even without repartitioning(just remove this call and try to see if it succeeds - without repartition call it will use default partitions split probably by size of input csv file)
I also hope that your input csv is not gzip-ed(since gzip-ed data it's not splittable, so all data will be in 1 partition)
Can you provide code?
perhaps the code you wrote are running on driver? how do you process the file?
there is a special Spark functionality of handling big data, for example RDD.
once you do:
someRdd.collect()
You bring the rdd to the driver memory, hence not using the abilities of spark.
Code that handles big data should run on slaves.
please check this : differentiate driver code and work code in Apache Spark
The problem looks like the read failed when decompress a stream of shuffled data in YARN mode.
Try the following code and see how it goes.
var df = spark.read.option("header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName))).persist(StorageLevel.MEMORY_AND_DISK)
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)
There is also a similar issue Spark job failing in YARN mode

Table imported as parquet from sqoop not working in spark

I imported a table from mssql server with Sqoop 1.4.5 in parquet format. But when I try to load it from Spark shell, it throws error like :
scala> val df1 = sqlContext.load("/home/bipin/Customer2")
scala.collection.parallel.CompositeThrowable: Multiple exceptions thrown during a parallel computation: java.lang.NullPointerException
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:543)
parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:520)
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:426)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:298)
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:297)
scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
.
.
.
at scala.collection.parallel.package$$anon$1.alongWith(package.scala:87)
at scala.collection.parallel.Task$class.mergeThrowables(Tasks.scala:86)
at scala.collection.parallel.mutable.ParArray$Map.mergeThrowables(ParArray.scala:650)
at scala.collection.parallel.Task$class.tryMerge(Tasks.scala:72)
at scala.collection.parallel.mutable.ParArray$Map.tryMerge(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:190)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:514)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:162)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I looked at the sqoop parquet folder and it's structure is different than the one that I created on Spark. How can I make the parquet file work ?
Use parquetFile instead of load. load is for data stored as DataFrame. More examples in guide.
val df1 = sqlContext.parquetFile("/home/bipin/Customer2")
The Parquet version is not compatible. Spark 1.2 use parquet version 1.6. But your parquet file maybe in the version 1.7 or higher. Parquet1.6 reader cannot parse the parquet 1.7 file.
The next version of spark(maybe 1.5) will use the parquet1.7 in the future that appears in the pom.xml of the master branch.
This is a bug in Spark 1.1 which comes from the parquet library, see PARQUET-136.
For a string or a binary column, if all values in a single column trunk are null, so do the min & max values in the column trunk statistics. However, while checking the statistics for column trunk pruning, a null check is missing, and causes NPE. Corresponding code can be found here.
This fix was brought into Spark 1.2.0, see SPARK-3968.
Easiest fix is probably to upgrade, or if that is not possible ensure there are no columns which have only null values!