Spark Streaming - Parquet file upload to S3 error - scala

I'm completely new in Spark Streaming topic.
Via streaming application I'm creating Parquet files of size about 2,5MB and store them on S3/Local directory.
Method I'm using is as follow:
where "data" is a DataFrame
If destination is a local path, everything works like a charm but if only I send it to s3 with path like "s3n://bucket/directory/filename" I'm getting following exception:
15/12/17 10:47:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
at$Windows.access0(Native Method)
at org.apache.hadoop.fs.FileUtil.canRead(
at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(
at org.apache.hadoop.util.DiskChecker.checkDirAccess(
at org.apache.hadoop.util.DiskChecker.checkDir(
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(
at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.newBackupFile(
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>(
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Read from bucket operation works fine.
Despite the error there is sth stored on bucket. Like "directory&folder" and it creates folders for given path but in the end instead of file there is "filename&folder" file.
Tech Details:#
S3 Browser
Windows 8.1
IntelliJ CE 14.1.5
Spark Streaming Application
Spark 1.5 for Hadoop 2.6.0

Problem was in Hadoop libs. I had to rebuild winutils (winutils.exe) and native lib (hadoop.dll) with windows SDK 7 then I had to move it to %HADOOP_HOME%\bin% and add %HADOOP_HOME%\bin% to Path variable. Projects to rebuild can be found under hadoop-2.7.1-src\hadoop-common-project\hadoop-common\target. For win utils I recommend to use windows optimized branch


scala spark raises an error related to derby everytime when doing toDF() or createDataFrame

I am new to scala and scala-api spark and I tried scala-api spark recently on my own computer, which means I run the spark locally by setting SparkSession.builder().master("local[*]"). at first I succeeded in reading the text file using spark.sparkContext.textFile(). After having got the corresponding rdd, I tried convert the rdd to a spark DataFrame, but failed again and again.
To be specific, I used two methods, 1) toDF() and 2) spark.createDataFrame(), all failed, both two methods gave me similar error as shown below.
2018-10-16 21:14:27 ERROR Schema:125 - Failed initialising database.
Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader
org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#199549a5, see the next exception for details.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.seeNextException(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(Unknown Source)
at org.apache.derby.impl.jdbc.EmbedConnection.<init>(Unknown Source)
at org.apache.derby.jdbc.InternalDriver$ Source)
at org.apache.derby.jdbc.InternalDriver$ Source)
at Method)
at org.apache.derby.jdbc.InternalDriver.getNewEmbedConnection(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
at org.apache.derby.jdbc.InternalDriver.connect(Unknown Source)
I examined the error message, it seems that the errors are related to apache.derby and some connection to some database is failed. I do not know what JDBC is actually. I am somewhat familiar with pyspark and I have never been asked to configure any JDBC database, WHY SCALA-API SPARK need it? what should I do to avoid this error? why scala-api spark dataframe need JDBC or any database while scala-api spark RDD doesn't?
For future googler:
I have googled for several hours and still have no idea about how to get rid of this error. But the origin of this problem is very clear: my sparksession enables the support for Hive which then need to specify the database. To solve this problem, we need to disable the support for Hive, since I am running spark on my own mac, it is ok to do this.
So I download the spark source file and build it by myself using the command
./ --name hadoop-2.6_scala-2.11 --tgz -Pyarn -Phadoop-2.6 -Dscala-2.11 -DskipTests
omits -Phive -Phive-thriftserver.
I tested self-built spark, and metastore_db folder has never been created and so fat so good.
For the detail, please refer to this post: Prebuilt Spark 2.1.0 creates metastore_db folder and derby.log when launching spark-shell

Spark: LeaseExpiredException while writing large dataframe to parquet files

I have a large dataframe which I am writing to parquet files in HDFS. Getting the below exception from logs :
2018-10-15 18:31:32 ERROR Executor:91 - Exception in task 41.0 in stage 0.0 (TID 1321)
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /home/prod_out/20181007/_temporary/0/_temporary/attempt_20181015183108_0000_m_000041_0/part-00041-1185b10b-bcb1-4b7e-b732-dd6f71322b7d-c000.snappy.parquet (inode 33628528083): File does not exist. Holder DFSClient_NONMAPREDUCE_179567941_77 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$
at org.apache.hadoop.ipc.RPC$
at org.apache.hadoop.ipc.Server$Handler$
at org.apache.hadoop.ipc.Server$Handler$
at Method)
at org.apache.hadoop.ipc.Server$
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(
at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at com.sun.proxy.$Proxy19.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(
at org.apache.hadoop.hdfs.DFSOutputStream$
2018-10-15 18:32:06 INFO CoarseGrainedExecutorBackend:54 - Got assigned task 2189
Googled about it but couldn't find any concrete solution. Set the speculation false:
But still didn't help.
It's finishing few tasks, generating few part files and then abruptly stops with this error.
Spark version : 2.3.1 (This was not happening in 1.6x).
There is only one session running, which rules out the possibility of the same location being accessed by a different session.
Any pointers?
Actually the issue is because before spark writes the data into specified hdfs location, it uploads the data into temporary location.This two stage mechanism is the used to ensure consistency of the final data set when working with file systems. In case of successful write the data is moved from temporary location. And in case of unsuccessful write the data is removed from the temporary location. In your case there might be a different executor thread making changes to the temporary location. And once the original executor thread looks to the temporary location, it is not available and hdfs lease exception is thrown.
In order to avoid this exception,
Make sure you are not using any parallel collections.
Avoid multi-threading if applicable
It may be useful to you this solution: java.lang.OutOfMemoryError: Unable to acquire 100 bytes of memory, got 0
In my case I couldn't write orc files. I removed coalesce option and then it worked!

Spark-Application to Local Directory

Spark Application error due to Mkdirs failed to create.
I'm using spark 1.6.3 unable to save output on my local dir Mkdirs failed to create file:/home/zooms/output/sample1/sample1.txt/_temporary/0/_temporary/attempt_201709251225_0005_m_000000_10
(exists=false, cwd=file:/grid/1/hadoop/yarn/local/usercache/zooms/appcache/application_1504506749061_0086/container_e01_1504506749061_0086_01_000003)
Updated logs
17/09/25 13:39:02 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 10, Mkdirs failed to create file:/home/zooms/output/sample1/sample1.txt/_temporary/0/_temporary/attempt_201709251339_0005_m_000000_10 (exists=false, cwd=file:/grid/1/hadoop/yarn/local/usercache/zooms/appcache/application_1504506749061_0099/container_e01_1504506749061_0099_01_000003)
at org.apache.hadoop.fs.ChecksumFileSystem.create(
at org.apache.hadoop.fs.ChecksumFileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.hadoop.fs.FileSystem.create(
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1191)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1183)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
val output = "file:///home/zooms/output/sample1/sample1.txt"
Make sure that the whole cluster have access to the local or specific directory.
On my case, the cluster or the spark executors doesn't have access to the specific directory.
Here's the answer to my question.
Since i'm running on a cluster mode or client mode, workers won't able to create the directory on each node unless you define it. Use
spark-submit -v --master local ...
Writing files to local system with Spark in Cluster mode
Why does Spark job fails to write output?

Spark is crashing when computing big files

I have a program in Scala that read a CSV file, add a new column to the Dataframe and save the result as a parquet file. It works perfectly on small files (<5 Go) but when I try to use bigger files (~80 Go) it always fail when it should write the parquet file with this stacktrace :
16/10/20 10:03:37 WARN scheduler.TaskSetManager: Lost task 14.0 in stage 4.0 (TID 886, reached end of stream after reading 136445 bytes; 1245184 bytes expected
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$
at scala.collection.Iterator$$anon$
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.executor.Executor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
If anyone know what could cause this, that would help me a lot !
System used
Spark 2.0.1
Scala 2.11
Hadoop HDFS 2.7.3
All running in Docker in a 6 machine cluster (each 4 cores and 16 Go of RAM)
Example code
var df ="header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName)))
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)
Here are few points that might help you:
I think you should check distribution of your ipix column data, it might happen that you have data skew, so 1 or few partitions might be much bigger than other. Those fat partitions might be such that 1 task that is working on the fat partition might fail. It probably has something to do with output of your function a2p. I'd test first to run this job even without repartitioning(just remove this call and try to see if it succeeds - without repartition call it will use default partitions split probably by size of input csv file)
I also hope that your input csv is not gzip-ed(since gzip-ed data it's not splittable, so all data will be in 1 partition)
Can you provide code?
perhaps the code you wrote are running on driver? how do you process the file?
there is a special Spark functionality of handling big data, for example RDD.
once you do:
You bring the rdd to the driver memory, hence not using the abilities of spark.
Code that handles big data should run on slaves.
please check this : differentiate driver code and work code in Apache Spark
The problem looks like the read failed when decompress a stream of shuffled data in YARN mode.
Try the following code and see how it goes.
var df ="header", "true").option("inferSchema", "true").option("nullValue", "NULL").csv(hdfsFileURLIn)
df = df.withColumn("ipix", a2p(df.col(deName), df.col(raName))).persist(StorageLevel.MEMORY_AND_DISK)
df.repartition(nPartitions, $"ipix").write.mode("overwrite").parquet(hdfsFileURLOut)
There is also a similar issue Spark job failing in YARN mode

java.util.concurrent.RejectedExecutionException in Spark although driver/client has precisely same version as Server

A task that works in spark local mode is not working for standalone cluster running on the same machine.
The only difference is:
for the master
I am able to run spark pi against the master at the above address and also use the spark gui: so the master address is generally working for spark.
Here is the (normal) spark init code:
val sconf = new SparkConf().setMaster(master).setAppName("EpisCatalog")
val sc = new SparkContext(sconf)
Here is the stacktrace from running the program:
15/12/03 03:39:04.746 main WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/03 03:39:07.706 main WARN MetricsSystem: Using default name DAGScheduler for source because is not set.
15/12/03 03:39:27.739 appclient-registration-retry-thread ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#b649f0b rejected from java.util.concurrent.ThreadPoolExecutor#5ef7a52b[Running, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(
at java.util.concurrent.ThreadPoolExecutor.reject(
at java.util.concurrent.ThreadPoolExecutor.execute(
at java.util.concurrent.AbstractExecutorService.submit(
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:103)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:102)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$
at scala.collection.mutable.ArrayOps$
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:102)
at org.apache.spark.deploy.client.AppClient$$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:128)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:139)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1130)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$
at java.util.concurrent.Executors$
at java.util.concurrent.FutureTask.runAndReset(
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
at java.util.concurrent.ScheduledThreadPoolExecutor$
at java.util.concurrent.ThreadPoolExecutor.runWorker(
at java.util.concurrent.ThreadPoolExecutor$
I am running spark 1.6.0-SNAPSHOT. It has been "installed" to local maven repo and I have verified that the client is using the latest local maven repo version.
I had the same problem. It could be solved by using the full host url (can be found on the Master Web UI, port 18080) instead of just the hostname or localhost.
So I had to use instead of mymachine
I got the same problem and in my case there was version mismatch. I had Spark Driver written on 1.5.1 version and Spark Cluster setup on 1.6.0.
Maybe you deploy cluster on stable version which was on that time 1.5.1.