Spark job using HBase fails - scala

Any Spark job I run that involves HBase access results in the errors below. My own jobs are in Scala, but supplied python examples end the same. The cluster is Cloudera, running CDH 5.4.4. The same jobs run fine on a different cluster with CDH 5.3.1.
Any help is greatly apreciated!
...
15/08/15 21:46:30 WARN TableInputFormatBase: initializeTable called multiple times. Overwriting connection and table reference; TableInputFormatBase will not close these old references when done.
...
15/08/15 21:46:32 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, some.server.name): java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:163)
...
Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389)
at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158)
... 14 more

run spark-shell with this parameters:
--driver-class-path .../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar --driver-java-options "-Dspark.executor.extraClassPath=.../cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar"
Why it works is described here.

Related

Why Apache Spark does some checks and raises those exceptions during the job runtime, but has never thrown them during Unit test?

There was a bug in my Scala code, formatting the date of the timestamp, being then concatenated as the String to some, non-timestamp column of the Spark Streaming:
concat(date_format(col("timestamp"),"yyyy-MM-DD'T'HH:mm:ss.SSS'Z'")
So, during the tests, everything was ok and tests, sending the messages to the Kafka, were passed, and I was able to see those messages in the Kafka Tool:
Not 292th of October there because of DD instead of dd in the formatter.
But then in the executor it was some extra check that wasn't passed and job was crashed:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 1 times, most recent failure: Lost task 1.0 in stage 8.0 (TID 12, kafkadatageneratorjob-driver, executor driver): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to format it to '2021-10-292T14:27:12.577Z' in the new formatter. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
How to enable the same strict check on the Unit tests to make them also failing on those checks without explicit check of the value, but just forcing timeParserPolicy also to be executed in tests also.

Flink job cant use savepoint in a batch job

Let me start in a generic fashion to see if I somehow missed some concepts: I have a streaming flink job from which I created a savepoint. Simplified version of this job looks like this
Pseduo-Code:
val flink = StreamExecutionEnvironment.getExecutionEnvironment
val stream = if (batchMode) {
flink.readFile(path)
}
else {
flink.addKafkaSource(topicName)
}
stream.keyBy(key)
stream.process(new ProcessorWithKeyedState())
CassandraSink.addSink(stream)
This works fine as long as I run the job without a savepoint. If I start the job from a savepoint I get an exception which looks like this
Caused by: java.lang.UnsupportedOperationException: Checkpoints are not supported in a single key state backend
at org.apache.flink.streaming.api.operators.sorted.state.NonCheckpointingStorageAccess.resolveCheckpoint(NonCheckpointingStorageAccess.java:43)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1623)
at org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:362)
at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:292)
at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:249)
I could work around this if I set the option:
execution.batch-state-backend.enabled: false
but this eventually results in another error:
Caused by: java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0. Please make sure that all types of managed memory consumers contained in the job are configured with a non-negative weight via `taskmanager.memory.managed.consumer-weights`.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:160)
at org.apache.flink.runtime.memory.MemoryManager.validateFraction(MemoryManager.java:673)
at org.apache.flink.runtime.memory.MemoryManager.computeMemorySize(MemoryManager.java:653)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:526)
Of course I tried to set the config key taskmanager.memory.managed.consumer-weights (used DATAPROC:70,PYTHON:30) but this doesn't seems to have any effects.
So I wonder if I have a conceptual error and can't reuse savepoints from a streaming job in a batch job or if I simply have a problem in my configuration. Any hints?
After a hint from the flink user-group it turned out that it is NOT possible to reuse a savepoint from the streaming job (https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/execution_mode/#state-backends--state). So instead of running the job as in batch-mode (flink.setRuntimeMode(RuntimeExecutionMode.BATCH)) I just run it in the default execution mode (STREAMING). This has the minor downside that it will run forever and have to be stopped by someone once all data was processed.

Spark streaming Redis Read Time Out with Scala

While i'm reading table from redis getting this below error.
Below code normally working well.
val readDF= spark.sparkContext.fromRedisKeyPattern(tableName,5).getHash().toDS()
Normally it's working for less than 2 million rows. But if i'm reading big table getting this error.
18/10/11 17:08:25 ERROR Executor: Exception in task 37.0 in stage 3.0
(TID 338) redis.clients.jedis.exceptions.JedisConnectionException:
java.net.SocketTimeoutException: Read timed out at
redis.clients.util.RedisInputStream.ensureFill(RedisInputStream.java:202)
at
redis.clients.util.RedisInputStream.readByte(RedisInputStream.java:40)
val redis =
spark.sparkContext.fromRedisKeyPattern(tableName,100).getHash().toDS()
I also changed some settings on redis but i think it's not about that.
Do you know how can i solve this problem ?

Spark and Amazon S3 not setting credentials in executors

Im doing a Spark program that reads and writes from Amazon S3.My problem is that It works if I execute in local mode (--master local[6]) but if i execute in the cluster (in other machines) I got an error with the credentials:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 33, mmdev02.stratio.com): com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:384)
at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:157)
at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:155)
at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
My code is as follows:
val conf = new SparkConf().setAppName("BackupS3")
val sc = SparkContext.getOrCreate(conf)
sc.hadoopConfiguration.set("fs.s3a.access.key", accessKeyId)
sc.hadoopConfiguration.set("fs.s3a.secret.key", secretKey)
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-" + region + ".amazonaws.com")
sc.hadoopConfiguration.set("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc.hadoopConfiguration.set("fs.s3a.buffer.dir", "/var/tmp/spark")
System.setProperty(SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY, "true");
System.setProperty("com.amazonaws.services.s3.enableV4", "true")
I can write to Amazon S3 but cannot read! I also had to send some properties when I do spark-submit because my region is Frankfurt and I had to enable V4:
--conf spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true
I tried passing the credentials this way too. If i put them in the hdfs-site.xml in every machine it works.
My question, is how can I do it from code? Why are the executors not getting the config i pass them from the code?
I'm using Spark 1.5.2, hadoop-aws 2.7.1 and aws-java-sdk 1.7.4.
Thanks
Don't put secrets the keys, that leads to loss of secrets
If you are running in EC2, your secrets will be picked up automatically from the IAM feature; the client asks a magic web server for session secrets.
...which means: it may be that spark's automatic credential propagation is getting in the way. Unset your AWS_ env vars before submitting the work.
If you set these properties explicitly in your code, the values will only be visible to the driver process. The executors will not have a chance to pick up those credentials.
If you had set them in actual config file like core-site.xml, they will propagate.
Your code would work in local mode because all operations are happening in a single process.
Why it works on a cluster on small files but not large ones (*): The code could also work on unpartitioned files, where read operations are performed in the driver and partitions are then broadcast to executors. On partitioned files, where executors read individual partitions, the credentials won't be set on the executors so it fails.
Best to use standard mechanisms for passing credentials, or better yet, use EC2 roles and IAM policies in your cluster as EricJ's answer suggests. By default, if you do not provide credentials, EMRFS will look up temporary credentials via EC2 instance metadata service.
(*) I am still learning about this myself, and I may need to revise this answer as I learn more

Cannot load main class from JAR file

I have a Spark-scala application. I tried to display a simple message - "Hello my App". When I compile it with sbt compile and run it by sbt run it's fine. I displayed my message with success but he display an error; like this:
Hello my application!
16/11/27 15:17:11 ERROR Utils: uncaught error in thread SparkListenerBus, stopping SparkContext
java.lang.InterruptedException
ERROR ContextCleaner: Error in cleaning thread
java.lang.InterruptedException
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:67)
16/11/27 15:17:11 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040
[success] Total time: 13 s, completed Nov 27, 2016 3:17:12 PM
16/11/27 15:17:12 INFO DiskBlockManager: Shutdown hook called
I can't understand whether it's fine or not!
Also when I try to load my file jar after the run, it displays an error.
My command line look like:
spark-submit "appfilms" --master local[4] target/scala-2.11/system-of-recommandation_2.11-1.0.jar
And the error is:
Error: Cannot load main class from JAR file:/root/projectFilms/appfilms
Run with --help for usage help or --verbose for debug output
16/11/27 15:24:11 INFO Utils: Shutdown hook called
Please can you answer me!
The error is due to the fact that the SparkContext is not stopped, this is required in versions higher than Spark 2.x.
This should be stopped to prevent this error by SparkContext.stop(), or sc.stop(). Inspiration for solving this error is gained from own experiences and the following sources: Spark Context, Spark Listener Bus error
You forgot to use --class Parameter
spark-submit "appfilms" --master local[4] target/scala-2.11/system-of-recommandation_2.11-1.0.jar
spark-submit --class "appfilms" --master local[4] target/scala-2.11/system-of-recommandation_2.11-1.0.jar.
Please note if appfilm belong to any package dont forgot to add package name as below
packagename.appfilms
I believe this will suffice