Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX - scala

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark Dataframe. Here is my code
And all classes and object used are serialized. Fails on AWS EMR spark job (works fine in local Machine.)
val newSchema = df.schema.add("errorList", ArrayType(new StructType()
.add("fieldName" , StringType)
.add("value" , StringType)
.add("message" , StringType)))
//Validators is a Sequence of validations on columns in a Row.
// Validator method signature
// def checkForErrors(row: Row): (fieldName, value, message) ={
// logic to validate the field in a row }
val validateRow: Row => Row = (row: Row)=>{
val errorList = validators.map(validator => validator.checkForErrors(row)
Row.merge(row, Row(errorList))
}
val validateDf = df.map(validateRow)(RowEncoder.apply(newSchema))
Versions : Spark 2.4.7 and Scala 2.11.8
Any ideas on why this might happen or if someone had the same issue.

I faced a very similar problem with EMR release 6.8.0 - in particular, the spark.jars configuration was not respected for me on EMR (I pointed it at a location of a JAR in S3), even though it seems to be normally accepted Spark parameter.
For me, the solution was to follow this guide ("How do I resolve the "java.lang.ClassNotFoundException" in Spark on Amazon EMR?"):
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
In CDK (where our EMR cluster definitino is), I set up an EMR step to be executed immediately after cluster creation the rewrite the spark.driver.extraClassPath and spark.executor.extraClassPath to also contain the location of my additional JAR (in my case, the JAR physically comes in a Docker image, but you could also set up a boostrap action to copy it on the cluster from S3), as per their code in the article under "For Amazon EMR release version 6.0.0 and later,". The reason you have to do this "rewriting" is because EMR already populates these spark.*.extraClassPath with a bunch of its own JAR location, e.g. for JARs that contain the S3 drivers, so you effectively have to append your own JAR location, rather than just straight up setting the spark.*.extraClassPath to your location. If you do the latter (I tried it), then you will lose lot of the EMR functionality such as being able to read from S3.
#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0

Related

Stream data from flink to S3

I am using Flink on amazon EMR and want to stream results of my pipeline to s3 bucket.
I am using Flink version => 1.11.2
This is a code snippet of how the code looks right now variable :
val outputPath = new Path("s3://test/flinkStreamTest/failureLogs/dt=2021-04-15/")
val sink: StreamingFileSink[String] = StreamingFileSink
.forRowFormat(outputPath, new SimpleStringEncoder[String]("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024)
.build()
)
.build()
val enrichedStream = AsyncDataStream
.unorderedWait(
resConsumer,
new AsyncElasticRequest(elasticIndexName, elasticHost, elasticPort),
asyncTimeOut.toInt, TimeUnit.MILLISECONDS,
asyncCapacity.toInt
) // this is my pipeline result. it returns a string
enrichedStream.addSink(sink)
env.execute("run pipeline") // this is just to run the pipeline
And this is the error i am currently getting;
java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS
at org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriter.<init>(HadoopRecoverableWriter.java:61)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.createRecoverableWriter(HadoopFileSystem.java:202)
at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.createRecoverableWriter(SafetyNetWrapperFileSystem.java:69)
at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink$RowFormatBuilder.createBuckets(StreamingFileSink.java:260)
at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.initializeState(StreamingFileSink.java:396)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:185)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:167)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:106)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:258)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
at java.lang.Thread.run(Thread.java:748)
I have placed the s3-fs-hadoop jar file in the plugins/s3-fs-hadoop folder.
I also have the same s3-fs-hadoop jar in /usr/lib/flink/lib just in case flink looks for the s3-fs-hadoop jar in that folder also.
Please can someone help me.
I have searched and searched but cant seem to resolve it.
Thanks
I figured it out. I needed to restart the entire flink long running application (not restart job).
Also had to remove the s3-fs-hadoop jar I placed in /usr/lib/flink/lib directory but kept a copy of the s3-fs-hadoop jar in the plugins/s3-fs-hadoop folder.

Spark Dataframe writes part files to _temporary in instead directly creating partFiles in output directory [duplicate]

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Spark 2.3 dynamic partitionBy not working on S3 AWS EMR 5.13.0

Dynamic partitioning introduced by Spark 2.3 doesn't seem to work on AWS's EMR 5.13.0 when writing to S3
When executing, a temporary directory is created in S3 but it disappears once the process is completed without writing the new data to the final folder structure.
The issue was found when executing a Scala/Spark 2.3 application on EMR 5.13.0.
The configuration is as follows:
var spark = SparkSession
.builder
.appName(MyClass.getClass.getSimpleName)
.getOrCreate()
spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC") // also tried "dynamic"
The code that writes to S3:
val myDataset : Dataset[MyType] = ...
val w = myDataset
.coalesce(10)
.write
.option("encoding", "UTF-8")
.option("compression", "snappy")
.mode("overwrite")
.partitionBy("col_1","col_2")
w.parquet(s"$destinationPath/" + Constants.MyTypeTableName)
With destinationPath being a S3 bucket/folder
Anyone else has experienced this issue?
Upgrading to EMR 5.19 fixes the problem. However my previous answer is incorrect - using the EMRFS S3-optimized Committer has nothing to do with it. The EMRFS S3-optimized Committer is silently skipped when spark.sql.sources.partitionOverwriteMode is set to dynamic: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-committer-reqs.html
If you can upgrade to at least EMR 5.19.0, AWS's EMRFS S3-optimized Committer solves these issues.
--conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
See: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html

spark join fails with exception "ClassNotFoundException: org.apache.spark.rdd.RDD$" but runs when pasting into spart-shell of Hadoop Cluster

I am trying to filter records from a file (facts) based on values from another file (list) using join.
case class CDR(no:Int,nm:String)
val facts = sc.textFile("/temp_scv/a.csv").map( (line) => { val cols = line.split(",");new CDR(cols(0).toInt,cols(1)); }).keyBy( (cdr:CDR) => cdr.no)
val list = sc.textFile("/temp_scv/b.csv").keyBy( (no) => no.toInt)
val filtered = facts.join(list)
When I package this as jar and execute this on Hadoop cluster using spark-submit it fails with exception
ClassNotFoundException: org.apache.spark.rdd.RDD$
However the same code runs fine when I paste it into spark-shell on the Hadoop cluster.
It was a version mismatch. I am using Spark 1.2.0 on the clusters. And the code was compiled with sark-core version 1.3.0
Compiling the code with same spark-core version resolved the issue.

Sharing data between nodes using Apache Spark

Here is how I launch the Spark job :
./bin/spark-submit \
--class MyDriver\
--master spark://master:7077 \
--executor-memory 845M \
--deploy-mode client \
./bin/SparkJob-0.0.1-SNAPSHOT.jar
The class MyDriver accesses the spark context using :
val sc = new SparkContext(new SparkConf())
val dataFile= sc.textFile("/data/example.txt", 1)
In order to run this within a cluster I copy the file "/data/example.txt" to all nodes within the cluster. Is there a mechanism using Spark to share this data file between nodes without manually copying them ? I don't think I can use a broadcast variable in this case ?
Update :
An option is to have a dedicated file server which shares the file to be processed : val dataFile= sc.textFile("http://fileserver/data/example.txt", 1)
sc.textFile("/some/file.txt") read a file distributed in hdfs, i.e.:
/some/file.txt is (already) split in multiple parts which are distributed a couple of computers each.
and each worker/task read one parts of the file. This is useful because you don't need to manage which part yourself.
If you have copied the files on each worker node, you can read it in all task:
val myRdd = sc.parallelize(1 to 100) // 100 tasks
val fileReadEveryWhere = myRdd.map( read("/my/file.txt") )
and have the code of read(...) implemented somewhere.
Otherwise, you can also use a [broadcast variable] that is seed from the driver to all workers:
val myObject = read("/my/file.txt") // obj instantiated on driver node
val bdObj = sc.broadcast(myObject)
val myRdd = sc.parallelize(1 to 100)
.map{ i =>
// use bdObj in task i, ex:
bdObj.value.process(i)
}
In this case, myObject should be serializable and it is better if it is not too big.
Also, the method read(...) is run on the driver machine. So you only need the file on the driver. But if you don't know which machine it is (e.g. if you use spark-submit) then the file should be on all machines :-\ . In this case, it is maybe better to have access to some DB or external file system.