Stream data from flink to S3 - scala

I am using Flink on amazon EMR and want to stream results of my pipeline to s3 bucket.
I am using Flink version => 1.11.2
This is a code snippet of how the code looks right now variable :
val outputPath = new Path("s3://test/flinkStreamTest/failureLogs/dt=2021-04-15/")
val sink: StreamingFileSink[String] = StreamingFileSink
.forRowFormat(outputPath, new SimpleStringEncoder[String]("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1024 * 1024 * 1024)
.build()
)
.build()
val enrichedStream = AsyncDataStream
.unorderedWait(
resConsumer,
new AsyncElasticRequest(elasticIndexName, elasticHost, elasticPort),
asyncTimeOut.toInt, TimeUnit.MILLISECONDS,
asyncCapacity.toInt
) // this is my pipeline result. it returns a string
enrichedStream.addSink(sink)
env.execute("run pipeline") // this is just to run the pipeline
And this is the error i am currently getting;
java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only supported for HDFS
at org.apache.flink.runtime.fs.hdfs.HadoopRecoverableWriter.<init>(HadoopRecoverableWriter.java:61)
at org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.createRecoverableWriter(HadoopFileSystem.java:202)
at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.createRecoverableWriter(SafetyNetWrapperFileSystem.java:69)
at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink$RowFormatBuilder.createBuckets(StreamingFileSink.java:260)
at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.initializeState(StreamingFileSink.java:396)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:185)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:167)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.initializeOperatorState(StreamOperatorStateHandler.java:106)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:258)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:290)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:479)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:475)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:528)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
at java.lang.Thread.run(Thread.java:748)
I have placed the s3-fs-hadoop jar file in the plugins/s3-fs-hadoop folder.
I also have the same s3-fs-hadoop jar in /usr/lib/flink/lib just in case flink looks for the s3-fs-hadoop jar in that folder also.
Please can someone help me.
I have searched and searched but cant seem to resolve it.
Thanks

I figured it out. I needed to restart the entire flink long running application (not restart job).
Also had to remove the s3-fs-hadoop jar I placed in /usr/lib/flink/lib directory but kept a copy of the s3-fs-hadoop jar in the plugins/s3-fs-hadoop folder.

Related

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX

Class org.apache-spark.SparkException, java.lang.NoClassDefFoundError: Could not initialize class XXX(class where field validation exists) Exception when I am trying to do field validations on Spark Dataframe. Here is my code
And all classes and object used are serialized. Fails on AWS EMR spark job (works fine in local Machine.)
val newSchema = df.schema.add("errorList", ArrayType(new StructType()
.add("fieldName" , StringType)
.add("value" , StringType)
.add("message" , StringType)))
//Validators is a Sequence of validations on columns in a Row.
// Validator method signature
// def checkForErrors(row: Row): (fieldName, value, message) ={
// logic to validate the field in a row }
val validateRow: Row => Row = (row: Row)=>{
val errorList = validators.map(validator => validator.checkForErrors(row)
Row.merge(row, Row(errorList))
}
val validateDf = df.map(validateRow)(RowEncoder.apply(newSchema))
Versions : Spark 2.4.7 and Scala 2.11.8
Any ideas on why this might happen or if someone had the same issue.
I faced a very similar problem with EMR release 6.8.0 - in particular, the spark.jars configuration was not respected for me on EMR (I pointed it at a location of a JAR in S3), even though it seems to be normally accepted Spark parameter.
For me, the solution was to follow this guide ("How do I resolve the "java.lang.ClassNotFoundException" in Spark on Amazon EMR?"):
https://aws.amazon.com/premiumsupport/knowledge-center/emr-spark-classnotfoundexception/
In CDK (where our EMR cluster definitino is), I set up an EMR step to be executed immediately after cluster creation the rewrite the spark.driver.extraClassPath and spark.executor.extraClassPath to also contain the location of my additional JAR (in my case, the JAR physically comes in a Docker image, but you could also set up a boostrap action to copy it on the cluster from S3), as per their code in the article under "For Amazon EMR release version 6.0.0 and later,". The reason you have to do this "rewriting" is because EMR already populates these spark.*.extraClassPath with a bunch of its own JAR location, e.g. for JARs that contain the S3 drivers, so you effectively have to append your own JAR location, rather than just straight up setting the spark.*.extraClassPath to your location. If you do the latter (I tried it), then you will lose lot of the EMR functionality such as being able to read from S3.
#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0

IllegalStateException: _spark_metadata/0 doesn't exist while compacting batch 9

We have Streaming Application implemented using Spark Structured Streaming which tries to read data from Kafka topics and write it to HDFS Location.
Sometimes application fails with Exception:
_spark_metadata/0 doesn't exist while compacting batch 9
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
We are not able to resolve this issue.
Only solution I found is to delete checkpoint location files which will make the job read the topic/data from beginning as soon as we run the application again. However, this is not a feasible solution for production application.
Does anyone has a solution for this error without deleting checkpoint such that I can continue from where the last run was failed?
Sample code of application:
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <server list>)
.option("subscribe", <topic>)
.load()
[...] // do some processing
dfProcessed.writeStream
.format("csv")
.option("format", "append")
.option("path",hdfsPath)
.option("checkpointlocation","")
.outputmode(append)
.start
The error message
_spark_metadata/n.compact doesn't exist when compacting batch n+10
can show up when you
process some data into a FileSink with checkpoint enabled, then
stop your streaming job, then
change the output directory of the FileSink while keeping the same checkpointLocation, then
restart the streaming job
Quick Solution (not for production)
Just delete the files in checkpointLocation and restart the application.
Stable Solution
As you do not want to delete your checkpoint files, you could simply copy the missing spark metadata files from the old File Sink output path to the new output Path. See below to understand what are the "missing spark metadata files".
Background
To understand, why this IllegalStateException is being thrown, we need to understand what is happening behind the scene in the provided file output path. Let outPathBefore be the name of this path. When your streaming job is running and processing data the job creates a folder outPathBefore/_spark_metadata. In that folder you will find a file named after micro-batch Identifier containing the list of files (partitioned files) the data has been written to, e.g:
/home/mike/outPathBefore/_spark_metadata$ ls
0 1 2 3 4 5 6 7
In this case we have details for 8 micro batches. The content of one of the files looks like
/home/mike/outPathBefore/_spark_metadata$ cat 0
v1
{"path":"file:///tmp/file/before/part-00000-99bdc705-70a2-410f-92ff-7ca9c369c58b-c000.csv","size":2287,"isDir":false,"modificationTime":1616075186000,"blockReplication":1,"blockSize":33554432,"action":"add"}
By default, on each tenth micro batch, these files are getting compacted, meaning the contents of the files 0, 1, 2, ..., 9 will be stored in a compacted file called 9.compact.
This procedure continuous for the subsequent ten batches, i.e. in the micro batch 19 the job aggregates the last 10 files which are 9.compact, 10, 11, 12, ..., 19.
Now, imagine you had the streaming job running until micro batch 15 which means the job has created the following files:
/home/mike/outPathBefore/_spark_metadata/0
/home/mike/outPathBefore/_spark_metadata/1
...
/home/mike/outPathBefore/_spark_metadata/8
/home/mike/outPathBefore/_spark_metadata/9.compact
/home/mike/outPathBefore/_spark_metadata/10
...
/home/mike/outPathBefore/_spark_metadata/15
After the fifteenth micro batch you stopped the streaming job and changed the output path of the File Sink to, say, outPathAfter. As you keep the same checkpointLocation the streaming job will continue with micro-batch 16. However, it now creates the metadata files in the new out path:
/home/mike/outPathAfter/_spark_metadata/16
/home/mike/outPathAfter/_spark_metadata/17
...
Now, and this is where the Exception is thrown: When reaching micro batch 19, the job tries to compact the tenth latest files from spark metadata folder. However, it can only find the files 16, 17, 18 but it does not find 9.compact, 10 etc. Hence the error message says:
java.lang.IllegalStateException: history/1523305060336/_spark_metadata/9.compact doesn't exist when compacting batch 19 (compactInterval: 10)
Documentation
The Structured Streaming Programming Guide explains on Recovery Semantics after Changes in a Streaming Query:
"Changes to output directory of a file sink are not allowed: sdf.writeStream.format("parquet").option("path", "/somePath") to sdf.writeStream.format("parquet").option("path", "/anotherPath")"
Databricks has also written some details in the article Streaming with File Sink: Problems with recovery if you change checkpoint or output directories
Error caused by checkpointLocation because checkpointLocation stores old or deleted data information. You just need to delete the folder containing checkpointLocation.
Explore more :https://kb.databricks.com/streaming/file-sink-streaming.html
Example :
df.writeStream
.format("parquet")
.outputMode("append")
.option("checkpointLocation", "D:/path/dir/checkpointLocation")
.option("path", "D:/path/dir/output")
.trigger(Trigger.ProcessingTime("5 seconds"))
.start()
.awaitTermination()
You need to do delete directory checkpointLocation.
This article introduces the mechanism and gives a good way to recover from a deleted _spark_metadata folder in Spark Structured Streaming:
https://dev.to/kevinwallimann/how-to-recover-from-a-deleted-sparkmetadata-folder-546j
"Create dummy log files:
If the metadata log files are irrecoverable, we could create dummy log files for the missing micro-batches.
In our example, this could be done like this:
for i in {0..1}; do echo v1 > "/tmp/destination/_spark_metadata/$i"; done
This will create the files
/tmp/destination/_spark_metadata/0
/tmp/destination/_spark_metadata/1
Now, the query can be restarted and should finish without errors."
As my previous output folder was not recoverable anymore. I tried this dummy solution, which could work to get rid of the IllegalStateException: _spark_metadata/... doesn't exist exception.

Spark Dataframe writes part files to _temporary in instead directly creating partFiles in output directory [duplicate]

We are running spark 2.3.0 on AWS EMR. The following DataFrame "df" is non empty and of modest size:
scala> df.count
res0: Long = 4067
The following code works fine for writing df to hdfs:
scala> val hdf = spark.read.parquet("/tmp/topVendors")
hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint]
scala> hdf.count
res4: Long = 4067
However using the same code to write to a local parquet or csv file end up with empty results:
df.repartition(1).write.mode("overwrite").parquet("file:///tmp/topVendors")
scala> val locdf = spark.read.parquet("file:///tmp/topVendors")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:207)
at scala.Option.getOrElse(Option.scala:121)
We can see why it fails:
ls -l /tmp/topVendors
total 0
-rw-r--r-- 1 hadoop hadoop 0 Jul 30 22:38 _SUCCESS
So there is no parquet file being written.
I have tried this maybe twenty times and for both csv and parquet and on two different EMR Servers: this same behavior is exhibited in all cases.
Is this an EMR specific bug? A more general EC2 bug? Something else? This code works on spark on macOS.
In case it matters - here is the versioning info:
Release label:emr-5.13.0
Hadoop distribution:Amazon 2.8.3
Applications:Spark 2.3.0, Hive 2.3.2, Zeppelin 0.7.3
That is not a bug and it is the expected behavior. Spark does not really support writes to non-distributed storage (it will work in local mode, just because you have shared file system).
Local path is not interpreted (only) as a path on the driver (this would require collecting the data) but local path on each executor. Therefore each executor will write its own chunk to its own local file system.
Not only output is no readable back (to load data each executor and the driver should see the same state of the file system), but depending on the commit algorithm, might not be even finalized (move from the temporary directory).
This error usually occurs when you try to read an empty directory as parquet.
You could check
1. if the DataFrame is empty with outcome.rdd.isEmpty() before write it.
2. Check the if the path you are giving is correct
Also in what mode you are running your application? Try running it in client mode if you are running in cluster mode.

Is this a serious Warning while accessing GCS Bucket directly from Dataproc Spark Job?

I am running a Spark 2.2 job on Dataproc and I need to access a bunch of avro files located in a GCP storage bucket. To be specific, I need to access the files DIRECTLY from the bucket (i.e. NOT first have them copy/pasted onto the Master machine, both because they might be very large and also for compliance reasons).
I am using the gs://XXX notation to refer to the Bucket inside the Spark code, based on recommendations in this doc:
https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage
Everything seems to work. However, I am seeing the following Warnings repeatedly:
18/08/08 15:42:59 WARN com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://ff_src_data/trns2.avro' is not open.
18/08/08 15:42:59 WARN com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://ff_src_data/trns1.avro' is not open.
18/08/08 15:42:59 WARN com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel: Channel for 'gs://ff_src_data/trns3.avro' is not open.
Is this a serious warning? Would it have any material impact on real-life performance (speed), particularly in case of large/many files involved? If so, how should I fix this, or should I just ignore it?
**** UPDATE:
Here's the most basic code to produce this in JAVA:
public static void main(String args[]) throws Exception
{
SparkConf spConf = new SparkConf().setAppName("AVRO-TEST-" + UUID.randomUUID().toString());
Master1 master = new Master1(spConf);
master.readSpark("gs://ff_src_data");
}
class Master1
{
private SparkConf m_spConf;
private JavaSparkContext m_jSPContext;
public Master1(SparkConf spConf)
{
m_spConf = spConf;
m_jSPContext = new JavaSparkContext(m_spConf);
}
public void readSpark(String srcDir)
{
SQLContext sqlContext = SQLContext.getOrCreate(JavaSparkContext.toSparkContext(m_jSPContext));
Dataset<Row> trn = sqlContext.read().format("com.databricks.spark.avro").load(srcDir);
trn.printSchema();
trn.show();
List<Row> rows = trn.collectAsList();
for(Row row : rows)
{
System.out.println("Row content [0]:\t" + row.getDouble(0));
}
}
}
For now, this is just a silly setup to test the ability to load a bunch of Avro files directly from the GCS Bucket.
Also, to clarify: this is Dataproc Image version 1.2 and Spark version 2.2.1
This warning means that code closes GoogleCloudStorageReadChannel after it was already closed. It's harmless warning message, but it could signal that input streams are handled inconsistently in the code when reading files.
May you provide simplified version of your job that reproduces this warning (the more concise it will be the better)? With this repro from you I will be able to check if this is an issue in GCS connector, or maybe in Hadoop/Spark Avro input format.
Update:
This warning message was removed in GCS connector 1.9.10.

Spark AWS emr checkpoint location

I'm running a spark job on EMR but need to create a checkpoint. I tried using s3 but got this error message
17/02/24 14:34:35 ERROR ApplicationMaster: User class threw exception:
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
java.lang.IllegalArgumentException: Wrong FS: s3://spark-
jobs/checkpoint/31d57e4f-dbd8-4a50-ba60-0ab1d5b7b14d/connected-
components-e3210fd6/2, expected: hdfs://ip-172-18-13-18.ec2.internal:8020
Here is my sample code
...
val sparkConf = new SparkConf().setAppName("spark-job")
.set("spark.default.parallelism", (CPU * 3).toString)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(classOf[Member], classOf[GraphVertex], classOf[GraphEdge]))
.set("spark.dynamicAllocation.enabled", "true")
implicit val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
sparkSession.sparkContext.setCheckpointDir("s3://spark-jobs/checkpoint")
....
How can I checkpoint on AWS EMR?
There's a now fixed bug for Spark which meant you could only checkpoint to the default FS, not any other one (like S3). It's fixed in master, don't know about backports.
if it makes you feel any better, the way checkpointing works: write then rename() is slow enough on the object store you may find yourself off better checkpointing locally then doing the upload to s3 yourself.
There is a fix in the master branch for this to allow checkpoint to s3 too. I was able to build against it and it worked so this should be part of next release.
Try something with AWS authenticaton like:
val hadoopConf: Configuration = new Configuration()
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3n.awsAccessKeyId", "id-1")
hadoopConf.set("fs.s3n.awsSecretAccessKey", "secret-key")
sparkSession.sparkContext.getOrCreate(checkPointDir, () =>
{ createStreamingContext(checkPointDir, config) }, hadoopConf)