spark streaming from kafka on spark operator(Kubernetes) - apache-kafka

I have a spark structured streaming job in scala, reading from kafka and writing to S3 as hudi tables. Now I am trying to move this job to spark operator on EKS.
When I give the option in the yaml file.
spark.jars.packages: org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.apache.hudi:hudi-spark3.1-bundle_2.12:0.11.1
But still I get the error at both the driver and executor
java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaBatchInputPartition .
How to add the package org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2, so it works.
Edit: Seems it is an existing issue fixed only in yet to be released version spark 3.4. Based on the suggestions here and here I had to bake all the jars (spark-sql-kafka-0-10_2.12-3.1.2 and its dependencies and also hudi jar) into the spark image. Then it worked.

Related

Upgrade Flink 1.10 to Flink 1.11 (Log4j on kubernetes deployment)

After upgrade Flink 1.10 to Flink 1.11, the log4j configuration is no longer working.
my previous configuration was using a library with an adapter that requires log4j 1.x and is no longer compatible with Flink 1.11
according to the new configuration, the flink-conf.yaml should look like this
log4j-console.properties: |+
# This affects logging for both user code and Flink
rootLogger.level = INFO
rootLogger.appenderRef.console.ref = ConsoleAppender
rootLogger.appenderRef.rolling.ref = RollingFileAppender
# Uncomment this if you want to _only_ change Flink's logging
#logger.flink.name = org.apache.flink
#logger.flink.level = INFO
my current configuration using log4j1 looks something similar to this
log4j-console.properties: |+
log4j.rootLogger=INFO,myappender,console
log4j.appender.myappender=com.company.log4j.MyAppender
log4j.appender.myappender.endpoints=http://
is there a way to tell Flink 1.11 to use log4j1 in the flink-conf.yaml file?
As far as I know, flink-conf.yaml does not contain log4j-console.properties section and this is a separate file. What you have specified I suppose is a part of flink-configuration-configmap.yaml cluster resource definition.
According to the flink Configuring Log4j1 Section, in order to use log4j1, you need to:
remove the log4j-core, log4j-slf4j-impl and log4j-1.2-api jars from the lib directory,
add the log4j, slf4j-log4j12 and log4j-to-slf4j jars to the lib directory,
After upgrading the Flink-1.10.2 to Flink-1.11.3 I came across the same issue in the Kubernetes and DCOS(Mesos) Flink cluster. Then to cross-verify I downloaded the Flink's binaries flink-1.11.3-bin-scala_2.12.tgz in local and tested the loggers and found it working without any change.
Flink 1.11 switched from Log4j1 to Log4j2
Then I have followed the steps mentioned in Flink's official documents to use Flink with Log4j1.
Remove the log4j-core, log4j-slf4j-impl and log4j-1.2-api jars from the Flink's lib directory.
Add the log4j, slf4j-log4j12, and log4j-to-slf4j jars to the Flink's lib directory.
Restarted the Kubernetes and DCOS(Mesos) Flink cluster and verified the loggers and found it working.

Is there a config file when installing spark dependency with scala

I installed spark with sbt in project dependecies. Then I want to change variables of the spark env without doing it within my code with a .setMaster(). The problem is that i cannot find any config file on my computer.
This is because I have an error : org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver#my-mbp.domain_not_set.invalid:50487even after trying to change my hostname. Thus, I would like to go deep into spark library and try some things.
I tried pretty much everything that is on this so post : Invalid Spark URL in local spark session.
Many thanks
What worked for the issue:
export SPARK_LOCAL_HOSTNAME=localhost in shell profil (e.g. ~/.bash_profil)
SBT was not able to find the host even using the command just before running sbt. I had to put it in the profil to have a right context.

Unable to create partition on S3 using spark

I would like to use this new functionality: overwrite specific partition without delete all data in s3
I used the new flag (spark.sql.sources.partitionOverwriteMode="dynamic") and test it locally from my IDE and it worked (I was able to overwrite specific partition in s3) but when I deployed it to hdp 2.6.5 with spark 2.3.0 same code didn't create the s3 folders as expected , folder didn't create at all , only temp folder has been created
My code :
df.write
.mode(SaveMode.Overwtite)
.partitionBy("day","hour")
.option("compression", "gzip")
.parquet(s3Path)
Have you tried spark version 2.4? I have worked with this version and both EMR and Glue it has worked well, to use the "dynamic" in version 2.4 just use the code:
dataset.write.mode("overwrite")
.option("partitionOverwriteMode", "dynamic")
.partitionBy("dt")
.parquet("s3://bucket/output")
AWS documentation specifies Spark version 2.3.2 to use spark.sql.sources.partitionOverwriteMode="dynamic".
Reference click here.

How to Save data to influxDB using a spark streaming job

I have tried library pygmalios/reactiveinflux-spark but its dependency is prone to error with our set of libraries and also have tried com.paulgoldbaum/scala-influxdb-client which is giving runtime error while reading file in cluster.
Is there any reliable library to address the needed solution

How to build and run Scala Spark locally

I'm attempting to build Apache Spark locally. Reason for this is to debug Spark methods like reduce. In particular I'm interested in how Spark implements and distributes Map Reduce under the covers as I'm experiencing performance issues and I think running these tasks from source is best method of finding out what the issue is.
So I have cloned the latest from Spark repo :
git clone https://github.com/apache/spark.git
Spark appears to be a Maven project so when I create it in Eclipse here is the structure :
Some of the top level folders also have pom files :
So should I just be building one of these sub projects ? Are these correct steps for running Spark against a local code base ?
Building Spark locally, the short answer:
git clone git#github.com:apache/spark.git
cd spark
sbt/sbt compile
Going in detail into your question, what you're actually asking is 'How to debug a Spark application in Eclipse'.
To have debugging in Eclipse, you don't really need to build Spark in Eclipse. All you need is to create a job with its Spark lib dependency and ask Maven 'download sources'. That way you can use the Eclipse debugger to step into the code.
Then, when creating the Spark Context, use sparkConfig.local[1] as master like:
val conf = new SparkConf()
.setMaster("local[1]")
.setAppName("SparkDebugExample")
so that all Spark interactions are executed in local mode in one thread and therefore visible to your debugger.
If you are investigating a performance issue, remember that Spark is a distributed system, where network plays an important role. Debugging the system locally will only give you part of the answer. Monitoring the job in the actual cluster will be required in order to have a complete picture of the performance characteristics of your job.