Problem reading AVRO messages from a Kafka Topic using Structured Spark Streaming (Spark Version 2.3.1.3.0.1.0-187/ Scala version 2.11.8) - scala

I am invoking spark-shell like this
spark-shell --jars kafka-clients-0.10.2.1.jar,spark-sql-kafka-0-10_2.11-2.3.0.cloudera1.jar,spark-streaming-kafka-0-10_2.11-2.3.0.jar,spark-avro_2.11-2.4.0.jar,avro-1.9.1.jar
After That I read from a Kafka Topic using readStream()
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"kafka-1.x:9093,kafka-2.x:9093,kafka-0.x:9093").option("kafka.security.protocol","
SASL_SSL").option("kafka.ssl.protocol","TLSv1.2").option("kafka.sasl.mechanism","PLAIN").option("kafka.sasl.jaas.config","org.apache.kafka.common.security.plain.PlainLoginModule
required username=\"token\" password=\"XXXXXXXXXX\";").option("subscribe", "test-topic").option("startingOffsets",
"latest").load()
Then I read the AVRO Schema File
val jsonFormatSchema = new String(Files.readAllBytes(Paths.get("/root/avro_schema.json")))
Then I make the DataFrame which matches the AVRO schema
val DataLineageDF = df.select(from_avro(col("value"),jsonFormatSchema).as("DataLineage")).select("DataLineage.*")
This Throws an Error on me :
java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
I could fix this Problem by replacing the jar spark-avro_2.11-2.4.0.jar with spark-avro_2.11-2.4.0-palantir.31.jar
Issue:
DataLineageDF.writeStream.format("console").outputMode("append").trigger(Trigger.ProcessingTime("10 seconds")).start
Fails, with this Error
Exception in thread "stream execution thread for [id = ad836d19-0f29-499a-adea-57c6d9c630b2, runId = 489b1123-a2b2-48ea-9d24-e6744e0959b0]" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.boxedType(Lorg/apache/spark/sql/types/DataType;)Ljava/lang/String;
which seems to be related to In-compatible jars. If anyone has any idea what's going wrong please comment

Related

"Data source org.apache.phoenix.spark does not support streamed writing" in Structured Streaming

**I'm trying to connect to the Phoenix driver using Spark Structured Streaming and I'm getting the following exception when I'm trying to load the HBase table data via the Phoenix driver...please help on this **
Jars:
spark.version: 2.4.0
scala.version: 2.12
phoenix.version: 4.11.0-HBase-1.1
hbase.version: 1.4.4
confluent.version: 5.3.0
spark-sql-kafka-0-10_2.12
Code
val tableDF = sqlContext.phoenixTableAsDataFrame("DATA_TABLE", Array("ID","DEPARTMENT"), conf = configuration)
ERROR
Exception in thread "main" java.lang.UnsupportedOperationException: Data source org.apache.phoenix.spark does not support streamed writing
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:298)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:322)
at com.spark.streaming.process.StreamProcess.processDataPackets(StreamProcess.scala:81)
at com.spark.streaming.main.$anonfun$start$1(IAlertJob.scala:55)
at com.spark.streaming.main.$anonfun$start$1$adapted(IAlertJob.scala:27)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext(SparkStreamingApplication.scala:38)
at com.spark.streaming.main.SparkStreamingApplication.withSparkStreamingContext$(SparkStreamingApplication.scala:23)

Spark streaming jdbc read the stream as and when data comes - Data source jdbc does not support streamed reading

I am using PostGre as database. I want to capture one table data for each batch and convert it as parquet file and store in to s3. I tried to connect using JDBC options of spark and readStream like below...
val jdbcDF = spark.readStream
.format("jdbc")
.option("url", "jdbc:postgresql://myserver:5432/mydatabase")
.option("dbtable", "database.schema.table")
.option("user", "xxxxx")
.option("password", "xxxxx")
.load()
but it throwed unsupported exception
Exception in thread "main" java.lang.UnsupportedOperationException: Data source jdbc does not support streamed reading
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:234)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
at examples.SparkJDBCStreaming$.delayedEndpoint$examples$SparkJDBCStreaming$1(SparkJDBCStreaming.scala:16)
at examples.SparkJDBCStreaming$delayedInit$body.apply(SparkJDBCStreaming.scala:5)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
Am I in right track ? Really there is no support for database as data source for spark streaming?
AFAIK other way of doing this is write a kafka producer to publish data in to kafka topic and then using spark streaming...
Note : I dont want to use kafka connect for this since I need to do
some auxiliary transformations.
Is this the only way to do it ?
What is the right way of doing this ? is there any example for such thing?
Please assist!
Spark structured streaming does not have a standard JDBC source, but you can write a custom, but you should understand that your table must have a unique key by which you can track changes.
For example, you can take my implementation, do not forget to add the necessary JDBC driver to the dependencies
This library may help: Jdbc2S.
It provides JDBC streaming capabilities and was built on top of Spark JDBC batch source.
Basically, you use it as you would with any other streaming source, the only mandatory configuration is the name of the offset column in the tables you're consuming.
Jdbc2s works for me using pyspark with few changes in JDBCStreamingSourceV1.scala to fit Python keyword convention such as :
object JDBCStreamingSourceV1 {
val CONFIG_OFFSET_FIELD = "offsetfield"
val CONFIG_START_OFFSET = "startoffset"
val CONFIG_OFFSET_FIELD_DATE_FORMAT = "offsetfielddateformat"
}
Then Finally:
def df_readstream(dbtable, offsetfield):
df = spark.readStream.format("jdbc-streaming-v1") \
.options(url=url,
driver='oracle.jdbc.driver.OracleDriver',
dbtable=dbtable,
user=user,
offsetField=offsetfield,
password=password).load()
return df

Multiple sources found for csv : readStream

I am trying to run below code to read file as a dataframe onto a Kafka topic (for Spark Streaming), developed via Eclipse IDE, using Scala, defining schemas appropriately, by running thin jar on server using spark-submit (without invoking any additional packages) and am getting error below. Tried out suggestions from researching on internet based on spark.read.option.schema.csv similar errors with no success.
Has anybody encountered similar issue for Spark Streaming when using readStream option??
Looking forward to hear your response(s)!
Error:
Exception in thread "main" java.lang.RuntimeException: Multiple sources found for csv (com.databricks.spark.csv.DefaultSource15, org.apache.spark.sql.execution.datasources.csv.CSVFileFormat), please specify the fully qualified class name.
Code:
val csvdf = spark.readStream.option("sep", ",").schema(userSchema).csv("server_path") //does not resolve error
val csvdf = spark.readStream.option("sep", ",").schema(userSchema).format("com.databricks.spark.csv").csv("server_path") //does not resolve error
val csvdf = spark.readStream.option("sep", ",").schema(userSchema).csv("server_path") //does not resolve error
val csvdf = spark.readStream.option("sep", ",").schema(userSchema).format("org.apache.spark.sql.execution.datasources.csv").csv("server_path") //does not resolve error
val csvdf = spark.readStream.option("sep", ",").schema(userSchema).format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").csv("server_path") //does not resolve error
val csvdf = spark.readStream.option("sep", ",").schema(userSchema).format("com.databricks.spark.csv.DefaultSource15").csv("server_path") //does not resolve error
Pom.xml did not have spark-csv jar called explicitly.
Turns out server HDP path containing jars for Spark2 had both spark-csv and spark-sql jars, that was causing the issue of conflicting sources for Csv.
On removing the extra spark-csv jar from the path, issue got resolved.

Getting error in Spark Structured Streaming for using from_json function

I am reading the data from Kafka and my spark code contains the following code:
val hiveDf = parsedDf
.select(from_json(col("value"), schema).as("value"))
.selectExpr("value.*")
When I am running it from IntelliJ it is working but when I am running it as a jar it is throwing following error:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.functions$.from_json(Lorg/apache/spark/sql/Column;Lorg/apache/spark/sql/types/StructType;)Lorg/apache/spark/sql/Column;
my spark-submit command looks like this:
C:\spark>.\bin\spark-submit --jars C:\Users\namaagarwal\Desktop\Spark_FI\spark-sql-kafka-0-10_2.11-2.1.0.cloudera1.jar --class ClickStream C:\Users\namaagarwal\Desktop\Spark_FI\SparkStreamingFI\target\scala-2.11\sparkstreamingfi_2.11-0.1.jar

PySpark: java.sql.SQLException: No suitable driver

I have spark code which connects to Netezza and reads a table.
conf = SparkConf().setAppName("app").setMaster("yarn-client")
sc = SparkContext(conf=conf)
hc = HiveContext(sc)
nz_df=hc.load(source="jdbc",url=address dbname";username=;password=",dbtable="")
I do spark-submit and run the code in the following way..
spark-submit -jars nzjdbc.jar filename.py
And I get the following exception:
py4j.protocol.Py4JJavaError: An error occurred while calling o55.load.
: java.sql.SQLException: No suitable driver
Am I doing anything wrong over here?? is the jar not suitable or is it not able to recgonize the jar?? please let me know the correct way if this is not and also can anyone provide the link to get the jar for connecting netezza from spark.
I am using the 1.6.0 version of spark.