I am reading log lines from kafka topic through spark structured streaming,separating the fields of loglines, performing some manipulations on fields and storing storing it in dataframe with separate columns for every fields. I want to write this dataframe to kafka
Below is my sample dataframe and writestream for writing it to kafka
val dfStructuredWrite = dfProcessedLogs.select(
dfProcessedLogs("result").getItem("_1").as("col1"),
dfProcessedLogs("result").getItem("_2").as("col2"),
dfProcessedLogs("result").getItem("_17").as("col3"))
dfStructuredWrite
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
Above code gives me below error
Required attribute 'value' not found
I believe this is because I don't have my dataframe in key/value format.How can I write my existing dataframe to kafka in most efficient way?
The Dataframe being written to Kafka should have the following columns in schema:
key (optional) (type: string or binary)
value (required) (type: string or binary)
topic (optional) (type: string)
In your case there is no value column and the exception is thrown.
You have to modify it to add at least value column, ex:
import org.apache.spark.sql.functions.{concat, lit}
dfStructuredWrite.select(concat($"col1", lit(" "), $"col2", lit(" "), $"col3").alias("value"))
For more details you can check: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
Related
I want to create dataframe on top of kafka topic and after that i want to register that dataframe as temp table to perform minus operation on data. I have written below code. But while querying registered table I'm getting error
"org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"
org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "SERVER ******").option("subscribe", "TOPIC_NAME").option("startingOffsets", "earliest").load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))
val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")
personDF.registerTempTable("final_df1")
spark.sql("select * from final_df1").show
ERROR:---------- "org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"
Also i have used start() method and I'm getting below error.
20/08/11 00:59:30 ERROR streaming.MicroBatchExecution: Query final_df1 [id = 1a3e2ea4-2ec1-42f8-a5eb-8a12ce0fb3f5, runId = 7059f3d2-21ec-43c4-b55a-8c735272bf0f] terminated with error
java.lang.AbstractMethodError
NOTE: My main objective behind writing this script is i want to write minus query on this data and want to compare it with one of the register table i have on cluster. So , to summarise If I'm sending 1000 records in kafka topic from oracle database, I'm creating dataframe on top of oracle table , registering it as temp table and same I'm doing with kafka topic. Than i want to run minus query between source(oracle) and target(kafka topic). to perform 100% data validation between source and target. (Registering kafka topic as temporary table is possible?)
Use memory sink instead of registerTempTable. Check below code.
org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "SERVER ******")
.option("subscribe", "TOPIC_NAME")
.option("startingOffsets", "earliest")
.load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))
val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")
personDF
.writeStream
.outputMode("append")
.format("memory")
.queryName("final_df1").start()
spark.sql("select * from final_df1").show(10,false)
Streaming DataFrame doesn't support the show() method. When you call start() method, it will start a background thread to stream the input data to the sink, and since you are using ConsoleSink, it will output the data to the console. You don't need to call show().
remove the below lines,
personDF.registerTempTable("final_df1")
spark.sql("select * from final_df1").show
and add the below or equivalent lines instead,
val query1 = personDF.writeStream.queryName("final_df1").format("memory").outputMode("append").start()
query1.awaitTermination()
I have a Spark Structured Streaming job which is streaming data from multiple Kafka topics based on a subscribePattern and for every Kafka topic I have a Spark schema. When streaming the data from Kafka I want to apply the Spark schema to the Kafka message based on the topic name.
Consider I have two topics: cust & customers.
Streaming data from Kafka based on subscribePattern (Java regex string):
var df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "cust*")
.option("startingOffsets", "earliest")
.load()
.withColumn("value", $"value".cast("string"))
.filter($"value".isNotNull)
the above streaming query streams data from both the topics.
Let's say I have two Spark schemas one for each topic:
var cust: StructType = new StructType()
.add("name", StringType)
.add("age", IntegerType)
var customers: StructType = new StructType()
.add("id", IntegerType)
.add("first_name", StringType)
.add("last_name", StringType)
.add("email", StringType)
.add("address", StringType)
Now, I want to apply the Spark Schema based on topic name and to do that I have written a udf which reads the topic name and returns the schema in DDL format:
val schema = udf((table: String) => (table) match {
case ("cust") => cust.toDDL
case ("customers") => customers.toDDL
case _ => new StructType().toDDL
})
Then I am using the udf (I understand that udf applies on every column) inside the from_json method like this:
val query = df
.withColumn("topic", $"topic".cast("string"))
.withColumn("data", from_json($"value", schema($"topic")))
.select($"key", $"topic", $"data.*")
.writeStream.outputMode("append")
.format("console")
.start()
.awaitTermination()
This gives me the following exception which is correct because from_json expects String schema in DDL format or StructType.
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of UDF(topic);
I want to know how to accomplish this?
Any help will be appreciated!
What you're doing is not possible. Your query df can't have 2 different schemas.
I can think of 2 ways to do it:
Split your df by topic, then apply your 2 schemas to 2 dfs (cust and customers)
Merge the 2 schemas into 1 schema and apply that to all topics.
I am using structured streaming with Spark 2.1.1 Needs to construct the key from coulmns available in streaming dataframe(kafka source) and get the hbase table data in dataframe(static) using shc spark hbase connector. Then apply business logic using both dataframes.
Planning to construct key from iterating records in streaming dataframe, and for each record after constructing key look into hbase table, get the dataframe using shc connector, then apply some business logic using both dataframes. then send response data to kafka topic.
structured streaming with Spark 2.1.1, kafka data source, shc spark hbase connector
val StreamingDF= spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribePattern", kafkaReqTopic)
.load()
val responseDF = StreamingDF.mapPartitions(rowIter => rowIter.map {
row =>
import spark.sqlContext.implicits._
def catalog = s"""{
|"table":{"namespace":"default", "name":"abchbasetable"},
|"rowkey":"abchbasetablerowkey",
|"columns":{
|"rowkey":{"cf":"rowkey", "col":"abchbasetablerowkey", "type":"string"},
|"col1":{"cf":"topo", "col":"col1", "type":"string"}
|}
|}""".stripMargin
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog -> cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val staticDF = withCatalog(catalog)
staticDF.show(10, false)
})
val kafkaOutput = abc.responseDF
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("topic", topicname)
.option("checkpointLocation", "/pathtocheckpoint")`enter code here`
.start()
planned to get the static dataframe from hbase and append coulmns to streaming dataframe
Kinda an offshoot of a post I had a month ago.
I have a spark structured steaming application that I'm reading in from Kafka. Here is the basic structure of my code.
I create the spark session.
val spark = SparkSession
.builder
.appName("app_name")
.getOrCreate()
Then I read from the stream
val data_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server_list")
.option("subscribe", "topic")
.load()
In Kafka record, I cast the "value" as a string. It converts from binary to string.
val df = data_stream
.select($"value".cast("string") as "json")
Based off of a pre-defined schema, I try to parse out the json structure into columns. However, the problem here is if the data is "bad" or a different format then it doesn't match the defined schema. I need to filter out row's that do not match my schema. Whether they are null, numbers, some random text like "hello". If it is not a json then it should not proceed through to the next dataframe process
val df2 = df.select(from_json($"json", schema) as "data")
.select("data.*")
if I pass in an empty kafka message through console producer the Spark query crashes giving
java.util.NoSuchElementException: head of empty list
at scala.collection.immutable.Nil$.head(List.scala:420)
at scala.collection.immutable.Nil$.head(List.scala:417)
at org.apache.spark.sql.catalyst.expressions.JsonToStruct.nullSafeEval(jsonExpressions.scala:500)
at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:325)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
Source)
at org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:219)
at org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:218)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:52)
at org.apache.spark.sql.execution.streaming.ForeachSink$$anonfun$addBatch$1.apply(ForeachSink.scala:49)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:925)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
I am trying to use structured streaming approach using Spark-Streaming based on DataFrame/Dataset API to load a stream of data from Kafka.
I use:
Spark 2.10
Kafka 0.10
spark-sql-kafka-0-10
Spark Kafka DataSource has defined underlying schema:
|key|value|topic|partition|offset|timestamp|timestampType|
My data come in json format and they are stored in the value column. I am looking for a way how to extract underlying schema from value column and update received dataframe to columns stored in value? I tried the approach below but it does not work:
val columns = Array("column1", "column2") // column names
val rawKafkaDF = sparkSession.sqlContext.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe",topic)
.load()
val columnsToSelect = columns.map( x => new Column("value." + x))
val kafkaDF = rawKafkaDF.select(columnsToSelect:_*)
// some analytics using stream dataframe kafkaDF
val query = kafkaDF.writeStream.format("console").start()
query.awaitTermination()
Here I am getting Exception org.apache.spark.sql.AnalysisException: Can't extract value from value#337; because in time of creation of the stream, values inside are not known...
Do you have any suggestions?
From the Spark perspective value is just a byte sequence. It has no knowledge about the serialization format or content. To be able to extract the filed you have to parse it first.
If data is serialized as a JSON string you have two options. You can cast value to StringType and use from_json and provide a schema:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.from_json
val schema: StructType = StructType(Seq(
StructField("column1", ???),
StructField("column2", ???)
))
rawKafkaDF.select(from_json($"value".cast(StringType), schema))
or cast to StringType, extract fields by path using get_json_object:
import org.apache.spark.sql.functions.get_json_object
val columns: Seq[String] = ???
val exprs = columns.map(c => get_json_object($"value", s"$$.$c"))
rawKafkaDF.select(exprs: _*)
and cast later to the desired types.