Extracting data from spark streaming dataframe - scala

I'm new to Spark Streaming.
I'm getting an event similar to below from Kafka. I have to extract the path from the dataframe, read the data from the path and write it to a destination.
{"path":["/tmp/file_path/file.parquet"],"format":"parquet","entries":null}
Any idea on how to extract the path and format the spark streaming dataframe?
Here's what I'm trying to achieve,
val df: DataFrame = spark.readStream.format("kafka").
option("kafka.bootstrap.servers", "localhost:9092").
option("subscribe", "kafka-test-event").
option("startingOffsets", "earliest").load()
df.printSchema()
val valDf = df.selectExpr("CAST(value AS STRING)")
val path = valDf.collect()(0).getString(0)
println("path - "+ path)
val newDf = spark.read.parquet(path)
newDf.selectExpr("CAST(value AS STRING)").writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
Error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
kafka
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:374)
When I try to do a collect on the dataframe it throws an Unsupported operation exception.

You are getting that error because you are trying to do static operations on a streaming dataframe.
you can try something like below.
After you read the streaming dataframe from Kafka, try the below
Create a schema class for parsing the incoming data
val incomingSchema = new StructType()
.add("path", StringType)
.add("format", StringType)
.add("entries", StringType)
Associate that schema on top of incoming data and you can select the required fields from your data and do transformations on top of it.
val valDf = df.selectExpr("CAST(value AS STRING) as jsonEntry").select(from_json($"jsonEntry",incomingSchema).select("path")

Related

.csv not a SequenceFile error on Select Hive Query

I am quite a newbie to Spark and Scala ;)
Code summary :
Reading data from CSV files --> Creating A simple inner join on 2 Files --> Writing data to Hive table --> Submitting the job on the cluster
Can you please help to identify what went wrong.
The code is not really complex.
The job is executed well on cluster.
Therefore when I try to visualize data written on hive table I am facing issue.
hive> select * from Customers limit 10;
Failed with exception java.io.IOException:java.io.IOException: hdfs://m01.itversity.com:9000/user/itv000666/warehouse/updatedcustomers.db/customers/part-00000-348a54cf-aa0c-45b4-ac49-3a881ae39702_00000.c000 .csv not a SequenceFile
object LapeyreSparkDemo extends App {
//Getting spark ready
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name","Spark for Lapeyre")
//Creating Spark Session
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.config("spark.sql.warehouse.dir","/user/itv000666/warehouse")
.getOrCreate()
Logger.getLogger(getClass.getName).info("Spark Session Created Successfully")
//Reading
Logger.getLogger(getClass.getName).info("Data loading in DF started")
val ordersSchema = "orderid Int, customerName String, orderDate String, custId Int, orderStatus
String, age String, amount Int"
val orders2019Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv0006666/lapeyrePoc/orders2019.csv")
.load
val newOrder = orders2019Df.withColumnRenamed("custId", "oldCustId")
.withColumnRenamed("customername","oldCustomerName")
val orders2020Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv000666/lapeyrePoc/orders2020.csv")
.load
Logger.getLogger(getClass.getName).info("Data loading in DF complete")
//processing
Logger.getLogger(getClass.getName).info("Processing Started")
val joinCondition = newOrder.col("oldCustId") === orders2020Df.col("custId")
val joinType = "inner"
val joinData = newOrder.join(orders2020Df, joinCondition, joinType)
.select("custId","customername")
//Writing
spark.sql("create database if not exists updatedCustomers")
joinData.write
.format("csv")
.mode(SaveMode.Overwrite)
.bucketBy(4, "custId")
.sortBy("custId")
.saveAsTable("updatedCustomers.Customers")
//Stopping Spark Session
spark.stop()
}
Please let me know in case more information required.
Thanks in advance.
This is the culprit
joinData.write
.format("csv")
Instead used this and it worked.
joinData.write
.format("Hive")
Since I am writing data to hive table (orc format), the format should be "Hive" and not csv.
Also, do not forget to enable hive support while creating spark session.
Also, In spark 2, bucketby & sortby is not supported. Maybe it does in Spark 3.

Reading kafka topic using spark dataframe

I want to create dataframe on top of kafka topic and after that i want to register that dataframe as temp table to perform minus operation on data. I have written below code. But while querying registered table I'm getting error
"org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"
org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "SERVER ******").option("subscribe", "TOPIC_NAME").option("startingOffsets", "earliest").load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))
val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")
personDF.registerTempTable("final_df1")
spark.sql("select * from final_df1").show
ERROR:---------- "org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"
Also i have used start() method and I'm getting below error.
20/08/11 00:59:30 ERROR streaming.MicroBatchExecution: Query final_df1 [id = 1a3e2ea4-2ec1-42f8-a5eb-8a12ce0fb3f5, runId = 7059f3d2-21ec-43c4-b55a-8c735272bf0f] terminated with error
java.lang.AbstractMethodError
NOTE: My main objective behind writing this script is i want to write minus query on this data and want to compare it with one of the register table i have on cluster. So , to summarise If I'm sending 1000 records in kafka topic from oracle database, I'm creating dataframe on top of oracle table , registering it as temp table and same I'm doing with kafka topic. Than i want to run minus query between source(oracle) and target(kafka topic). to perform 100% data validation between source and target. (Registering kafka topic as temporary table is possible?)
Use memory sink instead of registerTempTable. Check below code.
org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "SERVER ******")
.option("subscribe", "TOPIC_NAME")
.option("startingOffsets", "earliest")
.load()
df.printSchema()
val personStringDF = df.selectExpr("CAST(value AS STRING)")
val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))
val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")
personDF
.writeStream
.outputMode("append")
.format("memory")
.queryName("final_df1").start()
spark.sql("select * from final_df1").show(10,false)
Streaming DataFrame doesn't support the show() method. When you call start() method, it will start a background thread to stream the input data to the sink, and since you are using ConsoleSink, it will output the data to the console. You don't need to call show().
remove the below lines,
personDF.registerTempTable("final_df1")
spark.sql("select * from final_df1").show
and add the below or equivalent lines instead,
val query1 = personDF.writeStream.queryName("final_df1").format("memory").outputMode("append").start()
query1.awaitTermination()

How to apply Spark schema to the query based on Kafka topic name in Spark Structured Streaming?

I have a Spark Structured Streaming job which is streaming data from multiple Kafka topics based on a subscribePattern and for every Kafka topic I have a Spark schema. When streaming the data from Kafka I want to apply the Spark schema to the Kafka message based on the topic name.
Consider I have two topics: cust & customers.
Streaming data from Kafka based on subscribePattern (Java regex string):
var df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "cust*")
.option("startingOffsets", "earliest")
.load()
.withColumn("value", $"value".cast("string"))
.filter($"value".isNotNull)
the above streaming query streams data from both the topics.
Let's say I have two Spark schemas one for each topic:
var cust: StructType = new StructType()
.add("name", StringType)
.add("age", IntegerType)
var customers: StructType = new StructType()
.add("id", IntegerType)
.add("first_name", StringType)
.add("last_name", StringType)
.add("email", StringType)
.add("address", StringType)
Now, I want to apply the Spark Schema based on topic name and to do that I have written a udf which reads the topic name and returns the schema in DDL format:
val schema = udf((table: String) => (table) match {
case ("cust") => cust.toDDL
case ("customers") => customers.toDDL
case _ => new StructType().toDDL
})
Then I am using the udf (I understand that udf applies on every column) inside the from_json method like this:
val query = df
.withColumn("topic", $"topic".cast("string"))
.withColumn("data", from_json($"value", schema($"topic")))
.select($"key", $"topic", $"data.*")
.writeStream.outputMode("append")
.format("console")
.start()
.awaitTermination()
This gives me the following exception which is correct because from_json expects String schema in DDL format or StructType.
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of UDF(topic);
I want to know how to accomplish this?
Any help will be appreciated!
What you're doing is not possible. Your query df can't have 2 different schemas.
I can think of 2 ways to do it:
Split your df by topic, then apply your 2 schemas to 2 dfs (cust and customers)
Merge the 2 schemas into 1 schema and apply that to all topics.

How to display results of intermediate transformations of streaming query?

I am implementing one usecase to try-out Spark Structured Streaming API.
The source data is read from Kafka topic and after applying some transformations, results written to console.
I want to print the intermediate output along with the final results of the structured streaming query.
Here is the code snippet:
val trips = getTaxiTripDataframe() //this function consumes kafka topic and desrialize the byte array to create dataframe with required columns
val filteredTrips = trips.filter(col("taxiCompany").isNotNull && col("pickUpArea").isNotNull)
val output = filteredTrips
.groupBy("taxiCompany","pickupArea")
.agg(Map("pickupArea" -> "count"))
val query = output.writeStream.format("console")
.option("numRows","50")
.option("truncate","false")
.outputMode("update").start()
query.awaitTermination()
I want to print 'filteredTrips' dataframe on console. I tried using .show() method of dataframe, but as it is a dataframe created on streaming data, it is throwing below exception:
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
Is there any other work around?
Yes, you can create two streams (I am using Spark 2.4.3)
val filteredTrips = trips.filter(col("taxiCompany").isNotNull && col("pickUpArea").isNotNull)
val query1 = filteredTrips
.format("console")
.option("numRows","50")
.option("truncate","false")
.outputMode("update").start()
val query2 = filteredTrips
.groupBy("taxiCompany","pickupArea")
.agg(Map("pickupArea" -> "count"))
.writeStream
.format("console")
.option("numRows","50")
.option("truncate","false")
.outputMode("update").start()
query1.awaitTermination()
query2.awaitTermination()

spark structured streaming dataframe/dataset apply map, iterate record and look into hbase table using shc connector

I am using structured streaming with Spark 2.1.1 Needs to construct the key from coulmns available in streaming dataframe(kafka source) and get the hbase table data in dataframe(static) using shc spark hbase connector. Then apply business logic using both dataframes.
Planning to construct key from iterating records in streaming dataframe, and for each record after constructing key look into hbase table, get the dataframe using shc connector, then apply some business logic using both dataframes. then send response data to kafka topic.
structured streaming with Spark 2.1.1, kafka data source, shc spark hbase connector
val StreamingDF= spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribePattern", kafkaReqTopic)
.load()
val responseDF = StreamingDF.mapPartitions(rowIter => rowIter.map {
row =>
import spark.sqlContext.implicits._
def catalog = s"""{
|"table":{"namespace":"default", "name":"abchbasetable"},
|"rowkey":"abchbasetablerowkey",
|"columns":{
|"rowkey":{"cf":"rowkey", "col":"abchbasetablerowkey", "type":"string"},
|"col1":{"cf":"topo", "col":"col1", "type":"string"}
|}
|}""".stripMargin
def withCatalog(cat: String): DataFrame = {
spark.sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog -> cat))
.format("org.apache.spark.sql.execution.datasources.hbase")
.load()
}
val staticDF = withCatalog(catalog)
staticDF.show(10, false)
})
val kafkaOutput = abc.responseDF
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("topic", topicname)
.option("checkpointLocation", "/pathtocheckpoint")`enter code here`
.start()
planned to get the static dataframe from hbase and append coulmns to streaming dataframe