Can't read data from kafka with spark

Can't read data from kafka with spark - scala

Bear with me I'm new to this. I'm trying to read a kafka stream from a zeppelin note book but it's not returning any data. But when I try to read the topic from the command line it does in fact return data.
C:\kafka_2.13-2.6.0>bin\windows\kafka-console-consumer.bat --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
This is my first event
This is my second event
This is my code:
val sourceTopic = "quickstart-events"
val targetTopic = "sensor-processed"
val kafkaBootstrapServer = "127.0.0.1:9092"
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder.appName("Simple Application")
.config("spark.master", "local").getOrCreate()
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", sourceTopic)
.option("startingOffsets", "earliest")
.load()
case class SensorData(id: String, ts: Long, value: Double)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[SensorData].schema
val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]
val visualizationQuery = rawValues.writeStream
.queryName("visualization")
.outputMode("append")
.format("memory")
.start()
val sampleDataset = sparkSession.sql("select * from visualization")
sampleDataset.count
The count returns 0, when there should be two events.
My dependencies
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.7
org.scala-lang:scala-library:2.11.0
org.apache.spark:spark-core_2.11:2.4.7
org.apache.spark:spark-sql_2.11:2.4.7

Related

Kafka Unit testing

I have a function kafkaIngestion which creates a df from kafkatopic in the following way:
def kafkaIngestion(spark:sparksession):Dataframe = {
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("group.id", grpid)
.load()
.selectExpr("cast(value as string) as data")
.select(from_json($"data", schema=inputSchema)
.as("data")
.select("data.*")
df
}
I am unable to mock the the code to return my expected df. What's the correct way to mock the df?

OneHotEncoder with Streaming Dataframe

I would like to apply OneHotEncoder to multiple Columns in my Streaming Dataframe, but I've got the following error.
Any suggestions?
Many thanks!
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;
CODE:
// Read csv
val Stream = spark.read
.format("csv")
.option("header", "true")
.option("delimiter", ";")
.option("header", "true")
.schema(DFschema)
.load("C:/[...]"/
// Kafka
val properties = new Properties()
//val topic = "mongotest"
properties.put("bootstrap.servers", "localhost:9092")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
Stream.selectExpr("CAST(Col AS STRING) AS KEY",
"to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("topic", "predict")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:[...]")
.start()
Subscribe to topic
val lines = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "predict")
.load()
val df = Stream
.selectExpr("CAST(value AS STRING)")
val jsons = df.select(from_json($"value", DFschema) as "data").select("data.*")
ETL[...]
Apply funcion Bucketizer() to field
val Msplits = Array(Double.NegativeInfinity,7, 14, 21, Double.PositiveInfinity)
val bucketizerM = new Bucketizer()
.setInputCol("MEASURE")
.setOutputCol("MEASURE_c")
.setSplits(Msplits)
val bucketedData1 = bucketizerD.transform(out)
val bucketedData2 = bucketizerM.transform(bucketedData1) # Works
Error using OneHotEncoder()
val indexer = new StringIndexer()
.setInputCol("CODE")
.setOutputCol("CODE_index")
val encoder = new OneHotEncoder()
.setInputCol("CODE")
.setOutputCol("CODE_encoded")
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("A","B", "CODE_encoded"))
.setOutputCol("features")
val transformationPipeline = new Pipeline()
.setStages(Array(indexer, encoder, vectorAssembler))
val fittedPipeline = transformationPipeline.fit(bucketedData2) # Does't work

How to send parquet to kafka in batches using strcutured spark streaming?

I'am reading parquet files and convert it into JSON format, then send to kafka. The question is, it read the whole parquet so send to kafka one-time, but i want to send json data line by line or in batches:
object WriteParquet2Kafka {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("yarn")
.appName("Write Parquet to Kafka")
.getOrCreate()
import spark.implicits._
val ds: DataFrame = spark.readStream
.schema(parquet-schema)
.parquet(path-to-parquet-file)
val df: DataFrame = ds.select($"vin" as "key", to_json( struct( ds.columns.map(col(_)):_* ) ) as "value" )
.filter($"key" isNotNull)
val ddf = df
.writeStream
.format("kafka")
.option("topic", topics)
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/test")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
ddf.awaitTermination()
}
}
Is it possible to do this?

I finally figure out how to solve my question, just add a option and set a suitable number for maxFilesPerTrigger:
val df: DataFrame = spark
.readStream
.option("maxFilesPerTrigger", 1)
.schema(parquetSchema)
.parquet(parqurtUri)
Note: maxFilesPerTrigger must set to 1, so that every parquet file being readed.

Writing sorted dataframe into kafka topic (sorted order rows) in spark structured streaming using scala

while displaying sorting results to console results are showing as expected in sorting order, but when i push those results to kafka topic the sorting order is missing
def main(args: Array[String]) = {
//Spark config and kafka config
// load method
val Raw_df = readStream(sparkSession, inputtopic)
//converting read kafka mesages into json format
val df_messages = Raw_df.selectExpr("CAST(value AS STRING)")
.withColumn("data", from_json($"value", my_schema))
.select("data.*")
val win = window($"date_column","5 minutes")
val modified_df = df_messages.withWatermark("date_column", "3 minutes")
.groupBy(win,$"All_colums", $"date_column")
.count()
.orderBy(asc("date_column"),asc("column_5"))
val finalcol = modified_df.drop("count").drop("window")
//mapping all columsn and converting them to json mesages
val finalcolonames = my_schema.fields.map(z => z.name)
val dataset_Json = finalcol.withColumn("value", to_json(struct(finalcolonames.map(y => col(y)): _*)))
.select($"value")
//val query = writeToKafkaStremoutput(dataset_Json, outputtopic,checkpointlocation)
val query = writeToConsole(order)
(query)
}
//below method write data to kafka topic
def writeToKafkaStremoutput(dataFrame: DataFrame, Config: Config, topic: String,checkpointlocation:String) = {
dataFrame
.selectExpr( "CAST(value AS STRING)")
.writeStream
.format("kafka")
.trigger(Trigger.ProcessingTime("1 second"))
.option("topic", topic)
.option("kafka.bootstrap.servers", "kafka.bootstrap_servers")
.option("checkpointLocation",checkpointPath)
.outputMode(OutputMode.Complete())
.start()
}
//console op for testing
// below method write data toconsole
def writeToConsole(dataFrame: DataFrame) = {
import org.apache.spark.sql.streaming.Trigger
val query = dataFrame
.writeStream
.format("console")
.option("numRows",300)
//.trigger(Trigger.ProcessingTime("20 second"))
.outputMode(OutputMode.Complete())
.option("truncate", false)
.start()
query
}

Spark Structured Streaming: console sink is not working as expected

I have the following code to read and process Kafka data using Structured Streaming
object ETLTest {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
run();
}
def run(): Unit = {
val spark = SparkSession
.builder
.appName("Test JOB")
.master("local[*]")
.getOrCreate()
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", "...")
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvWriter = new ForeachWriter[record] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(record: record) = {
println("record:: " + record)
}
def close(errorOrNull: Throwable): Unit = {}
}
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
// DOES NOT WORK
/*val query = sdvDF
.writeStream
.format("console")
.start()
.awaitTermination()*/
// WORKS
/*val query = sdvDF
.writeStream
.foreach(sdvWriter)
.start()
.awaitTermination()
*/
}
}
I am running this code from IntellijIdea IDE and when I use the foreach(sdvWriter), I could see the records consumed from Kafka, but when I use .writeStream.format("console") I do not see any records. I assume that the console write stream is maintaining some sort of checkpoint and assumes it has processed all the records. Is that the case ? Am I missing something obvious here?

reproduced your code here
both of the options worked. actually in both options without the
import spark.implicits._ it would fail so I'm not sure what you are missing. might be some dependencies configured not correctly. can you add the pom.xml?
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.streaming.Trigger
object Check {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder().master("local[2]")
.getOrCreate
import spark.implicits._
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets","earliest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
val query = sdvDF.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Can't read data from kafka with spark - scala

Related

Kafka Unit testing

OneHotEncoder with Streaming Dataframe

How to send parquet to kafka in batches using strcutured spark streaming?

Writing sorted dataframe into kafka topic (sorted order rows) in spark structured streaming using scala

Spark Structured Streaming: console sink is not working as expected

Categories

Resources