Bear with me I'm new to this. I'm trying to read a kafka stream from a zeppelin note book but it's not returning any data. But when I try to read the topic from the command line it does in fact return data.
C:\kafka_2.13-2.6.0>bin\windows\kafka-console-consumer.bat --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
This is my first event
This is my second event
This is my code:
val sourceTopic = "quickstart-events"
val targetTopic = "sensor-processed"
val kafkaBootstrapServer = ""
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder.appName("Simple Application")
.config("spark.master", "local").getOrCreate()
val rawData = sparkSession.readStream
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", sourceTopic)
.option("startingOffsets", "earliest")
case class SensorData(id: String, ts: Long, value: Double)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[SensorData].schema
val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]
val visualizationQuery = rawValues.writeStream
val sampleDataset = sparkSession.sql("select * from visualization")
The count returns 0, when there should be two events.
I have a function kafkaIngestion which creates a df from kafkatopic in the following way:
def kafkaIngestion(spark:sparksession):Dataframe = {
val df ="kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("", grpid)
.selectExpr("cast(value as string) as data")
.select(from_json($"data", schema=inputSchema)
I am unable to mock the the code to return my expected df. What's the correct way to mock the df?
I would like to apply OneHotEncoder to multiple Columns in my Streaming Dataframe, but I've got the following error.
Any suggestions?
Many thanks!
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;
// Read csv
val Stream =
.option("header", "true")
.option("delimiter", ";")
.option("header", "true")
// Kafka
val properties = new Properties()
//val topic = "mongotest"
properties.put("bootstrap.servers", "localhost:9092")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
Stream.selectExpr("CAST(Col AS STRING) AS KEY",
"to_json(struct(*)) AS value")
.option("topic", "predict")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:[...]")
Subscribe to topic
val lines = spark.readStream
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "predict")
val df = Stream
.selectExpr("CAST(value AS STRING)")
val jsons =$"value", DFschema) as "data").select("data.*")
Apply funcion Bucketizer() to field
val Msplits = Array(Double.NegativeInfinity,7, 14, 21, Double.PositiveInfinity)
val bucketizerM = new Bucketizer()
val bucketedData1 = bucketizerD.transform(out)
val bucketedData2 = bucketizerM.transform(bucketedData1) # Works
Error using OneHotEncoder()
val indexer = new StringIndexer()
val encoder = new OneHotEncoder()
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("A","B", "CODE_encoded"))
val transformationPipeline = new Pipeline()
.setStages(Array(indexer, encoder, vectorAssembler))
val fittedPipeline = # Does't work
I'am reading parquet files and convert it into JSON format, then send to kafka. The question is, it read the whole parquet so send to kafka one-time, but i want to send json data line by line or in batches:
object WriteParquet2Kafka {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.appName("Write Parquet to Kafka")
import spark.implicits._
val ds: DataFrame = spark.readStream
val df: DataFrame =$"vin" as "key", to_json( struct(* ) ) as "value" )
.filter($"key" isNotNull)
val ddf = df
.option("topic", topics)
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/test")
.trigger(Trigger.ProcessingTime("10 seconds"))
Is it possible to do this?
I finally figure out how to solve my question, just add a option and set a suitable number for maxFilesPerTrigger:
val df: DataFrame = spark
.option("maxFilesPerTrigger", 1)
Note: maxFilesPerTrigger must set to 1, so that every parquet file being readed.
while displaying sorting results to console results are showing as expected in sorting order, but when i push those results to kafka topic the sorting order is missing
def main(args: Array[String]) = {
//Spark config and kafka config
// load method
val Raw_df = readStream(sparkSession, inputtopic)
//converting read kafka mesages into json format
val df_messages = Raw_df.selectExpr("CAST(value AS STRING)")
.withColumn("data", from_json($"value", my_schema))
val win = window($"date_column","5 minutes")
val modified_df = df_messages.withWatermark("date_column", "3 minutes")
.groupBy(win,$"All_colums", $"date_column")
val finalcol = modified_df.drop("count").drop("window")
//mapping all columsn and converting them to json mesages
val finalcolonames = =>
val dataset_Json = finalcol.withColumn("value", to_json(struct( => col(y)): _*)))
//val query = writeToKafkaStremoutput(dataset_Json, outputtopic,checkpointlocation)
val query = writeToConsole(order)
//below method write data to kafka topic
def writeToKafkaStremoutput(dataFrame: DataFrame, Config: Config, topic: String,checkpointlocation:String) = {
.selectExpr( "CAST(value AS STRING)")
.trigger(Trigger.ProcessingTime("1 second"))
.option("topic", topic)
.option("kafka.bootstrap.servers", "kafka.bootstrap_servers")
//console op for testing
// below method write data toconsole
def writeToConsole(dataFrame: DataFrame) = {
import org.apache.spark.sql.streaming.Trigger
val query = dataFrame
//.trigger(Trigger.ProcessingTime("20 second"))
.option("truncate", false)
I have the following code to read and process Kafka data using Structured Streaming
object ETLTest {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
def run(): Unit = {
val spark = SparkSession
.appName("Test JOB")
val kafkaStreamingDF = spark
.option("kafka.bootstrap.servers", "...")
.option("subscribe", "...")
.option("failOnDataLoss", "false")
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvWriter = new ForeachWriter[record] {
def open(partitionId: Long, version: Long): Boolean = {
def process(record: record) = {
println("record:: " + record)
def close(errorOrNull: Throwable): Unit = {}
val sdvDF = kafkaStreamingDF
/*val query = sdvDF
/*val query = sdvDF
I am running this code from IntellijIdea IDE and when I use the foreach(sdvWriter), I could see the records consumed from Kafka, but when I use .writeStream.format("console") I do not see any records. I assume that the console write stream is maintaining some sort of checkpoint and assumes it has processed all the records. Is that the case ? Am I missing something obvious here?
reproduced your code here
both of the options worked. actually in both options without the
import spark.implicits._ it would fail so I'm not sure what you are missing. might be some dependencies configured not correctly. can you add the pom.xml?
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.streaming.Trigger
object Check {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession
import spark.implicits._
val kafkaStreamingDF = spark
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("failOnDataLoss", "false")
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvDF = kafkaStreamingDF
val query = sdvDF.writeStream