Bear with me I'm new to this. I'm trying to read a kafka stream from a zeppelin note book but it's not returning any data. But when I try to read the topic from the command line it does in fact return data.
C:\kafka_2.13-2.6.0>bin\windows\kafka-console-consumer.bat --topic quickstart-events --from-beginning --bootstrap-server localhost:9092
This is my first event
This is my second event
This is my code:
val sourceTopic = "quickstart-events"
val targetTopic = "sensor-processed"
val kafkaBootstrapServer = "127.0.0.1:9092"
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder.appName("Simple Application")
.config("spark.master", "local").getOrCreate()
val rawData = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("subscribe", sourceTopic)
.option("startingOffsets", "earliest")
.load()
case class SensorData(id: String, ts: Long, value: Double)
import org.apache.spark.sql.Encoders
val schema = Encoders.product[SensorData].schema
val rawValues = rawData.selectExpr("CAST(value AS STRING)").as[String]
val visualizationQuery = rawValues.writeStream
.queryName("visualization")
.outputMode("append")
.format("memory")
.start()
val sampleDataset = sparkSession.sql("select * from visualization")
sampleDataset.count
The count returns 0, when there should be two events.
My dependencies
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.7
org.scala-lang:scala-library:2.11.0
org.apache.spark:spark-core_2.11:2.4.7
org.apache.spark:spark-sql_2.11:2.4.7
Related
I have a function kafkaIngestion which creates a df from kafkatopic in the following way:
def kafkaIngestion(spark:sparksession):Dataframe = {
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", broker)
.option("subscribe", topic)
.option("group.id", grpid)
.load()
.selectExpr("cast(value as string) as data")
.select(from_json($"data", schema=inputSchema)
.as("data")
.select("data.*")
df
}
I am unable to mock the the code to return my expected df. What's the correct way to mock the df?
I would like to apply OneHotEncoder to multiple Columns in my Streaming Dataframe, but I've got the following error.
Any suggestions?
Many thanks!
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;
CODE:
// Read csv
val Stream = spark.read
.format("csv")
.option("header", "true")
.option("delimiter", ";")
.option("header", "true")
.schema(DFschema)
.load("C:/[...]"/
// Kafka
val properties = new Properties()
//val topic = "mongotest"
properties.put("bootstrap.servers", "localhost:9092")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
Stream.selectExpr("CAST(Col AS STRING) AS KEY",
"to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("topic", "predict")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:[...]")
.start()
Subscribe to topic
val lines = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "predict")
.load()
val df = Stream
.selectExpr("CAST(value AS STRING)")
val jsons = df.select(from_json($"value", DFschema) as "data").select("data.*")
ETL[...]
Apply funcion Bucketizer() to field
val Msplits = Array(Double.NegativeInfinity,7, 14, 21, Double.PositiveInfinity)
val bucketizerM = new Bucketizer()
.setInputCol("MEASURE")
.setOutputCol("MEASURE_c")
.setSplits(Msplits)
val bucketedData1 = bucketizerD.transform(out)
val bucketedData2 = bucketizerM.transform(bucketedData1) # Works
Error using OneHotEncoder()
val indexer = new StringIndexer()
.setInputCol("CODE")
.setOutputCol("CODE_index")
val encoder = new OneHotEncoder()
.setInputCol("CODE")
.setOutputCol("CODE_encoded")
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("A","B", "CODE_encoded"))
.setOutputCol("features")
val transformationPipeline = new Pipeline()
.setStages(Array(indexer, encoder, vectorAssembler))
val fittedPipeline = transformationPipeline.fit(bucketedData2) # Does't work
I'am reading parquet files and convert it into JSON format, then send to kafka. The question is, it read the whole parquet so send to kafka one-time, but i want to send json data line by line or in batches:
object WriteParquet2Kafka {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("yarn")
.appName("Write Parquet to Kafka")
.getOrCreate()
import spark.implicits._
val ds: DataFrame = spark.readStream
.schema(parquet-schema)
.parquet(path-to-parquet-file)
val df: DataFrame = ds.select($"vin" as "key", to_json( struct( ds.columns.map(col(_)):_* ) ) as "value" )
.filter($"key" isNotNull)
val ddf = df
.writeStream
.format("kafka")
.option("topic", topics)
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/test")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
ddf.awaitTermination()
}
}
Is it possible to do this?
I finally figure out how to solve my question, just add a option and set a suitable number for maxFilesPerTrigger:
val df: DataFrame = spark
.readStream
.option("maxFilesPerTrigger", 1)
.schema(parquetSchema)
.parquet(parqurtUri)
Note: maxFilesPerTrigger must set to 1, so that every parquet file being readed.
while displaying sorting results to console results are showing as expected in sorting order, but when i push those results to kafka topic the sorting order is missing
def main(args: Array[String]) = {
//Spark config and kafka config
// load method
val Raw_df = readStream(sparkSession, inputtopic)
//converting read kafka mesages into json format
val df_messages = Raw_df.selectExpr("CAST(value AS STRING)")
.withColumn("data", from_json($"value", my_schema))
.select("data.*")
val win = window($"date_column","5 minutes")
val modified_df = df_messages.withWatermark("date_column", "3 minutes")
.groupBy(win,$"All_colums", $"date_column")
.count()
.orderBy(asc("date_column"),asc("column_5"))
val finalcol = modified_df.drop("count").drop("window")
//mapping all columsn and converting them to json mesages
val finalcolonames = my_schema.fields.map(z => z.name)
val dataset_Json = finalcol.withColumn("value", to_json(struct(finalcolonames.map(y => col(y)): _*)))
.select($"value")
//val query = writeToKafkaStremoutput(dataset_Json, outputtopic,checkpointlocation)
val query = writeToConsole(order)
(query)
}
//below method write data to kafka topic
def writeToKafkaStremoutput(dataFrame: DataFrame, Config: Config, topic: String,checkpointlocation:String) = {
dataFrame
.selectExpr( "CAST(value AS STRING)")
.writeStream
.format("kafka")
.trigger(Trigger.ProcessingTime("1 second"))
.option("topic", topic)
.option("kafka.bootstrap.servers", "kafka.bootstrap_servers")
.option("checkpointLocation",checkpointPath)
.outputMode(OutputMode.Complete())
.start()
}
//console op for testing
// below method write data toconsole
def writeToConsole(dataFrame: DataFrame) = {
import org.apache.spark.sql.streaming.Trigger
val query = dataFrame
.writeStream
.format("console")
.option("numRows",300)
//.trigger(Trigger.ProcessingTime("20 second"))
.outputMode(OutputMode.Complete())
.option("truncate", false)
.start()
query
}
I have the following code to read and process Kafka data using Structured Streaming
object ETLTest {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
run();
}
def run(): Unit = {
val spark = SparkSession
.builder
.appName("Test JOB")
.master("local[*]")
.getOrCreate()
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "...")
.option("subscribe", "...")
.option("failOnDataLoss", "false")
.option("startingOffsets","earliest")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvWriter = new ForeachWriter[record] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(record: record) = {
println("record:: " + record)
}
def close(errorOrNull: Throwable): Unit = {}
}
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
// DOES NOT WORK
/*val query = sdvDF
.writeStream
.format("console")
.start()
.awaitTermination()*/
// WORKS
/*val query = sdvDF
.writeStream
.foreach(sdvWriter)
.start()
.awaitTermination()
*/
}
}
I am running this code from IntellijIdea IDE and when I use the foreach(sdvWriter), I could see the records consumed from Kafka, but when I use .writeStream.format("console") I do not see any records. I assume that the console write stream is maintaining some sort of checkpoint and assumes it has processed all the records. Is that the case ? Am I missing something obvious here?
reproduced your code here
both of the options worked. actually in both options without the
import spark.implicits._ it would fail so I'm not sure what you are missing. might be some dependencies configured not correctly. can you add the pom.xml?
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.streaming.Trigger
object Check {
case class record(value: String, topic: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder().master("local[2]")
.getOrCreate
import spark.implicits._
val kafkaStreamingDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test")
.option("startingOffsets","earliest")
.option("failOnDataLoss", "false")
.load()
.selectExpr("CAST(value as STRING)", "CAST(timestamp as STRING)","CAST(topic as STRING)")
val sdvDF = kafkaStreamingDF
.as[record]
.filter($"value".isNotNull)
val query = sdvDF.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}