I am trying to create SparkConsumer so I can send messeges in this case a csv file to Kafka through Spark Streaming. But I have an error that 'path' is not specified.
See my code below
My code is as follows:
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.streaming.OutputMode
object sparkConsumer extends App {
val conf = new SparkConf().setMaster("local").setAppName("Name")
val sc = new SparkContext(conf)
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.appName("Spark-Kafka-Integration")
.master("local")
.getOrCreate()
val schema = StructType(Array(
StructField("InvoiceNo", StringType, nullable = true),
StructField("StockCode", StringType, nullable = true),
StructField("Description", StringType, nullable = true),
StructField("Quantity", StringType, nullable = true)
))
val streamingDataFrame = spark.readStream.schema(schema).csv("C:/Users/me/Desktop/Tasks/Tasks1/test.csv")
streamingDataFrame.selectExpr("CAST(InvoiceNo AS STRING) AS key", "to_json(struct(*)) AS value").
writeStream
.format("csv")
.option("topic", "topic_test")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/me/IdeaProjects/SparkStreaming/checkpointLocation/")
.start()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic_test")
.load()
val df1 = df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as[(String, Timestamp)]
.select(from_json($"value", schema).as("data"), $"timestamp")
.select("data.*", "timestamp")
df1.writeStream
.format("console")
.option("truncate","false")
.outputMode(OutputMode.Append)
.start()
.awaitTermination()
}
I become the following error:
Exception in thread "main" java.lang.IllegalArgumentException: 'path' is not specified
Does anyone know what I am missing?
It seems that it can be a problem on this part of your code:
streamingDataFrame.selectExpr("CAST(InvoiceNo AS STRING) AS key", "to_json(struct(*)) AS value").
writeStream
.format("csv")
.option("topic", "topic_test")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/me/IdeaProjects/SparkStreaming/checkpointLocation/")
.start()
because you use use a "csv" format but you donĀ“t set the file location that it needs. Instead you configure Kafka properties to use a kafka topic as your sink. So if you change the format to "kafka" it should work.
Another problem you can experiment using csv as source is that your path should be a directory not file. In your case, if you create a directory and move your csv file it will work.
Just for testing, create a directoy named C:/Users/me/Desktop/Tasks/Tasks1/test.csv and create a file with the name part-0000.csv inside. Then include your csv content in this new file and start again the process.
Related
I am trying to read data from two kafka topics, but I am unable to join and find teh final dataframe.
My kafka topics are CSVStreamRetail and OrderItems.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
The output I am receiving is an empty dataframe.
First of all please check whether you are receiving data in your kafka topics.
You should always provide watermarking at least in one stream in case of a stream-stream join. I see you want to perform an inner join.
So I have added 200 seconds watermarking and now it is showing data in the output dataframe.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
Use the eventTimestamp for joining.
Let me know if this helps.
I want to save files to multiple destination using foreachBatch , the code is running fine but foreachBatch isn't running the way wanted.
Kindly help me with this if you got any clue.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.streaming._
import org.apache.spark.storage.StorageLevel
object multiDestination {
val spark = SparkSession.builder()
.master("local")
.appName("Writing data to multiple destinations")
.getOrCreate()
def main(args: Array[String]): Unit = {
val mySchema = StructType(Array(
StructField("Id", IntegerType),
StructField("Name", StringType)
))
val askDF = spark
.readStream
.format("csv")
.option("header","true")
.schema(mySchema)
.load("/home/amulya/Desktop/csv/")
//println(askDF.show())
println(askDF.isStreaming)
askDF.writeStream.foreachBatch { (askDF : DataFrame , batchId:Long) =>
askDF.persist()
I'am reading parquet files and convert it into JSON format, then send to kafka. The question is, it read the whole parquet so send to kafka one-time, but i want to send json data line by line or in batches:
object WriteParquet2Kafka {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession
.builder
.master("yarn")
.appName("Write Parquet to Kafka")
.getOrCreate()
import spark.implicits._
val ds: DataFrame = spark.readStream
.schema(parquet-schema)
.parquet(path-to-parquet-file)
val df: DataFrame = ds.select($"vin" as "key", to_json( struct( ds.columns.map(col(_)):_* ) ) as "value" )
.filter($"key" isNotNull)
val ddf = df
.writeStream
.format("kafka")
.option("topic", topics)
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "/tmp/test")
.trigger(Trigger.ProcessingTime("10 seconds"))
.start()
ddf.awaitTermination()
}
}
Is it possible to do this?
I finally figure out how to solve my question, just add a option and set a suitable number for maxFilesPerTrigger:
val df: DataFrame = spark
.readStream
.option("maxFilesPerTrigger", 1)
.schema(parquetSchema)
.parquet(parqurtUri)
Note: maxFilesPerTrigger must set to 1, so that every parquet file being readed.
I'm learning Structured Streaming and I was not able to display the output to my console.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.ProcessingTime
object kafka_stream {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("kafka-consumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
// val schema = StructType().add("a", IntegerType()).add("b", StringType())
val schema = StructType(Seq(
StructField("a", IntegerType, true),
StructField("b", StringType, true)
))
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "172.21.0.187:9093")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.load()
val values = df.selectExpr("CAST(value AS STRING)").as[String]
values.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
My input to Kafka
my name is abc how are you ?
I just want to display strings from Kafka to spark console
I'm trying to read a file using spark 2.1.0 SparkStreaming program. The csv files are stored in a directory on my local machine and trying to use writestream parquet with a new file on my local machine . But whenever I try its always error in .parquet or getting blank folders.
This is my code :
case class TDCS_M05A(TimeInterval:String ,GantryFrom:String ,GantryTo:String ,VehicleType:Integer ,SpaceMeanSpeed:Integer ,CarTimes:Integer)
object Streamingcsv {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.appName("Streamingcsv")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.types._
val schema = StructType(
StructField("TimeInterval",DateType, false) ::
StructField("GantryFrom", StringType, false) ::
StructField("GantryTo", StringType, false) ::
StructField("VehicleType", IntegerType, false) ::
StructField("SpaceMeanSpeed", IntegerType, false) ::
StructField("CarTimes", IntegerType, false) :: Nil)
import org.apache.spark.sql.Encoders
val usrschema = Encoders.product[TDCS_M05A].schema
val csvDF = spark.readStream
.schema(usrschema) // Specify schema of the csv files
.csv("/home/hduser/IdeaProjects/spark2.1/data/*.csv")
val query = csvDF.select("GantryFrom").where("CarTimes > 0")
query
.writeStream
.outputMode("append")
.format("parquet")
.option("checkpointLocation", "checkpoint")
.start("/home/hduser/IdeaProjects/spark2.1/output/")
//.parquet("/home/hduser/IdeaProjects/spark2.1/output/")
//.start()
query.awaitTermination()
}
I refer to the page How to read a file using sparkstreaming and write to a simple file using Scala?
and it isn't work, Please help me ,thanks.
You should make sure the checkpoint directory exists (or create it) before you start. Your implementation of query should also include a val for query that is separate from your DF.
If you don't need to, don't put within an object containing empty args so you can directly access other methods like
query.lastProgress
query.stop
(to name a few)
Change the second half of your code to look like this.
import org.apache.spark.sql.streaming.OutputMode
val csvQueriedDF = csvDF.select("GantryFrom").where("CarTimes > 0")
val query = csvQueriedDF
.writeStream
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "/home/hduser/IdeaProjects/spark2.1/partialTempOutput")
.option("path", "/home/hduser/IdeaProjects/spark2.1/output/")
.start()
Best of Luck!