Where to write HDFS data such they can be read with HIVE - scala

Given I write HDFS with apache spark like this:
var df = spark.readStream
.format("kafka")
//.option("kafka.bootstrap.servers", "kafka1:19092")
.option("kafka.bootstrap.servers", "localhost:29092")
.option("subscribe", "my_event")
.option("includeHeaders", "true")
.option("startingOffsets", "earliest")
.load()
df = df.selectExpr("CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(value AS STRING)")
val emp_schema = new StructType()
.add("id", StringType, true)
.add("timestamp", TimestampType, true)
df = df.select(
functions.col("topic"),
functions.col("partition"),
functions.col("offset"),
functions.from_json(functions.col("value"), emp_schema).alias("data"))
df = df.select("topic", "partition", "offset", "data.*")
val query = df.writeStream
.format("csv")
.option("path", "hdfs://172.30.0.5:8020/test")
.option("checkpointLocation", "checkpoint")
.start()
query.awaitTermination()
Here hdfs://172.30.0.5:8020 is the namenode. It seems this spark program is writing data successfully to the nameode.
How can I query this data from hive? Do I have to write the data into a special folder that hive can see it? Must I define a database for this folder? And how is this done? Where is the location of test then on the file-system?

Where is the location of test then on the file-system?
It's at /test
Note: if you properly configure fs.defaultFS in the core-site.xml, then you don't need to specify the full namenode address.
Do I have to write the data into a special folder that hive can see it?
You can, and that would be easiest, but the docs cover both options of "managed" (a dedicated HDFS location) and "external" (any other directory, with other restrictions) Hive tables
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
How can I query this data from hive?
See above link.
FWIW, Confluent has a Kafka Connector that can write data to HDFS and create Hive tables

Related

Can we fetch data from Kafka from specific offset in spark structured streaming batch mode

In kafka I get new topics dynamically and I have to process it using spark streaming from a specific offset. Is there a possibility to pass the json value from a variable. For example consider the below code
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.load()
In this I want to dynamically update value for startingOffsets... I tried to pass the value in string and called it but it did not work... If I am giving the same value in startingOffsets it is working. How to use a variable in this scenario?
val start_offset= """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}"""
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", start_offset)
.load()
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got """{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}"""
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("ReadSpecificOffsetFromKafka");
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.sparkContext.setLogLevel("error");
import spark.implicits._;
val start_offset = """{"first_topic" : {"0" : 15, "1": -2, "2": 6}}"""
val fromKafka = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092, localhost:9093")
.option("subscribe", "first_topic")
// .option("startingOffsets", "earliest")
.option("startingOffsets", start_offset)
.load();
val selectedValues = fromKafka.selectExpr("cast(value as string)", "cast(partition as integer)");
selectedValues.writeStream
.format("console")
.outputMode("append")
// .trigger(Trigger.Continuous("3 seconds"))
.start()
.awaitTermination();
}
This is the exact code to fetch specific offset from kafka using spark structured streaming and scala
Looks like your job is check pointing the Kafka offsets onto some
persistent storage. Try cleaning those. and Re run your Job.
Also try renaming your job and running it.
Spark can read the stream via readStream. So try with an offset displayed in the error message to get rid of the error.
spark
.readStream
.format("kafka")
.option("subscribePattern", "topic.*")

Termination of Structured Streaming queue using Databricks

I would like to understand whether running a cell in a Databricks notebook with the code below and then cancelling it means that the stream reading is over. Or perhaps it does require some explicit closing?
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBootstrapServers)
.option("subscribe", "topic1")
.load()
display(df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)])
Non-display Mode
It's best to issue this command in a cell:
streamingQuery.stop()
for this type of approach:
val streamingQuery = streamingDF // Start with our "streaming" DataFrame
.writeStream // Get the DataStreamWriter
.queryName(myStreamName) // Name the query
.trigger(Trigger.ProcessingTime("3 seconds")) // Configure for a 3-second micro-batch
.format("parquet") // Specify the sink type, a Parquet file
.option("checkpointLocation", checkpointPath) // Specify the location of checkpoint files & W-A logs
.outputMode("append") // Write only new data to the "file"
.start(outputPathDir)
Otherwise it continues to run - which is the idea of streaming.
I would not stop the cluster as it is all Streams then.
Databricks display Mode
DataBricks have written a nice set of utilities, but you need to do the course to get them. My folly.
display is a databricks thing. Needs format like:
display(myDF, streamName = "myQuery")
then proceed as follows in a separate cell:
println("Looking for %s".format(myStreamName))
for (stream <- spark.streams.active) // Loop over all active streams
if (stream.name == myStreamName) // Single out your stream
{val s = spark.streams.get(stream.id)
s.stop()
}
This will stop the display approach which is write to memory sink.

How to write stream to S3 with year, month and day of the day when records were received?

I have a simple streams that reads some data from a Kafka topic:
val ds = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1")
.option("subscribe", "topic1")
.option("startingOffsets", "earliest")
.load()
val df = ds.selectExpr("cast (value as string) as json")
.select(from_json($"json", schema).as("data"))
.select("data.*")
I want to store this data in S3 based on the day it's received, so something like:
s3_bucket/year/month/day/data.json
When I want to write the data I do:
df.writeStream
.format("json")
.outputMode("append")
.option("path", s3_path)
.start()
But if I do this I get to only specify one path. Is there a way to change the s3 path dynamically based on the date?
Use partitionBy clause:
import org.apache.spark.sql.functions._
df.select(
dayofmonth(current_date()) as "day",
month(current_date()) as "month",
year(current_date()) as "year",
$"*")
.writeStream
.partitionBy("year", "month", "day")
... // all other options

Not able to write Data in Parquet File using Spark Structured Streaming

I have a Spark Structured Streaming:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("subscribe", "topic")
.load()
I want to write data to FileSystem using DataStreamWriter,
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.start("data")
But zero files are getting created in data folder. Only _spark_metadata is getting created.
However, I can see the data on console when format is console:
val query = df
.writeStream
.outputMode("append")
.format("console")
.start()
+--------------------+------------------+------------------+
| time| col1| col2|
+--------------------+------------------+------------------+
|49368-05-11 20:42...|0.9166470338147503|0.5576946794171861|
+--------------------+------------------+------------------+
I cannot understand the reason behind it.
Spark - 2.1.0
I had a similar problem but for different reasons, posting here in case someone has the same issue. When writing your output stream to file in append mode with watermarking, structured streaming has an interesting behavior where it won't actually write any data until a time bucket is older than the watermark time. If you're testing structured streaming and have an hour long water mark, you won't see any output for at least an hour.
I resolved this issue. Actually when I tried to run the Structured Streaming on spark-shell, then it gave an error that endingOffsets are not valid in streaming queries, i.e.,:
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest")
.option("subscribe", "topic")
.load()
java.lang.IllegalArgumentException: ending offset not valid in streaming queries
at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:374)
at org.apache.spark.sql.kafka010.KafkaSourceProvider$$anonfun$validateStreamOptions$1.apply(KafkaSourceProvider.scala:373)
at scala.Option.map(Option.scala:146)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.validateStreamOptions(KafkaSourceProvider.scala:373)
at org.apache.spark.sql.kafka010.KafkaSourceProvider.sourceSchema(KafkaSourceProvider.scala:60)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:199)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
... 48 elided
So, I removed endingOffsets from streaming query.
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "earliest")
.option("subscribe", "topic")
.load()
Then I tried to save streaming queries' result in Parquet files, during which I came to know that - checkpoint location must be specified, i.e.,:
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.start("data")
org.apache.spark.sql.AnalysisException: checkpointLocation must be specified either through option("checkpointLocation", ...) or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...);
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:207)
at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:204)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:203)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:206)
... 48 elided
So, I added checkPointLocation:
val query = df
.writeStream
.outputMode("append")
.format("parquet")
.option("checkpointLocation", "checkpoint")
.start("data")
After doing these modifications, I was able to save streaming queries' results in Parquet files.
But, it is strange that when I ran the same code via sbt application, it didn't threw any errors, but when I ran the same code via spark-shell these errors were thrown. I think Apache Spark should throw these errors when run via sbt/maven app too. It is seems to be a bug to me !

How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark...
note: the topic is written into Kafka in JSON format.
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", IP + ":9092")
.option("zookeeper.connect", IP + ":2181")
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
The following code won't work, I believe that's because the column json is a string and does not match the method from_json signature...
val df = ds1.select($"value" cast "string" as "json")
.select(from_json("json") as "data")
.select("data.*")
Any tips?
[UPDATE] Example working:
https://github.com/katsou55/kafka-spark-structured-streaming-example/blob/master/src/main/scala-2.11/Main.scala
First you need to define the schema for your JSON message. For example
val schema = new StructType()
.add($"id".string)
.add($"name".string)
Now you can use this schema in from_json method like below.
val df = ds1.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")