Stream files from HDFS using Apache Spark Steaming - scala

How can I stream files already present in HDFS using apache spark?
I have a very specific use case where I have millions of customer data and I want to process them at a customer level using apache stream. Currently what I am trying to do is I am taking entire customer dataset and repartition it on customerId and creating 100 such partitions and ensuring unique customer multiple records to be passed in a single stream.
Now I have all the data present in HDFS location
hdfs:///tmp/dataset
Now using the above HDFS location I want to stream the files which will read the parquet file get the dataset. I have tried the following things but no luck.
// start stream
val sparkConf = new SparkConf().setAppName("StreamApp")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(60))
val dstream = ssc.sparkContext.textFile("hdfs:///tmp/dataset")
println("dstream: " + dstream)
println("dstream count: " + dstream.count())
println("dstream context: " + dstream.context)
ssc.start()
ssc.awaitTermination()
NOTE: This solution doesn't stream data it just reads data from HDFS
and
// start stream
val sparkConf = new SparkConf().setAppName("StreamApp")
// Create the context
val ssc = new StreamingContext(sparkConf, Seconds(60))
val dstream = ssc.textFileStream("hdfs:///tmp/dataset")
println("dstream: " + dstream)
println("dstream count: " + dstream.count())
println("dstream context: " + dstream.context)
dstream.print()
ssc.start()
ssc.awaitTermination()
I am always getting 0 result. Is is possible to stream files from HDFS if is already present in HDFS where no new files are publishing.

TL;DR This functionality is not supported in spark as of now. The closest you can get is by moving the files into hdfs:///tmp/dataset after starting the streaming context.
textFileStream internally uses FileInputDStream which has an option newFilesOnly. But this does not process all existing files but only the files which were modified within one minute (set by config value spark.streaming.fileStream.minRememberDuration) before streaming context. As described in jira issue
when you set the newFilesOnly to false, it means this FileInputDStream would not only handle coming files, but also include files which came in the past 1 minute (not all the old files). The length of time defined in FileInputDStream.MIN_REMEMBER_DURATION.
Or
You could create an (normal) RDD out the existing files before you start the streaming context. Which can be used along with the stream RDD later.

Related

Consuming exactly one file per micro batch in spark stream

I am new to spark streaming technology and doing a POC to migrate one of our ETL application to spark from commercial ETL tool. We get many small files thorough out the day as INPUT and will do the ETL operation on that.
As part of POC, we are trying out spark streaming. Currently, we are struck in logic to generate Surrogate KEY values for each record and we identified ROW_NUMBER as the way to generate SK, but it doesn't work for spark stream, as it has to be windowed operation.
I am trying to use the record number column (sequence number for each record in side the files) to generate SK.
Consume exactly one file at each micro batch.
calculate Sk = Previous batch max record number + record number
In write stream phase -calculate and store the max record number value using "foreachbatch" option.
I have used "maxFilesPerTrigger" option to fetch one file, but it doesn't guarantee exactly one file per micro batch.
val smsfDF_stream = spark.readStream.option("sep", ",").schema(schema).option("maxFilesPerTrigger", 1).option("cleanSource","archive").option("sourceArchiveDir", "/edw_ab_data/spark_poc/archive").csv("/edw_ab_data/spark_poc/input")
Also, I see max value for each batchdf, I see '9999' as value
val fin_query = resinterdf2.writeStream.format("csv").foreachBatch( (batchDF: DataFrame, batchId: Long ) => {
batchDF.createOrReplaceTempView("batchdf_tab")
val file_name = "max_file_"+batchId+".dat"
val max_rec_id = batchDF.sparkSession.sql("""select max(rec_number) from batchdf_tab""")
max_rec_id.write.format("csv").mode("overwrite").save("/edw_ab_data/spark_poc/max_val.dat")}).option("checkpointLocation", "/edw_ab_data/spark_poc").outputMode("append").option("path", "/edw_ab_data/spark_poc/output").start()

Transform and persist Spark DStream into several separate locations, in parallel?

I have an use case of a DStream that contains data with several levels of nesting, and I have a requirement to persist different elements from that data into separate HDFS locations. I managed to work this out by using Spark SQL, as below:
val context = new StreamingContext(sparkConf, Seconds(duration))
val stream = context.receiverStream(receiver)
stream.foreachRDD {rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
rdd.toDF.drop("childRecords").write.parquet("ParentTable")
}
stream.foreachRDD {rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
rdd.toDF.select(explode(col("childRecords")).as("children"))
.select("children.*").write.parquet("ChildTable")
}
// repeat as necessary if parent table has more different kinds of child records,
// or if child table itself also has child records too
The code works, but the only issue I have with it, is that the persistence runs sequentially - the first stream.foreachRDD has to complete before the second one starts, etc. What I'd like to see ideally is for the persistence job for ChildTable to start without waiting for ParentTable to finish, as they're writing to different locations and would not conflict. In reality, I have about 10 different jobs all waiting to complete sequentially, and would probably see a big improvement in execution time if I'd be able to run them all in parallel.

Unbounded table is spark structured streaming

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...

Parse NiFi Data Packet using spark

I'm doing a small project for university using Apache NiFi and Apache Spark. I want to create a workflow with NiFi that reads TSV files from HDFS and using Spark Streaming I can process the files and store the information I need in MySQL. I've already created my Workflow in NiFi and the storage part is already working. The problem is that i can't parse the NiFi package so i can use them.
The files contain rows like this:
linea1File1 TheReceptionist 653 Entertainment 424 13021 4.34 1305 744 DjdA-5oKYFQ NxTDlnOuybo c-8VuICzXtU
Where each space is a tab ("\t")
This is my code in Spark using Scala:
val ssc = new StreamingContext(config, Seconds(10))
val packet = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))
val file = packet.map(dataPacket => new String(dataPacket.getContent, StandardCharsets.UTF_8))
Until here I can obtain my entire file (7000+ rows) in a single string... unfortunately i can't split that string in rows. I need to get the entire file in rows, so I can parse that in an object, apply some operations on it and store what I want
Anyone can help me with this?
Each data packet is going to be the content of one flow file from NiFi, so if NiFi picks up one TSV file from HDFS that has a lot of rows, all those rows will be in one data packet.
It is hard to say without seeing your NiFi flow, but you could probably use SplitText with a line count of 1 to split your TSV in NiFi before it gets to spark streaming.

Network Spark Streaming from multiple remote hosts

I wrote program for Spark Streaming in scala. In my program, i passed 'remote-host' and 'remote port' under socketTextStream.
And in the remote machine, i have one perl script who is calling system command:
echo 'data_str' | nc <remote_host> <9999>
In that way, my spark program is able to get data, but it seems little bit confusing as i have multiple remote machines which needs to send data to spark machine.
I wanted to know the right way of doing it. Infact, how will i deal with data coming from multiple hosts?
For Reference, My current program:
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("HBaseStream")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val inputStream = ssc.socketTextStream(<remote-host>, 9999)
-------------------
-------------------
ssc.start()
// Wait for the computation to terminate
ssc.awaitTermination()
}
}
Thanks in advance.
You can find more information from "Level of Parallelism in Data Receiving".
Summary:
Receiving multiple data streams can therefore be achieved by creating
multiple input DStreams and configuring them to receive different
partitions of the data stream from the source(s);
These multiple DStreams can be unioned together to create a single
DStream. Then the transformations that were being applied on a single
input DStream can be applied on the unified stream.