I have a Spark 1.6.2 (on Scala) process which writes a parquet file, and when it finishes, it should load it back again as a DataFrame. Is there a Spark method to check it a DataFrameWriter finished successfully and resume only after that? I've tried using Future and on Complete, but it doesn't work well with Spark (SparkContext gets shut down).
I assume I can just look for the _SUCCESS file in the folder and loop until it's there, but then if the process gets stuck, I'll have an infinite loop..
Related
I have an application that creates a few dataframes, writes them to disk, then runs a command using vertica_python to load the data into Vertica. The Spark Vertica connector doesn't work because of an encrypted drive.
What I'd like to do, is have the application run the command to load the data, then move on to the next job immediately. What it's doing however, is waiting for the load to be done in Vertica before moving to the next job. How can I have it do what I want? Thanks.
What's weird about this problem is that the command I'd like to have run in the background is as simple as db_client.cursor.execute(command). This shouldn't be blocking under normal circumstances, so why is it in Spark?
To be very specific, what is happening is that I'm reading in a dataframe, doing transformations, writing to s3, and then I'd like to start the db loading the files from s3, before moving taking the transformed dataframe, transforming it again, writing to s3, loading to db.... multiple times.
I see now what I was doing. Simply putting the dbapi call in its own thread isn't enough. I have to put the other calls that I want to run concurrently in their own threads as well.
This is the exception I am getting whenever I am trying to convert it.
val df_col = df.select("ts.user.friends_count").collect.map(_.toSeq)
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
All I am trying to do is replicate the following sql.dataframe operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
collect is a big no-no even in Spark Core's RDD world due to the size of the data you may transfer back to the driver's single JVM. It just sets the boundary of the benefits of Spark as after collect you are in a single JVM.
With that said, think about unbounded data, i.e. a data stream, that will never terminate. That's Spark Structured Streaming.
A streaming Dataset is one that is never complete and the data inside varies every time you ask for the content, i.e. the result of executing the structured query over a stream of data.
You simply cannot say "Hey, give me the data that is the content of a streaming Dataset". That does not even make sense.
That's why you cannot collect on a streaming dataset. It is not possible up to Spark 2.2.1 (the latest version at the time of this writing).
If you want to receive the data that is inside a streaming dataset for a period of time (aka batch interval in Spark Streaming or trigger in Spark Structured Streaming) you write the result to a streaming sink, e.g. console.
You can also write your custom streaming sink that does collect.map(_.toSeq) inside addBatch which is the main and only method of a streaming sink. As a matter of fact, console sink does exactly it.
All I am trying to do is replicate the following sql.dataframe
operations in structured streaming.
df.collect().foreach(row => droolsCaseClass(row.getLong(0), row.getString(1)))
which is running fine in Dataframes but not in structured streaming.
The very first solution that comes to my mind is to use foreach sink:
The foreach operation allows arbitrary operations to be computed on the output data.
That, of course, does not mean that this is the best solution. Just one that comes to my mind immediately.
I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.
val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")
val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")
val query = streamingDF.writeStream
.format("console")
.start()
What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?
See the example
The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table.
This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.
In Structured Streaming Model, this is how the execution of this query is performed.
Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically
batch it? If yes, is this batching parameter configurable?
Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of
course you need to re-partition before processing to ensure equal
number of partitions and data spread across executors uniformly...
As I was hitting the resource limit in my Spark program, I want to divide the processing into iterations, and upload results from each iteration to the HDFS, as shown below.
do something using first rdd
upload the output to hdfs
do something using second rdd
upload the output to hdfs
But as far as I know, Spark will try to run those two in parallel. Is there a way to wait for the processing of the first rdd, before processing the second rdd?
I think I understand where you're confused. Within a single RDD, the partitions will run in parallel to each other. However, two RDDs will run sequentially to each other (unless you code otherwise).
Is there a way to wait for the processing of the first rdd, before processing the second rdd
You have the RDD, so why do you need to wait and read from disk again?
Do some transformations on the RDD, write to disk in the first action, and continue with that same RDD to perform a second action.
i'm new to Spark and i have one question.
I have Spark Streaming application which uses Kafka. Is there way to tell my application to shut down if new batch is empty (let's say batchDuration = 15 min)?
Something in the lines of should do it:
dstream.foreachRDD{rdd =>
if (rdd.isEmpty) {
streamingContext.stop()
}
}
But be aware that depending on your application workflow, it could be that the first batch (or some batch in between) is also empty and hence your job will stop on the first run. You may need to combine some conditions for a more reliable stop.