Job executed with no data in Spark Streaming - streaming

My code:
// messages is JavaPairDStream<K, V>
Fun01(messages)
Fun02(messages)
Fun03(messages)
Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD) .
Fun01, Fun03 both executed as expected, which prove "messages" is not null or empty.
On Spark application UI, I found Fun02's output stage in "Spark stages", which prove "executed".
The first line of Fun02 is a map function, I add log in it. I also add log for every step in Fun02, they all prove "with no data".
Does somebody know possible reasons? Thanks very much.
#maasg Fun02's logic is:
msg_02 = messages.mapToPair(...)
msg_03 = msg_02.reduceByKeyAndWindow(...)
msg_04 = msg_03.mapValues(...)
msg_05 = msg_04.reduceByKeyAndWindow(...)
msg_06 = msg_05.filter(...)
msg_07 = msg_06.filter(...)
msg_07.cache()
msg_07.foreachRDD(...)
I have done test on Spark-1.1 and Spark-1.2, which is supported by my company's Spark cluster.

It seems that this is a bug in Spark-1.1 and Spark-1.2, fixed in Spark-1.3 .
I post my test result here: http://secfree.github.io/blog/2015/05/08/spark-streaming-reducebykeyandwindow-data-lost.html .
When continuously use two reduceByKeyAndWindow, depending of the window, slide value, "data lost" may appear.
I can not find the bug in Spark's issue list, so I can not get the patch.

Related

Spark accumulator causing application to silently fail

I have an application that processes records in an rdd and puts them into a cache. I put a couple of Spark Accumulators in my application to keep track of processed and failed records. These stats are sent to statsD before the application closes. Here is some simple sample code:
val sc: SparkContext = new SparkContext(conf)
val jdbcDF: DataFrame = sqlContext.read.format("jdbc").options(Map(...)).load().persist(StorageLevel.MEMORY_AND_DISK)
logger.info("Processing table with " + jdbcDF.count + " rows")
val processedRecords = sc.accumulator(0L, "processed records")
val erroredRecords = sc.accumulator(0L, "errored records")
jdbcDF.rdd.foreachPartition(iterator => {
processedRecords += iterator.length // Problematic line
val cache = getCacheInstanceFromBroadcast()
processPartition(iterator, cache, erroredRecords) // updates cache with iterator documents
}
submitStats(processedRecords, erroredRecords)
I built and ran this in my cluster and it appeared to be functioning correctly, the job was marked as a SUCCESS by Spark. I queried the stats using Grafana and both counts were accurate.
However, when I queried my cache, Couchbase, none of the documents were there. I've combed through both driver and executor logs to see if any errors were being thrown but I couldn't find anything. My thinking is that this is some memory issue, but a couple long accumulators is enough to cause a problem?
I was able to get this code snippet working by commenting out the line that increments processedRecords - see the line in the snippet noted with Problematic line.
Does anyone know why commenting out that line fixes the issue? Also why is Spark failing silently and not marking the job as FAILURE?
The application isn't "failing" per se. The main problem is, Iterators can only be "iterated" through one time.
Calling iterator.length actually goes through and exhausts the iterator. Thus, when processPartition receives iterator, the iterator is already exhausted and looks empty (so no records will be processed).
Reference Scala docs to confirm that size is "the number of elements returned by it. Note: it will be at its end after this operation!" -- you can also view the source code to confirm this.
Workaround
If you rewrite processPartition to return a long value, that can be fed into the accumulator.
Also, sc.accumulator is deprecated in recent versions of Spark.
The workaround could look something like:
val acc = sc.longAccumulator("total processed records")
...
df.rdd.foreachPartition(iterator => {
val cache = getCacheInstanceFromBroadcast()
acc.add(processPartition(iterator, cache, erroredRecords))
})
...
// do something else

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

How to suppress printing of variable values in zeppelin

Given the following snippet:
val data = sc.parallelize(0 until 10000)
val local = data.collect
println(s"local.size")
Zeppelin prints out the entire value of local to the notebook screen. How may that behavior be changed?
You can also try adding curly brackets around your code.
{val data = sc.parallelize(0 until 10000)
val local = data.collect
println(s"local.size")}
Since 0.6.0, Zeppelin provides a boolean flag zeppelin.spark.printREPLOutput in spark's interpreter configuration (accessible via the GUI), which is set to true by default.
If you set its value to false then you get the desired behaviour that only explicit print statements are output.
See also: https://issues.apache.org/jira/browse/ZEPPELIN-688
What I do to avoid this is define a top-level function, and then call it:
def run() : Unit = {
val data = sc.parallelize(0 until 10000)
val local = data.collect
println(local.size)
}
run();
FWIW, this appears to be new behaviour.
Until recently we have been using Livy 0.4, it only output the content of the final statement (rather than echoing the output of the whole script).
When we upgraded to Livy 0.5, the behaviour changed to output the entire script.
While splitting the paragraph and hiding the output does work, it seems like an unnecessary overhead to the usability of Zeppelin.
for example, if you need to refresh your output, then you have to remember to run two paragraphs (i.e. the one that sets up your output and the one containing the actual println).
There are, IMHO, other usability issues with this approach that makes, again IMHO, Zeppelin less intuitive to use.
Someone has logged this JIRA ticket to address "the problem", please vote for it:
LIVY-507
Zeppelin, as well as spark-shell REPL, always prints the whole interpreter output.
If you really want to have only local.size string printed - best way to do it is to put println "local.size" statement inside the separate paragraph.
Then you can hide all output of the previous paragraph using small "book" icon on the top-right.
a simple trick I am using is to define
def !() ="_ __ ___ ___________________________________________________"
and use as
$bang
above or close to the code I want to check
and it works
res544: String = _ __ ___ ___________________________________________________
then I just leave there commented out ;)
// hope it helps

MongoDB WriteConcern Java Driver

I have a simple mongo application that happens to be async (using Akka).
I send a message to an actor, which in turn write 3 records to a database.
I'm using WriteConcern.SAFE because I want to be sure the write happened (also tried WriteConcern.FSYNC_SAFE).
I pause for a second to let the writes happen then do a read--and get nothing.
So my write code might be:
collection.save( myObj, WriteConcern.SAFE )
println("--1--")
collection.save( myObj, WriteConcern.SAFE )
println("--2--")
collection.save( myObj, WriteConcern.SAFE )
println("--3--")
then in my test code (running outside the actor--in another thread) I print out the # of records I find:
println( collection.findAll(...) )
My output looks like this:
--1--
--2--
--3--
(pauses)
0
Indeed if I look in the database I see no records. Sometimes I actually do see data there and the test works. Async code can be tricky and it's possible the test code is being hit before the writes happen, so I also tried printing out timestamps to ensure these are being executed in the order presented--they are. The data should be there. Sample output below w/timestamps:
Saved: brand_1 / dev 1375486024040
Saved: brand_1 / dev2 1375486024156
Saved: brand_1 / dev3 1375486024261
1375486026593 0 found
So the 3 saves clearly happened (and should have written) a full 2 seconds before the read was attempted.
I understand for more liberal WriteConcerns you could get this behavior, but I thought the two safest ones would assure me the write actually happened before proceeding.
Subtle but simple problem. I was using a def to create my connection... which I then proceeded to call twice as if it was a val. So I actually had 2 different writers so that explained the sometimes-difference in my results. Refactored to a val and all was predictable. Agonizing to identify, easy to understand/fix.