Spark Streaming -- (Top K requests) how to maintain a state on a driver? - streaming

I have a stream of logs with URLs users request.
Every minute I want to get top100 pages requested during all the time and save it to HDFS.
I understand how to maintain a number of requests for each url:
val ratingItemsStream : DStream[(String,Long)] = lines
.map(LogEntry(_))
.map(entry => (entry.url, 1L))
.reduceByKey(_ + _)
.updateStateByKey(updateRequestCount)
// this provides a DStream of Tuple of [Url, # of requests]
But what to I do next?
Obviously I need to pass all the updates to host to maintain a priorityqueue, and then take top K of it every 1 minute.
How can I achieve this?
UPD: I've seen spark examples and algebird's MapMonoid used there. But since I do not understand how it works(surpisingly no information was found online), I don't want to use it. There must me some way, right?

You could approach it by taking a x-minute window aggregations of the data and applying sorting to get the ranking.
val window = ratingItemStream.window(Seconds(windowSize), Seconds(windowSize))
window.forEachRDD{rdd =>
val byScore = rdd.map(_.swap).sortByKey(ascending=false).zipWithIndex
val top100 = byScore.collect{case ((score, url), index) if (index<100) => (url, score)}
top100.saveAsTextFile("./path/to/file/")
}
(sample code, not tested!)
Note that rdd.top(x) will give you better performance than sorting/zipping but it returns an array, and therefore, you're on your own to save it to hdfs using the hadoop API (which is an option, I think)

Related

Spark accumulator causing application to silently fail

I have an application that processes records in an rdd and puts them into a cache. I put a couple of Spark Accumulators in my application to keep track of processed and failed records. These stats are sent to statsD before the application closes. Here is some simple sample code:
val sc: SparkContext = new SparkContext(conf)
val jdbcDF: DataFrame = sqlContext.read.format("jdbc").options(Map(...)).load().persist(StorageLevel.MEMORY_AND_DISK)
logger.info("Processing table with " + jdbcDF.count + " rows")
val processedRecords = sc.accumulator(0L, "processed records")
val erroredRecords = sc.accumulator(0L, "errored records")
jdbcDF.rdd.foreachPartition(iterator => {
processedRecords += iterator.length // Problematic line
val cache = getCacheInstanceFromBroadcast()
processPartition(iterator, cache, erroredRecords) // updates cache with iterator documents
}
submitStats(processedRecords, erroredRecords)
I built and ran this in my cluster and it appeared to be functioning correctly, the job was marked as a SUCCESS by Spark. I queried the stats using Grafana and both counts were accurate.
However, when I queried my cache, Couchbase, none of the documents were there. I've combed through both driver and executor logs to see if any errors were being thrown but I couldn't find anything. My thinking is that this is some memory issue, but a couple long accumulators is enough to cause a problem?
I was able to get this code snippet working by commenting out the line that increments processedRecords - see the line in the snippet noted with Problematic line.
Does anyone know why commenting out that line fixes the issue? Also why is Spark failing silently and not marking the job as FAILURE?
The application isn't "failing" per se. The main problem is, Iterators can only be "iterated" through one time.
Calling iterator.length actually goes through and exhausts the iterator. Thus, when processPartition receives iterator, the iterator is already exhausted and looks empty (so no records will be processed).
Reference Scala docs to confirm that size is "the number of elements returned by it. Note: it will be at its end after this operation!" -- you can also view the source code to confirm this.
Workaround
If you rewrite processPartition to return a long value, that can be fed into the accumulator.
Also, sc.accumulator is deprecated in recent versions of Spark.
The workaround could look something like:
val acc = sc.longAccumulator("total processed records")
...
df.rdd.foreachPartition(iterator => {
val cache = getCacheInstanceFromBroadcast()
acc.add(processPartition(iterator, cache, erroredRecords))
})
...
// do something else

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

Can the subflows of groupBy depend on the keys they were generated from ?

I have a flow with data associated to users. I also have a state for each user, that I can get asynchronously from DB.
I want to separate my flow with one subflow per user, and load the state for each user when materializing the subflow, so that the elements of the subflow can be treated with respect to this state.
If I don't want to merge the subflows downstream, I can do something with groupBy and Sink.lazyInit :
def getState(userId: UserId): Future[UserState] = ...
def getUserId(element: Element): UserId = ...
def treatUser(state: UserState): Sink[Element, _] = ...
val treatByUser: Sink[Element] = Flow[Element].groupBy(
Int.MaxValue,
getUserId
).to(
Sink.lazyInit(
elt => getState(getUserId(elt)).map(treatUser),
??? // this is never called, since the subflow is created when an element comes
)
)
However, this does not work if treatUser becomes a Flow, since there is no equivalent for Sink.lazyInit.
Since subflows of groupBy are materialized only when a new element is pushed, it should be possible to use this element to materialize the subflow, but I wasn't able to adapt the source code for groupBy so that this work consistently. Likewise, Sink.lazyInitdoesn't seem to be easily translatable to the Flow case.
Any idea on how to solve this issue ?
The relevant Akka issue you have to look at is #20129: add Sink.dynamic and Flow.dynamic.
In the associated PR #20579 they actually implemented LazySink stuffs.
They are planning to do LazyFlow next:
Will do next lazyFlow with similar signature.
Unfortunately you have to wait for that functionality to be implemented in Akka or write it yourself (then consider a PR to Akka).

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Apache Spark: multiple outputs in one map task

TL;DR: I have a large file that I iterate over three times to get three different sets of counts out. Is there a way to get three maps out in one pass over the data?
Some more detail:
I'm trying to compute PMI between words and features that are listed in a large file. My pipeline looks something like this:
val wordFeatureCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield ((word, feature), 1)
})
And then I repeat this to get word counts and feature counts separately:
val wordCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield (word, 1)
})
val featureCounts = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
for (feature <- features) yield (feature, 1)
})
(I realize I could just iterate over wordFeatureCounts to get the wordCounts and featureCounts, but that doesn't answer my question, and looking at running times in practice I'm not sure it's actually faster to do it that way. Also note that there are some reduceByKey operations and other stuff that I do with this after the counts are computed that aren't shown, as they aren't relevant to the question.)
What I would really like to do is something like this:
val (wordFeatureCounts, wordCounts, featureCounts) = sc.textFile(inputFile).flatMap(line => {
val word = getWordFromLine(line)
val features = getFeaturesFromLine(line)
val wfCounts = for (feature <- features) yield ((word, feature), 1)
val wCounts = for (feature <- features) yield (word, 1)
val fCounts = for (feature <- features) yield (feature, 1)
??.setOutput1(wfCounts)
??.setOutput2(wCounts)
??.setOutput3(fCounts)
})
Is there any way to do this with spark? In looking for how to do this, I've seen questions about multiple outputs when you're saving the results to disk (not helpful), and I've seen a bit about accumulators (which don't look like what I need), but that's it.
Also note that I can't just yield all of these results in one big list, because I need three separate maps out. If there's an efficient way to split a combined RDD after the fact, that could work, but the only way I can think of to do this would end up iterating over the data four times, instead of the three I currently do (once to create the combined map, then three times to filter it into the maps I actually want).
It is not possible to split an RDD into multiple RDDs. This is understandable if you think about how this would work under the hood. Say you split RDD x = sc.textFile("x") into a = x.filter(_.head == 'A') and b = x.filter(_.head == 'B'). Nothing happens so far, because RDDs are lazy. But now you print a.count. So Spark opens the file, and iterates through the lines. If the line starts with A it counts it. But what do we do with lines starting with B? Will there be a call to b.count in the future? Or maybe it will be b.saveAsTextFile("b") and we should be writing these lines out somewhere? We cannot know at this point. Splitting an RDD is just not possible with the Spark API.
But nothing stops you from implementing something if you know what you want. If you want to get both a.count and b.count you can map lines starting with A into (1, 0) and lines with B into (0, 1) and then sum up the tuples elementwise in a reduce. If you want to save lines with B into a file while counting lines with A, you could use an aggregator in a map before filter(_.head == 'B').saveAsTextFile.
The only generic solution is to store the intermediate data somewhere. One option is to just cache the input (x.cache). Another is to write the contents into separate directories in a single pass, then read them back as separate RDDs. (See Write to multiple outputs by key Spark - one Spark job.) We do this in production and it works great.
This is one of the major disadvantages of Spark over traditional map-reduce programming. An RDD/DF/DS can be transformed into another RDD/DF/DS but you cannot map an RDD into multiple outputs. To avoid recomputation you need to cache the results into some intermediate RDD and then run multiple map operations to generate multiple outputs. The caching solution will work if you are dealing with reasonable size data. But if the data is large compared to the memory available the intermediate outputs will be spilled to disk and the advantage of caching will not be that great. Check out the discussion here - https://issues.apache.org/jira/browse/SPARK-1476. This is an old Jira but relevant. Checkout out the comment by Mridul Muralidharan.
Spark needs to provide a solution where a map operation can produce multiple outputs without the need to cache. It may not be elegant from the functional programming perspective but I would argue, it would be a good compromise to achieve better performance.
I was also quite disappointed to see that this is a hard limitation of Spark over classic MapReduce. I ended up working around it by using multiple successive maps in which I filter out the data I need.
Here's a schematic toy example that performs different calculations on the numbers 0 to 49 and writes both to different output files.
from functools import partial
import os
from pyspark import SparkContext
# Generate mock data
def generate_data():
for i in range(50):
yield 'output_square', i * i
yield 'output_cube', i * i * i
# Map function to siphon data to a specific output
def save_partition_to_output(part_index, part, filter_key, output_dir):
# Initialise output file handle lazily to avoid creating empty output files
file = None
try:
for key, data in part:
if key != filter_key:
# Pass through non-matching rows and skip
yield key, data
continue
if file is None:
file = open(os.path.join(output_dir, '{}-part{:05d}.txt'.format(filter_key, part_index)), 'w')
# Consume data
file.write(str(data) + '\n')
yield from []
finally:
if file is not None:
file.close()
def main():
sc = SparkContext()
rdd = sc.parallelize(generate_data())
# Repartition to number of outputs
# (not strictly required, but reduces number of output files).
#
# To split partitions further, use repartition() instead or
# partition by another key (not the output name).
rdd = rdd.partitionBy(numPartitions=2)
# Map and filter to first output.
rdd = rdd.mapPartitionsWithIndex(partial(save_partition_to_output, filter_key='output_square', output_dir='.'))
# Map and filter to second output.
rdd = rdd.mapPartitionsWithIndex(partial(save_partition_to_output, filter_key='output_cube', output_dir='.'))
# Trigger execution.
rdd.count()
if __name__ == '__main__':
main()
This will create two output files output_square-part00000.txt and output_cube-part00000.txt with the desired output splits.