Spark caching strategy - scala

I have a Spark driver that goes like this:
EDIT - earlier version of the code was different & didn't work
var totalResult = ... // RDD[(key, value)]
var stageResult = totalResult
do {
stageResult = stageResult.flatMap(
// Some code that returns zero or more outputs per input,
// and updates `acc` to number of outputs
...
).reduceByKey((x, y) => x.sum(y))
totalResult = totalResult.union(stageResult)
} while(stageResult.count() > 0)
I know from properties of my data that this will eventually terminate (I'm essentially aggregating up the nodes in a DAG).
I'm not sure of a reasonable caching strategy here - should I cache stageResult each time through the loop? Am I setting up a horrible tower of recursion, since each totalResult depends on all previous incarnations of itself? Or will Spark figure that out for me? Or should I put each RDD result in an array and take one big union at the end?
Suggestions will be welcome here, thanks.

I would rewrite this as follows:
do {
stageResult = stageResult.flatMap(
//Some code that returns zero or more outputs per input
).reduceByKey(_+_).cache
totalResult = totalResult.union(stageResult)
} while(stageResult.count > 0)
I am fairly certain(95%) that the stageResult DAG used in the union will be the correct reference (especially since count should trigger it), but this might need to be double checked.
Then when you call totalResult.ACTION, it will put all of the cached data together.
ANSWER BASED ON UPDATED QUESTION
As long as you have the memory space, then I would indeed cache everything along the way as it stores the data of each stageResult, unioning all of those data points at the end. In fact, each union does not rely on the past as that is not the semantics of RDD.union, it merely puts them together at the end. You could just as easily change your code to use a val due to RDD immutability.
As a final note, maybe the DAG visualization will help understand why there would not be recursive ramifications:

Related

Pass each item of dataset through list and update

I am working on refactoring code for a Spark job written in Scala. We have a dataset of rolled up data, "rollups", that we pass through a list of rules. Each of the rollups has a "flags" value that is a list that we appended the rule information to to keep track of which "rule" was triggered by that rollup. Each rule is an object that will look at the data of a rollup and decide whether or not to add its identifying information to the rollups "flags".
So here is where the problem is. Currently each rule takes in the rollup data, adds a flag and then returns it again.
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): RollupData = {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
rollupData
}
}
This allows us to calculate the rules like:
val anomalyInfo = rollups.flatMap(x =>
AnomalyRules.rules.map(y =>
y.identifyAnomalies(x)
).filter(a => a.flags.nonEmpty)
)
# do later processing on anomalyInfo
The rollups val here is a Dataset of our rollup data and rules is a list of rule objects.
The issue with this method is that it will create a duplicate rollups for each rule. For example if we have 7 rules, each rollup will be duplicated 7 times because each rule returns the rollup passed into it. Running dropDuplicates() on the dataset will take care of this issue but its ugly and confusing.
This is why I wanted to refactor, but if I set up the rules to instead only append the rule like this:
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): Unit= {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
}
}
We can instead write
rollups.foreach(rollup =>
AnomalyRules.rules.foreach(rule =>
rule.identifyAnomalies(rollup)
)
)
# do later processing on rollups
This seems like the more intuitive approach. However, while running these rules in unit tests works fine, no "flag" information is added when the "rollups" dataset is passed through. I think it is because datasets are not mutable? This method actually does work if I collect the rollups as a list but the dataset I am testing on is much smaller than what is in production so we can't do that. This is where I am stuck and can not think of a cleaner way of writing this code. I feel like I am missing some fundamental programming concept or do not know how mutability works well enough.

Count Triangles in Scala - Spark

i am trying to get into data analytics using Spark with Scala. My question is how do i get the triangles in a graph? And i mean not the Triangle Count that comes with graphx, but the actual nodes that consist the triangle.
Suppose we have a graph file, i was able to calculate the triangles in scala, but the same technique does not apply in spark since i have to use RDD operations.
The data i give to the function is a complex List consisting of the src and the List of the destinations of that source; ex. Adj(5, List(1,2,3)), Adj(4, List(9,8,7)), ...
My scala version is this :
(Paths: List[Adj])
Paths.flatMap(i=> Paths.map(j => Paths.map(k => {
if(i.src != j.src && i.src!= k.src && j.src!=k.src){
if(i.dst.contains(j.src) && j.dst.contains(k.src) && k.dst.contains(i.src)){
println(i.src,j.src,k.src) //3 nodes that make a triangle
}
else{
()
}
}
})))
And the output would be something like:
(1,2,3)
(4,5,6)
(2,5,6)
In conclusion i want the same output but in spark environment execution. In addition i am looking for a more efficient way to hold the information about Adjacencies like key mapping and then reducing by key or something. As the spark environment needs a quite different way to approach each problem (big data operations) i would be gratefull if you could explain the way of thinking and give me a small briefing about the functions you used.
Thank you.

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Play/Scala Template Block Statement HTML Output Syntax with Local Variable

Ok, I've been stuggling with this one for a while, and have spent a lot of time trying different things to do something that I have done very easily using PHP.
I am trying to iterate over a list while keeping track of a variable locally, while spitting out HTML attempting to populate a table.
Attempt #1:
#{
var curDate : Date = null
for(ind <- indicators){
if(curDate == null || !curDate.equals(ind.getFirstFound())){
curDate = ind.getFirstFound()
<tr><th colspan='5' class='day'>#(ind.getFirstFound())</th></tr>
<tr><th>Document ID</th><th>Value</th><th>Owner</th><th>Document Title / Comment</th></tr>
}
}
}
I attempt too user a scala block statement to allow me to keep curDate as a variable within the created scope. This block correctly maintains curDate state, but does not allow me to output anything to the DOM. I did not actually expect this to compile, due to my unescaped, randomly thrown in HTML, but it does. this loop simply places nothing on the DOM, although the decision structure is correctly executed on the server.
I tried escaping using #Html('...'), but that produced compile errors.
Attempt #2:
A lot of google searches led me to the "for comprehension":
#for(ind <- indicators; curDate = ind.getFirstFound()){
#if(curDate == null || !curDate.equals(ind.getFirstFound())){
#(curDate = ind.getFirstFound())
}
<tr><th colspan='5' class='day'>#(ind.getFirstFound())</th></tr>
<tr><th>Document ID</th><th>Value</th><th>Owner</th><th>Document Title / Comment</th></tr>
}
Without the if statement in this block, this is the closest I got to doing what I actually wanted, but apparently I am not allowed to reassign a non-reference type, which is why I was hoping attempt #1's reference declaration of curDate : Date = null would work. This attempt gets me the HTML on the page (again, if i remove the nested if statement) but doesn't get me the
My question is, how do i implement this intention? I am very painfully aware of my lack of Scala knowledge, which is being exacerbated by Play templating syntax. I am not sure what to do.
Thanks in advance!
Play's template language is very geared towards functional programming. It might be possible to achieve what you want to achieve using mutable state, but you'll probably be best going with the flow, and using a functional solution.
If you want to maintain state between iterations of a loop in functional programming, that can be done by doing a fold - you start with some state, and on each iteration, you get the previous state and the next element, and you then return the new state based on those two things.
So, looking at your first solution, it looks like what you're trying to do is only print an element out if it's date is different from the previous one, is that correct? Another way of putting this is you want to filter out all the elements that have a date that's the same date as the previous one. Expressing that in terms of a fold, we're going to fold the elements into a sequence (our initial state), and if the last element of the folded sequence has a different date to the current one, we add it, otherwise we ignore it.
Our fold looks like this:
indicators.foldLeft(Vector.empty[Indicator]) { (collected, next) =>
if (collected.lastOption.forall(_.getFirstFound != next.getFirstFound)) {
collected :+ next
} else {
collected
}
}
Just to explain the above, we're folding into a Vector because Vector has constant time append and last, List has n time. The forall will return true if there is no last element in collected, otherwise if there is, it will return true if the passed in lambda evaluates to true. And in Scala, == invokes .equals (after doing a null check), so you don't need to use .equals in Scala.
So, putting this in a template:
#for(ind <- indicators.foldLeft(Vector.empty[Indicator]) { (collected, next) =>
if (collected.lastOption.forall(_.getFirstFound != next.getFirstFound)) {
collected :+ next
} else {
collected
}
}){
...
}