Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition? - email

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.

The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Related

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

mirth connect Database Reader automatic column mapping

Please could somebody confirm the following..
I am using Mirth Connect 3.5.08232.
My Source Connector is a Database Reader.
Say, I am using a query that returns multiple rows, and return the result (via JavaScript), as documentation suggests, so that Mirth would treat each row as a separate message. I also use a couple of mappers as source transformers, and save the mapped fields in my channel map (which ends up to contain only those fields that I define in transformers)
In the destination, and specifically, in destination response transformer (or destination body, if it is a JavaScript writer), how do I access the source fields?
the only way I found by trial and error is
var rawMsg = connectorMessage.getRawData();
var xmlMsg = new XML(rawMsg);
logger.info(xmlMsg.some_field); // ignore the root element of rawMsg
Is this the right way to do this? I thought that maybe the fields that were nicely automatically detected would be put in some kind of a map, like sourceMap - but that doesn't seem to be the case, right?
Thank you
If you are using Mapper steps in your transformer to extract the data and put it into a variable map (like the channel map), then you can use any of the following methods to retrieve it from a subsequent JavaScript context (including a JavaScript Writer, and your response transformer):
var value = channelMap.get('key');
var value = $c('key');
var value = $('key');
Look at the Variable Maps section of the User Guide for more information.
So to recap, say you're selecting a column "mycolumn" with a Database Reader. The XML sent to the channel will be something like this:
<result>
<mycolumn>value</mycolumn>
</result>
Then you can choose to extract pieces of that message into specific variables for later use. The transformer allows you to easily drag-and-drop pieces of the sample inbound message.
Finally in your JavaScript Writer (or in any subsequent filter, transformer, or response transformer), just drag the value into the field you want:
And the corresponding JavaScript code will automatically be inserted:
One last note, if you are selecting a lot of variables and don't want to make Mapper steps for each one individually, you can use a JavaScript Step to iterate through the message and extract each column into a separate map variable:
for each (child in msg.children()) {
channelMap.put(child.localName(), child.toString());
}
Or, you can just reference the columns directly from within the JavaScript Writer:
var msg = new XML(connectorMessage.getEncodedData());
var column1 = msg.column1.toString();
var column2 = msg.column2.toString();
...

Spark caching strategy

I have a Spark driver that goes like this:
EDIT - earlier version of the code was different & didn't work
var totalResult = ... // RDD[(key, value)]
var stageResult = totalResult
do {
stageResult = stageResult.flatMap(
// Some code that returns zero or more outputs per input,
// and updates `acc` to number of outputs
...
).reduceByKey((x, y) => x.sum(y))
totalResult = totalResult.union(stageResult)
} while(stageResult.count() > 0)
I know from properties of my data that this will eventually terminate (I'm essentially aggregating up the nodes in a DAG).
I'm not sure of a reasonable caching strategy here - should I cache stageResult each time through the loop? Am I setting up a horrible tower of recursion, since each totalResult depends on all previous incarnations of itself? Or will Spark figure that out for me? Or should I put each RDD result in an array and take one big union at the end?
Suggestions will be welcome here, thanks.
I would rewrite this as follows:
do {
stageResult = stageResult.flatMap(
//Some code that returns zero or more outputs per input
).reduceByKey(_+_).cache
totalResult = totalResult.union(stageResult)
} while(stageResult.count > 0)
I am fairly certain(95%) that the stageResult DAG used in the union will be the correct reference (especially since count should trigger it), but this might need to be double checked.
Then when you call totalResult.ACTION, it will put all of the cached data together.
ANSWER BASED ON UPDATED QUESTION
As long as you have the memory space, then I would indeed cache everything along the way as it stores the data of each stageResult, unioning all of those data points at the end. In fact, each union does not rely on the past as that is not the semantics of RDD.union, it merely puts them together at the end. You could just as easily change your code to use a val due to RDD immutability.
As a final note, maybe the DAG visualization will help understand why there would not be recursive ramifications:

Handle errors and control breaks with Spring Batch

I have almost completed my work with Spring Batch, it's working but then I have problems to handle the errors. I'll make a simple example:
I read one flat file, that I (later) map with 3 variables:
ID CODE NAME
AAA3333333Alex
AAA3333333Mark
BBB4444444Paul
I want the reader to read the flat file with a control break (I don't know if it's the right term in english, in italian it's something like "key break"): I read the elements with the same ID and CODE and only when the key changes return them to the reader:
while ((line = (Person) peek()) != null) { //while there are elements to read
if (line.getId().equals(prevElement.getId()) && line.getCode().equals(prevElement.getCode())
//do something
}
This works fine: when the ID or CODE changes I return the elements to the writer.
To make this work I had to set the commit-interval to 1 from the application-context. The thing is that in the worst case, if the elements are different for each line, I commit every single element and it becomes all very very slow.
So I said: let's put an outer control. Instead of returning the elements to the writer each time the key changes, I put them in a list, and then I return the list every 200 key changes (like a...handmade commit-interval):
while (controlBreakCount < 200 &&) {
while (!exit && (line = (Person) peek()) != null) { //while there are elements to read
if (line.getId().equals(prevElement.getId()) && line.getCode().equals(prevElement.getCode())
//do something
else { //if the key changes
//there is a controlBreakCount++; to increase the count
//add the elements to a list
}
}
}
return the list
and this works too (the real code has more controls, but this was to explain in a simple way).
The problem comes here: how to handle the errors with the listener in this case. With the outer while I have put (the one with the controlBreakCount), if even one of the 200 elements has an error all the elements currently in the list go to the listener and so it's very difficult to recognize the element with the error.
I guess my solution is not the best way to handle the "control break", but I can't find really much about this (and I'm not very pro with Spring Batch)...may I have some help?
Thank you very much :)
The 'do something' in your code is suspect... :/ Which operations do you do on the element just read? In reader you should only aggregate items and pass them to processor or writer.
IMHO you have to set a commit-interval greater than 1 to make process faster and try to group your data with same ID+CODE key in your own custom reader and THEN pass to writer in this way:
class PersonList {
String id;
String code;
List<Person> persons = new ArrayList<Person>();
}
when you find a keybreak (ID+CODE differs from previous) you have to create a new PersonList object and until ID+CODE are the same of previous line add to current PersonList.person list.
If your problem is about manage single line error with my solution you you can manage in your reader and, if you want, skip items with same ID+CODE in your reader as well.
Your reader must change its signature to MyPersonItemReader<PersonList> and your writer will write PersonList objects, but you have under your control the object (with ID+CODE).
I hope I had understood correctly your problem.

Manipulating form input values after submission causes multiple instances

I'm building a form with Yii that updates two models at once.
The form takes the inputs for each model as $modelA and $modelB and then handles them separately as described here http://www.yiiframework.com/wiki/19/how-to-use-a-single-form-to-collect-data-for-two-or-more-models/
This is all good. The difference I have to the example is that $modelA (documents) has to be saved and its ID retrieved and then $modelB has to be saved including the ID from $model A as they are related.
There's an additional twist that $modelB has a file which needs to be saved.
My action code is as follows:
if(isset($_POST['Documents'], $_POST['DocumentVersions']))
{
$modelA->attributes=$_POST['Documents'];
$modelB->attributes=$_POST['DocumentVersions'];
$valid=$modelA->validate();
$valid=$modelB->validate() && $valid;
if($valid)
{
$modelA->save(false); // don't validate as we validated above.
$newdoc = $modelA->primaryKey; // get the ID of the document just created
$modelB->document_id = $newdoc; // set the Document_id of the DocumentVersions to be $newdoc
// todo: set the filename to some long hash
$modelB->file=CUploadedFile::getInstance($modelB,'file');
// finish set filename
$modelB->save(false);
if($modelB->save()) {
$modelB->file->saveAs(Yii::getPathOfAlias('webroot').'/uploads/'.$modelB->file);
}
$this->redirect(array('projects/myprojects','id'=>$_POST['project_id']));
}
}
ELSE {
$this->render('create',array(
'modelA'=>$modelA,
'modelB'=>$modelB,
'parent'=>$id,
'userid'=>$userid,
'categories'=>$categoriesList
));
}
You can see that I push the new values for 'file' and 'document_id' into $modelB. What this all works no problem, but... each time I push one of these values into $modelB I seem to get an new instance of $modelA. So the net result, I get 3 new documents, and 1 new version. The new version is all linked up correctly, but the other two documents are just straight duplicates.
I've tested removing the $modelB update steps, and sure enough, for each one removed a copy of $modelA is removed (or at least the resulting database entry).
I've no idea how to prevent this.
UPDATE....
As I put in a comment below, further testing shows the number of instances of $modelA depends on how many times the form has been submitted. Even if other pages/views are accessed in the meantime, if the form is resubmitted within a short period of time, each time I get an extra entry in the database. If this was due to some form of persistence, then I'd expect to get an extra copy of the PREVIOUS model, not multiples of the current one. So I suspect something in the way its saving, like there is some counter that's incrementing, but I've no idea where to look for this, or how to zero it each time.
Some help would be much appreciated.
thanks
JMB
OK, I had Ajax validation set to true. This was calling the create action and inserting entries. I don't fully get this, or how I could use ajax validation if I really wanted to without this effect, but... at least the two model insert with relationship works.
Thanks for the comments.
cheers
JMB