Is Rx extensions suitable for reading a file and store to database - entity-framework

I have a really long Excel file wich I read using EPPlus. For each line I test if it meets certain criteria and if so I add the line (an object representing the line) to a collection. When the file is read, I store those objects to the database. Would it be possible to do both things at the same time? My idea is to have a collection of objects that somehow would be consumed by thread that would save the objects to the DB. At the same time the excel reader method would populate the collection... Could this be done using Rx or is there a better method?
Thanks.

An alternate answer - based on comments to my first.
Create a function returning an IEnumberable<Records> from EPPlus/Xls - use yield return
then convert the seqence to an observable on the threadpool and you've got the Rx way of having a producer/consumer and BlockingCollection.
function IEnumberable<Records> epplusRecords()
{
while (...)
yield return nextRecord;
}
var myRecords = epplusRecords
.ToObservable(Scheduler.ThreadPool)
.Where(rec => meetsCritera(rec))
.Select(rec => newShape(rec))
.Do(newRec => writeToDb(newRec))
.ToArray();

Your case seems to be of pulling data (IEnumerable) and not pushing data (IObservable/Rx).
Hence I would suggest LINQ to objects is something that can be used to model the solution.
Something like shown in below code.
publis static IEnumerable<Records> ReadRecords(string excelFile)
{
//Read from excel file and yield values
}
//use linq operators to do filtering
var filtered = ReadRecords("fileName").Where(r => /*ur condition*/)
foreach(var r in filtered)
WriteToDb(r);
NOTE: In using IEnumerable you don't create intermediate collections in this case and the whole process looks like a pipeline.

It doesn't seem like a good fit, as there's no inherent concurrency, timing, or eventing in the use case.
That said, it may be a use case for plinq. If EEPlus supports concurrent reads. Something like
epplusRecords
.AsParallel()
.Where(rec => meetsCritera(rec))
.Select(rec => newShape(rec))
.Do(newRec => writeToDb(newRec))
.ToArray();

Related

Pass each item of dataset through list and update

I am working on refactoring code for a Spark job written in Scala. We have a dataset of rolled up data, "rollups", that we pass through a list of rules. Each of the rollups has a "flags" value that is a list that we appended the rule information to to keep track of which "rule" was triggered by that rollup. Each rule is an object that will look at the data of a rollup and decide whether or not to add its identifying information to the rollups "flags".
So here is where the problem is. Currently each rule takes in the rollup data, adds a flag and then returns it again.
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): RollupData = {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
rollupData
}
}
This allows us to calculate the rules like:
val anomalyInfo = rollups.flatMap(x =>
AnomalyRules.rules.map(y =>
y.identifyAnomalies(x)
).filter(a => a.flags.nonEmpty)
)
# do later processing on anomalyInfo
The rollups val here is a Dataset of our rollup data and rules is a list of rule objects.
The issue with this method is that it will create a duplicate rollups for each rule. For example if we have 7 rules, each rollup will be duplicated 7 times because each rule returns the rollup passed into it. Running dropDuplicates() on the dataset will take care of this issue but its ugly and confusing.
This is why I wanted to refactor, but if I set up the rules to instead only append the rule like this:
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): Unit= {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
}
}
We can instead write
rollups.foreach(rollup =>
AnomalyRules.rules.foreach(rule =>
rule.identifyAnomalies(rollup)
)
)
# do later processing on rollups
This seems like the more intuitive approach. However, while running these rules in unit tests works fine, no "flag" information is added when the "rollups" dataset is passed through. I think it is because datasets are not mutable? This method actually does work if I collect the rollups as a list but the dataset I am testing on is much smaller than what is in production so we can't do that. This is where I am stuck and can not think of a cleaner way of writing this code. I feel like I am missing some fundamental programming concept or do not know how mutability works well enough.

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Spark caching strategy

I have a Spark driver that goes like this:
EDIT - earlier version of the code was different & didn't work
var totalResult = ... // RDD[(key, value)]
var stageResult = totalResult
do {
stageResult = stageResult.flatMap(
// Some code that returns zero or more outputs per input,
// and updates `acc` to number of outputs
...
).reduceByKey((x, y) => x.sum(y))
totalResult = totalResult.union(stageResult)
} while(stageResult.count() > 0)
I know from properties of my data that this will eventually terminate (I'm essentially aggregating up the nodes in a DAG).
I'm not sure of a reasonable caching strategy here - should I cache stageResult each time through the loop? Am I setting up a horrible tower of recursion, since each totalResult depends on all previous incarnations of itself? Or will Spark figure that out for me? Or should I put each RDD result in an array and take one big union at the end?
Suggestions will be welcome here, thanks.
I would rewrite this as follows:
do {
stageResult = stageResult.flatMap(
//Some code that returns zero or more outputs per input
).reduceByKey(_+_).cache
totalResult = totalResult.union(stageResult)
} while(stageResult.count > 0)
I am fairly certain(95%) that the stageResult DAG used in the union will be the correct reference (especially since count should trigger it), but this might need to be double checked.
Then when you call totalResult.ACTION, it will put all of the cached data together.
ANSWER BASED ON UPDATED QUESTION
As long as you have the memory space, then I would indeed cache everything along the way as it stores the data of each stageResult, unioning all of those data points at the end. In fact, each union does not rely on the past as that is not the semantics of RDD.union, it merely puts them together at the end. You could just as easily change your code to use a val due to RDD immutability.
As a final note, maybe the DAG visualization will help understand why there would not be recursive ramifications:

How can you filter on a custom value created during dehydration?

During dehydration I create a custom value:
def dehydrate(self, bundle):
bundle.data['custom_field'] = ["add lots of stuff and return an int"]
return bundle
that I would like to filter on.
/?format=json&custom_field__gt=0...
however I get an error that the "[custom_field] field has no 'attribute' for searching with."
Maybe I'm misunderstanding custom filters, but in both build_filters and apply_filters I can't seem to get access to my custom field to filter on it. On the examples I've seen, it seems like I'd have to redo all the work done in dehydrate in build_filters, e.g.
for all the items:
item['custom_field'] = ["add lots of stuff and return an int"]
filter on item and add to pk_list
orm_filters["pk__in"] = [i.pk for i in pk_list]
which seems wrong, as I'm doing the work twice. What am I missing?
The problem is that dehydration is "per object" by design, while filters are per object_list. That's why you will have to filter it manually and redo work in dehydration.
You can imagine it like this:
# Whole table
[obj, obj1, obj2, obj3, obj4, obj5, obj5]
# filter operations
[...]
# After filtering
[obj1, obj3, obj6]
# Returning
[dehydrate(obj), dehydrate(obj3), dehydrate(obj5)]
In addition you can imagine if you fetch by filtering and you get let say 100 objects. It would be quite inefficient to trigger dehydrate on whole table for instance 100000 records.
And maybe creating new column in model could be candidate solution if you plan to use a lot of filters, ordering etc. I guess its kind of statistic information in this field so if not new column then maybe django aggregation could ease your pain a little.

How to compare 2 mongodb collections?

Im trying to 'compare' all documents between 2 collections, which will return true only and if only all documents inside 2 collections are exactly equal.
I've been searching for the methods on the collection, but couldnt find one that can do this.
I experimented something like these in the mongo shell, but not working as i expected :
db.test1 == db.test2
or
db.test1.to_json() == db.test2.to_json()
Please share your thoughts ! Thank you.
You can try using mongodb eval combined with your custom equals function, something like this.
Your methods don't work because in the first case you are comparing object references, which are not the same. In the second case, there is no guarantee that to_json will generate the same string even for the objects that are the same.
Instead, try something like this:
var compareCollections = function(){
db.test1.find().forEach(function(obj1){
db.test2.find({/*if you know some properties, you can put them here...if don't, leave this empty*/}).forEach(function(obj2){
var equals = function(o1, o2){
// here goes some compare code...modified from the SO link you have in the answer.
};
if(equals(ob1, obj2)){
// Do what you want to do
}
});
});
};
db.eval(compareCollections);
With db.eval you ensure that code will be executed on the database server side, without fetching collections to the client.