Pass each item of dataset through list and update - scala

I am working on refactoring code for a Spark job written in Scala. We have a dataset of rolled up data, "rollups", that we pass through a list of rules. Each of the rollups has a "flags" value that is a list that we appended the rule information to to keep track of which "rule" was triggered by that rollup. Each rule is an object that will look at the data of a rollup and decide whether or not to add its identifying information to the rollups "flags".
So here is where the problem is. Currently each rule takes in the rollup data, adds a flag and then returns it again.
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): RollupData = {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
rollupData
}
}
This allows us to calculate the rules like:
val anomalyInfo = rollups.flatMap(x =>
AnomalyRules.rules.map(y =>
y.identifyAnomalies(x)
).filter(a => a.flags.nonEmpty)
)
# do later processing on anomalyInfo
The rollups val here is a Dataset of our rollup data and rules is a list of rule objects.
The issue with this method is that it will create a duplicate rollups for each rule. For example if we have 7 rules, each rollup will be duplicated 7 times because each rule returns the rollup passed into it. Running dropDuplicates() on the dataset will take care of this issue but its ugly and confusing.
This is why I wanted to refactor, but if I set up the rules to instead only append the rule like this:
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): Unit= {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
}
}
We can instead write
rollups.foreach(rollup =>
AnomalyRules.rules.foreach(rule =>
rule.identifyAnomalies(rollup)
)
)
# do later processing on rollups
This seems like the more intuitive approach. However, while running these rules in unit tests works fine, no "flag" information is added when the "rollups" dataset is passed through. I think it is because datasets are not mutable? This method actually does work if I collect the rollups as a list but the dataset I am testing on is much smaller than what is in production so we can't do that. This is where I am stuck and can not think of a cleaner way of writing this code. I feel like I am missing some fundamental programming concept or do not know how mutability works well enough.

Related

Can the subflows of groupBy depend on the keys they were generated from ?

I have a flow with data associated to users. I also have a state for each user, that I can get asynchronously from DB.
I want to separate my flow with one subflow per user, and load the state for each user when materializing the subflow, so that the elements of the subflow can be treated with respect to this state.
If I don't want to merge the subflows downstream, I can do something with groupBy and Sink.lazyInit :
def getState(userId: UserId): Future[UserState] = ...
def getUserId(element: Element): UserId = ...
def treatUser(state: UserState): Sink[Element, _] = ...
val treatByUser: Sink[Element] = Flow[Element].groupBy(
Int.MaxValue,
getUserId
).to(
Sink.lazyInit(
elt => getState(getUserId(elt)).map(treatUser),
??? // this is never called, since the subflow is created when an element comes
)
)
However, this does not work if treatUser becomes a Flow, since there is no equivalent for Sink.lazyInit.
Since subflows of groupBy are materialized only when a new element is pushed, it should be possible to use this element to materialize the subflow, but I wasn't able to adapt the source code for groupBy so that this work consistently. Likewise, Sink.lazyInitdoesn't seem to be easily translatable to the Flow case.
Any idea on how to solve this issue ?
The relevant Akka issue you have to look at is #20129: add Sink.dynamic and Flow.dynamic.
In the associated PR #20579 they actually implemented LazySink stuffs.
They are planning to do LazyFlow next:
Will do next lazyFlow with similar signature.
Unfortunately you have to wait for that functionality to be implemented in Akka or write it yourself (then consider a PR to Akka).

ScalikeJDBC, raw SQL failing to map or return a valid result set

I posted this to the scalikejdbc user group, and should also cross post to github issues.
I've seen in the docs examples of running raw queries against a table, i'm trying to get list of IndexEntry with the following code, and while the query is executed and returns results in the console, i'm not getting anything back in the map(rs =>... portion.
Relevant code is here - when i run this in intellij's debugger the result as "Vector" size = 0. Thanks for any guidance. i'm doing something wrong, hopefully its a simple oversight.
package company.shared.database.models
import company.shared.database.MySQLConnector
import scalikejdbc._
case class IndexEntry(model:String, indexName: String, fieldName: String)
object IndexSchemaModel extends App with MySQLConnector {
implicit val session:DBSession = AutoSession
def indexMap:List[IndexEntry] = {
val result = DB readOnly { implicit session =>
sql"""
SELECT
substring_index(t.name,'/',-1) as model,
i.name AS indexName,
f.name as tableName
FROM information_schema.innodb_sys_tables t
JOIN information_schema.innodb_sys_indexes i USING (table_id)
JOIN information_schema.innodb_sys_fields f USING (index_id)
where t.name like "MyDB/%"
""".map(rs => IndexEntry(
rs.string(0),
rs.string(1),
rs.string(2))).list().apply()
}
println(result) //always List()
List(IndexEntry("foo", "bar", "az")) //to match the return type
}
override def main(args: Array[String]): Unit = {
configureDB
indexMap
}
}
I have tried scalikejdc's other variants - withSql { queryDSL } and the entire bit as SQL Interpolation with full syntax support. The first and the last always execute against the mysql server, which returns 57 rows (small db), the middle throws an NPE, honestly i'm happy to tackle the middle second. I have an issue somewhere in .map and i've tried to have it just return the map, but it always results in an empty list.
Thanks and hopefully no syntax errors copying into SO.
Oh and FWIW, configureDb just sets a connection pool manually since the DB names and servers can vary wildly between sbt tests, dev, test and prod. Currently that is not my issue or i would see "ConnectionPool('default') not initialized" or similar.
Answered here: https://groups.google.com/forum/#!topic/scalikejdbc-users-group/yRjLjuCzuEo
should also cross post to github issues
Please avoid posting questions on GitHub issues. https://github.com/scalikejdbc/scalikejdbc/blob/master/CONTRIBUTING.md#issues
A bit of embarrassment aside. The user in question did not have the process privilege and therefore would not get any rows back from these tables, as soon as a grant process on . to user#host was added, all this worked fine. Permissions in information_schema are driven by what objects the user in question have access to. Meaning items like ROUTINES and in this case PROCESS etc have to be explicitly called out.

Spark caching strategy

I have a Spark driver that goes like this:
EDIT - earlier version of the code was different & didn't work
var totalResult = ... // RDD[(key, value)]
var stageResult = totalResult
do {
stageResult = stageResult.flatMap(
// Some code that returns zero or more outputs per input,
// and updates `acc` to number of outputs
...
).reduceByKey((x, y) => x.sum(y))
totalResult = totalResult.union(stageResult)
} while(stageResult.count() > 0)
I know from properties of my data that this will eventually terminate (I'm essentially aggregating up the nodes in a DAG).
I'm not sure of a reasonable caching strategy here - should I cache stageResult each time through the loop? Am I setting up a horrible tower of recursion, since each totalResult depends on all previous incarnations of itself? Or will Spark figure that out for me? Or should I put each RDD result in an array and take one big union at the end?
Suggestions will be welcome here, thanks.
I would rewrite this as follows:
do {
stageResult = stageResult.flatMap(
//Some code that returns zero or more outputs per input
).reduceByKey(_+_).cache
totalResult = totalResult.union(stageResult)
} while(stageResult.count > 0)
I am fairly certain(95%) that the stageResult DAG used in the union will be the correct reference (especially since count should trigger it), but this might need to be double checked.
Then when you call totalResult.ACTION, it will put all of the cached data together.
ANSWER BASED ON UPDATED QUESTION
As long as you have the memory space, then I would indeed cache everything along the way as it stores the data of each stageResult, unioning all of those data points at the end. In fact, each union does not rely on the past as that is not the semantics of RDD.union, it merely puts them together at the end. You could just as easily change your code to use a val due to RDD immutability.
As a final note, maybe the DAG visualization will help understand why there would not be recursive ramifications:

How can you filter on a custom value created during dehydration?

During dehydration I create a custom value:
def dehydrate(self, bundle):
bundle.data['custom_field'] = ["add lots of stuff and return an int"]
return bundle
that I would like to filter on.
/?format=json&custom_field__gt=0...
however I get an error that the "[custom_field] field has no 'attribute' for searching with."
Maybe I'm misunderstanding custom filters, but in both build_filters and apply_filters I can't seem to get access to my custom field to filter on it. On the examples I've seen, it seems like I'd have to redo all the work done in dehydrate in build_filters, e.g.
for all the items:
item['custom_field'] = ["add lots of stuff and return an int"]
filter on item and add to pk_list
orm_filters["pk__in"] = [i.pk for i in pk_list]
which seems wrong, as I'm doing the work twice. What am I missing?
The problem is that dehydration is "per object" by design, while filters are per object_list. That's why you will have to filter it manually and redo work in dehydration.
You can imagine it like this:
# Whole table
[obj, obj1, obj2, obj3, obj4, obj5, obj5]
# filter operations
[...]
# After filtering
[obj1, obj3, obj6]
# Returning
[dehydrate(obj), dehydrate(obj3), dehydrate(obj5)]
In addition you can imagine if you fetch by filtering and you get let say 100 objects. It would be quite inefficient to trigger dehydrate on whole table for instance 100000 records.
And maybe creating new column in model could be candidate solution if you plan to use a lot of filters, ordering etc. I guess its kind of statistic information in this field so if not new column then maybe django aggregation could ease your pain a little.

Is Rx extensions suitable for reading a file and store to database

I have a really long Excel file wich I read using EPPlus. For each line I test if it meets certain criteria and if so I add the line (an object representing the line) to a collection. When the file is read, I store those objects to the database. Would it be possible to do both things at the same time? My idea is to have a collection of objects that somehow would be consumed by thread that would save the objects to the DB. At the same time the excel reader method would populate the collection... Could this be done using Rx or is there a better method?
Thanks.
An alternate answer - based on comments to my first.
Create a function returning an IEnumberable<Records> from EPPlus/Xls - use yield return
then convert the seqence to an observable on the threadpool and you've got the Rx way of having a producer/consumer and BlockingCollection.
function IEnumberable<Records> epplusRecords()
{
while (...)
yield return nextRecord;
}
var myRecords = epplusRecords
.ToObservable(Scheduler.ThreadPool)
.Where(rec => meetsCritera(rec))
.Select(rec => newShape(rec))
.Do(newRec => writeToDb(newRec))
.ToArray();
Your case seems to be of pulling data (IEnumerable) and not pushing data (IObservable/Rx).
Hence I would suggest LINQ to objects is something that can be used to model the solution.
Something like shown in below code.
publis static IEnumerable<Records> ReadRecords(string excelFile)
{
//Read from excel file and yield values
}
//use linq operators to do filtering
var filtered = ReadRecords("fileName").Where(r => /*ur condition*/)
foreach(var r in filtered)
WriteToDb(r);
NOTE: In using IEnumerable you don't create intermediate collections in this case and the whole process looks like a pipeline.
It doesn't seem like a good fit, as there's no inherent concurrency, timing, or eventing in the use case.
That said, it may be a use case for plinq. If EEPlus supports concurrent reads. Something like
epplusRecords
.AsParallel()
.Where(rec => meetsCritera(rec))
.Select(rec => newShape(rec))
.Do(newRec => writeToDb(newRec))
.ToArray();