Is there a way in play-json to read a key/value map, remembering the original ordering? - scala

Given:
{
...
"fruits": {
"apple": { ... },
"banana": { ... },
"cherry": { ... },
...
"watermelon": { ... }
}
}
Is there a way in Scala to read this JSON map fruits of String -> object while remembering that the ordering of the keys was originally apple, banana, cherry ... watermelon?
[below added because I was asked why I wanted this and to provide a test example]
Normally I wouldn't care about the ordering (of a Map). I do not control the format of the input; it is a Map, not an Array. The real input is not fruits, it is alerts with alphanumeric keys, I just picked fruit names for simplicity. I am building test files based on the data. Suppose there were ten items in the first file, and I deleted "watermelon" (the 10th object) from the 2nd file. The code that read the first file put the objects in to a database. When it processes the objects (alerts), each produces an action. A test result is an EventAction(id:Long,action:String). The id is an auto-increment Long from the database; I do not control that. After processing the first file, it turns out that the alert associated with "watermelon" was created with an id of 2, not 10. When I'm building my test for the processing of the second file (the one without "watermelon"), if I think the id will be 10, the test will fail not because I predicted the action incorrectly, but because I didn't know the id would be 2 instead of 10.
One of way of dealing with Map ("you shouldn't care about ordering so you won't get any clues as to the original ordering in the JSON file") is I can build ad-hoc SQL to find out what database id was created for each key, just for the tests. Before I write ad-hoc SQL (the company normally asks all DB interactions be through stored procedures written by a DBA), I thought, "Wouldn't it be neat, at the time of reading the JSON, to remember the ordering in the moment, before it is lost."

Play-json(at least 2.6.9) parses js object from top to bottom collecting all fields to content: ListBuffer[(String, JsValue)]. When all fields in the object are parsed, then JsObject is instantiated via apply method
def apply(fields: Seq[(String, JsValue)]): JsObject = new JsObject(mutable.LinkedHashMap(fields: _*))
So the answer to your question: if you use the default Reads for Map from the library, the ordering is already kept.

Related

Pass each item of dataset through list and update

I am working on refactoring code for a Spark job written in Scala. We have a dataset of rolled up data, "rollups", that we pass through a list of rules. Each of the rollups has a "flags" value that is a list that we appended the rule information to to keep track of which "rule" was triggered by that rollup. Each rule is an object that will look at the data of a rollup and decide whether or not to add its identifying information to the rollups "flags".
So here is where the problem is. Currently each rule takes in the rollup data, adds a flag and then returns it again.
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): RollupData = {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
rollupData
}
}
This allows us to calculate the rules like:
val anomalyInfo = rollups.flatMap(x =>
AnomalyRules.rules.map(y =>
y.identifyAnomalies(x)
).filter(a => a.flags.nonEmpty)
)
# do later processing on anomalyInfo
The rollups val here is a Dataset of our rollup data and rules is a list of rule objects.
The issue with this method is that it will create a duplicate rollups for each rule. For example if we have 7 rules, each rollup will be duplicated 7 times because each rule returns the rollup passed into it. Running dropDuplicates() on the dataset will take care of this issue but its ugly and confusing.
This is why I wanted to refactor, but if I set up the rules to instead only append the rule like this:
object Rule_1 extends AnomalyRule {
override def identifyAnomalies(rollupData: RollupData): Unit= {
if (*condition on rollup data is triggered*) {
rollupData.flags = rollupData.addFlag("rule 1")
}
}
}
We can instead write
rollups.foreach(rollup =>
AnomalyRules.rules.foreach(rule =>
rule.identifyAnomalies(rollup)
)
)
# do later processing on rollups
This seems like the more intuitive approach. However, while running these rules in unit tests works fine, no "flag" information is added when the "rollups" dataset is passed through. I think it is because datasets are not mutable? This method actually does work if I collect the rollups as a list but the dataset I am testing on is much smaller than what is in production so we can't do that. This is where I am stuck and can not think of a cleaner way of writing this code. I feel like I am missing some fundamental programming concept or do not know how mutability works well enough.

Load more records from Gatling feeder

I would like to inject n-rows from my csv file to Gatling feeder. The default approach of Gatling is to read and inject one row at a time. However, I cannot find anywhere, how to take and inject an eg. Array into a template.
I came up with creating a JSON template with Gatling Expressions as some of the fields. The issue is I have a JSON array with N-elements:
[
{"myKey": ${value}, "mySecondKey": ${value2}, ...},
{"myKey": ${value}, "mySecondKey": ${value2}, ...},
{"myKey": ${value}, "mySecondKey": ${value2}, ...},
{"myKey": ${value}, "mySecondKey": ${value2}, ...}
]
And my csv:
value,value2,...
value,value2,...
value,value2,...
value,value2,...
...
I would like to make it as efficient as possible. My data is in CSV file, so I would like to use csv feeder. Also, the size is large, so readRecords is not possible, since I'm getting out of memory.
Is there a way I can put N-records into the request body using Gatling?
From the documentation:
feed(feeder, 2)
Old Gatling versions:
Attribute names, will be suffixed. For example, if the columns are name “foo” and “bar” and you’re feeding 2 records at once, you’ll get “foo1”, “bar1”, “foo2” and “bar2” session attributes.
Modern Gatling versions:
values will be arrays containing all the values of the same key.
In this latter case, you can access a value at a given index with Gatling EL: #{foo(0)}, #{foo(1)}, #{bar(0)} and #{bar(1)}
It seems that the documentation on this front might have changed a bit since then:
It’s also possible to feed multiple records at once. In this case, values will be arrays containing all the values of the same key.
I personally wrote this in Java, but it is easy to find the syntax for scala as well in the documentation.
The solution I used for my CSV file is to add the feeder to the scenario like:
.feed(CoreDsl.csv("pathofyourcsvfile"), NUMBER_OF_RECORDS)
To apply/receive that array data during your .exec you can do something like this:
.post("YourEndpointPath").body(StringBody(session -> yourMethod(session.get(YourStringKey))))
In this case, I am using a POST and requestBody, but the concept remains similar for GET and their corresponding queryParameters. So basically, you can use the session lambda in combination with the session.get method.
"yourMethod" can then receive this parameter as an Object[].

How does resource.data.size() work in firestore rules (what is being counted)?

TLDR: What is request.resource.data.size() counting in the firestore rules when writing, say, some booleans and a nested Object to a document? Not sure what the docs mean by "entries in the map" (https://firebase.google.com/docs/reference/rules/rules.firestore.Resource#data, https://firebase.google.com/docs/reference/rules/rules.Map) and my assumptions appear to be wrong when testing in the rules simulator (similar problem with request.resource.data.keys().size()).
Longer version: Running into a problem in Firestore rules where not being able to update data as expected (despite similar tests working in the rules simulator). Have narrowed down the problem to point where can see that it is a rule checking for request.resource.data.size() equaling a certain number.
An example of the data being passed to the firestore update function looks like
Object {
"parentObj": Object {
"nestedObj": Object {
"key1": Timestamp {
"nanoseconds": 998000000,
"seconds": 1536498767,
},
},
},
"otherKey": true,
}
where the timestamp is generated via firebase.firestore.Timestamp.now().
This appears to work fine in the rules simulator, but not for the actual data when doing
let obj = {}
obj.otherKey = true
// since want to set object key name dynamically as nestedObj value,
// see https://stackoverflow.com/a/47296152/8236733
obj.parentObj = {} // needed for adding nested dynamic keys
obj.parentObj[nestedObj] = {
key1: fb.firestore.Timestamp.now()
}
firebase.firestore.collection('mycollection')
.doc('mydoc')
.update(obj)
Among some other rules, I use the rule request.resource.data.size() == 2 and this appears to be the rules that causes a permission denied error (since commenting out this rules get things working again). Would think that since the object is being passed with 2 (top-level) keys, then request.resource.data.size()=2, but this is apparently not the case (nor is it the number of keys total in the passed object) (similar problem with request.resource.data.keys().size()). So there's a long example to a short question. Would be very helpful if someone could clarify for me what is going wrong here.
From my last communications with firebase support around a month ago - there were issues with request.resource.data.size() and timestamp based security rules for queries.
I was also told that request.resource.data.size() is the size of the document AFTER a successful write. So if you're writing 2 additional keys to a document with 4 keys, that value you should be checking against is 6, not 2.
Having said all that - I am still having problems with request.resource.data.size() and any alternatives such as request.resource.size() which seems to be used in this documentation
https://firebase.google.com/docs/firestore/solutions/role-based-access
I also have some places in my security rules where it seems to work. I personally don't know why that is though.
Been struggling with that for a few hours and I see now that the doc on Firebase is clear: "the request.resource variable contains the future state of the document". So with ALL the fields, not only the ones being sent.
https://firebase.google.com/docs/firestore/security/rules-conditions#data_validation.
But there is actually another way to ONLY count the number of fields being sent with request.writeFields.size(). The property writeFields is a table with all the incoming fields.
Beware: writeFields is deprecated and may stop working anytime, but I have not found any replacement.
EDIT: writeFields apparently does not work in the simulator anymore...

Spark: How to structure a series of side effect actions inside mapping transformation to avoid repetition?

I have a spark streaming application that needs to take these steps:
Take a string, apply some map transformations to it
Map again: If this string (now an array) has a specific value in it, immediately send an email (or do something OUTSIDE the spark environment)
collect() and save in a specific directory
apply some other transformation/enrichment
collect() and save in another directory.
As you can see this implies to lazily activated calculations, which do the OUTSIDE action twice. I am trying to avoid caching, as at some hundreds lines per second this would kill my server.
Also trying to mantaining the order of operation, though this is not as much as important: Is there a solution I do not know of?
EDIT: my program as of now:
kafkaStream;
lines = take the value, discard the topic;
lines.foreachRDD{
splittedRDD = arg.map { split the string };
assRDD = splittedRDD.map { associate to a table };
flaggedRDD = assRDD.map { add a boolean parameter under a if condition + send mail};
externalClass.saveStaticMethod( flaggedRDD.collect() and save in file);
enrichRDD = flaggedRDD.map { enrich with external data };
externalClass.saveStaticMethod( enrichRDD.collect() and save in file);
}
I put the saving part after the email so that if something goes wrong with it at least the mail has been sent.
The final 2 methods I found were these:
In the DStream transformation before the side-effected one, make a copy of the Dstream: one will go on with the transformation, the other will have the .foreachRDD{ outside action }. There are no major downside in this, as it is just one RDD more on a worker node.
Extracting the {outside action} from the transformation and mapping the already sent mails: filter if mail has already been sent. This is a almost a superfluous operation as it will filter out all of the RDD elements.
Caching before going on (although I was trying to avoid it, there was not much to do)
If trying to not caching, solution 1 is the way to go

Intersystems Cache - Maintaining Object Code to ensure Data is Compliant with Object Definition

I am new to using intersytems cache and face an issue where I am querying data stored in cache, exposed by classes which do not seem to accurately represent the data in the underlying system. The data stored in the globals is almost always larger than what is defined in the object code.
As such I get errors like the one below very frequently.
Msg 7347, Level 16, State 1, Line 2
OLE DB provider 'MSDASQL' for linked server 'cache' returned data that does not match expected data length for column '[cache]..[namespace].[tablename].columname'. The (maximum) expected data length is 5, while the returned data length is 6.
Does anyone have any experience with implementing some type of quality process to ensure that the object definitions (sql mappings) are maintained in such away that they can accomodate the data which is being persisted in the globals?
Property columname As %String(MAXLEN = 5, TRUNCATE = 1) [ Required, SqlColumnNumber = 2, SqlFieldName = columname ];
In this particular example the system has the column defined with a max len of 5, however the data stored in the system is 6 characters long.
How can I proactively monitor and repair such situations.
/*
I did not create these object definitions in cache
*/
It's not completely clear what "monitor and repair" would mean for you, but:
How much control do you have over the database side? Cache runs code for a data-type on converting from a global to ODBC using the LogicalToODBC method of the data-type class. If you change the property types from %String to your own class, AppropriatelyNamedString, then you can override that method to automatically truncate. If that's what you want to do. It is possible to change all the %String property types programatically using the %Library.CompiledClass class.
It is also possible to run code within Cache to find records with properties that are above the (somewhat theoretical) maximum length. This obviously would require full table scans. It is even possible to expose that code as a stored procedure.
Again, I don't know what exactly you are trying to do, but those are some options. They probably do require getting deeper into the Cache side than you would prefer.
As far as preventing the bad data in the first place, there is no general answer. Cache allows programmers to directly write to the globals, bypassing any object or table definitions. If that is happening, the code doing so must be fixed directly.
Edit: Here is code that might work in detecting bad data. It might not work if you are doing cetain funny stuff, but it worked for me. It's kind of ugly because I didn't want to break it up into methods or tags. This is meant to run from a command prompt, so it would have to be modified for your purposes probably.
{
S ClassQuery=##CLASS(%ResultSet).%New("%Dictionary.ClassDefinition:SubclassOf")
I 'ClassQuery.Execute("%Library.Persistent") b q
While ClassQuery.Next(.sc) {
If $$$ISERR(sc) b Quit
S ClassName=ClassQuery.Data("Name")
I $E(ClassName)="%" continue
S OneClassQuery=##CLASS(%ResultSet).%New(ClassName_":Extent")
I '$IsObject(OneClassQuery) continue //may not exist
try {
I 'OneClassQuery.Execute() D OneClassQuery.Close() continue
}
catch
{
D OneClassQuery.Close()
continue
}
S PropertyQuery=##CLASS(%ResultSet).%New("%Dictionary.PropertyDefinition:Summary")
K Properties
s sc=PropertyQuery.Execute(ClassName) I 'sc D PropertyQuery.Close() continue
While PropertyQuery.Next()
{
s PropertyName=$G(PropertyQuery.Data("Name"))
S PropertyDefinition=""
S PropertyDefinition=##CLASS(%Dictionary.PropertyDefinition).%OpenId(ClassName_"||"_PropertyName)
I '$IsObject(PropertyDefinition) continue
I PropertyDefinition.Private continue
I PropertyDefinition.SqlFieldName=""
{
S Properties(PropertyName)=PropertyName
}
else
{
I PropertyName'="" S Properties(PropertyDefinition.SqlFieldName)=PropertyName
}
}
D PropertyQuery.Close()
I '$D(Properties) continue
While OneClassQuery.Next(.sc2) {
B:'sc2
S ID=OneClassQuery.Data("ID")
Set OneRowQuery=##class(%ResultSet).%New("%DynamicQuery:SQL")
S sc=OneRowQuery.Prepare("Select * FROM "_ClassName_" WHERE ID=?") continue:'sc
S sc=OneRowQuery.Execute(ID) continue:'sc
I 'OneRowQuery.Next() D OneRowQuery.Close() continue
S PropertyName=""
F S PropertyName=$O(Properties(PropertyName)) Q:PropertyName="" d
. S PropertyValue=$G(OneRowQuery.Data(PropertyName))
. I PropertyValue'="" D
.. S PropertyIsValid=$ZOBJClassMETHOD(ClassName,Properties(PropertyName)_"IsValid",PropertyValue)
.. I 'PropertyIsValid W !,ClassName,":",ID,":",PropertyName," has invalid value of "_PropertyValue
.. //I PropertyIsValid W !,ClassName,":",ID,":",PropertyName," has VALID value of "_PropertyValue
D OneRowQuery.Close()
}
D OneClassQuery.Close()
}
D ClassQuery.Close()
}
The simplest solution is to increase the MAXLEN parameter to 6 or larger. Caché only enforces MAXLEN and TRUNCATE when saving. Within other Caché code this is usually fine, but unfortunately ODBC clients tend to expect this to be enforced more strictly. The other option is to write your SQL like SELECT LEFT(columnname, 5)...
The simplest solution which I use for all Integration Services Packages, for example is to create a query that casts all nvarchar or char data to the correct length. In this way, my data never fails for truncation.
Optional:
First run a query like: SELECT Max(datalength(mycolumnName)) from cachenamespace.tablename.mycolumnName
Your new query : SELECT cast(mycolumnname as varchar(6) ) as mycolumnname,
convert(varchar(8000), memo_field) AS memo_field
from cachenamespace.tablename.mycolumnName
Your pain of getting the data will be lessened but not eliminated.
If you use any type of oledb provider, or if you use an OPENQUERY in SQL Server,
the casts must occur in the query sent to Intersystems CACHE db, not in the the outer query that retrieves data from the inner OPENQUERY.