Imagine a situation where you need to have recursive data structures, like trees, represented in a FIX message. How would you do that?
I could represent a data structure in JSON like this:
{
{
"name": "1"
},
{
"name": "3",
"chains": [
[
{
"name": "a"
},
{
"name": "c",
"chains": [
{
"name": "x"
},
]
}
],
[
{
"name": "A"
},
]
]
}
}
How could I represent this in FIX?
I'm going to propose a solution here.
Standard FIX tag numbers are ignored.
Tags:
1 = Name
3 = NumberOfNodes
4 = NumberOfChains
Components:
Node: Name (1) tag is required and Chains component is optional
Chains: NumberOfChains (4) is required and at least one Chain is required
Chain: NumberOfNodes (3) is required and at least one Node is required
Lines starting with # are comment and are not part of the actual message.
New lines are tag delimiters.
# start of level 0
3=2
1=1
1=3
start of level 1
4=2
3=2
1=a
1=c
# start of level 2
4=1
3=1
1=x
# end of level 2
3=1
1=A
# end of level 1
# end of level 0
Please comment if this is valid FIX or not and whether there is a better way to express this in FIX.
There is no good reason to have a recursive segment in a FIX message. Why would any financial info transmission need to go infinitely deep?
You can't find any information about it because there aren't any parties in the traditional FIX userbase who would want that.
I suppose you could customize your FIX data dictionary to make a repeating group contain itself. I suspect that such a DD would crash the code generators of at least one (if not all) of the QuickFIX ports, as they probably aren't checking for such insanity (and thus will keep creating recursive structures in your memory until they blow it).
Related
I'm trying to make a custom syntax highlighter for my own markup language. All the examples are complicated, missing steps and are very, very hard to understand.
Is there anything that fully documents how to make a syntax highlighter?
(for VSCode, by the way)
For example, this video https://www.youtube.com/watch?v=5msZv-nKebI which has an extremely large skip in the middle and doesn't really explain much.
My current code, made with Yeoman generator is:
{
"$schema": "https://raw.githubusercontent.com/martinring/tmlanguage/master/tmlanguage.json",
"name": "BetterMarkupLanguage",
"patterns": [
{
"include": "#keywords"
},
{
"include": "#strings"
}
],
"repository": {
"keywords": {
"patterns": [{
"name": "entity.other.bml",
"match": "\\b({|}|\\\\|//)\\b"
}]
},
"strings": {
"name": "string.quoted.double.bml",
"begin": "`",
"end": "`"
}
},
"scopeName": "source.bml"
}
Synopsis
I'm not sure at what level you're approaching the problem, but there are basically two kinds of syntax-highlighting:
Just identify little nuggets of canonically identifiable tokens (strings, numbers, maybe operators, reserved words, comments) and highlight those, OR
Do the former, and also add in context awareness.
tmLanguage engines basically have two jobs:
Assign scopes.
Maintain a stack of contexts.
Samples
Lets say you make a definition for integers with this pattern:
"integers": {
"patterns": [{
"name": "constant.numeric.integer.bml",
"match": "[+-]\\d+"
}]
},
When the engine matches an integer like that, it will match to the end of the regex, assign the scope from "name", and then continue matching things in this same context.
Compare that to your "strings" definition:
"strings": {
"name": "string.quoted.double.bml", // should be string.quoted.backtick.bml
"begin": "`",
"end": "`"
},
Those "begin" and "end" markers denote a change in the tmLanguage stack. You have pushed into a new context inside of a string.
Right now, there are no matches configured in this context, but you could do that by adding a "patterns" key with some "match"es or "include"s. "include"s are other sets of matches like "integers" that you've defined elsewhere. You can add it to the "strings" patterns to match integers inside strings. Matching integers might be silly, but think about escaped backticks: You want to scope those and stay in the same context within "strings". You don't want those popping back out prematurely.
Order of operations
You'll eventually notice that the first pattern encountered is matched. Remember the integers set? What happens when you have 45.125? It will decide to match the 45 and the 125 as integers and ignore the . entirely. If you have a "floats" pattern, you want to include that before your naïve integer pattern. Both these "numbers" definitions below are equivalent, but one lets you re-use floats and integers independently (if that's useful for your language):
"numbers": {
"patterns": [
{"include": "#floats"},
{"include": "#integers"}
]
},
"integers": {
"patterns": [{
"name": "constant.numeric.integer.bml",
"match": "[+-]\\d+"
}]
},
"floats": {
"patterns": [{
"name": "constant.numeric.float.bml",
"match": "[+-]\\d+\\.\\d*"
}]
},
"numbers": {
"patterns": [{
"name": "constant.numeric.float.bml",
"match": "[+-]\\d+\\.\\d*"
}, {
"name": "constant.numeric.integer.bml",
"match": "[+-]\\d+"
}]
},
Doing it right
The "numbers"/"integers"/"floats" thing was trivial, but well-designed syntax definitions will define utility groups that "include" equivalent things together for re-usability:
A normal programming language will have things like
A "statements" group of all things that can be directly executed. This then may or may not (language-dependent) include...
An "expressions" group of things you can put on the right-hand-side of an assignment, which will definitely include...
An "atoms" group of strings, numbers, chars, etc. that might also be valid statements, but that also depends on your language.
"function-definitions" probably won't be in "expressions" (unless they are lambdas) but probably would be in "statements." Function definitions might push into a context that lets you return and so on.
A markup language like yours might have
An "inline" group to keep track of all the markup one can have within a block.
A "block" group to hold lists, quotes, paragraphs, headers.
...
Though there is more you could learn (capture groups, injections, scope conventions, etc.), this is hopefully a practical overview for getting started.
Conclusion
When you write your syntax highlighting, think to yourself: Does matching this token put me in a place where things like it can be matched again? Or does it put me in a different place where different things (more or fewer) ought to be matched? If the latter, what returns me to the original set of matches?
I have a job that runs on a daily basis. The purpose of this job is to correlate HTTP requests with their corresponding HTTP replies. This can be achieved because all HTTP requests & HTTP replies have a GUID that uniquely binds them.
So the job deals with two DataFrames: one containing the requests, and one containing the replies. So to correlate the requests with their replies, I am obviously doing an inner join based on that GUID.
The problem that I am running into is that a request that was captured on day X at 23:59:59 might see its reply captured on day X+1 at 00:00:01 (or vice-versa) which means that they will never get correlated together, neither on day X nor on day X+1.
Here is example code that illustrates what I mean:
val day1_requests = """[ { "id1": "guid_a", "val" : 1 }, { "id1": "guid_b", "val" : 3 }, { "id1": "guid_c", "val" : 5 }, { "id1": "guid_d", "val" : 7 } ]"""
val day1_replies = """[ { "id2": "guid_a", "val" : 2 }, { "id2": "guid_b", "val" : 4 }, { "id2": "guid_c", "val" : 6 }, { "id2": "guid_e", "val" : 10 } ]"""
val day2_requests = """[ { "id1": "guid_e", "val" : 9 }, { "id1": "guid_f", "val" : 11 }, { "id1": "guid_g", "val" : 13 }, { "id1": "guid_h", "val" : 15 } ]"""
val day2_replies = """[ { "id2": "guid_d", "val" : 8 }, { "id2": "guid_f", "val" : 12 }, { "id2": "guid_g", "val" : 14 }, { "id2": "guid_h", "val" : 16 } ]"""
val day1_df_requests = spark.read.json(spark.sparkContext.makeRDD(day1_requests :: Nil))
val day1_df_replies = spark.read.json(spark.sparkContext.makeRDD(day1_replies :: Nil))
val day2_df_requests = spark.read.json(spark.sparkContext.makeRDD(day2_requests :: Nil))
val day2_df_replies = spark.read.json(spark.sparkContext.makeRDD(day2_replies :: Nil))
day1_df_requests.show()
day1_df_replies.show()
day2_df_requests.show()
day2_df_replies.show()
day1_df_requests.join(day1_df_replies, day1_df_requests("id1") === day1_df_replies("id2")).show()
// guid_d from request stream is left over, as well as guid_e from reply stream.
//
// The following 'join' is done on the following day.
// I would like to carry 'guid_d' into day2_df_requests and 'guid_e' into day2_df_replies).
day2_df_requests.join(day2_df_replies, day2_df_requests("id1") === day2_df_replies("id2")).show()
I can see 2 solutions.
Solution#1 - custom carry over
In this solution, on day X, I would do a "full_outer" join instead of an inner-join, and I would persist into some storage the results that are missing one side or the other. On the next day X+1, I would load this extra data along with my "regular data" when doing my join.
An additional implementation detail is that my custom carry over would have to discard "old carry overs" otherwise it could pile up, i.e. it is possible that a HTTP request or HTTP reply from 10 days ago never sees its counterpart (maybe the app crashed for instance thus a HTTP request was emitted but not a reply).
Solution#2 - guid folding
In this solution, I would make the assumption that my requests and replies are within a certain amount of time of one another (e.g. 5 minutes). Thus on day X+1, I would also load the last 5 minutes of data from day X and include that as part of my join. This way, I don't need to use extra storage like in solution#1. However, the disadvantage is that this solution requires that the target storage can deal with duplicate entries (for instance, if the target storage is a SQL table, the PK would have to be this GUID and do an upsert instead of an insert).
Question
So my question is whether Spark provides functionality to automatically deal with a situation like that, thus not requiring any of my two solutions and by the same fact making things easier and more elegant?
Bonus question
Let's assume that I need to do the same type of correlation but with a stream of data (i.e. instead of a daily batch job that runs on a static set of data, I use Spark Streaming and data is processed on live streams of requests & replies).
In this scenario, a "full_outer" join is obviously inappropriate (https://dzone.com/articles/spark-structured-streaming-joins) and actually unnecessary since Spark Streaming takes care of that for us by having a sliding window for doing the join.
However, I am curious to know what happens if the job is stopped (or if it crashes) then resumed. Similarly to the batch mode example that I gave above, what if the job was interrupted after a request was consumed (and acknowledged) from stream/queue but before its related reply did? Does Spark Streaming keeps a state of its sliding window hence resuming the job will be able to correlate as if the stream was never interrupted?
P.S. backing up your answer with hyperlinks to reputable docs (like Apache's own) would be much appreciated.
I'm making a database on theses/arguments. They are related to other arguments, which I've placed in an object with a dynamic key, which is completely random.
{
_id : "aeokejXMwGKvWzF5L",
text : "test",
relations : {
cF6iKAkDJg5eQGsgb : {
type : "interpretation",
originId : "uFEjssN2RgcrgiTjh",
ratings: [...]
}
}
}
Can I find this document if I only know what the value of type is? That is I want to do something like this:
db.theses.find({relations['anything']: { type: "interpretation"}}})
This could've been done easily with the positional operator, if relations had been an array. But then I cannot make changes to the objects in ratings, as mongo doesn't support those updates. I'm asking here to see if I can keep from having to change the database structure.
Though you seem to have approached this structure due to a problem with updates in using nested arrays, you really have only caused another problem by doing something else which is not really supported, and that is that there is no "wildcard" concept for searching unspecified keys using the standard query operators that are optimal.
The only way you can really search for such data is by using JavaScript code on the server to traverse the keys using $where. This is clearly not a really good idea as it requires brute force evaluation rather than using useful things like an index, but it can be approached as follows:
db.theses.find(function() {
var relations = this.relations;
return Object.keys(relations).some(function(rel) {
return relations[rel].type == "interpretation";
});
))
While this will return those objects from the collection that contain the required nested value, it must inspect each object in the collection in order to do the evaluation. This is why such evaluation should really only be used when paired with something that can directly use an index instead as a hard value from the object in the collection.
Still the better solution is to consider remodelling the data to take advantage of indexes in search. Where it is neccessary to update the "ratings" information, then basically "flatten" the structure to consider each "rating" element as the only array data instead:
{
"_id": "aeokejXMwGKvWzF5L",
"text": "test",
"relationsRatings": [
{
"relationId": "cF6iKAkDJg5eQGsgb",
"type": "interpretation",
"originId": "uFEjssN2RgcrgiTjh",
"ratingId": 1,
"ratingScore": 5
},
{
"relationId": "cF6iKAkDJg5eQGsgb",
"type": "interpretation",
"originId": "uFEjssN2RgcrgiTjh",
"ratingId": 2,
"ratingScore": 6
}
]
}
Now searching is of course quite simple:
db.theses.find({ "relationsRatings.type": "interpretation" })
And of course the positional $ operator can now be used with the flatter structure:
db.theses.update(
{ "relationsRatings.ratingId": 1 },
{ "$set": { "relationsRatings.$.ratingScore": 7 } }
)
Of course this means duplication of the "related" data for each "ratings" value, but this is generally the cost of being to update by matched position as this is all that is supported with a single level of array nesting only.
So you can force the logic to match with the way you have it structured, but it is not a great idea to do so and will lead to performance problems. If however your main need here is to update the "ratings" information rather than just append to the inner list, then a flatter structure will be of greater benefit and of course be a lot faster to search.
I'm trying to make a paginate mechanism for our product documents stored in MongoDB. What makes this tricky, is that each document can have several colors, and I need to paginate by these instead of the document itself. E.g. the example below has two colors, and should then count as 2 in my paginate results.
How would anyone go around doing this the easiest / most affective way?
Thanks in advance!
{
"_id": ObjectId("4fdbaf608b446b0477000142"),
"created_at": newDate("14-10-2011 12:02:55"),
"modified_at": newDate("15-6-2012 23:55:43"),
"sku": "A1051g",
"name": {
"en": "Earrings - Celebrity"
},
"variants": [
{
color: {
en: "Blue"
}
},
{
color: {
en: "Yellow"
}
}
]
}
I like Sammaye's solution but another approach could just be pulling back more results than you need.
So for example, if you need 100 variants per page and each product has at least 1 variant, query with a limit of 100 to try and get 100 products, and therefore, at least 100 variants.
Chances are, you will have more than 100 variants (each product having more than 1) so build a list of products as you iterate over the cursor keeping track of the number variants.
When you have 100 variants, take note of how many products you have in the list, out of the 100 you retrieved, and use that as the skip for your next query.
This will eventually get expensive for large skips as you will have to seek over the number of documents you skip but could be a good solution for now.
Say I have a hash with a list of delivery drivers (for the classic flower shop scenario). Each driver has a rating and an event signal URL (ESL). I want to raise an event only to the top three drivers in that list, sorted by ranking.
With a relational database, I'd run a query like this:
SELECT esl FROM driver ORDER BY ranking LIMIT 3;
Is there a way to do this in KRL? There are two requirements:
A way to sort the hash
A way to limit the number of times a foreach iterates
The second could be solved like this:
rule reset_counter {
select when rfq delivery_ready
noop();
always {
clear ent:loop_counter;
raise explicit event loop_drivers;
}
}
rule loop_on_drivers {
select when explicit loop_drivers
foreach app:drivers setting (driver)
pre {
esl = driver.pick("$.esl");
}
if (ent:loop_counter < 3) then {
// Signal the driver's ESL
}
always {
ent:loop_counter += 1 from 0;
}
}
But that's kind of kludgy. Is there a more KRL-ish way to do it? And how should I solve the ordering problem?
EDIT: Here's the format of the app:drivers array, to make the question easier to answer:
[
{
"id": "1",
"rating": "5",
"esl": "http://example.com/esl"
},
{
"id": "2",
"rating": "3",
"esl": "http://example.com/esl2"
}
]
Without knowing the form of the hash, it's impossible to give you a specific answer, but you can use the sort operator to sort and then use the pick operator or hash
Something like
driver_data.sort(function(){...}).pick("$..something[:2]")
"something" is the name from the hash of the relevant field.