DOM quotes by depth to Delta quotes - anyone an idea?

DOM quotes by depth to Delta quotes - anyone an idea? - dom

Ok, here we go. THis is a tricky one. FInancial market.
I have a trading application originally developped for the Rithmic API. FOr different reasons I am now moving real time data feed over to NxCore.
RIthmic has no concept of "DOM levels". I get moessages for every quote with volume, price. If a quote turns invalid, it gets 0 volume. WOrks nice....
NxCore does work with levels. I get a different message for every level (D1 - D10).
The problem is - I need to move from this to the RIthmic representation without sending too many errors / surplus messages.
Basically, the isue looks like this:
We establish a DOM - just ignore the items here. Interesting is D1 8 # 11417
CBOT/YM.Z10 MM D2: 17 # 11416 vs 10 # 11421
CBOT/YM.Z10 MM D3: 10 # 11415 vs 14 # 11422
CBOT/YM.Z10 MM D5: 13 # 11413 vs 13 # 11424
CBOT/YM.Z10 MM D4: 17 # 11414 vs 12 # 11423
CBOT/YM.Z10 MM D9: 12 # 11409 vs 12 # 11428
CBOT/YM.Z10 MM D10: 14 # 11408 vs 14 # 11429
CBOT/YM.Z10 MM D5: 13 # 11413 vs 16 # 11424
CBOT/YM.Z10 MM D6: 20 # 11412 vs 11 # 11425
CBOT/YM.Z10 MM D1: 8 # 11417 vs 3 # 11420
CBOT/YM.Z10 2010-12-14 05:00:20.275000 BestBid 8 # 11417
Something happens to he BestBid here..... so now it goes to 11418
CBOT/YM.Z10 MM D1: 3 # 11418 vs 3 # 11420
CBOT/YM.Z10 MM D2: 10 # 11417 vs 10 # 11421
CBOT/YM.Z10 MM D3: 17 # 11416 vs 14 # 11422
CBOT/YM.Z10 MM D4: 10 # 11415 vs 12 # 11423
CBOT/YM.Z10 MM D5: 16 # 11414 vs 16 # 11424
CBOT/YM.Z10 MM D6: 13 # 11413 vs 11 # 11425
CBOT/YM.Z10 MM D7: 17 # 11412 vs 6 # 11426
CBOT/YM.Z10 MM D8: 12 # 11411 vs 5 # 11427
CBOT/YM.Z10 MM D9: 18 # 11410 vs 12 # 11428
CBOT/YM.Z10 MM D10: 12 # 11409 vs 14 # 11429
CBOT/YM.Z10 2010-12-14 05:00:30.325000 BestBid 3 # 11418
And that gives ups 10 updates just to move the DOM on the bid side. What I would need here is 2....
0 # 11417
18 # 11409
Every line is a separate callback into my code (or event if you want to use that term). I already know I need to keep some sort o fbuffer (D1 - D10 on both sides, bid and ask, with price, level and volume). My main problem are all the edge cases. When a volume changes, same DOM level, same price - easy.
But when the DOM moves like above, I get 10 updates but most need to be thrown away for me.
Anyone an idea? Also note that D1 - D10 do not necessarily handle the next 10 prices - sometimes (off hours) there are gaps.... which makes things REALLY nasty. I sort of know that if I get a D1 or D10 I can throw out all reported items "further down" the line. But my problem is mostly around gaps...
Oh, and it needs to be fast. Like 200.000 messages per second sometimes.
Anyone a smart idea?

Have a DOM ladder for every symbol that you follow (two separate arrays, one for bid, one for ask).
This has to be dimensioned big enough for any kind of volatility that might be expected.
Usually 2000 levels will surely be enough for one day.
No problem with modern computer memories even if you follow quite a few symbols at the same time.
When a DOM event comes in just
- update the volume at the appropriate price level
- update the index of current inside bid or ask (if this information comes also with the event)
In other words ignore the "DOM level" information, just write the volume into the appropriate place.
It is safe to assume that there are no "holes" in the DOM, the levels are always positioned at one tick distance (this can be seen in your sample).
Hope I outlined this idea clear enough. If not feel free to ask.

If I compute this right:
2000 levels * 8000 symbols * 4 byte (size) * 2 (bid/ask) = 256 MB
No problem with todays computers.
Probably 2 bytes for size would also do the job which cuts it down to 128 MB.
(Btw if I get this right the job is to follow the complete US stocks universe via NxCore)

Related

CPU and Memory Friendly Solution to Merge Large Matrix

For the following typical case:
n = 1000000;
r = randi(n,n,2);
(assume there are 0.05% common numbers between all rows; n could be even tens of millions)
I am looking for a CPU and Memory efficient solution to merge rows based on any common items (here integer numbers). A list of sample codes in Python is available here and a quick try to translate one into Matlab can be found here.
In my attempt they take ages (minutes to hours), so I am in favor of finding faster solution.
For the above example, the typical output should look like (cell):
{
[1 90 34 67 ... 9]
[35 89]
[45000 23 828 130 8999 45326 ... 11]
...
}
Note also that, I have tried to compile as mex but failed due to no-support for cell in Matlab-Coder.
Edit: A tiny demonstration example
%---------------------------------------
clc
n = 100;
r = randi(n,n,2); % random integers in [1,n], size(n,2)
%---------------------------------------
>> r
r =
82 17 % (1) 82 17
91 13 % (2) 91 13
13 32 % (3) 91 13 32 merged with (2), common 13
82 53 % (4) 82 17 53 merged with (1), common 82
64 17 % (5) 82 17 53 64 merged with (4), common 17
...
94 45
13 31 % (77) 91 13 32 31 merged with (3), common 13
57 51
47 52
2 13 % (80) 91 13 32 31 2 merged with (77), common 13
34 80
%---------------------------------------
c = merge(r); % cpu and memory friendly solution is searched for.
%---------------------------------------
c =
[82 17 53 64]
[91 13 32 31 2]
...

You need an index.
In Python, use a dict. In MATLAB - I'd not use MATLAB, because open-source is the future, and MATLAB is dying out.
But Python is quite slow. You can likely get a 10x speedup by using e.g. Cython to translate and optimize the code in C. Avoid using Python data types such as a list of int, because they are very memory intensive. numpy has memory-efficient arrays of integer.
If you get a new pair (a,b) you can use this dictionary to find existing items to merge. Then update the dict after the merge.
Actually for integers, you should use an array instead of a dict.
The trickiest part is handling the case when both a and b exist, but are large different groups. There are some neat optimizations possible here, if that isn't fast enough yet.
It's not clustering, but connected components.

Displaying data to a map, creating a choropleth

What I would like to do is create a choropleth map which is darker or lighter based on the number of data points in a particular area.
I have the following data:
RO-B, 9
PL-MZ, 24
SE-C, 3
DE-NI, 5
PL-DS, 14
ES-CM, 11
RO-IS, 2
DE-BY, 51
SE-Z, 18
CH-BE, 10
PL-WP, 1
ES-IB, 1
DE-BW, 21
DE-BE, 24
DE-BB, 1
IE-M, 26
ES-PV, 1
DE-SN, 6
CH-ZH, 31
ES-GA, 1
NL-GE, 2
IE-U, 1
ES-AN, 4
FR-J, 82
DE-HH, 34
PL-PD, 1
PL-LD, 6
GB-WLS, 60
GB-ENG, 8619
RO-BV, 45
CH-VD, 2
PL-SL, 1
DE-HE, 17
SE-I, 1
HU-PE, 4
PL-MA, 4
SE-AB, 3
CH-BS, 20
ES-CT, 31
DE-TH, 25
IE-C, 1
CZ-ST, 1
DE-NW, 29
NL-NH, 3
DE-RP, 9
CZ-PR, 4
IE-L, 134
HU-BU, 10
RO-CJ, 1
GB-NIR, 29
ES-MD, 33
CH-LU, 11
GB-SCT, 172
CH-GE, 3
BE-BRU, 30
BE-VLG, 25
It references the ISO3166-2 of a country and sub region, and the # corresponds to the amount of data points affiliated with that region.
I've seen this project on GitHub which seems to also use the same ISO3166-2 to reference countries.
I'm trying to figure out how I could modify their code to display my data points, such that if the number is higher the area would be darker, if the number is less it would be lighter.
It seems it should be possible, the first thing I was trying to do was modify this jsfiddle code, which seems to be very close to what I need, but I couldn't get it to work.
For instance this line:
osmeRegions.geoJSON('RU-MOW',
Seems to directly reference a ISO3166-2 code, but it's not as simple as just changing that (or maybe it is but I couldn't get that to work properly).
Does anyone know if I could possibly adapt the code from that project to create the map rendering I've described?
Or perhaps there's a different way to achieve the same goal?

Rearrange distribution function Matlab

I have the following data representing values over a 12 month period:
1. 0
2. 253
3. 168
4. 323
5. 556
6. 470
7. 225
8. 445
9. 98
10. 114
11. 381
12. 187
How can I smooth this line forward?
What I need is that going through the list sequentially any value that is above the mean (268) be evenly distributed amongst the remaining months- but in such a way that it produces as smooth a line as possible. I need to go through from Jan to Dec in order. Looking forward I want to sweep any excess (peaks) into the months still to come such that the distribution is as even as possible (such that the troughs are filled first). So the issue is to, at each point, determine what the "excess" for that month is and secondly how to distribute that amongst the months still to come.
I have used
p = find(Diff>0);
n = find(Diff<=0);
POS = Diff(p,1);
NEG = Diff(n,1)
to see where shortfall/ excesses against the mean exist but unsure how to construct a code that will redistribute forward by allocating to the "troughs" of the distribution first. An analogy is that these numbers represent harvest quantities and I must give out the harvest to the population. How do I redistribute the incoming harvest over the year such that I minimise excess supply/ under supply? I obviously cannot give out anything I haven't received in a particular month unless I have stored some harvest from previous months.
e.g. I start in Jan, I see that I cannot give anything to the months Feb to Dec so the value for Jan is 0. In Feb I have 253- do I adjust 253 downwards or give it all out? If so by how much? and where do I redistribute the excess I trim between Mar to Dec? And so on and so forth.. How do I do this to give as smooth (even) a distribution as possible?
For any month the new value assigned to that month cannot exceed the current value. The sum for the 12 months must be equal before and after smoothing. As the first position January will always be 0.

Simple version, just loops through and if the next month is lower than the current month, passes value forward to equalise them.
for n = 1:11
if y(n)>y(n+1);
y(n:n+1)=(y(n)+y(n+1))/2;
end
end

It's not very clear to me what you're asking...It sounds a bit like a roundabout way of asking how to fit a straight line to data. If that is the case, see below. Otherwise: please clarify a bit more what you want. Provide a toy example input data, and expected output data.
y = [ 0 253 168 323 556 470 225 445 98 114 381 187 ].';
x = (0:numel(y)-1).';
A = [ones(size(x)) x];
plot(...
x, y, 'b.',...
x, A*(A\y), 'r')
xlabel('Month'), ylabel('Data')
legend('original data', 'fit')

I dont get exactly what you want either, maybe something simple like this?
year= [0 253 168 323 556 470 225 445 98 114 381 187];
m= mean(year);
total_before = sum(year)
linear_year = linspace(0,m*2,12);
toal_after= sum(linear_year)
this gives you a line, the sum stays the same and the line is perfectly smooth ...

Stata longwise average

I'm using Stata and trying to compute conditional means based on time/date. For each store I want to calculate mean (inventory) per year. If there are missing year gaps, then I want to take the mean from the closest two dates' inventory values.
I've used (below) to get overall means per store, but I need more granularity.
egen mean_inv = mean(inventory), by (store)
I've also tried this loop with similar results:
by id, sort: gen v1'=_n'
forvalues x = 1/'=n'{
by store: sum inventory if v1==`x'
replace mean_inv= r(mean) if v1==`x'
}
Visually, I want mean inventory per store: (store id is not sequential)
5/1/2003 2/3/2006 8/9/2006 3/5/2007 6/9/2007 2/1/2008
13 18 12 15 24 11
[mean1] [mean2] [mean3] [mean4] [mean5]
store date inventory
1 16750 17
1 18234 16
1 15844 13
1 17111 14
1 17870 13
1 16929 13.5
1 17503 13
4 15987 18
4 15896 16
4 18211 16
4 17154 18
4 17931 24
4 16776 23
12 16426 26
12 17681 17
12 16386 17
12 16603 18
12 17034 16
12 17205 16
42 15798 18
42 16022 18
42 17496 16
42 17870 18
42 16204 18
42 16778 14
33 18053 23
33 16086 13
33 16450 21
33 17374 19
33 16814 19
33 15834 16
33 16167 16
56 17686 16
56 17623 18
56 17231 20
56 15978 16
56 16811 15
56 17861 20

It is hard to relate your code to the word description of your problem.
Your egen call calculates means by store, not year.
Your longer fragment does not make complete sense given lack of definitions and at least one typo.
Note that your variable v1 contains identifiers that run 1 up within groups of store, and does not distinguish different values of store, as you (seem to) imply. It strains credibility that it produces results anywhere near those by the egen call.
n is not defined and the code evaluating it is presumably intended to be
`=n'
If you calculate
by store: sum inventory if v1 == `x'
several means will be calculated in turn but only the last to be calculated will be accessible as r(mean).
The sample data are unrelated to the problem. There is no year variable and even if the dates are Stata daily dates, they are all dates within 1960.
Setting all that aside, suppose you have variables store, inventory and year. You can try
collapse inventory, by(store year)
fillin store year
ipolate inventory year, gen(inventory2) by(store)
The collapse produces a reduced dataset of means. The ipolate interpolates across gaps, as you ask. fillin may not be adequate to give all the store and year combinations you want and you may need to add further years manually before the interpolation. If you want to put these results back with the original data, that's a merge.
In total, this is a pretty messy question.

How to convert a large number to base 36 using DC or other

I am trying to represent the maximum 64-bit unsigned value in different bases.
For base 2 (binary) it would be 64 1's:
1111111111111111111111111111111111111111111111111111111111111111
For base 16 (hex) it would be 16 F's
FFFFFFFFFFFFFFFF
For base 10 (decimal) it would be:
18446744073709551615
I'm trying to get the representation of this value in base 36 (it uses 0-9 and A-Z). There are many online base converters, but they all fail to produce the correct representation because they are limited by 64-bit math.
Does anyone know how to use DC (which is an extremely hard to use string math processors that can handle numbers of unlimited magnitude) and know how to do this conversion? Either that or can anyone tell me how I can perform this conversion with a calculator that won't fail due to integer roll-over?

I mad a quick test with ruby:
i = 'FFFFFFFFFFFFFFFF'.to_i(16)
puts i #18446744073709551615
puts i.to_s(36) #3w5e11264sgsf
You may also use larger numbers:
i = 'FFFFFFFFFFFFFFFF'.to_i(16) ** 16
puts i
puts i.to_s(36)
result:
179769313486231590617005494896502488139538923424507473845653439431848569886227202866765261632299351819569917639009010788373365912036255753178371299382143631760131695224907130882552454362167933328609537509415576609030163673758148226168953269623548572115351901405836315903312675793605327103910016259918212890625
1a1e4vngailcqaj6ud31s2kk9s94o3tyofvllrg4rx6mxa0pt2sc06ngjzleciz7lzgdt55aedc9x92w0w2gclhijdmj7le6osfi1w9gvybbfq04b6fm705brjo535po1axacun6f7013c4944wa7j0yyg93uzeknjphiegfat0ojki1g5pt5se1ylx93knpzbedn29
A short explanation what happens with big numbers:
Normal numbers are Fixnums. If you get larger numbers, the number becomes a Bignum:
small = 'FFFFFFF'.to_i(16)
big = 'FFFFFFFFFFFFFFFF'.to_i(16) ** 16
puts "%i is a %s" % [ small, small.class ]
puts "%i\n is a %s" % [ big, big.class ]
puts "%i^2 is a %s" % [ small, (small ** 2).class ]
Result:
268435455 is a Fixnum
179769313486231590617005494896502488139538923424507473845653439431848569886227202866765261632299351819569917639009010788373365912036255753178371299382143631760131695224907130882552454362167933328609537509415576609030163673758148226168953269623548572115351901405836315903312675793605327103910016259918212890625
is a Bignum
268435455^2 is a Bignum
From the documentation of Bignum:
Bignum objects hold integers outside the range of Fixnum. Bignum objects are created automatically when integer calculations would otherwise overflow a Fixnum. When a calculation involving Bignum objects returns a result that will fit in a Fixnum, the result is automatically converted.

It can be done with dc, but the output is not extremely useful.
$ dc
36
o
16
i
FFFFFFFFFFFFFFFF
p
03 32 05 14 01 01 02 06 04 28 16 28 15
Here's the explanation:
Entering a number by itself pushes that number
o pops the stack and sets the output radix.
i pops the stack and sets the input radix.
p prints the top number on the stack, in the current output radix. However, dc prints any output with a higher radix than 16 as binary (not ASCII).
In dc, the commands may be all put on the same line, like so:
$ dc
36o16iFFFFFFFFFFFFFFFFp
03 32 05 14 01 01 02 06 04 28 16 28 15

Get any language that can handle arbitrarily large integers. Ruby, Python, Haskell, you name it.
Implement the basic step: modulo 36 gives you the next digit, division by 36 gives you the number with the last digit cut out.
Map the digits to characters the way you like. For instance, '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ'[digit] is fine by me. Append digits to the result as you produce them.
???
Return the concatenated string of digits. Profit!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse