Rx.Net - Get stock price changes and process them - system.reactive

The problem I'm trying to solve
Get stock ticks
Always consider latest stock price
Each x second take a snapshot of ticks and send for processing
So I have an Observable source of stock ticks. It sends only the ticks for stocks I'm interested in. What I need to do is to receive these stock prices, and after each x seconds (for the sake of example let's say every 3 seconds) send a snapshot of prices for processing. If within 3 seconds I receive 2 ticks for the same stock, I only need the latest tick. This processing is compute heavy, so if possible I would like to avoid sending same stock price for processing twice.
To bring an example.
Let's say at the beginning of sequence I received 2 ticks -> MSFT:1$, GOOG:2$.
In the next 3 seconds I receive nothing, so MSFT & GOOG ticks should be sent for processing.
Now the next second I receive new ticks -> MSFT:1$, GOOG:3$, INTL:3$
Again let's assume within next 3 seconds nothing comes in.
Here, since MSFT price didn't change (it's still 1$), only GOOG & INTL should be sent for processing.
And this repeats throughout a day.
Now I think Rx helps to solve this kind of problems in easy & elegant way. But I'm having a problem to have the proper queries.
This is what I have so far, will try to explain what it does and what's the issue with it
var finalQuery =
from priceUpdate in **Observable<StockTick>**
group priceUpdate by priceUpdate.Stock into grouped
from combined in Observable.Interval(TimeSpan.FromSeconds(3))
.CombineLatest(grouped, (t, pei) => new { PEI = pei, Interval = t })
group combined by new { combined.Interval } into combined
select new
{
Interval = combined.Key.Interval,
PEI = combined.Select(c => new StockTick(c.PEI.Stock, c.PEI.Price))
};
finalQuery
.SelectMany(combined => combined.PEI)
.Distinct(pu => new { pu.Stock, pu.Price })
.Subscribe(priceUpdate =>
{
Process(priceUpdate);
});
public class StockTick
{
public StockTick(string stock, decimal price)
{
Stock = stock;
Price = price;
}
public string Stock {get;set;}
public decimal Price {get;set;}
}
So this gets the stock price, groups it by stock, then combines latest from this grouped sequence with Observable.Interval. This way I'm trying to ensure only latest ticks for a stock are processed and it fires up every 3 seconds.
Then again it groups it up by interval this time, as a result I have group of sequences for each 3 second intervals that passed.
And as a last step, I flatten this sequence to sequence of stock price updates using SelectMany and also I'm applying Distinct to ensure same price for the same stock is not processed twice.
There are 2 issues with this query I don't like. First is I don't really like double group by's - is there any way to avoid it ? Second - with this approach I have to process prices one by one, what I really would like to have is snapshots - that is within 3 seconds whatever I have I will bundle and send for processing, but can't figure out how to bundle.
I'll be happy for suggestions to solve this problem other way, but I would prefer to stay within Rx, unless there is really something much much better.

A couple things:
You'll want to take advantage of the Sample operator:
You probably want DistinctUntilChanged instead of Distinct. If you use Distinct, then if MSFT goes from $1, to $2 then back to $1, you won't get an event on the third tick.
I imagine your solution will look something like this:
IObservable<StockTick> source;
source
.GroupBy(st => st.Stock)
.Select(stockObservable => stockObservable
.Sample(TimeSpan.FromSeconds(3))
.DistinctUntilChanged(st => st.Price)
)
.Merge()
.Subscribe(st => Process(st));
EDIT (Distinct performance problems):
Each Distinct operator has to maintain within it, the full distinct history. If you have a high-priced stock, for example AMZN, which so far today has ranged from $958-$974, then you could end up with a lot of data. That's ~1600 possible data points that have to sit in memory until you unsubscribe from the Distinct. It also will eventually degrade performance, as each AMZN tick has to be compared to the 1600-ish present data points before going through. If this is for a long-running process (spanning multiple trading days), then you'll end up with even more data points.
Given N stocks, you have N Distinct operators that need to operate accordingly. Multiply that behavior by N stocks, and you have an ever-increasing problem.

Related

Spring batch, compare current processed record with the rest of chunk records

I need to plan a solution to this case. I have a Table like this and I have to reduce the number of registers that share Product+Service+Origin to minium range dates possible:
ID
PRODUCT
SERVICE
ORIGIN
STARTDATE
ENDDATE
1
100001
500
1
10/01/2023
15/01/2023
2
100001
500
1
12/01/2023
18/01/2023
I have to read all records, and in the process check date intervals to unificate them:
RecordA (10/01/2023 - 15/01/2023) RecordB (12/01/2023 - 18/01/2023) this will result in update the register with ID1 dates leaving the big range between the two dates and registers: 10/01/2023 - 18/01/2023 (extending to "right" or "left" one of the ranges when necessary)
Other case:
ID
PRODUCT
SERVICE
ORIGIN
STARTDATE
ENDDATE
1
100001
500
1
10/01/2023
15/01/2023
2
100001
500
1
12/01/2023
14/01/2023
On this case, range of dates from Record1 is biggest, We Should delete Record2.
Of course, deleting duplicate date ranges
Now whe have implemented and chunk step to make this possible:
Reader: Read data ordering by common fields (Product-Service-Origin)
Processor: Saves in a HashMap<String, List> in the job context all the register while the combination "Product+Service+Origin" is the same. When detect a new combination, get The current List and make a lot of comparision between this, marking records aux properties to "delete" or "update" and sending the full list to the writer (previusly starting a create other list in the map with the new combination of common fields)
Writer: group the records to delete and update and call child writers to execute the sentence.
Well, this was the software several years ago but soon We have to control massive records for each case and the "solution" of use an map in the JobContext have to change.
I was thinking if Spring Batch has some features for process this type of situations that I can use.
Anyway I am thinking about change the step where We insert all this records, and make date range checks one-to-one in the processor, but I think the commit interval here will be mandatory 1 to allows each register check all the previous processed registers (table is iintially empty when We execute this job). Other value in commit interval will check in bbdd but not in the previous processed items making incorrect processing here.
All this cases can have 0-n records sharing Product+Service+Origin.
Sorry my english, it's difficult explain this on other language.

Calculated Field to Sum Two Values Tableau

Is there a way to sum up two values from two different sheets. I have one sheet that looks at full time students as a distinct count of their ID and on another sheet I made a calculated field that takes the contact hours of part time students and divides by 12 (FT course load). I want to be able to add up these two numbers so that the... SUM(Full Time + (Part Time Contact Hours/12)) = ### and would result in an FTE (full-time equivalent enrollment).
Try a calculations similar to these:
Full Time
IF [Type]="FullTime" THEN [StudentId] END
Part Time
IF [Type]!="FullTime" THEN [Hours] END
FTEs
COUNTD([Full Time]) + SUM([Part Time])/12
Obviously I don't know your data structure, hopefully that gives you enough to go on to start.

Count # of records by grouped date?

I'm a novice Tableau user, trying to help my organization to analyze phone traffic. My data source of incoming phone calls is in an Excel spreadsheet, and is listed like this:
TRANSACTION ID DATETIME
151313:179805 1/2/2018 9:57
151340:108017 1/2/2018 17:27
151395:176211 1/3/2018 15:27
Our total calls per day range from 10 to 50.
I'd like to count days with an identical # of calls, and probably make a Histogram sorted by # of calls on the X-Axis, and # of days w/ that many calls on the Y-Axis.
I feel like this would be a simple Calculated Field, but for the life of me, I'm not getting what I'd do here.
Help! :)
One solution is to define an LOD calc, calls_per_day, as
{ FIXED DateTrunc('day', [DATETIME]) : COUNT("*") }
which in effect, prebuilds a little table in space showing the number of data rows for each day. That works if you have one data row in your input per transaction id.
If transaction ids are repeated, and instead you want the number of transactions for each day, you can use the following variation.
{ FIXED DateTrunc('day', [DATETIME]) : COUNTD([TRANSACTION ID]) }
COUNTD() can be expensive on large data sets, so its better to use an alternative when you have the option.
You can use LoD
{Fixed TRANSACTION ID : Count(Day(DATETIME))}
Try this and post the result

How to Handle Rows that Change over Time in Druid

I'm wondering how we could handle data that changes over time in Druid. I realize that Druid is built for streaming data where we wouldn't expect a particular row to have data elements change. However, I'm working on a project where we want to stream transactional data from a logistics management system, but there's a calculation that happens in that system that can change for a particular transaction based on other transactions. What I mean:
-9th of the month - I post transaction A with a date of today (9th) that results in the stock on hand coming to 0 units
-10th of the month - I post transaction B with a date of the 1st of the month, crediting my stock amount by 10 units. At this time (on the 10th of the month) the stock on hand for transaction A recalculates to 10 units. The same would be true for ALL transactions after the 1st of the month
As I understand it, we would re-extract transaction A, resulting in transaction A2.
The stock on hand dimension is incredibly important to our metrics. Specifically, identifying when stockouts occur (when stock on hand = 0). In the above example, if I have two rows for transaction A, I would be mistakenly identifying a stockout with transaction A1, whereas transaction A2 is the source of truth.
Is there any ability to archive a row and replace it with an updated row, or do we need to add logic to our queries that finds the rows with the freshest timestamp per transaction id?
Thanks
I have two thoughts that I hope help you. The key documentation for this is "Updating Existing Data": http://druid.io/docs/latest/ingestion/update-existing-data.html which gives you three options: Lookup Tables, Reindexing, and Delta Ingestion. The last one, Delta Ingestion, is only for adding new rows to old segments, so that's not very useful for you, let's go over the other two.
Reindexing: You can crunch all the numbers that change in your ETL process, identify the segments that would need to be reloaded, and simply have Druid re-index those segments. That will replace the stock-on-hand value for A in your example whenever you want, whenever you do the re-indexing.
Lookups: If you have stock values for multiple products, you can store the product id in the segment and have that be immutable, but lookup the stock-on-hand value in a lookup. So, you would store:
A, 2018-01-01, product-id: 123
And in your lookup, you'd have:
product-id: 123, stock-on-hand: 0
And later, you'd update the lookup and change that to 10. This would update any rows that reference product-id: 123.
I can't be sure but you may be mixing up dimensions and metrics while you're doing this, and you may need to read over that terminology in OLAP descriptions like this: https://en.wikipedia.org/wiki/Online_analytical_processing
Good luck!

Tableau Future and Current References

Tough problem I am working on here.
I have a table of CustomerIDs and CallDates. I want to measure whether there is a 'repeat call' within a certain period of time (up to 30 days).
I plan on creating a parameter called RepeatTime which is a range from 0 - 30 days, so the user can slide a scale to see the number/percentage of total repeats.
In Excel, I have this working. I sort CustomerID in order and then sort CallDate from earliest to latest. I then have formulas like:
=IF(AND(CurrentCustomerID = FutureCustomerID, FutureCallDate - CurrentCallDate <= RepeatTime), 1,0)
CurrentCustomerID = the current row, and the FutureCustomerID = the following row (so it is saying if the customer ID is the same).
FutureCallDate = the following row and the CurrentCallDate = the current row. It is subtracting the future call time from the first call time to measure the time in between.
The goal is to be able to see, dynamically, how many customers called in for a specific reason within maybe 4 hours or 1 day or 5 days, etc. All of the way up until 30 days (this is our actual metric but it is good to see the calls which are repeats within a shorter time frame so we can investigate).
I had a similar problem, see here for detailed version Array calculation in Tableau, maxif routine
In your case, that is basically the same thing as mine, so you could apply that solution, but I find it easier to understand the one I'm about to give, I would do:
1) Create a calculated field called RepeatTime:
DATEDIFF('day',MAX(CallDates),LOOKUP(MAX(CallDates),-1))
This will calculated how many days have passed since the last call to the current. You can add a IFNULL not to get Null values for the first entry.
2) Drag CustomersID, CallDates and RepeatTime to the worksheet (can be on the marks tab, don't need to be on rows or column).
3) Configure the table calculation of RepeatTIme, Compute using Advanced..., partitioning CustomersID, Adressing CallDates
Also Sort by Field CallDates, Maximum, Ascending.
This will guarantee the table calculation works properly
4) Now you have a base that you can use for what you need. You can either export it to csv or mdb and connect to it.
The best approach, actually, is to have this RepeatTime field calculated outside Tableau, on your database, so it's already there when you connect to it. But this is a way to use Tableau to do the calculation for you.
Unfortunately there's no direct way to do this directly with your database.