Spring batch, compare current processed record with the rest of chunk records - spring-batch

I need to plan a solution to this case. I have a Table like this and I have to reduce the number of registers that share Product+Service+Origin to minium range dates possible:
ID
PRODUCT
SERVICE
ORIGIN
STARTDATE
ENDDATE
1
100001
500
1
10/01/2023
15/01/2023
2
100001
500
1
12/01/2023
18/01/2023
I have to read all records, and in the process check date intervals to unificate them:
RecordA (10/01/2023 - 15/01/2023) RecordB (12/01/2023 - 18/01/2023) this will result in update the register with ID1 dates leaving the big range between the two dates and registers: 10/01/2023 - 18/01/2023 (extending to "right" or "left" one of the ranges when necessary)
Other case:
ID
PRODUCT
SERVICE
ORIGIN
STARTDATE
ENDDATE
1
100001
500
1
10/01/2023
15/01/2023
2
100001
500
1
12/01/2023
14/01/2023
On this case, range of dates from Record1 is biggest, We Should delete Record2.
Of course, deleting duplicate date ranges
Now whe have implemented and chunk step to make this possible:
Reader: Read data ordering by common fields (Product-Service-Origin)
Processor: Saves in a HashMap<String, List> in the job context all the register while the combination "Product+Service+Origin" is the same. When detect a new combination, get The current List and make a lot of comparision between this, marking records aux properties to "delete" or "update" and sending the full list to the writer (previusly starting a create other list in the map with the new combination of common fields)
Writer: group the records to delete and update and call child writers to execute the sentence.
Well, this was the software several years ago but soon We have to control massive records for each case and the "solution" of use an map in the JobContext have to change.
I was thinking if Spring Batch has some features for process this type of situations that I can use.
Anyway I am thinking about change the step where We insert all this records, and make date range checks one-to-one in the processor, but I think the commit interval here will be mandatory 1 to allows each register check all the previous processed registers (table is iintially empty when We execute this job). Other value in commit interval will check in bbdd but not in the previous processed items making incorrect processing here.
All this cases can have 0-n records sharing Product+Service+Origin.
Sorry my english, it's difficult explain this on other language.

Related

optimal way to stream trades data out of a postgres database

I have a table with a very simple schema:
(
instrument varchar(20) not null,
ts timestamp not null,
price double precision not null,
quantity double precision not null,
direction integer not null,
id serial
constraint trades_pkey
primary key
);
It stores a list of trades done on various instruments.
You can have multiple trades on a single timestamp and also the timestamps are not regular; it's possible to have 10 entries on the same millisecond and then nothing for 2 seconds, etc.
When the client starts, I would like to accomplish two things:
Load the last hour of data.
Stream all the new updates.
The client processes the trades one by one, as if they were coming from a queue. They are sorted by instrument and each instrument has its own queue, expecting each trade to be the one following the previous one.
Solution A:
I did a query to find the id at now - 1hour, and then query all rows with id >= start id, and then loop to get all row with id > last id.
This does not work:
the row id and timestamps do not match, sometimes an older timestamp gets a higher row id, etc. I guess this is due to writes being done on multiple threads, but getting data by id doesn't guarantee I will get the trades in order and while I can sort one batch I receive, I can't be sure that the next batch will not contain an older row.
Solution B:
I can make a query loop that takes the last timestamp received, subtracts 1 second and queries again, etc. I can sort the data in the client and, for each instrument, discard all rows older than the last one processed.
Not very efficient, but that will work.
Solution C:
I can make a query per instrument (there are 22 of them), ordered by timestamp. Can 22 subqueries be grouped into a single one?
Or, is there another solution?
You could try big serial with auto increment to ensure each row is numbered in order as it is inserted.
Since this number is handled by Postgres you should be fine to get a guaranteed ordering on your data.
On the client side you just store (maybe in a separate table of meta-data) the latest serial number you have seen and then just query everything larger than that and keep your meta data table up to date.

How to Handle Rows that Change over Time in Druid

I'm wondering how we could handle data that changes over time in Druid. I realize that Druid is built for streaming data where we wouldn't expect a particular row to have data elements change. However, I'm working on a project where we want to stream transactional data from a logistics management system, but there's a calculation that happens in that system that can change for a particular transaction based on other transactions. What I mean:
-9th of the month - I post transaction A with a date of today (9th) that results in the stock on hand coming to 0 units
-10th of the month - I post transaction B with a date of the 1st of the month, crediting my stock amount by 10 units. At this time (on the 10th of the month) the stock on hand for transaction A recalculates to 10 units. The same would be true for ALL transactions after the 1st of the month
As I understand it, we would re-extract transaction A, resulting in transaction A2.
The stock on hand dimension is incredibly important to our metrics. Specifically, identifying when stockouts occur (when stock on hand = 0). In the above example, if I have two rows for transaction A, I would be mistakenly identifying a stockout with transaction A1, whereas transaction A2 is the source of truth.
Is there any ability to archive a row and replace it with an updated row, or do we need to add logic to our queries that finds the rows with the freshest timestamp per transaction id?
Thanks
I have two thoughts that I hope help you. The key documentation for this is "Updating Existing Data": http://druid.io/docs/latest/ingestion/update-existing-data.html which gives you three options: Lookup Tables, Reindexing, and Delta Ingestion. The last one, Delta Ingestion, is only for adding new rows to old segments, so that's not very useful for you, let's go over the other two.
Reindexing: You can crunch all the numbers that change in your ETL process, identify the segments that would need to be reloaded, and simply have Druid re-index those segments. That will replace the stock-on-hand value for A in your example whenever you want, whenever you do the re-indexing.
Lookups: If you have stock values for multiple products, you can store the product id in the segment and have that be immutable, but lookup the stock-on-hand value in a lookup. So, you would store:
A, 2018-01-01, product-id: 123
And in your lookup, you'd have:
product-id: 123, stock-on-hand: 0
And later, you'd update the lookup and change that to 10. This would update any rows that reference product-id: 123.
I can't be sure but you may be mixing up dimensions and metrics while you're doing this, and you may need to read over that terminology in OLAP descriptions like this: https://en.wikipedia.org/wiki/Online_analytical_processing
Good luck!

long running queries and new data

I'm looking at a postgres system with tables containing 10 or 100's of millions of rows, and being fed at a rate of a few rows per second.
I need to do some processing on the rows of these tables, so I plan to run some simple select queries: select * with a where clause based on a range (each row contains a timestamp, that's what I'll work with for ranges). It may be a "closed range", with a start and an end I know are contained in the table, and I know no new data will fall into the range, or an open range : ie one of the range boundary might not be "in the table yet" and rows being fed in the table might thus fall in that range.
Since the response will itself contains millions of rows, and the processing per row can take some time (10s of ms) I'm fully aware I'll use a cursor and fetch, say, a few 1000 rows at a time. My question is:
If I run an "open range" query: will I only get the result as it was when I started the query, or will new rows being inserted in the table that fall in the range while I run my fetch show up ?
(I tend to think that no I won't see new rows, but I'd like a confirmation...)
updated
It should not happen under any isolation level:
https://www.postgresql.org/docs/current/static/transaction-iso.html
but Postgres insures it only in Serializable isolation
Well, I think when you make a query, that means you create a new transaction and it will not receive/update data from any other transaction until it commit.
So, basically "you only get the result as it was when you started the query"

How to get all missing days between two dates

I will try to explain the problem on an abstract level first:
I have X amount of data as input, which is always going to have a field DATE. Before, the dates that came as input (after some process) where put in a table as output. Now, I am asked to put both the input dates and any date between the minimun date received and one year from that moment. If there was originally no input for some day between this two dates, all fields must come with 0, or equivalent.
Example. I have two inputs. One with '18/03/2017' and other with '18/03/2018'. I now need to create output data for all the missing dates between '18/03/2017' and '18/04/2017'. So, output '19/03/2017' with every field to 0, and the same for the 20th and 21st and so on.
I know to do this programmatically, but on powercenter I do not. I've been told to do the following (which I have done, but I would like to know of a better method):
Get the minimun date, day0. Then, with an aggregator, create 365 fields, each has that "day0"+1, day0+2, and so on, to create an artificial year.
After that we do several transformations like sorting the dates, union between them, to get the data ready for a joiner. The idea of the joiner is to do an Full Outer Join between the original data, and the data that is going to have all fields to 0 and that we got from the previous aggregator.
Then a router picks with one of its groups the data that had actual dates (and fields without nulls) and other group where all fields are null, and then said fields are given a 0 to finally be written to a table.
I am wondering how can this be achieved by, for starters, removing the need to add 365 days to a date. If I were to do this same process for 10 years intead of one, the task gets ridicolous really quick.
I was wondering about an XOR type of operation, or some other function that would cut the number of steps that need to be done for what I (maybe wrongly) feel is a simple task. Currently I now need 5 steps just to know which dates are missing between two dates, a minimun and one year from that point.
I have tried to be as clear as posible but if I failed at any point please let me know!
Im not sure what the aggregator is supposed to do?
The same with the 'full outer' join? A normal join on a constant port is fine :) c
Can you calculate the needed number of 'dublicates' before the 'joiner'? In that case a lookup configured to return 'all rows' and a less-than-or-equal predicate can help make the mapping much more readable.
In any case You will need a helper table (or file) with a sequence of numbers between 1 and the number of potential dublicates (or more)
I use our time-dimension in the warehouse, which have one row per day from 1753-01-01 and 200000 next days, and a primary integer column with values from 1 and up ...
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
You've identified you know how to do this programmatically and to be fair this problem is more suited to that sort of solution... but that doesn't exclude powercenter by any means, just feed the 2 dates into a java transformation, apply some code to produce all dates between them and for a record to be output for each. Java transformation is ideal for record generation
Ok... so you could override your source qualifier to achieve this in the selection query itself (am giving Oracle based example as its what I'm used to and I'm assuming your data in is from a table). I looked up the connect syntax here
SQL to generate a list of numbers from 1 to 100
SELECT (MIN(tablea.DATEFIELD) + levquery.n - 1) AS Port1 FROM tablea, (SELECT LEVEL n FROM DUAL CONNECT BY LEVEL <= 365) as levquery
(Check if the query works for you - haven't access to pc to test it at the minute)

Volume of an Incident Queue at a Point in Time

I have an incident queue, consisting of a record number-string, the open time - datetime, and a close time-datetime. The records go back a year or so. What I am trying to get is a line graph displaying the queue volume as it was at 8PM each day. So if a ticket was opened before 8PM on that day or anytime on a previous day, but not closed as of 8, it should be contained in the population.
I tried the below, but this won't work because it doesn't really take into account multiple days.
If DATEPART('hour',[CloseTimeActual])>18 AND DATEPART('minute',[CloseTimeActual])>=0 AND DATEPART('hour',[OpenTimeActual])<=18 THEN 1
ELSE 0
END
Has anyone dealt with this problem before? I am using Tableau 8.2, cannot use 9 yet due to company license so please only propose 8.2 solutions. Thanks in advance.
For tracking history of state changes, the easiest approach is to reshape your data so each row represents a change in an incident state. So there would be a row representing the creation of each incident, and a row representing each other state change, say assignment, resolution, cancellation etc. You probably want columns to represent an incident number, date of the state change and type of state change.
Then you can write a calculated field that returns +1, -1 or 0 to to express how the state change effects the number of currently open incidents. Then you use a running total to see the total number open at a given time.
You may need to show missing date values or add padding if state changes are rare. For other analytical questions, structuring your data with one record per incident may be more convenient. To avoid duplication, you might want to use database views or custom SQL with UNION ALL clauses to allow both views of the same underlying database tables.
It's always a good idea to be able to fill in the blank for "Each record in my dataset represents exactly one _________"
Tableau 9 has some reshaping capability in the data connection pane, or you can preprocess the data or create a view in the database to reshape it. Alternatively, you can specify a Union in Tableau with some calculated fields (or similarly custom SQL with a UNION ALL clause). Here is a brief illustration:
select open_date as Date,
"OPEN" as Action,
1 as Queue_Change,
<other columns if desired>
from incidents
UNION ALL
select close_date as Date,
"CLOSE" as Action,
-1 as Queue_Change,
<other columns if desired>
from incidents
where close_date is not null
Now you can use a running sum for SUM(Queue_Change) to see the number of open incidents over time. If you have other columns like priority, department, type etc, you can filter and group as usual in Tableau. This data source can be in addition to your previous one. You don't have ta have a single view of the data for every worksheet in your workbook. Sometimes you want a few different connections to the same data at different levels of detail or for perspectives.