How to stream data in KDB? - real-time

I have access to a realtime KDB server that has tables with new data arriving every millisecond.
Currently, I'm just using a naive method which is basically like:
.z.ts:{
newData: getNewData[]; / get data arriving in the last second
data::data uj newData;
};
\t 100;
to ensure that my data (named data) is constantly updated.
However, the uj is very slow (probably due to constant reallocation of memory) and polling is just plain awkward.
I've heard KDB is meant to be good at handling this kind of streaming tick data, so is there a better way? Perhaps some push-based method without need for uj?

Rather than polling. Use kdb+tick the publish subscriber architecture for kdb+.
Official Manual: https://github.com/KxSystems/kdb/blob/master/d/tick.htm
kdb+ Tick overview: http://www.timestored.com/kdb-guides/kdb-tick-data-store
Source code: https://github.com/KxSystems/kdb-tick

If there is a realtime, presumably there's a tickerplant feeding it. You can subscribe to the tickerplant:
.u.sub[`;`];
That means sub all tables, all symbols. The result of the call is an array where 0th element is the table name and 1th element is the current data the tickerplant holds for that table (usually empty or a small number of rows). The tickerplant will then cache the handle to your kdb instance and keep sending it data. BUT it assumes there is a upd function on your kdb instance that can handle the request.
upd:{[t;x] t insert x}
OR
upd:insert
(same thing)
The upd function is called with a table symbol name (t) and the data to insert into it (x).
So a good straightforward implementation overall would be:
upd:insert;
#[`.;:;t:.u.sub[`;`][0];t[1]]; //set result of sub to t, set t to t[1] (initial result)

Related

Efficient row reading with libpq (postgresql)

This is a scalability related question.
We want to read some rows from a table and, after processing some of them, stop the query. The stop criteria is data dependent (we do not know in advance how many or what rows are we interested in).
This is scalability sensitive when the number of rows of the table grows far beyond the number of rows we really are interested in.
If we use the standard PQExec, all rows are returned and we are forced to consume them (we have to call PQGetResult until it returns null). So this does not scale.
We are now trying "row by row" reading.
We first used PQsendQuery and PQsetSingleRowMode. However, we still have to call PQGetResult until it returns null.
Our last approach is PQsendQuery, PQsetSingleRowMode and when we are done we cancel the query as follows
void CloseRowByRow() {
PGcancel *c = PQgetCancel(conn);
char errbuf[256];
PQcancel(c, errbuf, 256);
PQfreeCancel(c);
while (res) {
PQclear(res);
res = PQgetResult(conn);
}
}
This produces some performance benefits but we are wondering if this is the best we can do.
So here comes the question: Is there any other way?
Use DECLARE and FETCH to define & read from a server-side cursor, this is exactly what they are meant for. You would use standard APIs, FETCH will just let you retrieve the results in batches of a controlled size. See the examples in the docs for more details.

Handle Range based queries

We came across a case, where we want to retrieve data from a time series. Let say we have time based data : [“t1-t2” : {data1}, “t2-t3” : {data2}, “t3-t4”:{dat3}]
With the above kind of data, we would want to look up exact data w.r.t time. For example, for a given time t1.5, the data has to come as data1, and for t2.6 it should come as data2.
To solve the above problem, we are planning to store the data in a sorted map in aerospike as mentioned below {“t1”:{data1}, “t2”:{dat2}, “t3”: {data3}}
When a client asks for t1.5, we must return data1. To achieve this, we implemented a UDF at the server level to do a binary search for the nearest and lowest value for the given input (i.e t1.5), which will return t1 's value ,i.e data1.
Is there a better way of achieving the same, as it incurs cost at server level for every request. Even UDF to do a binary search requires loading all the data in memory, can we avoid it?
We are planning to use a Aerospike for this. Is there a better data store to handle such queries..?
Thinking aloud… Storing t1-t2, t2-t3 is redundant on t2. Just store t1, t2 is inferred from next key:value. { t1:data, t2:data, …} - store key sorted (map policy) You must know max difference between any ‘t1’ and ‘t2’ Build secondary index on MAPKEY and type numeric (this essentially does the bulk of the sort work for you upfront in the RAM) Search for records where t between t-maxdiff and t+maxdiff ==> a set of few records and pass these to your UDF. Invoke UDF on these few records subset to return the data. This will be a very simple UDF. Note: UDFs are limited to 128 concurrent executions at any given time.
I'm not sure I understand the problem. First, you should be inserting into a K-ordered map, where the key is the timestamp (in millisecond or second or another resolution). The value would be a map of the attributes.
To get back any range of time you'd use a get_by_key_interval (for example the Python client's Client.map_get_by_key_range). You can figure out how to build the range but it's simply all between two timestamps.
Don't use a UDF for this, it is not going to perform as well or scale as the native map/list operations would.

Parallel design of program working with Flink and scala

This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flag, which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.

Detecting concurrent data modification of document between read and write

I'm interested in a scenario where a document is fetched from the database, some computations are run based on some external conditions, one of the fields of the document gets updated and then the document gets saved, all in a system that might have concurrent threads accessing the DB.
To make it easier to understand, here's a very simplistic example. Suppose I have the following document:
{
...
items_average: 1234,
last_10_items: [10,2187,2133, ...]
...
}
Suppose a new item (X) comes in, five things will need to be done:
read the document from the DB
remove the first (oldest) item in the last_10_items
add X to the end of the array
re-compute the average* and save it in items_average.
write the document to the DB
* NOTE: the average computation was chosen as a very simple example, but the question should take into account more complex operations based on data existing in the document and on new data (i.e. not something solvable with the $inc operator)
This certainly is something easy to implement in a single-threaded system, but in a concurrent system, if 2 threads would like to follow the above steps, inconsistencies might occur since both will update the last_10_items and items_average values without considering and/or overwriting the concurrent changes.
So, my question is how can such a scenario be handled? Is there a way to check or react-upon the fact that the underlying document was changed between steps 1 and 5? Is there such a thing as WATCH from redis or 'Concurrent Modification Error' from relational DBs?
Thanks
In database system,it uses a memory inspection and roll back scheme which is similar to transactional memory.
Briefly speaking, it simply monitors the share memory parts you specified and do something like compare and swap or load and link or test and set.
Therefore,if any memory content is changed during transaction,it will abort and try again until there is no conflict operation for that shared memory.
For example,GCC implements the following:
https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
type __sync_lock_test_and_set (type *ptr, type value, ...)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
For more info about transactional memory,
http://en.wikipedia.org/wiki/Software_transactional_memory

Applying updates to a KDB table in thread safe manner in C

I need to update a KDB table with new/updated/deleted rows while it is being read by other threads. Since writing to K structures while other threads access will not be thread safe, the only way I can think of is to clone the whole table and apply new changes to that. Even to do that, I need to first clone the table, then find a way to insert/update/delete rows from it.
I'd like to know if there are functions in C to:
1. Clone the whole table
2. Delete existing rows
3. Insert new rows easily
4. Update existing rows
Appreciate suggestions on new approaches to the same problem as well.
Based on the comments...
You need to do a set of operations on the KDB database "atomically"
You don't have "control" of the database, so you can't set functions (though you don't actually need to be an admin to do this, but that's a different story)
You have a separate C process that is connecting to the database to do the operations you require. (Given you said you don't have "access" to the database as admin, you can't get KDB to load your C binary to use within-process anyway).
Firstly I'm going to assume you know how to connect to KDB+ and issue via the C API (found here).
All you need to do then is to concatenate your "atomic" operation into a set of statements that you are going to issue in one call from C. For example say you want to update a table and then delete some entry. This is what your call might look like:
{update name:`me from table where name=`you; delete from `table where name=`other;}[]
(Caution: this is just a dummy example, I've assumed your table is in-memory so that the delete operation here would work just fine, and not saved to disk, etc. If you need specific help with the actual statements you require in your use case then that's a different question for this forum).
Notice that this is an anonymous function that will get called immediately on issue ([]). There is the assumption that your operations within the function will succeed. Again, if you need actual q query help it's a different question for this forum.
Even if your KDB database is multithreaded (started with -s or negative port number), it will not let you update global variables inside a peach thread. Therefore your operation should work just fine. But just in case there's something else that could interfere with your new anonymous function, you can wrap the function with protected evaluation.