How to construct a two level partition in kdb? - kdb

Is it possible to have a two level partition scheme in kdb+? For example, one level is based on trading dates and the other level is on stock symbols. Thanks!

If you use the standard partitioned-by-date-and-`p#attribute-on-sym approach then you do effectively have a two level partition. It is very efficient to query/filter by date
select from table where date=x
and also generally still very efficient to query/filter by sym
select from table where date within(x;y),sym=z
Particularly if you run secondary threads and have good disk performance.
So you should get good performance when querying a sym across all history, and also when querying all syms for a given date. This is also the simplest way to write daily data.
I don't believe there's any real way to have a true two-level partition without writing the data twice in two different ways.

If your queries span a large range of symbols or involve extensive queries on each one that would benefit from parallel computation, you can segment by first letter ranges: A-N, N-Z, or finer grained than that.
As long as you maintain`p# within and across segments (so that e.g. IBM only appears in one place), you can throw threads and disks at the problem.

Related

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

How to store aggregated data in kdb+

I've faced with an architecture issue: what strategy should I choose to store aggregated data.
I know that in some Time Series DBs, like RRDTools, it is OK to have several db layers to store 1H,1W,1M,1Y aggregated data.
Is it a normal practice to use the same strategy for kdb+: to have several HDBs with date/month/year/int(for week and other) partitions? (with a rule on Gateway how to those an appropriate source.)
As an alternative I have in mind to store all data in a single HDB in tables like tablenameagg. But it looks not so smooth like several HDBs to me.
What points should I take into account for a decision?
It's hard to give a general answer as requirements are different for everyone but I can say in my experience that the normal practice is to have a single date-partitioned HDB as this can accommodate the widest range of historical datasets. In terms of increasing granularity of aggregation:
Full tick data - works best as date-partitioned with `p# on sym
Minutely aggregated data - still works well as date-partitioned with `p# on either sym or minute, `g# on either minute or sym
Hourly aggregated data - could be either date-partitioned or splayed depending on volume. Again you can have some combination of attributes on the sym and/or the aggregated time unit (in this case hour)
Weekly aggregated data - given how much this would compress the data you're now likely looking at a splayed table in this date-partitioned database. Use attributes as above.
Monthly/Yearly aggregated data - certainly could be splayed and possibly even flat given how small these tables would be. Attributes almost unnecessary in the flat case.
Maintaining many different HDBs with different partition styles would seem like overkill to me. But again it all depends on the situation and the volumes of data involved and in the expected usage pattern of the data.

Search Engine Database (Cassandra) & Best Practise

I'm currently storing rankings in MongoDB (+ nodejs as API) . It's now at 10 million records, so it's okay for now but the dataset will be growing drastically in the near future.
At this point I see two options:
MongoDB Sharding
Change Database
The queries performed on the database will not be text searches, but for example:
domain, keyword, language, start date, end date
keyword, language, start date, end date
A rank contains a:
1. domain
2. url
3. keyword
4. keyword language
5. position
6. date (unix)
Requirement is to be able to query and analyze the data without caching. For example get all data for domain x, between dates y, z and analyze the data.
I'm noticing a perfomance decrease lately and I'm looking into other databases. The one that seems to fit the job best is Cassandra, I did some testing and it looked promising, performance is good. Using Amazon EC2 + Cassandra seems a good solution, since it's easilly scalable.
Since I'm no expert on Cassandra I would like to know if Cassandra is the way to go. Secondly, what would be the best practice / database model.
Make a collection for (simplified):
domains (domain_id, name)
keywords (keyword_id, name, language)
rank (domain_id, keyword_id, position, url, unix)
Or put all in one row:
domain, keyword, language, position, url, unix
Any tips, insights would be greatly appreciated.
Cassandra relies heavily on query driven modelling. It's very restrictive in how you can query, but it is possible to fit an awful lot of requirements within those capabilities. For any large scale database, knowing your queries is important, but in terms of cassandra, it's almost vital.
Cassandra has the notion of primary keys. Each primary key consists of one or more keys (read columns). The first column (which may be a composite) is referred to as the partition key. Cassandra keeps all "rows" for a partition in the same place (on disk, in mem, etc.), and a partition is the unit of replication, etc.
Additional keys in the primary key are called clustering keys. Data within a partition are ordered according to successive clustering keys. For instance, if your primary key is (a, b, c, d) then data will be partitioned by hashing a, and within a partition, data will be ordered by b, c and d.
For efficient querying, you must hit one (or very few) partitions. So your query must have a partition key. This MUST be exact equality (no starts with, contains, etc.). Then you need to filter down to your targets. This can get interesting too:
Your query can specify exact equality conditions for successive clustering keys, and a range (or equality) for the last key in your query. So, in the previous example, this is allowed:
select * from tbl where a=a1 and b=b1 and c > c1;
This is not:
select * from tbl where a=a1 and b>20 and c=c1;
[You can use allow filtering for this]
or
select * from tbl where a=a1 and c > 20;
Once you understand the data storage model, this makes sense. One of the reason cassandra is so fast for queries is that it pin points data in a range and splats it out. If it needed to do pick and choose, it'd be slower. You can always grab data and filter client side.
You can also have secondary indexes on columns. These would allow you to filter on exact equality on non-key columns. Be warned, never use a query with a secondary index without specifying a partition key. You'll be doing a cluster query which will time out in real usage. (The exception is if you're using Spark and locality is being honoured, but that's a different thing altogether).
In general, it's good to limit partition sizes to less than a 100mb or at most a few hundred meg. Any larger, you'll have problems. Usually, a need for larger partitions suggests a bad data model.
Quite often, you'll need to denormalise data into multiple tables to satisfy all your queries in a fast manner. If your model allows you to query for all your needs with the fewest possible tables, that's a really good model. Often that might not be possible though, and denormalisation will be necessary. For your question, the answer to whether or not all of it goes in one row depends on whether you can still query it and keep partition sizes less than 100 meg or not if everything is in one row.
For OLTP, cassandra will be awesome IF you can build the data model that works the way Cassandra does. Quite often OLAP requirements won't be satisfied by this. The current tool of choice for OLAP with Cassandra data is the DataStax Spark connector + Apache Spark. It's quite simple to use, and is really powerful.
That's quite a brain dump. But it should give you some idea of the things you might need to learn if you intend to use Cassandra for a real world project. I'm not trying to put you off Cassandra or anything. It's an awesome data store. But you have to learn what it's doing to harness its power. It works very different to Mongo, and you should expect a mindshift when switching. It's most definitely NOT like switching from mysql to sql server.

One big and wide table or many not so big for statistics data

I'm writing simplest analytics system for my company. I have about 100 different event types that should be collected per tens of projects. We are not interested in cross-project analytic requests but events have similar types through all projects. I use PostgreSQL as primary storage for this system. Now I should decide which architecture is more preferable.
First architecture is one very big table (in terms of rows count) per project that contains data for all types of events. It will be about 20 or more columns many of them will be nullable. May be it will be used partitioning to split this table by event type but table still be so wide.
Second one architecture is a lot of tables (fairly big in terms of rows count but not so wide) with one table per event type.
I going to retrieve analytic data from this tables using different join queries (self join in case of first architecture). Which one is more preferable and where are pitfalls of them?
UPD. All events have about 10 common attributes. And remain attributes are varied from one event type to another.
In the past, I've had similar situations. With postgres you have a bunch of options.
Depending on how your data is input into the system (all at once/ a little at a time) and the volume of your data per project (hundreds of data points vs millions of data points) and the querying pattern (IE, querying after the data is all in, querying nightly, or reports running constantly throughout), there are many options. One other factor will be IF new project types (with new data point types) are likely to crop up.
First, in your "first architecture" the first question that comes up for me is: Are all the "data points" the same data type (or at least very similar). Are some text and others numeric? Are some numeric and others floats? If so, you're likely to run into issues with rolling up your data without either building a column or a table for every data type.
If all your data is the same datatype, then the first architecture you mentioned might work really well.
The second architecture you mentioned is OK especially if you don't predict having a bunch of new project types coming down the pike anytime soon, otherwise, you'll be constantly modifying the DB, which I prefer to avoid when unnecessary.
A third architecture that you didn't mention is to have a combination of 1 and 2. Basically have 1 table to hold the 10 common attributes and use either 1 or 2 to hold the additional attributes. This would have an advantage, especially if the additional data wasn't that frequently used, or was non-numeric.
Lastly, you could use one of PostgreSQLs "document store" type datatypes. You could store this information in arrays, hstores, or json. Now, this will be fairly inefficient if you're doing a ton of aggregate functions as you might be left calculating the aggregates outside of Pgsql, or at a minimum, running an inefficient query. You could store the 10 common fields in normal fields, and the additional ones as hstore or json.
I didn't ask you, but it'd be nice to know that if each event within a project had more than 1 data point (IE are you logging changes, or just updating data).If your overall table has less than 100,000 rows, it's likely just going to be best to focus on what's easier to maintain and program rather than performance, as small amounts of data are pretty quick regardless of how they're stored.

Sequential Row IDs in Column Oriented DBs (HBase, Cassandra)?

I've seen two contradictory pieces of advice when it comes to designing row IDs in HBase, (specifically, but I think it applies to Cassandra as well.)
Group keys that you'll be aggregating together often to take advantage of data locality. (White, Hadoop: The Definitive Guide and I recall seeing it on the HBase site, but can't find it...)
Spread keys around so that work can be distributed across multiple machines (Twitter, Pig, and HBase at Twitter slide 14)
I'm guessing which one is optimal can depend on your use case, but does anyone have any experience with either strategy?
In HBase, a table is partitioned into regions by dividing up the key space, which is sorted lexicographically. Each region of the table belongs to a single region server, so all reads and writes are handled by that server (which allows for a strong consistency guarantee). This means that if all of your reads or writes are concentrated on a small range of your keyspace, that you will only be able to scale to what a single region server can handle. For example, if your data is a time series and keyed by the timestamp, then all writes are going to the last region in the table, and you will be constrained to writing at the rate that a single server can handle.
On the other hand, if you can choose your keys such that any given query only needs to scan a small range of rows, but that the overall set of reads and writes are spread across your keyspace, then the total load will be distributed and scale nicely, but you can still enjoy the locality benefits for your query.