Why can't keyed table be splayed in kdb? - kdb

Keyed tables are nothing but dictionary mapping of two tables like:
q)kts:([] sym:`GOOG`AMZN`FB)!([] px:3?10.; size:3?100000)
q).Q.dpft[`:/path/db;.z.d;`id;`kts]
'nyi
[0] .Q.dpft[`:/path/db;.z.d;`id;`kts]
Why is there is a limitation that keyed tables cannot be splayed or partitioned?

I think the simplest answer comes from both the technical and the logical.
Technical: there is no way in the on-disk format to indicate this currently. The .d file indicates the order of columns on disk but not any further metadata. This could be changed at a later date in theory.
The logical answer comes from the size of the data in question. Splayed tables are typically used when you want to hold a few columns in memory. A decade ago this meant that splayed tables were useful for holding up to 100M rows but with 3.x and modern memory that upper limit can be well north of 250M. I don't think there's a good way to make that kind of join performant in ad-hoc calculation. The grouped attribute index supported to make that work is around the same size as the column on disk and would need to be constantly re-written as data is appended.
I think the use of 'nyi in this case, to mean, "we probably need to think about this one for a bit", is appropriate.
The obvious solution is to look at explicit row relationships via linking columns, where the lookup calculation is done ahead of time.

Related

Will one Foreign key to two tables cause confusion?

If I have table A, and table B, and I have data that start off in table A but end up in table B, and I have a table C which has a foreign key that points to the primary key of A, but when the data get removed from A and ends up in table B, it should point to B instead (having the same id as A's data did). Will this cause confusion. Heres and example to show what I mean:
A (Pending results)
id =3
B( Completed Results)
null
C(user)
id = 1
results id = 3 (foreign key to both A and B)
After three minutes, the results have been posted.
A (Pending results)
null
B( Completed Results)
id = 3
C(user)
id = 1
results id = 3 (foreign key to both A and B)
Is there anything wrong with this implementation. Or would it be better to have A and B as one table? The table could grow very large which is what I am worried about. As separate tables, the reads to table A would be far greater than the reads to table B and table A would be much smaller, as it is just pending results. If A and B were combined into one table, then it would be both pending and a history of all completed results, so finding the ones which are pending would take much more time I assume. All of this is being done is postgresql if that makes a difference.
So I guess my question is: Is this implementation fine for a medium scale, or given the information I just said, should I combine table A and B (Even though B will grow infinitely large whereas A only contains present data and is significantly smaller).
Sounds like you've already found that this does not work. I couldn't follow your example properly because "A", "B", and "C" never work for me. I suspect those kinds of formulaic labels are better than specifics for other people. You just can't win ;-) In any case, it sounds like you're facing a practical concern about table size, and are being tempted to use a design that splits a natural table into two parts. (Hot and old.) As you found, that doesn't really work with the keys in a system. The relational model (etc., etc.) doesn't have a concept for "this thing is a child of this or that." So, you're swimming up stream there. Regardless, this kind of setup is very commonplace in the wild, so much so that it's got a name. Well, several names. "Polymorphic Association" from Bill Karwin's SQL Anti-Patterns is common. That's a good book, and short, by the way. Similarly, "promiscuous association" is a term you'll see. Or sometimes you'll see the table itself listed as a "jump table", or a "hub", etc.
I suspect there's a reason this non-relational pattern is so widely used: It makes sense to humans. An area where the relational model is always a tight pinch is when you have things which are kinds of things. Like, people who are staff or student. So many fields in common, several that are distinct to their specific type. One table? Two? Three? Table inheritance in Postgres might help...at least it's trying to. Anyway, polymorphic relations are problematic in an RDBMS because they're not able to be modeled or constrained automatically. You need custom code to figure out that this record is a child of that table...or the other table. You can't bake that into the relations. If you're interested in various solutions to this design problem, Karwin's chapter is quite good, easy to read, and full of alternative designs. If you don't feel like tracking down the book but are a bit interested, check out this article from a few years ago:
https://hashrocket.com/blog/posts/modeling-polymorphic-associations-in-a-relational-database
Chances are, your interest right now is more day-to-day. Is sounds like you've got a processing pipeline with a few active records and an ever-increasing collection of older records. You don't mention your Postgres version, but you might have less to worry about than you imagine. First up, you could consider partitioning the table. A partitioned table has a single logical table that you talk to in your queries with a collection of smaller physical tables under the hood. You can get at the partitions directly, but you don't need to. You just talk to my_big_table and Postgres figures out where to look. So, you could split the data on week, month, etc. so that no one bucket every gets too big for you. In this case, the individual partitions have their own indexes too. So, you'll end up with smaller tables and smaller indexes under the hood. For this, you're best off using PG 11, or maybe PG 10. Partitioning is a big topic, and the Postgres feature set isn't a perfect match for every situation...you have to work within its limits. I'll leave it at that now as it's likely not what you need first.
Simpler than partitioning is an awesome Postgres feature you may not know about, partial indexes. This isn't unique to Postgres (SQL Server calls the same sort of feature a "filtered" index), but I don't think MySQL has it. Okay, the idea is really simple: Build an index that only includes rows that match a condition. Here's an example:
CREATE INDEX tasks_pending
ON tasks (status)
WHERE status = 'Pending'
If you're table has 100M records, a full B-tree has to catalog all 100M rows. You need that for a uniqueness check on a primary key...but it's big and expensive. Now imagine your 100M records have only 1,000 rows where status = pending. You've got an index with just those 1,000 rows. Tiny, fast, perfect. The beauty part here is that the partial index doesn't necessarily get bigger as your historical data set grows. And, shout out to historical data sets, they're very nice to have when you need to get aggregates, etc. in a simple search. If you split things into multiple tables, you'll need to write longer queries with UNION. (That wouldn't be the case with partitions where the physical division is masked by the logical partition master table.)
HTH

Optimizing aggregation function and ordering in PostgreSQL

I have the following table 'medicion' with the followings fields:
id_variable[int](PK),
id_departamento[int](PK),
fecha [date](PK),
valor [number]`.
So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:
SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;
Knowing that by default PostgreSQL builds an index for the primary key
(id_departamento, id_variable, fecha)
how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?
Thanks!
Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.
The only things an extra covering index brings are:
Not reading the entire table.
This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.
I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.
...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.
This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.
Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...
If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.
You could keep your stats in cache if you don't mind them being stale.
Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.

"Big" data in a JSONB column

I have a table with a metadata column (JSONB). Sometimes I run queries on this column. Example:
select * from "a" where metadata->'b'->'c' is not null
This column has always just small JSON objects <1KB. But for some records (less than 0.5%), it can be >500KB, because some sub-sub-properties have many data.
Today, I only have ~1000 records, everything works fine. But I think I will have more records soon, I don't know if having some big data (I don't speak about "Big Data" of course!) will have a global impact on performance. Is 500KB "big" for postgres and is it "hard" to parse? Maybe my question is too vague, I can edit if required. In other words:
Is having some (<0.5%) bigger entries in a JSONB column affect noticeably global performance of JSON queries?
Side note: assuming the "big" data is in metadata->'c'->'d' I don't run any queries to this particular property. Queries are always done on "small/common" properties. But the "big" properties still exists.
It is a theoretical question, so I hope a generic answer will satisfy.
If you need numbers, I recommend that you run some performance tests. It shouldn't be too hard to generate some large jsonb objects for testing.
As long as the data are jsonb and not json, operations like metadata->'b'->'c' will be fast. Where you could lose time is when a large jsonb value is loaded and uncompressed from the TOAST table (“detoasted”).
You can avoid that problem by creating an index on the expression. Then an index scan does not have to detoast the jsonb and hence will be fast, no matter how big the jsonb is.
So I think that you will not encounter problems with performance as long as you can index for the queries.
You should pay attention to the total size of your database. If the expected size is very large, backups will become a challenge. It helsp to design and plan how to discard old and unused data. That is an aspect commonly ignored during application design that tends to cause headaches years later.

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

historic data storage and retrieval

I am using a standard splayed format for my trade data where i have directories for each date and each column as separate file in there.
I am reading from csv files and storing using the below code. I am using the trial version 32 bit on win 7, 64 bit.
readDat: {[x]
tmp: read data from csv file(x)
tmp: `sym`time`trdId xasc tmp;
/trd: update `g#sym from trd;
trade:: trd;
.Q.dpft[`:/kdb/ndb; dt; `sym; `trade];
.Q.gc[];
};
\t readDat each 50#dtlist
I have tried both using the `g#sym and without it. Data has typically 1.5MM rows per date. select time for this is from 0.5 to 1 second for a day
Is there a way to improve times for either of the below queries.
\t select from trade where date=x
\t select from trade where date=x, sym=y
I have read the docs on segmentation, partitioning etc. but not sure if anything would help here.
On second thoughts, will creating a table for each sym speed up things? I am trying that out but wanted to know if there are memory/space tradeoffs i should be aware of.
Have you done any profiling to see what the actual bottleneck is? If you find that the problem has to do with disk read speed (using something like iostat) you can either get a faster disk (SSD), more memory (for bigger disk cache), or use par.txt to shard your database across multiple disks such that the query happens on multiple disks and cores in parallel.
As you are using .Q.dpft, you are already partitioning your DB. If your use case is always to pass one date in your queries, then segmenting by date will not provide any performance improvements. You could possibly segment by symbol range (see here), although this is never something I've tried.
One basic way to improve performance would be to select a subset of the columns. Do you really need to read all of the fields when querying? Depending on the width of your table this can have a large impact as it now can ignore some files completely.
Another way to improve performance would be to apply `u# to the sym file. This will speed up your second query as the look up on the sym file will be faster. Although this really depends on the size of your universe. The benefit of this would be marginal in comparison to reducing the number of columns requested I would imagine.
As user1895961 mentioned, selecting only certain columns will be faster.
KDB splayed\partitioned tables are almost exactly just files on the filesystem, the smaller the files and the fewer you have to read, the faster it will be. The balance between the number of folders and the number of files is key. 1.5mln per partition is ok, but is on the large side. Perhaps you might want to partition by something else.
You may also want to normalise you data, splitting it into multiple tables and using linked columns to join it back again on the fly. Linked columns, if set up correctly, can be very powerful and can help avoid reading too much data back from disk if filtering is added in.
Also try converting your data to char instead of sym, i found big performance increases from doing so.