Can I set column attributes for a kdb partitioned table? - kdb

Is it possible to set column attributes for a partitioned table?
q)h "update `g#ticker from `pmd"
'par
q)h "update `s#ts from `pmd"
'par
q)
Should I set the attributes on the memory table, before I run the partitioning? Will the attributes be preserved after the partitioning?

Take a look at the setattrcol in dbmaint.q. This script is very very useful when working with partitioned databases.

In order for the partitions on disk to be sorted, you need to iterate through the partitions and use xasc as follows:
for each partition .. assuming you have a quote table partitioned by date to sort by `timestamp
{`timestamp xasc `$":./2014.04.20/quote/"}
Once you've finished sorting each partition, the s attribute will appear ontimestamp column..
q)meta quote
c | t f a
---------| -----
date | d
timestamp| p s
pair | s
side | c
...

Related

Best usage of indexes and primary key on joined and filtered data in PostgreSQL

I have 2 tables with the exact same number of rows and the same non-repeated id. Because the data comes from 2 sources I want to keep it 2 tables and not combine it. I assume the best approach would be to leave the unique id as the primary key and join on it?
SELECT * FROM tableA INNER JOIN tableB ON tableA primary key = tableB primary key
The data is used by an application that force the user to select 1 or many values from 5 drop downs in cascading order:
select 1 or many values from tableA column1.
select 1 or many values from tableA column2 but filtered from the first filter.
select 1 or many values from tableA column3 but filtered from the second filter which in turn is filtered from the first filter.
For example:
pk
Column 1
Column 2
Column 3
123
Doe
Jane
2022-01
234
Doe
Jane
2021-12
345
Doe
John
2022-03
456
Jones
Mary
2022-04
Selecting "Doe" from column1 would limit the second filter to ("Jane","John"). And selecting "Jane" from column2 would filter column3 to ("2022-01","2021-12")
And last part of the question;
The application have 3 selection options for column3:
picking the exact value (for example "2022-01") or picking the year ("2022") or picking the quarter that the month falls into ("Q1", which equates in "01","02","03").
What would be the best usage of indexes AND/OR additional columns for this scenario?
Volume of data would be 20-100 million rows.
Each filter is in the range of 5-25 distinct values.
Which version of Postgres do you operate?
The volume you state is rather daunting for such a use case of populating drop-down boxes using live data for a PG db.
No kidding, it's possible, Kibana/Elastic has even a filter widget that works exactly this way for instance.
My guess is you may consider storing the different combinations of search columns in another table simply to speed up populating the dropboxes. You can achieve that with triggers on the two main tables. So instead of additional columns/indexes you may end with an additional table ;)
Regarding indexing strategy and given the hints you stated (AND/OR), I'd say there's no silver bullet. Index the columns that will be queried the most often.
Index each column individually because Postgres starting from 11 IIRC can combine multiple indexes to answer conjunctive/disjunctive formulas in WHERE clauses.
Hope this helps

Is it possible to use postgres/psql COPY into a specific partition of a table?

I am currently looking into an efficient way to allocate data into a partitioned table. Is it possible to use postgres/psql to COPY data into a specific table partition (instead of using INSERT)?
According to the documentation on COPY here:
COPY FROM can be used with plain, foreign, or partitioned tables or with views that have INSTEAD OF INSERT triggers.
And according to the documentation on partitioning here:
Be aware that COPY ignores rules. If you want to use COPY to insert data, you'll need to copy into the correct partition table rather than into the master. COPY does fire triggers, so you can use it normally if you use the trigger approach.
From my understanding of the aforementioned resources, it seems possible to copy into partition; however, I can't find any examples or support for that online.
In other words, can I write something like:
COPY some_table_partition_one FROM '/some_dir/some_file'
COPY to a partitioned table was introduced in v11:
Allow INSERT, UPDATE, and COPY on partitioned tables to properly route rows to foreign partitions (Etsuro Fujita, Amit Langote)
But COPY directly to a partition is possible in all releases since v10, where declarative partitioning was introduced.
It seems like we forgot to remove the second quotation from the documentation.
It is possible at least with PG 12.2:
CREATE TABLE measurement (
city_id int not null,
logdate date not null,
peaktemp int,
unitsales int
) PARTITION BY RANGE (logdate);
CREATE TABLE
CREATE TABLE measurement_y2020m03 PARTITION OF measurement
FOR VALUES FROM ('2020-03-01') TO ('2020-03-31');
CREATE TABLE
CREATE TABLE measurement_y2020m04 PARTITION OF measurement
FOR VALUES FROM ('2020-04-01') TO ('2020-04-30');
CREATE TABLE
insert into measurement values (1, current_date, 10,100);
INSERT 0 1
select * from measurement;
city_id | logdate | peaktemp | unitsales
---------+------------+----------+-----------
1 | 2020-03-27 | 10 | 100
(1 row)
cat /tmp/m.dat
4,2020-04-01,40,400
copy measurement_y2020m04 from '/tmp/m.dat' delimiter ',';
COPY 1
select * from measurement;
city_id | logdate | peaktemp | unitsales
---------+------------+----------+-----------
1 | 2020-03-27 | 10 | 100
4 | 2020-04-01 | 40 | 400
(2 rows)

Does PostgreSQL have a way of creating metadata about the data in a particular table?

I'm dealing with a lot of unique data that has the same type of columns, but each group of rows have different attributes about them and I'm trying to see if PostgreSQL has a way of storing metadata about groups of rows in a database or if I would be better off adding custom columns to my current list of columns to track these different attributes. Microsoft Excel for instance has a way you can merge multiple columns into a super-column to group multiple columns into one, but I don't know how this would translate over to a PostgreSQL database. Thoughts anyone?
Right, can't upload files. Hope this turns out well.
Section 1 | Section 2 | Section 3
=================================
Num1|Num2 | Num1|Num2 | Num1|Num2
=================================
132 | 163 | 334 | 1345| 343 | 433
......
......
......
have a "super group" of columns (In SQL in general, not just postgreSQL), the easiest approach is to use multiple tables.
Example:
Person table can have columns of
person_ID, first_name, last_name
employee table can have columns of
person_id, department, manager_person_id, salary
customer table can have columns of
person_id, addr, city, state, zip
That way, you can join them together to do whatever you like..
Example:
select *
from person p
left outer join student s on s.person_id=p.person_id
left outer join employee e on e.person_id=p.person_id
Or any variation, while separating the data into different types and PERHAPS save a little disk space in the process (example if most "people" are "customers", they don't need a bunch of employee data floating around or have nullable columns)
That's how I normally handle this type of situation, but without a practical example, it's hard to say what's best in your scenario.

optimizing postgres view for timestamps and aggregation of fields from another table

I've greatly simplified the examples to hopefully produce a clear enough question that can be answered:
Consider a table of events
CREATE TABLE alertable_events
(
unique_id text NOT NULL DEFAULT ''::text,
generated_on timestamp without time zone NOT NULL DEFAULT now(),
message_text text NOT NULL DEFAULT ''::text,
CONSTRAINT pk_alertable_events PRIMARY KEY (unique_id),
)
with the following data:
COPY alertable_events (unique_id,message_text,generated_on) FROM stdin;
one message one 2014-03-20 06:00:00.000000
two message two 2014-03-21 06:00:00.000000
three message three 2014-03-22 06:00:00.000000
four message four 2014-03-23 06:00:00.000000
five message five 2014-03-24 06:00:00.000000
\.
And for each event, there is a list of fields
CREATE TABLE alertable_event_fields
(
unique_id text NOT NULL DEFAULT ''::text,
field_name text NOT NULL,
field_value text NOT NULL DEFAULT ''::text,
CONSTRAINT pk_alertable_event_fields PRIMARY KEY (unique_id, field_name),
CONSTRAINT fk_alertable_event_fields_0 FOREIGN KEY (unique_id)
REFERENCES alertable_events (unique_id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE,
)
with the following data:
COPY alertable_event_fields (unique_id,field_name,field_value) FROM stdin;
one field1 a
one field2 b
two field1 z
two field2 y
three field1 a
three field2 m
four field1 a
four field2 b
five field1 z
five field2 y
\.
I want to define a view that produces the following:
| unique_id | fields | message_text | generated_on | updated_on | count |
| five | z|y | message five | 2014-03-21 06:00:00.000000 | 2014-03-24 06:00:00.000000 | 2 |
| four | a|b | message four | 2014-03-20 06:00:00.000000 | 2014-03-23 06:00:00.000000 | 2 |
| three | a|m | message three | 2014-03-22 06:00:00.000000 | 2014-03-22 06:00:00.000000 | 1 |
Notably:
fields is a pipe delimited string (or any serialization of) the field values (json encoding of field_name:field_value pairs would be even better ... but I can work with pipe_delim for now)
the output is grouped by matching fields. Update 3/30 12:45am The values are ordered by their field_name's alphabetically therefore a|b would not match b|a
a count is produced of the events that match that field set. updated 3/30 12:45am there can be different number of fields per unique_id, a match requires matching all fields and not a subset of the fields.
generated_on is the timestamp of the first event
updated_on is the timestamp of the most recent event
message_text is the message_text of the most recent event
I've produced this view, and it works for small data sets, however, as the alertable_events table grows, it becomes exceptionally slow. I can only assume I'm doing something wrong in the view because I have never dealt with anything quite so ugly.
Update 3/30 12:15PM EDT It looks like I may have server tuning problems causing this high run-times, see added explain for more info. If you see a glaring issue there, I'd be greatly interested in tweaking the server's configuration.
Can anyone piece together a view that handles large datasets well and has a significantly better run time than this? Perhaps using hstore? (I'm running 9.2 preferrably, though 9.3 if I can have a nice json encoding of the fields.)
Updated 3/30 11:30AM I'm beginning to think my issue may be server tuning (which means I'll need to talk to the SA) Here's a very simple explain (analyze,buffers) which is showing a ridiculous run-time for as few as 8k rows in the unduplicated_event_fields
Update 3/30 7:20PM I bumped my available memory to 5MB using SET WORK_MEM='5MB' (which is plenty for the query below), strangely, even though the planner went to in memory quicksort, it actually took on average 100ms longer!
explain (analyze,buffers)
SELECT a.unique_id,
array_to_string(array_agg(a.field_value order by a.field_name),'|') AS "values"
FROM alertable_event_fields a
GROUP BY a.unique_id;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=771.11..892.79 rows=4056 width=80) (actual time=588.679..630.989 rows=4056 loops=1)
Buffers: shared hit=143, temp read=90 written=90
-> Sort (cost=771.11..791.39 rows=8112 width=80) (actual time=588.591..592.622 rows=8112 loops=1)
Sort Key: unique_id
Sort Method: external merge Disk: 712kB
Buffers: shared hit=143, temp read=90 written=90
-> Seq Scan on alertable_event_fields a (cost=0.00..244.40 rows=8112 width=80) (actual time=0.018..5.478 rows=8112 loops=1)
Filter: (message_name = 'LIMIT_STATUS'::text)
Buffers: shared hit=143
Total runtime: 632.323 ms
(10 rows)
Update 3/30 4:10AM EDT I'm still not completely satisfied and would be interested in any further optimization. I have a requirement to support 500msgs/sec steady state, and although most of those should not be "events", I get a little backlogged right now when stress testing.
Update 3/30 12:00PM EDT Here's my most readable iteration yet, unfortunately, for 4000 rows I'm still looking at 600ms runtimes! ... (see above, as its mostly contained with the inner most query) Any help here would be greatly appreciated
CREATE OR REPLACE VIEW views.unduplicated_events AS
SELECT a.unique_id,a.message_text,
b."values",b.generated_on,b.updated_on,b.count
FROM alertable_events a
JOIN (
SELECT b."values",
min(a.generated_on) AS generated_on,
max(a.generated_on) AS updated_on,
count(*) AS count
FROM alertable_events a
JOIN (
SELECT a.unique_id,
array_to_string(array_agg(a.field_value order by a.field_name),'|') AS "values"
FROM alertable_event_fields a
GROUP BY a.unique_id
) b USING (unique_id)
GROUP BY b."values"
) b ON a.generated_on=b.updated_on
ORDER BY updated_on DESC;
Update 3/30 12:00PM EDT removed old stuff as this is getting too long
Some pointers
Invalid query
Your current query is incorrect unless generated_on is unique, which is not declared in the question and probably is not the case:
CREATE OR REPLACE VIEW views.unduplicated_events AS
SELECT ...
FROM alertable_events a
JOIN ( ... ) b ON a.generated_on=b.updated_on -- !! unreliable
Possibly faster
SELECT DISTINCT ON (f.fields)
unique_id -- most recent
, f.fields
, e.message_text -- most recent
, min(e.generated_on) OVER (PARTITION BY f.fields) AS generated_on -- "first"
, e.generated_on AS updated_on -- most recent
, count(*) OVER (PARTITION BY f.fields) AS ct
FROM alertable_events e
JOIN (
SELECT unique_id, array_to_string(array_agg(field_value), '|') AS fields
FROM (
SELECT unique_id, field_value
FROM alertable_event_fields
ORDER BY 1, field_name -- a bit of a hack, but much faster
) f
GROUP BY 1
) f USING (unique_id)
ORDER BY f.fields, e.generated_on DESC;
SQL Fiddle.
The result is currently sorted by fields. If you need a different sort order, you'd need to wrap it in another subquery ...
Major points
The output column name generated_on conflicts with the input column generated_on. You have to table-qualify the column e.generated_on to refer to the input column. I added table-qualification everywhere to make it clear, but it is only actually necessary the ORDER BY clause. The manual:
If an ORDER BY expression is a simple name that matches both an
output column name and an input column name, ORDER BY will interpret
it as the output column name. This is the opposite of the choice that
GROUP BY will make in the same situation. This inconsistency is made
to be compatible with the SQL standard.
The updated query should also be faster (as intended all along). Run EXPLAIN ANALYZE again.
For the whole query, indexes will hardly be of use. Only if you select specific rows ... One possible exception: a covering index for alertable_event_fields:
CREATE INDEX f_idx1
ON alertable_event_fields (unique_id, field_name, field_value);
Lots of write operations might void the benefit, though.
array_agg(field_value ORDER BY ...) tends to be slower for big sets than pre-sorting in a subquery.
DISTINCT ON is convenient here. Not sure, whether it's actually faster, though, since ct and generated_on have to be computed in separate window functions, which requires another sort step.
work_mem: setting it too high can actually harm performance. More in the Postgres Wiki. or in "Craig's list".
Generally this is hard to optimize. Indexes fail because the sort order depends on two tables. If you can work with a snapshot, consider a MATERIALIZED VIEW.

Cassandra CQL3 select row keys from table with compound primary key

I'm using Cassandra 1.2.7 with the official Java driver that uses CQL3.
Suppose a table created by
CREATE TABLE foo (
row int,
column int,
txt text,
PRIMARY KEY (row, column)
);
Then I'd like to preform the equivalent of SELECT DISTINCT row FROM foo
As for my understanding it should be possible to execute this query efficiently inside Cassandra's data model(given the way compound primary keys are implemented) as it would just query the 'raw' table.
I searched the CQL documentation but I didn't find any options to do that.
My backup plan is to create a separate table - something like
CREATE TABLE foo_rows (
row int,
PRIMARY KEY (row)
);
But this requires the hassle of keeping the two in sync - writing to foo_rows for any write in foo(also a performance penalty).
So is there any way to query for distinct row(partition) keys?
I'll give you the bad way to do this first. If you insert these rows:
insert into foo (row,column,txt) values (1,1,'First Insert');
insert into foo (row,column,txt) values (1,2,'Second Insert');
insert into foo (row,column,txt) values (2,1,'First Insert');
insert into foo (row,column,txt) values (2,2,'Second Insert');
Doing a
'select row from foo;'
will give you the following:
row
-----
1
1
2
2
Not distinct since it shows all possible combinations of row and column. To query to get one row value, you can add a column value:
select row from foo where column = 1;
But then you will get this warning:
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
Ok. Then with this:
select row from foo where column = 1 ALLOW FILTERING;
row
-----
1
2
Great. What I wanted. Let's not ignore that warning though. If you only have a small number of rows, say 10000, then this will work without a huge hit on performance. Now what if I have 1 billion? Depending on the number of nodes and the replication factor, your performance is going to take a serious hit. First, the query has to scan every possible row in the table (read full table scan) and then filter the unique values for the result set. In some cases, this query will just time out. Given that, probably not what you were looking for.
You mentioned that you were worried about a performance hit on inserting into multiple tables. Multiple table inserts are a perfectly valid data modeling technique. Cassandra can do a enormous amount of writes. As for it being a pain to sync, I don't know your exact application, but I can give general tips.
If you need a distinct scan, you need to think partition columns. This is what we call a index or query table. The important thing to consider in any Cassandra data model is the application queries. If I was using IP address as the row, I might create something like this to scan all the IP addresses I have in order.
CREATE TABLE ip_addresses (
first_quad int,
last_quads ascii,
PRIMARY KEY (first_quad, last_quads)
);
Now, to insert some rows in my 192.x.x.x address space:
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000000002');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001001');
insert into ip_addresses (first_quad,last_quads) VALUES (192,'000001255');
To get the distinct rows in the 192 space, I do this:
SELECT * FROM ip_addresses WHERE first_quad = 192;
first_quad | last_quads
------------+------------
192 | 000000001
192 | 000000002
192 | 000001001
192 | 000001255
To get every single address, you would just need to iterate over every possible row key from 0-255. In my example, I would expect the application to be asking for specific ranges to keep things performant. Your application may have different needs but hopefully you can see the pattern here.
according to the documentation, from CQL version 3.11, cassandra understands DISTINCT modifier.
So you can now write
SELECT DISTINCT row FROM foo
#edofic
Partition row keys are used as unique index to distinguish different rows in the storage engine so by nature, row keys are always distinct. You don't need to put DISTINCT in the SELECT clause
Example
INSERT INTO foo(row,column,txt) VALUES (1,1,'1-1');
INSERT INTO foo(row,column,txt) VALUES (2,1,'2-1');
INSERT INTO foo(row,column,txt) VALUES (1,2,'1-2');
Then
SELECT row FROM foo
will return 2 values: 1 and 2
Below is how things are persisted in Cassandra
+----------+-------------------+------------------+
| row key | column1/value | column2/value |
+----------+-------------------+------------------+
| 1 | 1/'1' | 2/'2' |
| 2 | 1/'1' | |
+----------+-------------------+------------------+