kdb+/q optimize union function - kdb

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!

If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables

this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

Related

kdb: getting one row from HDB

For a normal table, we can select one row using select[1] from t. How can I do this for HDB?
I tried select[1] from t where date=2021.02.25 but it gives error
Not yet implemented: it probably makes sense, but it’s not defined nor implemented, and needs more thinking about as the language evolves
select[n] syntax works only if table is already loaded in memory.
The easiest way to get 1st row of HDB table is:
1#select from t where date=2021.02.25
select[n] will work if applied on already loaded data, e.g.
select[1] from select from t where date=2021.02.25
I've done this before for ad-hoc queries by using the virtual index i, which should avoid the cost of pulling all data into memory just to select a couple of rows. If your query needs to map constraints in first before pulling a subset, this is a reasonable solution.
It will however pull N rows for each date partition selected due to the way that q queries work under the covers. So YMMV and this might not be the best solution if it was behind an API for example.
/ 5 rows (i[5] is the 6th row)
select from t where date=2021.02.25, sum=`abcd, price=1234.5, i<i[5]
If your table is date partitioned, you can simply run
select col1,col2 from t where date=2021.02.25,i=0
That will get the first record from 2021.02.25's partition, and avoid loading every record into memory.
Per your first request (which is different to above) select[1] from t, you can achieve that with
.Q.ind[t;enlist 0]

Transpose table with unknown amount of variables

I would like to transpose table a to table b without knowing exactly how many procedures there are. Is there a way to include a loop inside a query?
Thank you in advance!
So far I am just checking what the maximum amount of 'procedures' are, I put all the procedures in an array, and then query all elements from this array. However I would like a query that always works without first defining the maximum amount of procedures.

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

How should a table with two sets of almost duplicate column names be designed?

I have a table that has around 40 columns. The only difference in the columns names is that the last 20 all start with "B" before the column name. This table is used for comparing. In other words, compare the data in the first 20 columns to the data in the last 20 columns.
I know this is very bad design, so how should this table be redesigned, so that there are only 20 columns, yet we can still compare the data?
EDIT: if it helps, we also use this data to find a matched cohort
Also note that performance is of main concern here. By duplicating the columns the getting of data is extremely fast.
Thanks!
Two possible architectures and a query tip.
1) Build your table with a "Type" column, and use that to flag "primary" vs. "alternate". In your case, "A" vs. "B" might be appropriate.
2) Build a vertical partition, two identical tables (for primary and alternate data), that share a common primary key. (If Id = 42 is in one table, it must be in the other--unless "alternate" data is optional, in which case don't populate the second table.) Also optionally, have a third table that tracks all possible primary keys, along with any data that is known to always be common to both tables.
Tip: Read up on SELECT...EXCEPT and SELECT...INTERSECT. They run disturbingly quickly, and are idea for comparing all columns and rows between two datasets for differences (except) and matches (intersect). You can use this fairly easily with either of the two structures, and it would work with your existing code as well (though it might be fussier to write the query).