Update a serialized table on disk in kdb - kdb

I have a serialised table on disk which I want to update based on condition.
One way of doing so, is by loading the table in memory, updating it and then serializing it again on disk.
Eg:
q)`:file set ([] id:10 20; l1:("Blue hor";"Antop"); l2:("Malad"; "KC"); pcd:("NCD";"FRB") );
q)t:get `:file;
q)update l1:enlist "Chin", l2:enlist "Gor" from `t where id=10;
q)`:file set t;
I tried updating the table directly on disk but received type error:
q)update l1:enlist "Chin", l2:enlist "Gor" from `:file where id=10
'type
[0] update l1:enlist "Chin", l2:enlist "Gor" from `:file where id=10
Is there a way in which we can update the serialized table directly on disk?(In one of the case we don't have enough primary memory to load the serialized table)

If you save your table as one flat file, then the whole table has to be loaded in, updated and then written down, requiring enough memory to hold the full table.
To avoid this you can splay your table by adding a trailing / in your filepath, ie
`:file/ set ([] id:10 20; l1:("Blue hor";"Antop"); l2:("Malad"; "KC"); pcd:("NCD";"FRB") );
If symbols columns are present they will need to be enumerated using .Q.en.
This will split the table vertically and save your columns as individual files, under a directory called file. Having the columns saved as individual files allows you to only load in the columns required as opposed to the entire table, resulting in smaller memory requirements. You only need to specify the required columns in a select query.
https://code.kx.com/q4m3/14_Introduction_to_Kdb%2B/#142-splayed-tables
You can further split your data horizontally by partitioning, resulting in smaller subsets again if individual columns are too big.
https://code.kx.com/q4m3/14_Introduction_to_Kdb%2B/#143-partitioned-tables
When you run
get`:splayedTable
This memory maps the table assuming your columns are mappable, as opposed to copying it into memory, shown by .Q.w[]
You could do
tab:update l1:enlist "Chin", l2:enlist "Gor" from (select id, l1,l2 from get`:file) where id=10
`:file/l1 set tab`l1
`:file/l2 set tab`l2
If loading only the required columns for your query into memory is still too big, you can load them one at a time. First load id and find the required indices (where id=10), delete id from memory, load in l1 and modify using the indices,
#[get`:file/l1;indicies;:;enlist"Chin"]
write it down and then delete it from memory. Then do the same with l2. This way you would have at most one column in memory. Ideally your table would be appropriately partitioned so you can hold the data in memory.
You can also directly update vectors on disk, which avoids having to re write the whole file, for example,
ind:exec i from get`:file where id=10
#[`:file/l1;ind;:;enlist"Chin"]
Though there are some restrictions on the file which are mentioned in the below link
https://code.kx.com/q/ref/amend/#functional-amend

Related

how to get talend (TMAP) to load lookup data and incoming data at the same time

I have a talend job that i require a lookup at the target table.
Naturally the target table is large (a fact table) so I don't want to have to wait to load the whole thing before going to running lookups like this picture below:
Is there a way to have the lookup work DURING the pull from the main source?
The attempt is to speed up the inital loads so things move fast, and attempt to save on memory. as you can see, the lookup is already passed 3 Million rows.
the tLogRow represents the same table as the lookup.
You can achieve what you're looking for by configuring the lookup in your tMap to use "Reload at each row" lookup model, instead of "Load Once". This lookup model allows you to reexecute your lookup query for each incoming row, instead of loading all your lookup table at once, useful for lookups on large tables.
When you select the reload at each row model, you will have to specify a lookup key in the global map sections that will appear under the settings. Create a key with a name like "ORDER_ID", and map it with FromExt.ORDER_ID column. Then modify your lookup query so that it returns a single match for the ORDER_ID like so:
"SELECT col1, col1.. FROM lookup_table WHERE id = '" + (String)globalMap.get("ORDER_ID") + "'".
This is supposing your id column is a string.
What this does is create a global variable called "ORDER_ID" containing the order id for every incoming row from your main connection, then executes the lookup query filtering for that id.

Why does `FOR ALL ENTRIES` lower performance of CDS view on DB6?

I'm reading data from a SAP Core Data Service (CDS view, SAP R/3, ABAP 7.50) using a WHERE clause on its primary (and only) key column. There is a massive performance decrease when using FOR ALL ENTRIES (about a factor 5):
Reading data using a normal WHERE clause takes about 10 seconds in my case:
SELECT DISTINCT *
FROM ZMY_CDS_VIEW
WHERE prim_key_col eq 'mykey'
INTO TABLE #DATA(lt_table1).
Reading data using FOR ALL ENTRIES with the same WHERE takes about 50 seconds in my case:
"" boilerplate code that creates a table with one entry holding the same key value as above
TYPES: BEGIN OF t_kv,
key_value like ZMY_CDS_VIEW-prim_key_col,
END OF t_kv.
DATA lt_key_values TYPE TABLE OF t_kv.
DATA ls_key_value TYPE t_kv.
ls_key_value-key_value = 'mykey'.
APPEND ls_key_value TO lt_key_values.
SELECT *
FROM ZMY_CDS_VIEW
FOR ALL ENTRIES IN #lt_key_values
WHERE prim_key_col eq #lt_key_values-key_value
INTO TABLE #DATA(lt_table2).
I do not understand why the same selection takes five times as long when utilising FOR ALL ENTRIES. Since the table lt_key_values has only 1 entry I'd expect the database (sy-dbsys is 'DB6' in my case) to do exactly the same operations plus maybe some small neglectable overhead ≪ 40s.
Selecting from the underlying SQL view instead of the CDS (with its Access Control and so on) makes no difference at all, neither does adding or removing the DISTINCT key word (because FOR ALL ENTRIES implies DISTINCT).
A colleague guessed, that the FOR ALL ENTRIES is actually selecting the entire content of the CDS and comparing it with the internal table lt_key_values at runtime. This seems about right.
Using the transaction st05 I recorded a SQL trace that looks like the following in the FOR ALL ENTRIES case:
SELECT
DISTINCT "ZMY_UNDERLYING_SQL_VIEW".*
FROM
"ZMY_UNDERLYING_SQL_VIEW",
TABLE( SAPTOOLS.MEMORY_TABLE( CAST( ? AS BLOB( 2G )) ) CARDINALITY 1 ) AS "t_00" ( "C_0" VARCHAR(30) )
WHERE
"ZMY_UNDERLYING_SQL_VIEW"."MANDT" = ?
AND "ZMY_UNDERLYING_SQL_VIEW"."PRIM_KEY_COL" = "t_00"."C_0"
[...]
Variables
A0(IT,13) = ITAB[1x1(20)]
A1(CH,10) = 'mykey'
A2(CH,3) = '100'
So what actually happens is: ABAP selects the entire CDS content and puts the value from the internal table in something like an additional column. Then it only keeps those values where internal table and SQL result entry do match. ==> No optimzation on database level => bad performance.

Redshift Copy command identity column is alternate value due to number of slices

I am trying to achieve sequential incremental values in the identity column of Redshift while running copy command.
Redshift-Identity column SEED-STEP behavior with COPY command is an excellent article I followed to slowly move towards my target, but even after following the last step from the list and using the manifest file, I could only get (alternatively incrementing) 1,3,5,7... or 2,4,6,8... ID column values.
While creating table, I give that column as:
bucketingid INT IDENTITY(1, 1) sortkey
I can understand the behavior is because my dc2.large single node cluster has 2 slices and hence I am getting the issue.
I am trying to upload a single csv file from S3 to redshift.
How can I achieve sequential incremental IDs?
The IDENTITY column is not guaranteed to produce consecutive values. It guarantees to assign unique and monotonic values.
You can solve your problem with some sql once the data is loaded:
CREATE TABLE my_table_with_consecutive_ids AS
SELECT
row_number() over (order by bucketingid) as consecutive_bucketingid,
*
FROM my_table
Some explaination why the problem occurs:
Since COPY performs distributed loading of your data, and each file is loaded by a node slice, loading only one file will be handled by a single slice. To be able to guarantee unique values while loading data in parallel by different slices, each of them is using an space of identities exclusive to itself (with 2 slices, one uses odd, and the other one even numbers).
Theoretically, you can have consecutive ids after loading the data if you split the file in two (or whatever is the number of slices your cluster has) and use both slices for loading (you'll need to use MANIFEST file), but it's highly impractical and you also make assumptions about your cluster size.
Same explaination from CREATE TABLE manual:
IDENTITY(seed, step)
...
With a COPY operation, the data is loaded in parallel and distributed to the node slices. To be sure that the identity values are unique, Amazon Redshift skips a number of values when creating the identity values. As a result, identity values are unique and sequential, but not consecutive, and the order might not match the order in the source files.

kdb+/q optimize union function

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!
If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables
this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

Efficient way of insert millions of rows, convert data and deal with it, on PostgreSQL+PostGIS

I have a big collection of data I want to use for user search later.
Currently I have 200 millions resources (~50GB). For each, I have latitude+longitude. The goal is to create spatial index to be able to do spatial queries on it.
So for that, the plan is to use PostgreSQL + PostGIS.
My data are on CSV file. I tried to use custom function to not insert duplicates, but after days of processing I gave up. I found a way to load it fast in the database: with COPY it takes less than 2 hours.
Then, I need to convert latitude+longitude on Geometry format. For that I just need to do:
ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
After some checking, I saw that for 200 millions, I have 50 millions points. So, I think the best way is to have a table "TABLE_POINTS" that will store all the points, and a table "TABLE_RESOURCES" that will store resources with point_key.
So I need to fill "TABLE_POINTS" and "TABLE_RESOURCES" from temporary table "TABLE_TEMP" and not keeping duplicates.
For "POINTS" I did:
INSERT INTO TABLE_POINTS (point)
SELECT DISTINCT ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
FROM TABLE_RESOURCES
I don't remember how much time it took, but I think it was matter of hours.
Then, to fill "RESOURCES", I tried:
INSERT INTO TABLE_RESOURCES (...,point_key)
SELECT DISTINCT ...,point_key
FROM TABLE_TEMP, TABLE_POINTS
WHERE ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326) = point;
but again take days, and there is no way to see how far the query is ...
Also something important, the number of resources will continue to grow up, currently should be like 100K added by day, so storage should be optimized to keep fast access to data.
So if you have any idea for the loading or the optimization of the storage you are welcome.
Look into optimizing postgres first (ie google postgres unlogged, wal and fsync options), second do you really need points to be unique? Maybe just have one table with resources and points combined and not worry about duplicate points as it seems your duplicate lookup maybe whats slow.
For DISTINCT to work efficiently, you'll need a database index on those columns for which you want to eliminate duplicates (e.g. on the latitude/longitude columns, or even on the set of all columns).
So first insert all data into your temp table, then CREATE INDEX (this is usually faster that creating the index beforehand, as maintaining it during insertion is costly), and only afterwards do the INSERT INTO ... SELECT DISTINCT.
An EXPLAIN <your query> can tell you whether the SELECT DISTINCT now uses the index.