Table cols definition - kdb

There are a bunch of tricks used in kdb+ to work with keyed/splayed/partitioned and simple tables. I see lots of .Q functions which work as a facade for these varieties. One of them is cols. Could you help me please with one of the cases - what the 11h= case stands for?
cols
k){$[
.Q.qp x:.Q.v x; / If partitioned
.Q.pf,!+x; / add "partitioned field" to table cols
98h=#x; / If simple table
!+x; / just table cols (convert to dict of lists, get keys)
11h=#!x; / ?
!x;
!+0!x / (keys-dict)!(data-dict): remove keys, get table cols
]}

It's for dictionaries:
q)cols`a`b`c!1 2 3
`a`b`c
Where the type (#) of the key (!) is a list of symbols (11h)

Related

Convert jsonb column to a user-defined type

I'm trying to convert each row in a jsonb column to a type that I've defined, and I can't quite seem to get there.
I have an app that scrapes articles from The Guardian Open Platform and dumps the responses (as jsonb) in an ingestion table, into a column called 'body'. Other columns are a sequential ID, and a timestamp extracted from the response payload that helps my app only scrape new data.
I'd like to move the response dump data into a properly-defined table, and as I know the schema of the response, I've defined a type (my_type).
I've been referring to the 9.16. JSON Functions and Operators in the Postgres docs. I can get a single record as my type:
select * from jsonb_populate_record(null::my_type, (select body from data_ingestion limit 1));
produces
id
type
sectionId
...
example_id
example_type
example_section_id
...
(abbreviated for concision)
If I remove the limit, I get an error, which makes sense: the subquery would be providing multiple rows to jsonb_populate_record which only expects one.
I can get it to do multiple rows, but the result isn't broken into columns:
select jsonb_populate_record(null::my_type, body) from reviews_ingestion limit 3;
produces:
jsonb_populate_record
(example_id_1,example_type_1,example_section_id_1,...)
(example_id_2,example_type_2,example_section_id_2,...)
(example_id_3,example_type_3,example_section_id_3,...)
This is a bit odd, I would have expected to see column names; this after all is the point of providing the type.
I'm aware I can do this by using Postgres JSON querying functionality, e.g.
select
body -> 'id' as id,
body -> 'type' as type,
body -> 'sectionId' as section_id,
...
from reviews_ingestion;
This works but it seems quite inelegant. Plus I lose datatypes.
I've also considered aggregating all rows in the body column into a JSON array, so as to be able to supply this to jsonb_populate_recordset but this seems a bit of a silly approach, and unlikely to be performant.
Is there a way to achieve what I want, using Postgres functions?
Maybe you need this - to break my_type record into columns:
select (jsonb_populate_record(null::my_type, body)).*
from reviews_ingestion
limit 3;
-- or whatever other query clauses here
i.e. select all from these my_type records. All column names and types are in place.
Here is an illustration. My custom type is delmet and CTO t remotely mimics data_ingestion.
create type delmet as (x integer, y text, z boolean);
with t(i, j, k) as
(
values
(1, '{"x":10, "y":"Nope", "z":true}'::jsonb, 'cats'),
(2, '{"x":11, "y":"Yep", "z":false}', 'dogs'),
(3, '{"x":12, "y":null, "z":true}', 'parrots')
)
select i, (jsonb_populate_record(null::delmet, j)).*, k
from t;
Result:
i
x
y
z
k
1
10
Nope
true
cats
2
11
Yep
false
dogs
3
12
true
parrots

Select from a table with Limit expression works, without - fails

For a table t with a custom field c which is dictionary I could use select with limit expression, but simple select failes:
q)r1: `n`m`k!111b;
q)r2: `n`m`k!000b;
q)t: ([]a:1 2; b:10 20; c:(r1; r2));
q)t
a b c
----------------
1 10 `n`m`k!111b
2 20 `n`m`k!000b
q)select[2] c[`n] from t
x
-
1
0
q)select c[`n] from t
'type
[0] select c[`n] from t
^
Is it a bug, or am I missing something?
Upd:
Why does select [2] c[`n] from t work here?
Since c is a list, it does not support key indexing which is why it has returned a type
You need to index into each element instead of trying to index the column.
q)select c[;`n] from t
x
-
1
0
A list of confirming dictionaries outside of this context is equivalent to a table, so you can index like you were
q)c:(r1;r2)
q)type c
98h
q)c[`n]
10b
I would say that the way complex columns are represented in memory makes this not possible. I suspect that any modification that creates a copy of a subset of the elements will allow column indexing as the copy will be formatted as a table.
One example here is serialising and deserialising the column (not recommended to do this). In the case of select[n] it is selecting a subset of 2 elements
q)type exec c from t
0h
q)type exec -9!-8!c from t
98h
q)exec (-9!-8!c)[`n] from t
10b

Check for a column in all tables and if column is present print the table name - kdb

We have 3 tables with below columns in our current namespace
q)t:([] a:`a`b`c; b:1 2 3);
q)s:([] a:`a`b`c; b:1 2 3);
q)z:([] z:`a`b`c; b:1 2 3);
Now we want to search and print all the tables which have column a.
Expected output: `a`s
I have two ugly solution which somewhat works
q){$[`a in cols x;x;]} each tables[];
q){(enlist x) where enlist(`a in cols x)}each tables[];
But looking for better optimal solution.
It'd be best to use the each right (/:) adverb with in:
q)tables[] where `a in/: cols each tables[]
`s`t
If this is the sort of thing you would like to parameterise and use again something like the following projection may be of use to you:
f:{[t;c]c!t#/:where each flip(c,:())in/:cols each t}[tables[]];
This function produces the following output
q)f`a
a| s t
q)f`a`b
a| `s`t
b| `s`t`z
So it maps a column name to a list of tables containing that column.
This sort of mapping can often be more advantageous than working with simple lists as it allows for quick, legible indexing.
q)coldict:f`a`b
q)coldict[`a]
`s`t

Variable substitution in scala

I have two dataframes in scala both having data from two different tables but of same structure (srcdataframe and tgttable). I have to join these two based on composite primary key and select few columns and append two columns the code for which is as below:
for(i <- 2 until numCols) {
srcdataframe.as("A")
.join(tgttable.as("B"), $"A.INSTANCE_ID" === $"B.INSTANCE_ID" &&
$"A.CONTRACT_LINE_ID" === $"B.CONTRACT_LINE_ID", "inner")
.filter($"A." + srcColnm(i) =!= $"B." + srcColnm(i))
.select($"A.INSTANCE_ID",
$"A.CONTRACT_LINE_ID",
"$"+"\""+"A."+srcColnm(i)+"\""+","+"$"+"\""+"B."+srcColnm(i)+"\"")
.withColumn("MisMatchedCol",lit("\""+srcColnm(i)+"\""))
.withColumn("LastRunDate",current_timestamp.cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch");
hiveSQLContext.sql("Insert into table xxxx.f2f_Mismatch1 select t.* from (select * from IPF_1M_Mismatch) t");}
Here are the things am trying to do:
Inner join of srcdataframe and tgttable based on instance_id and contract_line_id.
Select only instance_id, contract_line_id, mismatched_col_values, hardcode of mismatched_col_nm, timestamp.
srcColnm(i) is an array of strings which contains the non-primary keys to be compared.
However, I am not able to resolve the variables inside the dataframe in the for loop. I tried looking up for solutions here and here. I got to know that it may be because of the way spark substitutes the variables only at compile time, in this case I'm not sure how to resolve it.
Instead of creating columns with $, you can simply use strings or the col() function. I would also recommend performing the join outside of the for as it's an expensive operation. Slightly changed code, the main difference to solve your problem is in the select:
val df = srcdataframe.as("A")
.join(tgttable.as("B"), Seq("INSTANCE_ID", "CONTRACT_LINE_ID"), "inner")
for(columnName <- srcColnm) {
df.filter(col("A." + columnName) =!= col("B." + columnName))
.select("INSTANCE_ID", "CONTRACT_LINE_ID", "A." + columnName, "B." + columnName)
.withColumn("MisMatchedCol", lit(columnName))
.withColumn("LastRunDate", current_timestamp().cast("long"))
.createOrReplaceTempView("IPF_1M_Mismatch")
// Hive command
}
Regarding the problem in select:
$ is short for the col() function, it's selecting a column in the dataframe by name. The problem in the select is that the two first arguments col("A.INSTANCE_ID") and col("A.CONTRACT_LINE_ID") are two columns ($replaced bycol()` for clarity).
However, the next two arguments are strings. It is not possible to mix these two, either all arguments should be columns or all are strings. As you used "A."+srcColnm(i) to build up the column name $ can't be used, however, you could have used col("A."+srcColnm(i)).

How to pass dictionary into query constraint?

If this is the dictionary of constraint:
dictName:`region`Code;
dictValue:(`NJ`NY;`EEE213);
dict:dictName!dictValue;
I would like to pass the dict to a function and depending on how many keys there are and let the query react accordingly. If there is one key region, then I would like to put it as
select from table where region in dict`region;
The same thing is for code. But if I pass two keys, I would like the query knows and pass it as:
select form table where region in dict`region,Code in dict`code;
Is there any way to do this?
I came up this code:
funcForOne:{[constraint]?[`bce;enlist(in;constraint;(`dict;enlist constraint));0b;()]};
funcForAll[]
{[dict]$[(null dict)~1;select from bce;($[(count key dict)=1;($[`region in (key dict);funcForOne[`region];funcForOne[`Code]]);select from bce where region in dict`region,rxmCode in dict`Code])]};
It works for one and two constraint. but when I called funcForAll[] it gives type error. How should I change it? i think it is from null dict~1
I tried count too. but doesn't work too well.
Update
So I did this but I have some error
tab:([]code:`B90056`B90057`B90058`B90059;region:`CA`NY`NJ`CA);
dictKey:`region`Code;dictValue:(`NJ`NY;`B90057);
dict:dictKey!dictValue;
?[tab;f dict;0b;()];
and I got 'NY error. Do you know why? Also,if I pass a null dictionary it doesn't seem working.
As I said funtional form would be the better approach but if your requirement is very limited as you said then you can consider other solution as below:
Note: Assuming all dictionary keys will be in table columns list.
q) f:{[dict] if[0=count dict;:select from t];
select from t where (#[key dict;t]) in {$[any 0<=type each value x;flip ;enlist ]x}[dict] }
Explanation:
1. convert dict to table depending on the values type. Flip if any value is a general list else enlist.
$[any 0<=type each value dict;flip ;enlist ]dict
Get subset of table t which consists only of dictionary keys as columns.
#[key dict;t]
get rows where (2) in (1)
Basically we are using below form of querying and matching:
q)t1:([]id:1 2;s:`a`b);
q)t2:([]id:1 3 ;s:`a`b);
q)select from t1 where ([]id;s) in t2
If you're just using in, you can do something like:
f:{{[x;y](in),'key[y],'(),x}[;x]enlist each value[x]}
So that:
q)d
a| 10 1
b| ,`a
q)f d
in `a 10 1
in `b ,`a
q)t
a b c
------
1 a 10
2 b 20
3 c 30
q)?[t;f d;0b;()]
a b c
------
1 a 10
Note that because of the enlist each the resulting list is enlisted so that singletons work too:
q)d:enlist[`a]!enlist 1
q)d
a| 1
q)?[t;f d;0b;()]
a b c
------
1 a 10
Update to secondary question
This still works with empty dict, i.e. ()!(). I'm passing in the dictionary variable.
In your 2nd question your dictionary is not constructed correctly (also remember q is case sensitive). Also your values need to be enlisted. Look up functional select in the reference pages on the kx site, you'll see that you need to enlist the symbol lists to differentiate them from column name declarations
`region`code!(enlist `NY`NJ;enlist `B90057)