KDB: why am I getting a type error when upserting? - kdb

I specified the columns to be of type String. Why am I getting the following error:
q)test: ([key1:"s"$()] col1:"s"$();col2:"s"$();col3:"s"$())
q)`test upsert(`key1`col1`col2`col3)!(string "999"; string "693"; string "943";
string "249")
'type
[0] `test upsert(`key1`col1`col2`col3)!(string "999"; string "693"; string "9
43"; string "249")

To do exactly this, you can remove the types of the list you defined in test:
q)test: ([key1:()] col1:();col2:();col3:())
q)test upsert (`key1`col1`col2`col3)!("999";"693";"943";"249")
key1 | col1 col2 col3
-----| -----------------
"999"| "693" "943" "249"
The reason you are getting a type error is because "s" corresponds to a list of symbols, not a list of characters. you can check this by using .Q.ty:
q).Q.ty `symbol$()
"s"
q).Q.ty `char$()
"c"
It is (generally) not a great idea to set the keys as nested list of chars, you might find it better to set them as integers ("i") or longs ("j") as in:
test: ([key1:"j"$()] col1:"j"$();col2:"j"$();col3:"j"$())
Having the keys as integers/longs will make the upsert function behave nicely. Also note that a table is a list of dictionaries, so each dictionary can be upserted inidividually as well as a table being upserted:
q)`test upsert (`key1`col1`col2`col3)!(9;4;6;2)
`test
q)test
key1| col1 col2 col3
----| --------------
9 | 4 6 2
q)`test upsert (`key1`col1`col2`col3)!(8;6;2;3)
`test
q)test
key1| col1 col2 col3
----| --------------
9 | 4 6 2
8 | 6 2 3
q)`test upsert (`key1`col1`col2`col3)!(9;1;7;4)
`test
q)test
key1| col1 col2 col3
----| --------------
9 | 1 7 4
8 | 6 2 3
q)`test upsert ([key1: 8 7] col1:2 4; col2:9 3; col3:1 9)
`test
q)test
key1| col1 col2 col3
----| --------------
9 | 1 7 4
8 | 2 9 1
7 | 4 3 9

You have a few issues:
an array of chars in quotes is a string so no need to write string "abc"
string "aaa" will split the string out in strings of strings
your initial defined types are symbols "s" and not strings
This will allow you to insert as symbols:
q)test: ([key1:"s"$()] col1:"s"$();col2:"s"$();col3:"s"$())
q)`test upsert(`key1`col1`col2`col3)!`$("999"; "693"; "943"; "249")
`test
This will keep them as strings:
q)test: ([key1:()] col1:();col2:();col3:())
q)`test upsert(`key1`col1`col2`col3)!("999"; "693"; "943"; "249")
`test
Have a look at the diffs in metas of the two
HTH,
Sean

Related

split keyvalue pair data into new columns

My table looks something like this:
id | data
1 | A=1000 B=2000
2 | A=200 C=300
In kdb is there a way to normalize the data such that the final table is as follows:
id | data.1 | data.2
1 | A | 1000
1 | B | 2000
2 | A | 200
2 | C | 300
One option would be to make use of 0: & it's key-value parsing functionality, documented here https://code.kx.com/q/ref/file-text/#key-value-pairs e.g.
q)ungroup delete data from {x,`data1`data2!"S= "0:x`data}'[t]
id data1 data2
---------------
1 A "1000"
1 B "2000"
2 A "200"
2 C "300"
assuming you want data2 to be long datatype (j), can do
update "J"$data2 from ungroup delete data from {x,`data1`data2!"S= "0:x`data}'[t]
You could use a combination of vs (vector from scalar), each-both ' and ungroup:
q)t:([]id:1 2;data:("A=1000 B=2000";"A=200 C=300"))
q)t
id data
------------------
1 "A=1000 B=2000"
2 "A=200 C=300"
q)select id, dataA:`$data[;0], dataB:"J"$data[;1] from
ungroup update data: "=" vs '' " " vs ' data from t
id dataA dataB
--------------
1 A 1000
1 B 2000
2 A 200
2 C 300
I wouldn't recommend naming the columns with . e.g. data.1

How to regexp split to table regexed splitted table

I have table with combined string and I want to split it to first parts. I have results from query with regexped split to table.
Now i have split from this: 1:9,5:4,4:8,6:9,3:9,2:5,7:8,34:8,24:6
to this table:
campaign_skill
----------------
1:9
5:4
4:8
6:9
3:9
2:5
7:8
34:8
24:6
with this expression:
select *
from regexp_split_to_table((select user_skill from users where user_token = 'ded8ab43-efe2-4aea-894d-511ed3505261'), E'[\\s,]+') as campaign_skill
How to split actual results to tables like this:
campaign | skill
---------|------
1 | 9
5 | 4
4 | 8
6 | 9
3 | 9
2 | 5
7 | 8
34 | 8
24 | 6
You can use split_part() for that.
select split_part(t.campaign_skill, ':', 1) as campaign,
split_part(t.campaign_skill, ':', 2) as skill
from users u,
regexp_split_to_table(u.user_skill, E'[\\s,]+') as t(campaign_skill)
where u.user_token = 'ded8ab43-efe2-4aea-894d-511ed3505261';

How to find duplicated columns with all values in spark dataframe?

I'm preprocessing my data(2000K+ rows), and want to count the duplicated columns in a spark dataframe, for example:
id | col1 | col2 | col3 | col4 |
----+--------+-------+-------+-------+
1 | 3 | 999 | 4 | 999 |
2 | 2 | 888 | 5 | 888 |
3 | 1 | 777 | 6 | 777 |
In this case, the col2 and col4's values are the same, which is my interest, so let the count +1.
I had tried toPandas(), transpose, and then duplicateDrop() in pyspark, but it's too slow.
Is there any function could solve this?
Any idea will be appreciate, thank you.
So you want to count the number of duplicate values based on the columns col2 and col4? This should do the trick below.
val dfWithDupCount = df.withColumn("isDup", when($"col2" === "col4", 1).otherwise(0))
This will create a new dataframe with a new boolean column saying that if col2 is equal to col4, then enter the value 1 otherwise 0.
To find the total number of rows, all you need to do is do a group by based on isDup and count.
import org.apache.spark.sql.functions._
val groupped = df.groupBy("isDup").agg(sum("isDup")).toDF()
display(groupped)
Apologies if I misunderstood you. You could probably use the same solution if you were trying to match any of the columns together, but that would require nested when statements.

Parameterize select query in unary kdb function

I'd like to be able to select rows in batches from a very large keyed table being stored remotely on disk. As a toy example to test my function I set up the following tables t and nt...
t:([sym:110?`A`aa`Abc`B`bb`Bac];px:110?10f;id:1+til 110)
nt:0#t
I select from the table only records that begin with the character "A", count the number of characters, divide the count by the number of rows I would like to fetch for each function call (10), and round that up to the nearest whole number...
aRec:select from t where sym like "A*"
counter:count aRec
divy:counter%10
divyUP:ceiling divy
Next I set an idx variable to 0 and write an if statement as the parameterized function. This checks if idx equals divyUP. If not, then it should select the first 10 rows of aRec, upsert those to the nt table, increment the function argument, x, by 10, and increment the idx variable by 1. Once the idx variable and divyUP are equal it should exit the function...
idx:0
batches:{[x]if[not idx=divyUP;batch::select[x 10]from aRec;`nt upsert batch;x+:10;idx+::1]}
However when I call the function it returns a type error...
q)batches 0
'type
[1] batches:{[x]if[not idx=divyUP;batch::select[x 10]from aRec;`nt upsert batch;x+:10;idx+::1]}
^
I've tried using it with sublist too, though I get the same result...
batches:{[x]if[not idx=divyUP;batch::x 10 sublist aRec;`nt upsert batch;x+:10;idx+::1]}
q)batches 0
'type
[1] batches:{[x]if[not idx=divyUP;batch::x 10 sublist aRec;`nt upsert batch;x+:10;idx+::1]}
^
However issuing either of those above commands outside of the function both return the expected results...
q)select[0 10] from aRec
sym| px id
---| ------------
A | 4.236121 1
A | 5.932252 3
Abc| 5.473628 5
A | 0.7014928 7
Abc| 3.503483 8
A | 8.254616 9
Abc| 4.328712 10
A | 5.435053 19
A | 1.014108 22
A | 1.492811 25
q)0 10 sublist aRec
sym| px id
---| ------------
A | 4.236121 1
A | 5.932252 3
Abc| 5.473628 5
A | 0.7014928 7
Abc| 3.503483 8
A | 8.254616 9
Abc| 4.328712 10
A | 5.435053 19
A | 1.014108 22
A | 1.492811 25
The issue is that in your example, select[] and sublist requires a list as an input but your input is not a list. Reason for that is when there is a variable in items(which will form a list), it is no longer considered as a simple list meaning blank(space) cannot be used to separate values. In this case, a semicolon is required.
q) x:2
q) (1;x) / (1 2)
Select command: Change input to (x;10) to make it work.
q) t:([]id:1 2 3; v: 3 4 5)
q) {select[(x;2)] from t} 1
`id `v
---------
2 4
3 5
Another alternative is to use 'i'(index) column:
q) {select from t where i within x + 0 2} 1
Sublist Command: Convert left input of the sublist function to a list (x;10).
q) {(x;2) sublist t}1
You can't use the select[] form with variable input like that, instead you can use a functional select shown in https://code.kx.com/q4m3/9_Queries_q-sql/#912-functional-forms where you input as the 5th argument the rows you want
Hope this helps!

PostgreSQL XOR - How to check if only 1 column is filled in?

How can I simulate a XOR function in PostgreSQL? Or, at least, I think this is a XOR-kind-of situation.
Lets say the data is as follows:
id | col1 | col2 | col3
---+------+------+------
1 | 1 | | 4
2 | | 5 | 4
3 | | 8 |
4 | 12 | 5 | 4
5 | | | 4
6 | 1 | |
7 | | 12 |
And I want to return 1 column for those rows where only one of the columns is filled in. (ignore col3 for now..
Lets start with this example of 2 columns:
SELECT
id, COALESCE(col1, col2) AS col
FROM
my_table
WHERE
COALESCE(col1, col2) IS NOT NULL -- at least 1 is filled in
AND
(col1 IS NULL OR col2 IS NULL) -- at least 1 is empty
;
This works nicely an should result in:
id | col
---+----
1 | 1
3 | 8
6 | 1
7 | 12
But now, I would like to include col3 in a similar way. Like this:
id | col
---+----
1 | 1
3 | 8
5 | 4
6 | 1
7 | 12
How can this be done is a more generic way? Does Postgres support such a method?
I'm not able to find anything like it.
rows with exactly 1 column filled in:
select * from my_table where
(col1 is not null)::integer
+(col1 is not null)::integer
+(col1 is not null)::integer
=1
rows with 1 or 2
select * from my_table where
(col1 is not null)::integer
+(col1 is not null)::integer
+(col1 is not null)::integer
between 1 and 2
The "case" statement might be your friend here, the "min" aggregated function doesn't affect the result.
select id, min(coalesce(col1,col2,col3))
from my_table
group by 1
having sum(case when col1 is null then 0 else 1 end+
case when col2 is null then 0 else 1 end+
case when col3 is null then 0 else 1 end)=1
[Edit]
Well, i found a better answer without using aggregated functions, it's still based on the use of "case" but i think is more simple.
select id, coalesce(col1,col2,col3)
from my_table
where (case when col1 is null then 0 else 1 end+
case when col2 is null then 0 else 1 end+
case when col3 is null then 0 else 1 end)=1
How about
select coalesce(col1, col2, col3)
from my_table
where array_length(array_remove(array[col1, col2, col3], null), 1) = 1