KDB: transpose a dual keyed table to a matrix - kdb

How can i transpose a dual keyed dictionay(x,y)
x y | z
- - | -
1 a | data
2 a | data
3 a | data
4 a | data
5 a | data
1 b | data
2 b | data
3 b | data
4 b | data
5 b | data
1 c | data
2 c | data
3 c | data
4 c | data
5 c | data
to matrix style structure
x\y 1 2 3 4 5
------------
a |data.......
b |data.......
c |data.......
I am not sure how to start. I have the concept of using flip group twice.
Can anyone help?

I believe you want a pivot table. http://code.kx.com/q/cookbook/pivoting-tables/
Nick Psaris has a nice pivot function on his github from the qtips book;
pivot:{[t]
u:`$string asc distinct last f:flip key t;
pf:{x#(`$string y)!z};
p:?[t;();g!g:-1_ k;(pf;`u;last k:key f;last key flip value t)];
p}
q)t:2!ungroup ([]x:1+til 5;y:5#enlist `a`b`c;z:`data)
q)pivot 2!`y`x`z xcols 0!t
y| 1 2 3 4 5
-| ------------------------
a| data data data data data
b| data data data data data
c| data data data data data

Related

Spark PIVOT performance is very slow on High volume Data

I have one dataframe with 3 columns and 20,000 no of rows. i need to be convert all 20,000 transid into column.
table macro:
prodid
transid
flag
A
1
1
B
2
1
C
3
1
so on..
Expected Op be like upto 20,000 no of columns:
prodid
1
2
3
A
1
1
1
B
1
1
1
C
1
1
1
I have tried with PIVOT/transpose function but its taking too long time for high volume data. for processing 20,000 rows to column its taking around 10 hrs.
eg.
val array =a1.select("trans_id").distinct.collect.map(x => x.getString(0)).toSeq
val a2=a1.groupBy("prodid").pivot("trans_id",array).sum("flag")
When i used pivot on 200-300 no of rows then it is working fast but when no of rows increase PIVOT is not good.
can anyone please help me to find out the solution.is there any method to avoid PIVOT function as PIVOT is good for low volume conversion only.How to deal with high volume data.
I need this type of conversion for matrix multiplication.
for matrix multiplication my input be like below table and final results will be in matrix multiplication.
|col1|col2|col3|col4|
|----|----|----|----|
|1 | 0 | 1 | 0 |
|0 | 1 | 0 | 0 |
|1 | 1 | 1 | 1 |

split keyvalue pair data into new columns

My table looks something like this:
id | data
1 | A=1000 B=2000
2 | A=200 C=300
In kdb is there a way to normalize the data such that the final table is as follows:
id | data.1 | data.2
1 | A | 1000
1 | B | 2000
2 | A | 200
2 | C | 300
One option would be to make use of 0: & it's key-value parsing functionality, documented here https://code.kx.com/q/ref/file-text/#key-value-pairs e.g.
q)ungroup delete data from {x,`data1`data2!"S= "0:x`data}'[t]
id data1 data2
---------------
1 A "1000"
1 B "2000"
2 A "200"
2 C "300"
assuming you want data2 to be long datatype (j), can do
update "J"$data2 from ungroup delete data from {x,`data1`data2!"S= "0:x`data}'[t]
You could use a combination of vs (vector from scalar), each-both ' and ungroup:
q)t:([]id:1 2;data:("A=1000 B=2000";"A=200 C=300"))
q)t
id data
------------------
1 "A=1000 B=2000"
2 "A=200 C=300"
q)select id, dataA:`$data[;0], dataB:"J"$data[;1] from
ungroup update data: "=" vs '' " " vs ' data from t
id dataA dataB
--------------
1 A 1000
1 B 2000
2 A 200
2 C 300
I wouldn't recommend naming the columns with . e.g. data.1

Modeling mutual exclusion & co-relation in relational database schema

If A, B & C are attributes with their values as,
A -> {1}
B -> {2,5,9}
C -> {11,12}
A & B are correlated (A cannot exist without B).
When A = 1, B can be 5 or 9, B cannot be 2.
B & C are correlated, when B is 5, C can be 11, C cannot be 12.
Ex: So, when C = 11, then B = 5, A = 1
how do i model this relationship in a relational schema or is there a better way to represent it?
What i have so far as attribute table.
ID | Attribute | value
----------------------
1 | A | 1
2 | B | 2
3 | B | 5
4 | B | 9
5 | C | 11
6 | C | 12
and the correlation table, ID1 & ID2 are foreign keys to the attribute table and are together composite primary key.
ID1 | ID2
---------
1 | 3
1 | 4
3 | 11

Add a key element for n rows in PySpark Dataframe

I have a dataframe like the one shown below.
id | run_id
--------------
4 | 12345
6 | 12567
10 | 12890
13 | 12450
I wish to add a new column say Key that will have value 1 for the first n rows and 2 for the next n rows. The result will be like:
id | run_id | key
----------------------
4 | 12345 | 1
6 | 12567 | 1
10 | 12890 | 2
13 | 12450 | 2
Is it possibile to do the same with PySpark?. Thanks in advance for the help.
Here is one way to do it using zipWithIndex:
# sample rdd
rdd=sc.parallelize([[4,12345], [6,12567], [10,12890], [13,12450]])
# group size for key
n=2
# add rownumber and then label in batches of size n
rdd=rdd.zipWithIndex().map(lambda (x, rownum): x+[int(rownum/n)+1])
# convert to dataframe
df=rdd.toDF(schema=['id', 'run_id', 'key'])
df.show(4)

postgresql find similar word groups

I have a table1 containing a column A, where ~100,000 strings (varchar) are stored. Unfortunately, each string has multiple words which are seperated with spaces. Further they have different length, i.e. one string can consist of 3 words while an other string contains 7 words.
Then I have a column B stored in a second table2 which contains only 100 strings in the same manner. Hence, multiple words per string, seperated by spaces.
The target is, to look how likely a record of Column B is matching with probably multiple records of column A based on the words. The result should also have a ranking. I was thinking of using full text search in a loop but I don't know how to do this, or if there is a proper way to achieve this?
I don't know if you can "tturn" table to a dictionary to use full text for ranking here. But you can query it with some primityve ranking quite easily, eg:
t=# with a(a) as (values('a b c'),('a c d'),('b e f'),('r b t'),('q w'))
, b(i,b) as (values(1,'a b'), (2,'e'), (3,'b'))
, p as (select unnest(string_to_array(b.b,' ')) arr,i from b)
select a phrases,arr match_words,count(1) over (partition by arr) words_in_matches, count(1) over (partition by i) matches,i from a left join p on a.a like '%'||arr||'%';
phrases | match_words | words_in_matches | matches | i
---------+-------------+------------------+---------+---
r b t | b | 6 | 5 | 1
a b c | b | 6 | 5 | 1
b e f | b | 6 | 5 | 1
a b c | a | 2 | 5 | 1
a c d | a | 2 | 5 | 1
b e f | e | 1 | 1 | 2
r b t | b | 6 | 3 | 3
a b c | b | 6 | 3 | 3
b e f | b | 6 | 3 | 3
q w | | 1 | 1 |
(10 rows)
phrases are rows from your big table.
match_words are tokens from your small table (splitted by spaces)
words_in_matches amount of tokens in phrases
matches is amount of matches in big table phrases from small table phrases
i index of phrase from small table
So you can order by third or fourth column to get some sort of ranking...