Find the count of word pairs in kdb+ - kdb

I have a file which contain multiple rows of item codes as follows. There are 1 million rows similar to these
1. 123,134,256,345,789.....
2. 123,256,345,678,789......
.
.
I would like to find the count of all the pair of words/items per row in the file using q in kdb+. i.e. any two pair of words that occur in the same row can be considered a word pair.
e.g:
(123,134),(123,256),(134,256), (123,345) (123,789), (134,789) are some of the word pairs in row 1
(123,256),(123,345),(123,345),(678,789),(345,789) are some of the word pairs in row 2
word/item pair count
`123,134----1
123,256---2
345,789---2`
I am reading the file using read0 and have been able to convert each line into list using vs and using count each group to count the number of words, but now I want to find the count of all the word pairs per row in the file.
Thanks in advance for your help

I'm not 100% I understand your definition of a word-pair. Perhaps you could expand a little if my logic doesn't match what you were looking for.
In the example below, I've created a 5x5 matrice of symbols for testing - selected distinct pairs of values from each row, and then checked how many rows each of these appeared in, in total.
Please double check with your own results.
q)test:5 cut`$string 25?5
q)test
2 0 1 0 0
2 4 4 2 0
1 0 0 3 4
2 1 1 4 4
3 0 3 4 0
q)count each group raze {l[where(count'[l:distinct distinct each asc'[x cross x:distinct x]])>1]} each test
0 2| 2
1 2| 2
0 1| 2
2 4| 2
0 4| 3
1 3| 1
1 4| 2
0 3| 2
3 4| 2

To add some other cases to Matthew's answer above, if what you want is to break the list down into pairs in this way:
l:"a,b,c,d,e,f,g"
becomes
"a,b"
"b,c"
"c,d"
"d,e"
"e,f"
"f,g"
so only taking valid pairs, you could use something like this:
f:{count each group b flip 0 1+\:til 1+count[b:","vs x]-1}
q)f l
,"a" ,"b"| 1
,"b" ,"c"| 1
,"c" ,"d"| 1
,"d" ,"e"| 1
,"e" ,"f"| 1
,"f" ,"g"| 1
where we're splitting the input list on ".", then using indexing to get a list of each element and the element directly to its right, then grouping the resultant list of pairs to count the distinct pairs. If you want to split it so l becomes
"a,b"
"c,d"
"e,f"
then you could use this:
g:{count each group b flip 0 1+\:2*til count[b:","vs x]div 2}
q)g l
,"a" ,"b"| 1
,"c" ,"d"| 1
,"e" ,"f"| 1
Which uses a similar approach, starting with the even-positioned elements and getting those to their right, and repeating as above.
You can easily apply these to the rows read with read0:
r:read0`:file.txt
f each r
will output a dictionary of the counts of each pair for each row, and this can be summed to give the total count of each word pair with each method throughout the file.
Hope this helps - it's still not clear what you mean by pairs, so if neither my answer not Matthew's is of some use, you could edit in a more complete explanation of what you'd like and we can help with that.

If you want to consider all possible combinations of 2 pairs in each row then this may be of help. The following function can be used to give distinct combinations, where x is the size of the list and y is the length of the combination:
q)comb:{$[x=y;enlist til x;1=y;flip enlist til x;.z.s[x;y],.z.s[x;y-1],'x-:1]}
q)comb[3;2]
0 1
0 2
1 2
From here we can index into each list to get the pairs, then raze to give a single list of all pairs, group to get the indices where each pair occurs and then count the number of indices in each group:
q)a
123 134 256 345 789
123 256 345 678 789
q)count each group raze{x comb[count x;2]}'[a]
123 134| 1
123 256| 2
134 256| 1
...
345 789| 2
...

Related

Processing each row in kdb table and appending arbitrary results in a new table

I have a table
t:([]a:`a`b`c;b:1 2 3;c:`x`y`z)
I would like to iterate and process each row.
The thing is that the processing logic for each row may result in arbitrary lines of data, after the full iteration the result maybe as such e.g.
results:([]a:`a1`b1`b2`b3`c1`c2;x:1 2 2 2 3 3)
I have the following idea so far but doesn't seem to work:
uj { // some processing function } each t
But how does one return arbitrary number of data append the results into a new table?
Assuming you are using something from the table entries to indicate your arbitrary value, you can use a dictionary to indicate a number (or a function) which can be used to apply these values.
In this example, I use the c column of the original table to indicate the number of rows to return (and the number from 1 to count to).
As each entry of the table is a dictionary, I can index using the column names to get the values and build a new table.
I also use raze to join each of the results together, as they will each have the same schema.
raze {[x]
d:`x`y`z!1 3 2;
([]a:((),`$string[x[`a]],/:string 1+til d[x[`c]]);x:((),d[x[`c]])#x[`b])
} each t
Not sure if this is what you want, but you can try something like this:
ungroup select a:`${y,/:x}[string b]'[string a],b from t
Or you can use accumulators if you need the result of the previous row calculations like this:
{y[`b]+:last[x]`b;x,y}/[t;t]
If your processing function is outputting tables that conform, just raze should suffice:
raze {y#enlist x}'[t;1 3 2]
a b c
-----
a 1 x
b 2 y
b 2 y
b 2 y
c 3 z
c 3 z
Otherwise use (uj/)
(uj/) {y#enlist x}'[t;1 3 2]
a b c
-----
a 1 x
b 2 y
b 2 y
b 2 y
c 3 z
c 3 z
Your best answer will depend very much on how you want to use the results computed from each row of t. It might suit you to normalise t; it might not. The key point here:
A table cell can be any q data structure.
The minimum you can do in this regard is to store the result of your processing function in a new column.
Below, an arbitrary binary function f returns its result as a dictionary.
q)f:{n:1+rand 3;(`$string[x],/:"123" til n)!n#y}
q)f [`a;2]
a1| 2
a2| 2
q)update d:a f'b from t
a b c d
---------------------
a 1 x `a1`a2`a3!1 1 1
b 2 y (,`b1)!,2
c 3 z `c1`c2!3 3
But its result could be any q data structure.
You were considering a unary processing function:
q)pf:{#[x;`d;:;] f . x`a`b}
q)pf each t
a b c d
---------------------
a 1 x `a1`a2`a3!1 1 1
b 2 y `b1`b2!2 2
c 3 z `c1`c2`c3!3 3 3
You might find other suggestions at KX Community.
If I understand correctly your question you need something like this :
(uj/){}each t
Check this bit :
(uj/)enlist[t],{x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
This part :
x:update x:i from
// functional form of a function that takes random rows/columns
?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];
// some for of if-else and an update to generate column a (not bullet proof)
{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]
Basically the above gives something like :
q){x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
+`a`b`c`x!(`a0`a1`a2`a3`a4`a5`a6`a7;1 1 1 1 1 1 1 1;`x`x`x`x`x`x`x`x;0 1 2 3 ..
+`a`x!(`a0`a1`a2`a3`a4`a5;0 1 2 3 4 5)
+`a`b`c`x!(`a0`a1`a2;1 1 1;`x`x`x;0 1 2)
+`a`b`c`x!(`a0`a1`a2`a3`a4`a5`a6`a7`a8`a9`a10`a11;1 1 1 1 1 1 1 1 1 1 1 1;`x`..
or taking the first one :
q)first{x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
a b x
--------
a0 1 0
a1 1 1
a2 1 2
a3 1 3
a4 1 4
a5 1 5
a6 1 6
a7 1 7
a8 1 8
a9 1 9
a10 1 10
You can do
(uj/)enist[t],{ // some function }each t
to get what you want. Drop the enlist[t] if you don't want the table you start with in your result
Hope this helps.

iter function over table as input - does order matter and why?

I'm totally new to kdb+/q, and I found this problem below quite confusing to me. Just to simplify, we say we have this one line function f returns an one-row table with preset values, and I want to run this function over a combination of inputs x and y, like dates (list) and metas (table, with columns like orderid, px, size etc).
Now, I listed two ways to do so below. Since the function f doesn't really use any of the input, I would suppose the order of x and y doesn't matter since the difference is just which one is passed to f before another and only when two inputs passed would f starts to operate.
But why I got error in the second way, i.e. table follows the list?
Any idea and explanation is much appreciated.
f: {[x;y]
([] m: enlist `M; n: enlist `N)
};
x: 1 2 3;
y: ([] a: 4 5 6; b: 7 8 9);
raze raze f ' [y] ' [x]; // this one works
raze raze f ' [x] ' [y]; // this one gives ERROR: length Explanation: Arguments do not conform
What you're doing is effectively equivalent to:
f:{y;1};
q)(f'[([]a:1 2 3;b:4 5 3)])#/:1 2 3
1 1 1
1 1 1
1 1 1
(using extra brackets to make it clear the order of operation).
In this situation each one reduces to
q)f'[([]a:1 2 3;b:4 5 3);1]
1 1 1
q)f'[([]a:1 2 3;b:4 5 3);2]
1 1 1
q)f'[([]a:1 2 3;b:4 5 3);3]
1 1 1
The "length" is ok here because the "y" values are atomic and kdb automatically expands those atomic values to match the length of the table. In order words, kdb treats these as:
q)f'[([]a:1 2 3;b:4 5 3);1 1 1]
1 1 1
q)f'[([]a:1 2 3;b:4 5 3);2 2 2]
1 1 1
q)f'[([]a:1 2 3;b:4 5 3);3 3 3]
1 1 1
However, when you change the order it becomes:
(f'[1 2 3])#/:([]a:1 2 3;b:4 5 3)
which is equivalent to:
f'[1 2 3;`a`b!1 4]
f'[1 2 3;`a`b!2 5]
f'[1 2 3;`a`b!3 3]
but now you do have a length problem because the dictionaries in the "y" variable are not atomic, they have length 2. Which doesn't match the length of the list (3).
You don’t say so but it looks like you are studying how to iterate a binary function f over list arguments, which has brought you to projecting f' onto x, which gives you a unary f'[x] that you then iterate over y. If that’s how we got here, what you want might be as simple as x f'y, which iterates f over corresponding items in x and y.
However, you mention combinations of inputs. If you want effectively a Cartesian product based on f, then combine the iterators Each Right and Each Left to get x f:/:\:y.
That returns a matrix. You have razed your result. Depending on your argument types, you might be able to use cross to generate all the argument pair combinations, and Apply Each .' to apply f to each pair:
f .' x cross y

How do I convert a dictionary of dictionaries into a table?

I've got a dictionary of dictionaries:
`1`2!((`a`b`c!(1 2 3));(`a`b`c!(4 5 6)))
| a b c
-| -----
1| 1 2 3
2| 4 5 6
I'm trying to work out how to turn it into a table that looks like:
1 a 1
1 b 2
1 c 3
2 a 4
2 b 5
2 c 6
What's the easiest/'right' way to achieve this in KDB?
Not sure if this is the shortest or best way, but my solution is:
ungroup flip`c1`c2`c3!
{(key x;value key each x;value value each x)}
`1`2!((`a`b`c!(1 2 3));(`a`b`c!(4 5 6)))
Which gives expected table with column names c1, c2, c3
What you're essentially trying to do is to "unpivot" - see the official pivot page here: https://code.kx.com/q/kb/pivoting-tables/
Unfortunately that page doesn't give a function for unpivoting as it isn't trivial and it's hard to have a general solution for it, but if you search the Kx/K4/community archives for "unpivot" you'll find some examples of unpivot functions, for example this one from Aaron Davies:
unpiv:{[t;k;p;v;f] ?[raze?[t;();0b;{x!x}k],'/:(f C){![z;();0b;x!enlist each (),y]}[p]'v xcol't{?[x;();0b;y!y,:()]}/:C:(cols t)except k;enlist(not;(.q.each;.q.all;(null;v)));0b;()]};
Using this, your problem (after a little tweak to the input) becomes:
q)t:([]k:`1`2)!((`a`b`c!(1 2 3));(`a`b`c!(4 5 6)));
q)`k xasc unpiv[t;1#`k;1#`p;`v;::]
k v p
-----
1 1 a
1 2 b
1 3 c
2 4 a
2 5 b
2 6 c
This solution is probably more complicated than it needs to be for your use case as it tries to solve for the general case of unpivoting.
Just an update to this, I solved this problem a different way to the selected answer.
In the end, I:
Converted each row into a table with one row in it and all the columns I needed.
Joined all the tables together.

In MATLAB, how do you compare the difference between two rows

An input is a data file with ID number of multiple occurrences. (e.g ID# 123) Now what I want is to gather all rows with same ID numbers, compare column by column, and see if what column do they have difference.
Now after that I will move on to the next ID number with multiple occurrences (e.g. ID#456) and do the same.
I repeat everything until I finish with the last ID number of multiple occurrence.
So my output will be like this,
(1)The column headers will be the same.
(2)The ID# column will have unique entries. Only the ID numbers which have multiple occurrences will be included in this column.
(3)I will add an extra column whose entry contains the number of occurrences the ID number occurred. Example, if it occurred 5 times, the entry is 5.
(4)For, the other columns, if the column has same entries for all the occurrences of a certain ID number, we write "0", else "1". E.g. if for ID#123, the entries in column "Section" is the same for all the occurrences of ID#123, then for our output table, the column "Section" will contain the value of "0". If there is any difference, the output will be "1"
Your question is not very clear but I think you want to count the number of unique values and the number of times the unique rows occur. The table below might demonstrate this.
+-------+---------+-----+----------+---------------------+
| ID | Column1 | ... | Column n | num of occurrencies |
+-------+---------+-----+----------+---------------------+
This can be done with unique and accumarray
In the example below, A is the original data and output is your desired output. The first n columns of output are your unique data and the last column contains the number of times this row occurred. The row [1 5] occurred twice, [2 3] once etc.
A = [1 5
1 5
2 3
2 4
3 9];
[k,~,idx]= unique(A,'rows');
n = accumarray(idx(:),1);
output = [k n]
output =
1 5 2
2 3 1
2 4 1
3 9 1

Reshape (#) doesn't work with a dynamic argument

To form a matrix consisting of identical rows, one could use
x:1 2 3
2 3#x,x
which produces (1 2 3i;1 2 3i) as expected. However, attempting to generalise this thus:
2 (count x)#x,x
produces a type error although the types are equal:
(type 3) ~ type count x
returns 1b. Why doesn't this work?
The following should work.
q)(2;count x)#x,x
1 2 3
1 2 3
If you look at the parse tree of both your statements you can see that the second is evaluated differently. In the second only the result of count is passed as an argument to #.
q)parse"2 3#x,x"
#
2 3
(,;`x;`x)
q)parse"2 (count x)#x,x"
2
(#;(#:;`x);(,;`x;`x))
If you're looking to build matrices with identical rows you might be better off using
rownum#enlist x
q)x:100000?100
q)\ts do[100;v1:5 100000#x,x]
157 5767696j
q)\ts do[100;v2:5#enlist x]
0 992j
q)v1~v2
1b
I for one find this more natural (and its faster!)