split keyvalue pair data into new columns - kdb

My table looks something like this:
id | data
1 | A=1000 B=2000
2 | A=200 C=300
In kdb is there a way to normalize the data such that the final table is as follows:
id | data.1 | data.2
1 | A | 1000
1 | B | 2000
2 | A | 200
2 | C | 300

One option would be to make use of 0: & it's key-value parsing functionality, documented here https://code.kx.com/q/ref/file-text/#key-value-pairs e.g.
q)ungroup delete data from {x,`data1`data2!"S= "0:x`data}'[t]
id data1 data2
---------------
1 A "1000"
1 B "2000"
2 A "200"
2 C "300"
assuming you want data2 to be long datatype (j), can do
update "J"$data2 from ungroup delete data from {x,`data1`data2!"S= "0:x`data}'[t]

You could use a combination of vs (vector from scalar), each-both ' and ungroup:
q)t:([]id:1 2;data:("A=1000 B=2000";"A=200 C=300"))
q)t
id data
------------------
1 "A=1000 B=2000"
2 "A=200 C=300"
q)select id, dataA:`$data[;0], dataB:"J"$data[;1] from
ungroup update data: "=" vs '' " " vs ' data from t
id dataA dataB
--------------
1 A 1000
1 B 2000
2 A 200
2 C 300
I wouldn't recommend naming the columns with . e.g. data.1

Related

Take new columns as output table - KDB

I have a query which returns results of data, which runs on a frequent basis. The new table will contain results of the old table as well but I only want to take whatever is in new in the most recent run of the new table and send that as an email. I already have the line for the email and trade data but just need a way to be able to:
display the results of the new table to be emailed
save the complete results of the new table to be used in the next run of the query
e.g.
Old results: tbl
| idx | name | age |
| 0 | Tom | 30 |
| 1 | Jerry | 25 |
| 2 | Bob | 30 |
| 3 | Ken | 45 |
New results: tbl
| idx | name | age |
| 0 | Tom | 30 |
| 1 | Jerry | 25 |
| 2 | Bob | 30 |
| 3 | Ken | 45 |
| 4 | Sam | 40 |
output required:
| 4 | Sam | 40 |
and then save the New results to be used in the next run
Thanks! :)
If the only changes between runs is that records are being appended onto the new table, you could just keep a variable denoting the last index seen and then select only those rows where idx is larger than that.
If the indexes are always increasing, this could be achieved using a query like
lastidx:exec last idx from tbl
select from tbl where idx>lastidx
If the idx values don't always increase monotonically, you could keep a count of the number of rows instead and only
lasti:count tbl
select from tbl where i>=lasti
This doesn't require saving the whole table in memory for use in the next iteration.
E.g to start with the old table had 4 rows so lasti = 4
q)tbl
idx name age
-------------
0 Tom 30
1 Jerry 25
2 Bob 30
3 Ken 45
q)lasti
4
The new table comes in and running the command selects the new row
q)tbl
idx name age
-------------
0 Tom 30
1 Jerry 25
2 Bob 30
3 Ken 45
4 Sam 40
q)select from tbl where i>lasti
idx name age
------------
4 Sam 40
lasti can then be updated to reflect the new count
q)lasti:count tbl
q)lasti
5
One way you can get this done, assuming the idx is the unique key :
q)old:([] idx:0 1 2 3; name:`T`J`B`K; age: 30 25 30 45)
q)new:old,enlist `idx`name`age!(4; `S;40) //new output from your query
q)out:()
q)if[0<count i:new[`idx] except old[`idx] ; out:new i ; old:new]
q)out
idx name age
------------
4 S 40
Another way, if your new records are always added to the last of old records:
q)old:([] idx:0 1 2 3; name:`T`J`B`K; age: 30 25 30 45)
q)i:count old
q)new:old,enlist `idx`name`age!(4; `S;40) //new output from your query
q)out:()
q)if[i<c:count new ; out:(i-c)#new ; old:new; i:c]
q)out
idx name age
------------
4 S 40

Add a key element for n rows in PySpark Dataframe

I have a dataframe like the one shown below.
id | run_id
--------------
4 | 12345
6 | 12567
10 | 12890
13 | 12450
I wish to add a new column say Key that will have value 1 for the first n rows and 2 for the next n rows. The result will be like:
id | run_id | key
----------------------
4 | 12345 | 1
6 | 12567 | 1
10 | 12890 | 2
13 | 12450 | 2
Is it possibile to do the same with PySpark?. Thanks in advance for the help.
Here is one way to do it using zipWithIndex:
# sample rdd
rdd=sc.parallelize([[4,12345], [6,12567], [10,12890], [13,12450]])
# group size for key
n=2
# add rownumber and then label in batches of size n
rdd=rdd.zipWithIndex().map(lambda (x, rownum): x+[int(rownum/n)+1])
# convert to dataframe
df=rdd.toDF(schema=['id', 'run_id', 'key'])
df.show(4)

postgresql find similar word groups

I have a table1 containing a column A, where ~100,000 strings (varchar) are stored. Unfortunately, each string has multiple words which are seperated with spaces. Further they have different length, i.e. one string can consist of 3 words while an other string contains 7 words.
Then I have a column B stored in a second table2 which contains only 100 strings in the same manner. Hence, multiple words per string, seperated by spaces.
The target is, to look how likely a record of Column B is matching with probably multiple records of column A based on the words. The result should also have a ranking. I was thinking of using full text search in a loop but I don't know how to do this, or if there is a proper way to achieve this?
I don't know if you can "tturn" table to a dictionary to use full text for ranking here. But you can query it with some primityve ranking quite easily, eg:
t=# with a(a) as (values('a b c'),('a c d'),('b e f'),('r b t'),('q w'))
, b(i,b) as (values(1,'a b'), (2,'e'), (3,'b'))
, p as (select unnest(string_to_array(b.b,' ')) arr,i from b)
select a phrases,arr match_words,count(1) over (partition by arr) words_in_matches, count(1) over (partition by i) matches,i from a left join p on a.a like '%'||arr||'%';
phrases | match_words | words_in_matches | matches | i
---------+-------------+------------------+---------+---
r b t | b | 6 | 5 | 1
a b c | b | 6 | 5 | 1
b e f | b | 6 | 5 | 1
a b c | a | 2 | 5 | 1
a c d | a | 2 | 5 | 1
b e f | e | 1 | 1 | 2
r b t | b | 6 | 3 | 3
a b c | b | 6 | 3 | 3
b e f | b | 6 | 3 | 3
q w | | 1 | 1 |
(10 rows)
phrases are rows from your big table.
match_words are tokens from your small table (splitted by spaces)
words_in_matches amount of tokens in phrases
matches is amount of matches in big table phrases from small table phrases
i index of phrase from small table
So you can order by third or fourth column to get some sort of ranking...

T-SQL. HOW to create a table with a sequence of values

I have a table with a list of names and indices. For example like this:
ID | Name | Index
1 | Value 1 | 3
2 | Value 2 | 4
...
N | Value N | NN
I need to create a new table, where every value from field "Name" will be repeat repeated as many times as the "Index" field is specified. For example like this:
ID | Name_2 | ID_2
1 | Value 1 | 1
2 | Value 1 | 2
3 | Value 1 | 3
4 | Value 2 | 1
5 | Value 2 | 2
6 | Value 2 | 3
7 | Value 2 | 4
...
N | Value N | 1
N+1| Value N | 2
...
I have no idea how to write a cycle to get such result. Please, give me an advice.
Here is solution to repeat the rows based on a column value
declare #order table ( Id int, name varchar(20), indx int)
Insert into #order
(Id, name, indx)
VALUES
(1,'Value1',3),
(2,'Value2',4),
(3,'Value3',2)
;WITH cte AS
(
SELECT * FROM #order
UNION ALL
SELECT cte.[ID], cte.name, (cte.indx - 1) indx
FROM cte INNER JOIN #order t
ON cte.[ID] = t.[ID]
WHERE cte.indx > 1
)
SELECT ROW_NUMBER() OVER(ORDER BY name ASC) AS Id, name as [name_2], 1 as [Id_2]
FROM cte
ORDER BY 1

Traversing Gaps in sequential data

I have a table [ContactCallDetail] which stores call data for each leg of a call from our phone system. The data is stored with a 4 part primary key: ([SessionID], [SessionSeqNum], [NodeID], [ProfileID]). The [NodeID], [ProfileID] , and [SessionID] together make up a call, and the [SessionSeqNum] defines each leg of the call as the caller is transferred from one department/rep to the next.
I need to look at each leg of a call and, if a transfer occured, find the next leg of the call so I can report on where the transfered call went.
The problems I am facing are 1) the session sequence does not always start with the same number 2) there can be gaps in the sequence number 3) The table has 15,000,000 rows and is added to via data import every night, so I need a non cursor based solution.
Sample data
| sessionid | sessionseqnum | nodeid | profileid |
| 170000459184 | 0 | 1 | 1 |
| 170000459184 | 1 | 1 | 1 |
| 170000459184 | 3 | 1 | 1 |
| 170000229594 | 1 | 1 | 1 |
| 170000229594 | 2 | 1 | 1 |
| 170000229598 | 0 | 1 | 1 |
| 170000229598 | 2 | 1 | 1 |
| 170000229600 | 0 | 1 | 1 |
| 170000229600 | 1 | 1 | 1 |
| 170000229600 | 3 | 1 | 1 |
| 170000229600 | 5 | 1 | 1 |
I think what I need to do is create a lookup table using an identity column or rownum() or the like to get a new sequence number for the call legs that will have no gaps. How would I do this? Or if there is a different, best practices solution you could point me to that would be great.
You can use the lead() analytic function to identify the next session sequence number.
SELECT sessionid ,
nodeid ,
profileid ,
sessionseqnum ,
lead(sessionseqnum) OVER ( PARTITION BY sessionid, nodeid, profileid ORDER BY sessionseqnum ) AS next_seq_num
FROM ContactCallDetail
ORDER BY sessionid ,
nodeid ,
profileid ,
sessionseqnum;
sessionid nodeid profileid sessionseqnum next_seq_num
--
170000229594 1 1 1 2
170000229594 1 1 2
170000229598 1 1 0 2
170000229598 1 1 2
170000229600 1 1 0 1
170000229600 1 1 1 3
170000229600 1 1 3 5
170000229600 1 1 5
170000459184 1 1 0 1
170000459184 1 1 1 3
170000459184 1 1 3
The ORDER BY clause isn't strictly necessary; it just makes it easier for humans to read the output.
Now you can join the original table to produce a row that shows relevant pairs of rows. There are several different ways to express that in standard SQL. Here, I'm using a common table expression.
WITH next_seq_nums
AS ( SELECT * ,
lead(sessionseqnum) OVER ( PARTITION BY sessionid, nodeid, profileid ORDER BY sessionseqnum ) AS next_seq_num
FROM ContactCallDetail
)
SELECT t1.sessionid ,
t1.nodeid ,
t1.profileid ,
t1.sessionseqnum ,
t2.sessionseqnum next_sessionseqnum ,
t2.nodeid next_nodeid ,
t2.profileid next_profileid
FROM next_seq_nums t1
LEFT JOIN ContactCallDetail t2 ON t1.sessionid = t2.sessionid
AND t1.nodeid = t2.nodeid
AND t1.profileid = t2.profileid
AND t1.next_seq_num = t2.sessionseqnum
ORDER BY t1.sessionid ,
t1.nodeid ,
t1.profileid ,
t1.sessionseqnum;
The LEFT JOIN will leave NULLs in the rows for the last session sequence numbers in each session. That makes sense--on the last row, there isn't a "next leg of the call". But it's easy enough to exclude those rows if you need to.
If your dbms doesn't support the lead() analytic function, you can replace the common table expression above with this one.
WITH next_seq_nums
AS ( SELECT t1.* ,
( SELECT MIN(sessionseqnum)
FROM contactcalldetail
WHERE sessionid = t1.sessionid
AND nodeid = t1.nodeid
AND profileid = t1.profileid
AND sessionseqnum > t1.sessionseqnum
) next_seq_num
FROM contactcalldetail t1
)
...
with cte
as
(SELECT *,
rank() OVER
(partition BY sessionid,profileid,nodeid
ORDER BY sessionseqnum ) AS Rank
FROM dbo.Table_1)
SELECT
cte.sessionid,cte.nodeid,cte.profileid,cte.sessionseqnum,cte_1.sessionseqnum
FROM cte LEFT JOIN
cte AS cte_1
ON cte.sessionid = cte_1.sessionid
and cte.profileid= cte_1.profileid
and cte.nodeid= cte_1.nodeid
and cte.rank= cte_1.rank-1