Randomly select observations separately for each column in SQL - hiveql

I am interested in generating a completely (damaged) randomized data where observations are selected randomly (with replacement) for each field and then combined. I will need to generate a new dummy id to represent the old id as I don't want to reconstruct the data. My goal is to create a simulated column-wise random dataset.
Here is a sample data:
Id Col1 Col2 Col3
11 A 0.01 David
12 B 0.04 Max
13 C 0.05 Tom
14 E 0.06 West
15 C 0.02 Mike
What I am interested in is something like this:
Id2 Col1 Col2 Col3
1 E 0.04 Mike
2 C 0.06 David
3 B 0.02 West
4 A 0.04 Tom
5 C 0.05 Max
I am looking for an organized way of doing this. Here is what I attempted so far but am not interested in doing many times over since I have a lot of columns in the real data.
proc sql;
create table newtable1 as
select monotonic() as id2, col1 from
(select col1 from Table1 order by ranuni(0));
quit;
Using the above code you generate separate random columns and then combine them using the new monotonic key.

Related

I have Multiple logical records in one db row, how do I split them into separate rows?

I have a table that has data like:
Name
Item_1
Qty_1
Price_1
Item_2
Qty_2
Price_2
...
Item_50
Qty_50
Price_50
Bob
Apples
10
0.50
Pears
5
0.65
...
Lemons
12
0.25
Alice
Cherries
20
1.00
NULL
NULL
NULL
...
NULL
NULL
NULL
I need to process the data per-item, so the ideal form of the data would be:
Name
ItemNo
Item
Qty
Price
Bob
1
Apples
10
0.50
Bob
2
Pears
5
0.65
...
...
...
...
...
Bob
50
Lemons
12
0.25
Alice
1
Cherries
20
1.00
How can I convert between the two forms?
I have looked at the pivot command, but it seems to convert column names into data in a field, not split groups of columns into separate rows. It doesn't look like it will work for this application.
The current code looks something like:
( SELECT t1.Name, 1 AS ItemNo, t1.Item_1 AS Item, t1.Qty_1 AS Qty, t1.Price_1 AS Price FROM table t1
UNION ALL
SELECT t2.Name, 2 AS ItemNo, t2.Item_2 AS Item, t2.Qty_2 AS Qty, t2.Price_2 AS Price FROM table t2
UNION ALL
...
SELECT t50.Name, 50 AS ItemNo, t50.Item_50 AS Item, t50.Qty_50 AS Qty, t50.Price_50 AS Price FROM table t50
)
It works, but it seems hard to maintain. Is there a better way?
Hopefully the reason you want to do this is to fix your design. If not, then make the reason you're asking is to fix your design.
Anyway, one method is to use a VALUES table construct to unpivot the data:
SELECT YT.Name,
V.ItemNo,
V.Item,
V.Qty,
V.Price
FROM dbo.YourTable YT
CROSS APPLY (VALUES(1,YT.Item_1, YT.Qty_1, YT.Price1),
(2,YT.Item_2, YT.Qty_2, YT.Price2),
(3,YT.Item_3, YT.Qty_3, YT.Price3),
... --You get the idea
(49,YT.Item_49, YT.Qty_49, YT.Price49),
(50,YT.Item_50, YT.Qty_50, YT.Price50))V(ItemNo,Item,Qty,Price)
WHERE V.Item IS NOT NULL;

Pivot table with multiple value columns in KDB+

I would like to transform the following two row table generated by:
tb: ([] time: 2010.01.01 2010.01.01; side:`Buy`Sell; price:100 101; size:30 50)
time side price size
--------------------------------
2010.01.01 Buy 100 30
2010.01.01 Sell 101 50
To the table below with single row:
tb1: ([] enlist time: 2010.01.01; enlist price_buy:100; enlist price_sell:101; enlist size_buy:30; enlist size_sell:50)
time price_buy price_sell size_buy size_sell
-----------------------------------------------------
2010.01.01 100 101 30 50
What is the most efficient way to achieve this?
(select price_buy:price, size_buy:size by time from tb where side = `Buy) lj select price_sell:price, size_sell:size by time from tb where side = `Sell
time | price_buy size_buy price_sell size_sell
----------| ---------------------------------------
2010.01.01| 100 30 101 50
If you wanted to avoid 2 select statements:
raze each select `price_buy`price_sell!(side!price)#/:`Buy`Sell, `size_buy`size_sell!(side!size)#/:`Buy`Sell by time from tb
As an additional note, having a date column labeled time can be misleading. Typical financial tables in kdb have the format date time sym etc
Edit: Functional form for dynamic column generation:
{x[0] lj x[1]}[{?[`tb;enlist (=;`side;enlist `$x);(enlist `time)!enlist `time;(`$("price",x;"size",x))!(`price;`size)]} each ("Sell";"Buy")]
time | priceSell sizeSell priceBuy sizeBuy
----------| -----------------------------------
2010.01.01| 101 50 100 30
The general pivot function on the Kx website can do this, see https://code.kx.com/q/kb/pivoting-tables/
q)piv[tb;(),`time;(),`side;`price`size;{[v;P]`$raze each string raze P[;0],'/:v,/:\:P[;1]};{x,z}]
time | Buyprice Sellprice Buysize Sellsize
----------| -----------------------------------
2010.01.01| 100 101 30 50
I have a pivot function in github . But it doesn't support multiple columns
.math.st.pivot: {[t;rc;cf;ff]
P: asc distinct t cf;
Pcol: `$string[P] cross "_",/:string key ff;
t: ?[t;();rc!rc;key[ff]!{({[x;y;z] z each y#group x}[;;z];x;y)}[cf]'[key ff;value ff]];
t: ![t;();0b; Pcol! raze {((';#);x;$[-11h=type y;enlist;::] y)}'[key ff]'[P] ];
![t;();0b;key ff]
};
But you can left join to achieve expected result:
.math.st.pivot[tb;enlist`time;`side;enlist[`price]!enlist first]
lj .math.st.pivot[tb;enlist`time;`side;enlist[`size]!enlist first]
Looks like adding support for multiple columns is a good idea.

calculating median to output similar results to over partition

i have a large table, here is a snippet of how it looks like
name class brand rating
12 d 1 3.8
22 d 1 3.9
33 a 2 1.1
12 d 1 2.3
12 a 3 3.4
44 b 1 9.8
22 c 2 3.0
i calculated for the average of the rating doing the below
select avg(rating) over(partition by name,class,brand) as avg_rating from df
i'm aware that postgres doesn't have a median function but i would like to calculate for that column and have the output in a similar structure to that of my window function for average
in case of even number of rows, i would like the average number between the middle two numbers
To get the median, you should use percentile_cont
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY rating) FROM df;

Replace infinity with nulls throughout entire table KDB

Example table:
table:([]col1:20 40 30 0w;col2:4?4;col3: 100 200 0w 300)
My solution:
{.[table;(where 0w=table[x];x);:;0n]}'[exec c from meta table where t="f"]
There is a way I am not seeing I'm sure. This just returns a list of for each change which I don't want. I just want the original table returned with nulls replaced.
Thanks in advance!
It would be good to flesh out your question a bit more. Are you always expecting it to be float columns? Will the table have many columns? Will there be string/sym columns mixed in that might complicate things?
If your table has a small number of columns you could just do an update
q)show t
col1 col2 col3
--------------
20 1 100
40 2 200
30 2 0w
0w 1 300
q)inftonull:{(x where x=0w):0n;x}
q)update inftonull col1, inftonull col3 from t
col1 col2 col3
--------------
20 2 100
40 1 200
30 0
3 300
If you think the column names might change or have a very large number of columns you could try a functional update (where you can pass the column names as parameters)
q){![t;();0b;x!inftonull,/:x,:()]}`col1`col3
col1 col2 col3
--------------
20 1 100
40 2 200
30 2
1 300
If your table is comprised of only numeric data something like
q)flip{(x where x=.Q.t[type x]$0w):x 0N;x}each flip t
col1 col2 col3
--------------
20 2 100
40 1 200
30 0
3 300
Might work, which tries to account for the fact the numeric data has different types.
If your data is going to contain string/sym columns the last example won't work

Kdb q query data from one table based on the data from another table without join

I'm new in kdb/q. And the following is my question. Really hope someone who experts in kdb can help me out.
I have two tables. Table t1 has two attributes: tp_time and id, which looks like:
tp_time id
------------------------------
2018.06.25T00:07:15.822 1
2018.06.25T00:07:45.823 3
2018.06.25T00:09:01.963 8
...
...
Table t2 has three attributes: tp_time, id, and price.
For each id, it has lots of price at different tp_time. So the table t2 is really large, which looks like the following:
tp_time id price
----------------------------------------
2018.06.25T00:05:99.999 1 10.87
2018.06.25T00:06:05.823 1 10.88
2018.06.25T00:06:18.999 1 10.88
...
...
2018.06.25T17:39:20.999 1 10.99
2018.06.25T17:39:23.999 1 10.99
2018.06.25T17:39:24.999 1 10.99
...
...
2018.06.25T01:39:39.999 2 10.99
2018.06.25T01:39:41.999 2 10.99
2018.06.25T01:39:45.999 2 10.99
...
...
What I try to do is for each row in Table t1, find its price at the nearest time and its price at approximately 5 seconds later. For example, for the first row in table t1:
2018.06.25T00:07:15.822 1
The price at nearest time is 10.87 and the price at around 5 seconds later is 10.88. And my expected output table looks like the following:
tp_time id price_1 price_2
----------------------------------------------------
2018.06.25T00:07:15.822 1 10.87 10.88
2018.06.25T00:07:45.823 3 SOME_PRICE SOME_PRICE
2018.06.25T00:09:01.963 8 SOME_PRICE SOME_PRICE
...
...
The thing is I cannot join t1 and t2 because table t2 is so large and I will kill the server. I've try something like ...where tp_time within(time1, time2). But I'm not sure how to deal with the time1 and time2 varibles.
Could someone gives me some helps on this questions? Thanks so much!
I'll recommend organizing the table t1 by applying the proper attributes so that when you join the tables, it will generate the results quickly.
Since you are looking for the prevailing price and price after 5 seconds, You will need wj for this.
the general syntax is :
wj[w;c;t;(q;(f0;c0);(f1;c1))]
w - begin and end time
t & q - unkeyed tables; q should be sorted by `id`time with `p# on id
c- names of the columns to be joined
f0,f1 - aggregation functions
In your case t2 should be sorted by `id`time with `p# on id
q)t2:update `g#id from `id`tp_time xasc ([] tp_time:`time$10:20:30 + asc -10?10 ; id:10?3 ;price:10?10.)
q)t1:([] tp_time:`time$10:20:30 + asc -3?5 ; id:1 1 1 )
q)select from t2 where id=1
tp_time id price
10:20:31.000 1 4.410662
10:20:32.000 1 5.473385
10:20:38.000 1 1.247049
q)wj[(`second$0 5)+\:t1.tp_time;`id`tp_time;t1;(t2;(first;`price);(last;`price))]
tp_time id price price
10:20:30.000 1 4.410662 5.473385
10:20:31.000 1 4.410662 5.473385
10:20:34.000 1 5.473385 1.247049 //price at 32nd second & 38th second