I have a table which looks like this:
Entry number
Timestamp
Value1
Value2
Value3
Value4
5758
28-06-2018 16:30
34
63
34.2
60.9
5759
28-06-2018 17:00
33.5
58
34.9
58.4
5758
28-06-2018 16:30
34
63
34.2
60.9
5759
28-06-2018 17:00
33.5
58
34.9
58.4
5760
28-06-2018 17:30
33
53
35.2
58.5
5761
28-06-2018 18:00
33
63
35
57.9
5762
28-06-2018 18:30
33
61
34.6
58.9
5763
28-06-2018 19:00
33
59
34.1
59.4
5764
28-06-2018 19:30
28
89
33.5
64.2
5765
28-06-2018 20:00
28
89
33
66.1
5766
28-06-2018 20:30
28
83
32.5
67
5767
28-06-2018 21:00
29
89
32.2
68.4
Where '28-06-2018 16:30' is under one column. So I have 6 columns:
Entry number, Timestamp, Value1, Value2, Value3, Value4
I want to extract all rows that belong to '28-06-2018', i.e all data pertaining to that day. Since my table is too large I couldn't fit more data, however, the entries under the timestamp range for a couple of months.
t=table([5758;5759],["28-06-2018 16:30";"29-06-2018 16:30"],[34;33.5],'VariableNames',{'Entry number','Timestamp','Value1'})
t =
2×3 table
Entry number Timestamp Value1
____________ __________________ ______
5758 "28-06-2018 16:30" 34
5759 "29-06-2018 16:30" 33.5
t(contains(t.('Timestamp'),"28-06"),:)
ans =
1×3 table
Entry number Timestamp Value1
____________ __________________ ______
5758 "28-06-2018 16:30" 34
Hello Fellow Kdb Mortals :D
Stuck on a pretty weird problem here. I have a table like
time col is xbar-ed to 5-mins
time code name count
--------------------------------
00:00 SPY S&P.. 15
00:00 QQQ ... 88
00:00 IWM ... 100
00:00 XLE ... 80
00:05 QQQ ... 20
00:05 SPY ... 75
00:10 QQQ ... 22
00:10 XLE ... 10
00:15 SPY ... 23
.....
.....
23:40 XLE ... 11
23:50 SPY ... 16
23:55 IWM ... 100
23:55 QQQ ... 10
What I want to be returned is a table like (from asc time)
code name stime etime cumcount
------------------------------------------------
SPY S&P... 00:00 00:15 123 <-- 15+75+23
QQQ ... 00:00 00:05 108 <-- 88+20
IWM ... 00:00 00:00 100 <-- 100
XLE ... 00:00 23:40 101 <-- 80+10+11
Notice there is a condition on this time bucket, where the first cumulative sum by (code,name) is greater than or equal to 100.
I can also generate another table from bottoms up (desc time)
code name stime etime cumcount
------------------------------------------------
SPY ... 23:50 20:10 103
QQQ ... 23:55 21:45 118
IWM ... 23:55 23:55 100
XLE ... 23:40 00:00 101 <-- 11+10+80
I have been at this for a couple of hours, but can't get this working. Basic select and sums don't get me anywhere. I could use loops but thought I should check in here first before I go down that lane.
Any help is appreciated :D
Assuming you have a table sorted ascending on time i.e.:
`time xasc `t
Something like this could work
q)t1:update cumcount:sums cnt,stime:first time by code,name from t
q)select code,name,stime,etime:time, cumcount from t1 where cumcount>=100,i=(first;i) fby ([]code;name)
Notice that I have relabelled count as cnt to prevent a clash with the count function that already exists in the q language.
So first you calculate your cumulative count in the update statement.
Then select from the resulting table in such a way that first you pull out only those records where the count is > 100, then you use fby to filter down on this again to pull out the first record for each distinct (code;name) pair.
In this example stime is the time of the first entry for each (code;name) pair and etime will be time when it first exceeds 100.
I prefer Seans solution, but for the sake of alternative:
q)t:update name:string lower code from([]time:"u"$0 0 0 0 5 5 10 10 15 1420 1430 1435 1435;code:`SPY`QQQ`IWM`XLE 0 1 2 3 1 0 1 3 0 3 0 2 1;cnt:15 88 100 80 20 75 22 10 23 11 16 100 10);
q)exec{x x[`cumcnt]binr 100}[([]stime:first time;etime:time;cumcnt:sums cnt)]by code,name from t
code name | stime etime cumcnt
----------| ------------------
IWM "iwm"| 00:00 00:00 100
QQQ "qqq"| 00:00 00:05 108
SPY "spy"| 00:00 00:15 113
XLE "xle"| 00:00 23:40 101
Summing from the bottom would be:
q)exec{x x[`cumcnt]binr 100}[([]stime:last time;etime:reverse time;cumcnt:sums reverse cnt)]by code,name from t
code name | stime etime cumcnt
----------| ------------------
IWM "iwm"| 23:55 23:55 100
QQQ "qqq"| 23:55 00:00 140
SPY "spy"| 23:50 00:05 114
XLE "xle"| 23:40 00:00 101
I am a beginner in kdb. As I was practicing window join on test data from NYSE, I came across issues with window join across different dates.
Basically, my table looks like:
t:([] sym:10#`AAPL;date:2021.03.21 2021.03.21 2021.03.21 2021.03.21 2021.03.21 2021.03.22 2021.03.22 2021.03.22 2021.03.22 2021.03.22;price:100 101 105 110 120 130 140 150 160 170;time:10:01 10:04 10:07 10:10 10:13 10:01 10:04 10:07 10:10 10:13)
I am trying to create a sliding window for every 3 minutes on each date and calculate the sum of price in that window. However, I am not sure how to do window join on different groups.
I tried:
w3:-3 0+\:t[`minute];
newdata: wj1[w3;`minute;t;(t;(sum;`price)
but this does not give me the correct result. Could someone please help with this. Thank you!
To do a wj across dates you need a timestamp column which you can create from date and time:
t:update timeStamp:"P"$"D" sv/: flip string (date;time) from t
t
sym date price time timeStamp
---------------------------------------------------------
AAPL 2021.03.21 100 10:01 2021.03.21D10:01:00.000000000
AAPL 2021.03.21 101 10:04 2021.03.21D10:04:00.000000000
AAPL 2021.03.21 105 10:07 2021.03.21D10:07:00.000000000
AAPL 2021.03.21 110 10:10 2021.03.21D10:10:00.000000000
AAPL 2021.03.21 120 10:13 2021.03.21D10:13:00.000000000
You can then use the timeStamp column like so:
w3:-00:03 00:00 +\:t[`timeStamp]
wj1[w3;`timeStamp;t;(t;(sum;`price))]
sym date price time timeStamp
---------------------------------------------------------
AAPL 2021.03.21 100 10:01 2021.03.21D10:01:00.000000000
AAPL 2021.03.21 201 10:04 2021.03.21D10:04:00.000000000
AAPL 2021.03.21 206 10:07 2021.03.21D10:07:00.000000000
AAPL 2021.03.21 215 10:10 2021.03.21D10:10:00.000000000
AAPL 2021.03.21 230 10:13 2021.03.21D10:13:00.000000000
AAPL 2021.03.22 130 10:01 2021.03.22D10:01:00.000000000
AAPL 2021.03.22 270 10:04 2021.03.22D10:04:00.000000000
AAPL 2021.03.22 290 10:07 2021.03.22D10:07:00.000000000
AAPL 2021.03.22 310 10:10 2021.03.22D10:10:00.000000000
AAPL 2021.03.22 330 10:13 2021.03.22D10:13:00.000000000
If you have more than 1 sym in the table you should apply the parted attribute:
t:update `p#sym from `sym`timeStamp xasc t
Then add sym before timeStamp in the 2nd argument of wj:
q)wj[w3;`sym`timeStamp;select sym, timeStamp from t;(t;(sum;`price))]
sym timeStamp price
----------------------------------------
AAPL 2021.03.21D10:01:00.000000000 100
AAPL 2021.03.21D10:04:00.000000000 201
AAPL 2021.03.21D10:07:00.000000000 206
AAPL 2021.03.21D10:10:00.000000000 215
AAPL 2021.03.21D10:13:00.000000000 230
AAPL 2021.03.22D10:01:00.000000000 250
AAPL 2021.03.22D10:04:00.000000000 270
AAPL 2021.03.22D10:07:00.000000000 290
AAPL 2021.03.22D10:10:00.000000000 310
AAPL 2021.03.22D10:13:00.000000000 330
MSFT 2021.03.21D10:01:00.000000000 468
MSFT 2021.03.21D10:04:00.000000000 915
MSFT 2021.03.21D10:07:00.000000000 668
MSFT 2021.03.21D10:10:00.000000000 403
MSFT 2021.03.21D10:13:00.000000000 604
MSFT 2021.03.22D10:01:00.000000000 775
MSFT 2021.03.22D10:04:00.000000000 697
MSFT 2021.03.22D10:07:00.000000000 829
MSFT 2021.03.22D10:10:00.000000000 799
MSFT 2021.03.22D10:13:00.000000000 382
I have following table in PostgreSQL 11.0
drug_id synonym score
96165807064 chembl490421 0.667
96165807064 querciformolide a 1.0
96165807064 querciformolide b 1.0
96165807066 chembl196832 1.0
96165807066 cpiylcsbeicghy-uhfffaoysa-n 0.875
96165807066 schembl1752046 0.938
96165807066 stk694847 0.75
96165807066 molport-006-827-808 0.812
96165807066 akos016348681 0.625
96165807066 akos004112738 0.688
96165807066 mcule-5237395512 0.562
I would like to add a column with 'rank' group by drug_id based on the score column.
Following is the expected output
drug_id synonym score rank
96165807064 querciformolide a 1.0 1
96165807064 querciformolide b 1.0 1
96165807064 chembl490421 0.667 2
96165807066 chembl196832 1.0 1
96165807066 schembl1752046 0.938 2
96165807066 cpiylcsbeicghy-uhfffaoysa-n 0.875 3
96165807066 molport-006-827-808 0.812 4
96165807066 stk694847 0.75 5
96165807066 akos004112738 0.688 6
96165807066 akos016348681 0.625 7
96165807066 mcule-5237395512 0.562 8
I am using following query:
SELECT distinct
drug_id,
synonym,
score,
dense_RANK () OVER (
PARTITION BY drug_id
ORDER BY score
) rank_number
FROM
tbl
order by drug_id, score desc
;
I am not getting expected output using above query.
drug_id synonym score rank_number
96165807064 querciformolide a 1.0 2
96165807064 querciformolide b 1.0 2
96165807064 chembl490421 0.667 1
96165807066 chembl196832 1.0 15
96165807066 schembl1752046 0.938 14
96165807066 cpiylcsbeicghy-uhfffaoysa-n 0.875 13
96165807066 molport-006-827-808 0.812 12
96165807066 stk694847 0.75 11
96165807066 akos004112738 0.688 10
96165807066 akos016348681 0.625 9
96165807066 mcule-5237395512 0.562 8
You can use the following query:
SELECT
t.drug_id,
t.synonym,
t.score,
DENSE_RANK() OVER (
PARTITION BY t.drug_id
ORDER BY t.drug_id, t.score desc
) rank
FROM
test t;
I created a sql fiddle to show the query working.
https://www.db-fiddle.com/f/p9ANUghi8TxLgXrhUHsUaY/3
When I merge two CSV files, of the format (date, someValue), I see some duplicate records.
If I reduce the records to half the problem goes away. However, if I double the size of both the files it worsens. Appreciate any help!
My code:
i = pd.DataFrame.from_csv('i.csv')
i = i.reset_index()
e = pd.DataFrame.from_csv('e.csv')
e = e.reset_index()
total_df = pd.merge(i, e, right_index=False, left_index=False,
right_on=['date'], left_on=['date'], how='left')
total_df = total_df.sort(column='date')
(Note: the dupulicate records for 11/15, 11/16, 12/17, 12/18.)
In [7]: total_df
Out[7]:
date Cost netCost
25 2012-11-15 00:00:00 1 2
26 2012-11-15 00:00:00 1 2
31 2012-11-16 00:00:00 1 2
32 2012-11-16 00:00:00 1 2
37 2012-11-17 00:00:00 1 2
2 2012-11-18 00:00:00 1 2
5 2012-11-19 00:00:00 1 2
8 2012-11-20 00:00:00 1 2
11 2012-11-21 00:00:00 1 2
14 2012-11-22 00:00:00 1 2
17 2012-11-23 00:00:00 1 2
20 2012-11-24 00:00:00 1 2
23 2012-11-25 00:00:00 1 2
29 2012-11-26 00:00:00 1 2
35 2012-11-27 00:00:00 1 2
0 2012-11-28 00:00:00 1 2
3 2012-11-29 00:00:00 1 2
6 2012-11-30 00:00:00 1 2
9 2012-12-01 00:00:00 1 2
12 2012-12-02 00:00:00 1 2
15 2012-12-03 00:00:00 1 2
18 2012-12-04 00:00:00 1 2
21 2012-12-05 00:00:00 1 2
24 2012-12-06 00:00:00 1 2
30 2012-12-07 00:00:00 1 2
36 2012-12-08 00:00:00 1 2
1 2012-12-09 00:00:00 2 2
4 2012-12-10 00:00:00 2 2
7 2012-12-11 00:00:00 2 2
10 2012-12-12 00:00:00 2 2
13 2012-12-13 00:00:00 1 2
16 2012-12-14 00:00:00 2 2
19 2012-12-15 00:00:00 2 2
22 2012-12-16 00:00:00 2 2
27 2012-12-17 00:00:00 1 2
28 2012-12-17 00:00:00 1 2
33 2012-12-18 00:00:00 1 2
34 2012-12-18 00:00:00 1 2
i.csv
date,Cost
2012-11-15 00:00:00,1
2012-11-16 00:00:00,1
2012-11-17 00:00:00,1
2012-11-18 00:00:00,1
2012-11-19 00:00:00,1
2012-11-20 00:00:00,1
2012-11-21 00:00:00,1
2012-11-22 00:00:00,1
2012-11-23 00:00:00,1
2012-11-24 00:00:00,1
2012-11-25 00:00:00,1
2012-11-26 00:00:00,1
2012-11-27 00:00:00,1
2012-11-28 00:00:00,1
2012-11-29 00:00:00,1
2012-11-30 00:00:00,1
2012-12-01 00:00:00,1
2012-12-02 00:00:00,1
2012-12-03 00:00:00,1
2012-12-04 00:00:00,1
2012-12-05 00:00:00,1
2012-12-06 00:00:00,1
2012-12-07 00:00:00,1
2012-12-08 00:00:00,1
2012-12-09 00:00:00,2
2012-12-10 00:00:00,2
2012-12-11 00:00:00,2
2012-12-12 00:00:00,2
2012-12-13 00:00:00,1
2012-12-14 00:00:00,2
2012-12-15 00:00:00,2
2012-12-16 00:00:00,2
2012-12-17 00:00:00,1
2012-12-18 00:00:00,1
e.csv
date,netCost
2012-11-15 00:00:00,2
2012-11-16 00:00:00,2
2012-11-17 00:00:00,2
2012-11-18 00:00:00,2
2012-11-19 00:00:00,2
2012-11-20 00:00:00,2
2012-11-21 00:00:00,2
2012-11-22 00:00:00,2
2012-11-23 00:00:00,2
2012-11-24 00:00:00,2
2012-11-25 00:00:00,2
2012-11-26 00:00:00,2
2012-11-27 00:00:00,2
2012-11-28 00:00:00,2
2012-11-29 00:00:00,2
2012-11-30 00:00:00,2
2012-12-01 00:00:00,2
2012-12-02 00:00:00,2
2012-12-03 00:00:00,2
2012-12-04 00:00:00,2
2012-12-05 00:00:00,2
2012-12-06 00:00:00,2
2012-12-07 00:00:00,2
2012-12-08 00:00:00,2
2012-12-09 00:00:00,2
2012-12-10 00:00:00,2
2012-12-11 00:00:00,2
2012-12-12 00:00:00,2
2012-12-13 00:00:00,2
2012-12-14 00:00:00,2
2012-12-15 00:00:00,2
2012-12-16 00:00:00,2
2012-12-17 00:00:00,2
2012-12-18 00:00:00,2
This does seem like a bug with pandas 0.7.3 or numpy 1.6. This only happens if the column being merged on is a date (internally converted to numpy.datetime64). My solution was to convert date into a string-
def _DatetimeToString(datetime64):
timestamp = datetime64.astype(long)/1000000000
return datetime.fromtimestamp(timestamp).strftime('%Y-%m-%d')
i = pd.DataFrame.from_csv('i.csv')
i = i.reset_index()
i['date'] = i['date'].map(_DatetimeToString)
e = pd.DataFrame.from_csv('e.csv')
e = e.reset_index()
i['date'] = i['date'].map(_DatetimeToString)
total_df = pd.merge(i, e, right_index=False, left_index=False,
right_on=['date'], left_on=['date'], how='left')
total_df = total_df.sort(column='date')
This issue/bug came up for me as well. I was not merging on a datetime series, however, I did have a datetime series in the left dataframe. My solution was to de-dupe:
len(pophist)
2347
pop_merged = pd.merge(left=pophist, right=df_labels, how='left',
left_on ='candidate', right_on ='Slug', indicator = True)
pop_merged.shape
3303
pop_merged2 = pop_merged.drop_duplicates() #note dedupping is required due to issue in how pandas handles datetime dtypes on merge.
len(pop_merged2)
2347