How can I extract the first n rows from each group? For example: for table
bb: ([]sym:(4#`a),(5#`b);val: til 9)
sym val
-------------
a 0
a 1
a 2
a 3
b 4
b 5
b 6
b 7
b 8
How can I select the first 2 rows of each group by sym?
Thanks
Can use fby:
q)select from bb where ({x in 2#x};i) fby sym
sym val
-------
a 0
a 1
b 4
b 5
You can try this:
q)select from t where i in raze exec 2#i by sym from t
sym val
-------
a 0
a 1
b 4
b 5
Related
I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns.
For Example: My df is like this
col1
col2
col3
Cat1
XYZ
A
Cat1
XYZ
C
Cat1
ABC
B
Cat2
ABC
A
Cat2
XYZ
B
Cat2
MNO
A
I want something like this
VarNAME
Category
Count
A
B
C
col1
Cat1
3
1
1
1
col1
Cat2
3
2
0
1
col2
XYZ
3
1
1
1
col2
ABC
2
1
1
0
col2
MNO
1
1
0
0
col3
A
3
3
0
0
col3
B
2
0
2
0
Col3
C
1
0
0
1
So, Basically, I want cross-tabulation for all individual columns with col3 and the total count.
I can do it in Python using a loop but the loop is somewhat different in pyspark.
Here are my 2 cents.
Created a sample dataframe
df = spark.createDataFrame(
[("Cat1","XYZ","A"),
("Cat1","XYZ","C"),
("Cat1","ABC","B"),
("Cat2","ABC","A"),
("Cat2","XYZ","B"),
("Cat2","MNO","A")
],schema = ['col1','col2','col3'])
Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it.
Then performed union for all these dataframes
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df_union = \
df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))
Printing the data frame based on the column order
df_union.select('VarName','Category','count','A','B','C').show()
Please check the sample output for the reference:
To get an appropriate table rows count I thought to use a naive approach: use count 1 construct. And it works in a simple case:
q)t:([]sym:`a`a`b`b);
q)select cnt: count 1 by sym from t
sym| cnt
---| ---
a | 2
b | 2
But when I added other fields, I've got wrong result:
q)select cnt: count 1, sym by sym from t
sym| cnt sym
---| -------
a | 1 a a
b | 1 b b
Why does count 1 work (or just it seems so) in one column case and failed with multiple columns?
Upd: Expected to get something like this
sym| cnt sym
---| -------
a | 2 a a
b | 2 b b
I don't think count 1 will produce the result you're looking for, nor even a consistent one.
I think you might want to use count i instead. When selecting by sym you are specifying which column you want to count by.
q)t:([]sym:`a`a`b`b)
q)select cnt:count i,sym by sym from t
sym| cnt sym
---| -------
a | 2 a a
b | 2 b b
q).z.K
3.6
A point to note however is that this solution will not work on kdb+ 4.0.
q)t:([]sym:`a`a`b`b)
q)select cnt:count i,sym by sym from t
'dup names for cols/groups sym
[0] select cnt:count i,sym by sym from t
^
q).z.K
4f
I'd like to create a nested listed for one of my table's columns, but I'm unsure of the syntax to use. If for instance I had the following table...
q)t:([]submitter:`A`B`C; code:3?100; status:110b)
q)t
submitter code status
---------------------
A 2 1
B 39 1
C 64 0
I want to do something similar to below. However this will add the additional column x to the table and place the value there instead of creating a compound list for the code column....
q)update code,:77 from t where status<>1b
submitter code status x
------------------------
A 2 1
B 39 1
C 64 0 77
If it were a dictionary with a single value I would do the following...
q)d:`sumbitter`code`status!(`A;1?100;1)
q)d
sumbitter| `A
code | ,88
status | 1
q)d[`code],:99
q)d
sumbitter| `A
code | 88 99
status | 1
How do I perform the same operation on a table with multiple rows?
My desired output would look like...
q)t
submitter code status
----------------------
A 2 1
B 39 1
C 64 77 0
This would also do it for you, doesn't require you to change the type in advance
q)update code:(code,'(77;())status) from t
submitter code status
---------------------
A ,12 1
B ,10 1
C 1 77 0
You can't change the column type of your code column on-the-fly like you intend to do.
Instead, you first have to update the type of the column code to a list of long instead of long:
q)meta t
c | t f a
---------| -----
submitter| s
code | j
status | b
Update the type:
t: update enlist each code from t
Now the type of code is "J", which is indeed a list of long:
q)meta t
c | t f a
---------| -----
submitter| s
code | J
status | b
And then you can append an element to the code like this:
t:update code:{x,77} each code from t where status<>1b
q)t
submitter code status
----------------------
A ,2 1
B ,39 1
C 64 77 0
I have a table bb:
bb:([]key1: 0 1 2 1 7; col1: 1 2 3 4 5; col2: 5 4 3 2 1; col3:("11";"22" ;"33" ;"44"; "55"))
How do I do a relational comparison of string? Say I want to get records with col3 less than or equal to "33"
select from bb where col3 <= "33"
Expected result:
key1 col1 col2 col3
0 1 5 11
1 2 4 22
2 3 3 33
If you want col3 to remain of string type, then just cast temporarily within the qsql query?
q)select from bb where ("J"$col3) <= 33
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
If you are looking for classical string comparison, regardless to if string is number or not, I would propose the next approach:
a. Create methods which behave similar to common Java Comparators. Which returns 0 when strings are equal, -1 when first string is less than second one, and 1 when first is greater than the second
.utils.compare: {$[x~y;0;$[x~first asc (x;y);-1;1]]};
.utils.less: {-1=.utils.compare[x;y]};
.utils.lessOrEq: {0>=.utils.compare[x;y]};
.utils.greater: {1=.utils.compare[x;y]};
.utils.greaterOrEq: {0<=.utils.compare[x;y]};
b. Use them in where clause
bb:([]key1: 0 1 2 1 7;
col1: 1 2 3 4 5;
col2: 5 4 3 2 1;
col3:("11";"22" ;"33" ;"44"; "55"));
select from bb where .utils.greaterOrEq["33"]'[col3]
c. As you see below, this works for arbitrary strings
cc:([]key1: 0 1 2 1 7;
col1: 1 2 3 4 5;
col2: 5 4 3 2 1;
col3:("abc" ;"def" ;"tyu"; "55poi"; "gab"));
select from cc where .utils.greaterOrEq["ffff"]'[col3]
.utils.compare could also be written in vector form, though, I'm not sure if it will be more time/memory efficient
.utils.compareVector: {
?[x~'y;0;?[x~'first each asc each(enlist each x),'enlist each y;-1;1]]
};
one way would be to evaluate the strings before comparison:
q)bb:([]key1: 0 1 2 1 7; col1: 1 2 3 4 5; col2: 5 4 3 2 1; col3:("11";"22" ;"33" ;"44"; "55"))
q)bb
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
1 4 2 "44"
7 5 1 "55"
q)
q)
q)select from bb where 33>=value each col3
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
in this case value each returns the strings values as integers and then performs the comparison
I am trying to merge two pandas dataframes without index:
In [127]: df1
Out[127]:
value1 date id value2 group
0 -0.2284 2012-04-01 a -0.067469 group d
1 -0.4875 2012-04-01 b -0.021274 group d
2 0.1139 2012-04-01 c -0.015978 group d
3 0.3191 2012-04-01 d 0.022634 group d
4 -0.0077 2012-04-01 e 0.000000 group d
In [128]: df2
Out[128]:
date id value2 group
23044 2012-04-01 a -0.06701001 group c
23045 2012-04-01 b -0.02128 group c
23046 2012-04-01 c 0 group c
23047 2012-04-01 d 0 group c
23048 2012-04-01 e 0 group c
In [129]: pd.merge(df1, df2, how = 'outer', on = ['date', 'id', 'value2', 'group'])
Out[129]:
value1 date id value2 group
0 -0.2284 2012-04-01 a -0.067469 group d
1 -0.4875 2012-04-01 b -0.021274 group d
2 0.1139 2012-04-01 c -0.015978 group d
3 0.3191 2012-04-01 d 0.022634 group d
4 -0.0077 2012-04-01 e 0.000000 group d
5 NaN 2012-04-01 a -0.067010 group c
6 NaN 2012-04-01 b -0.021280 group c
7 NaN 2012-04-01 c 0.000000 group c
8 NaN 2012-04-01 d 0.000000 group c
9 NaN 2012-04-01 e 0.000000 group c
This is almost the desired output, except I would like the NaNs of value1 for group c to be filled by the value1 from group d according to date and id. What is the correct way to achieve that?
I think this is unavoidably a two-step process.
To "fill in" value1, you're relating any and all rows with the same (date, id), regardless of group or value.
In [5]: df3 = df2.set_index(['date', 'id']).join(
....: df1.set_index(['date', 'id'])['value1']).reset_index()
To get the final result, you're listing distinguishing rows by all attributes, no longer lumping together groups and values.
In [6]: pd.merge(df1, df3, how = 'outer',
....: on = ['date', 'id', 'value1', 'value2', 'group'])
Out[6]:
value1 date id value2 group
0 -0.2284 2012-04-01 a -0.067469 group_d
1 -0.4875 2012-04-01 b -0.021274 group_d
2 0.1139 2012-04-01 c -0.015978 group_d
3 0.3191 2012-04-01 d 0.022634 group_d
4 -0.0077 2012-04-01 e 0.000000 group_d
5 -0.2284 2012-04-01 a -0.067010 group_c
6 -0.4875 2012-04-01 b -0.021280 group_c
7 0.1139 2012-04-01 c 0.000000 group_c
8 0.3191 2012-04-01 d 0.000000 group_c
9 -0.0077 2012-04-01 e 0.000000 group_c