Confusion regarding Merge in Pandas - merge

I am trying to merge two pandas dataframes without index:
In [127]: df1
Out[127]:
value1 date id value2 group
0 -0.2284 2012-04-01 a -0.067469 group d
1 -0.4875 2012-04-01 b -0.021274 group d
2 0.1139 2012-04-01 c -0.015978 group d
3 0.3191 2012-04-01 d 0.022634 group d
4 -0.0077 2012-04-01 e 0.000000 group d
In [128]: df2
Out[128]:
date id value2 group
23044 2012-04-01 a -0.06701001 group c
23045 2012-04-01 b -0.02128 group c
23046 2012-04-01 c 0 group c
23047 2012-04-01 d 0 group c
23048 2012-04-01 e 0 group c
In [129]: pd.merge(df1, df2, how = 'outer', on = ['date', 'id', 'value2', 'group'])
Out[129]:
value1 date id value2 group
0 -0.2284 2012-04-01 a -0.067469 group d
1 -0.4875 2012-04-01 b -0.021274 group d
2 0.1139 2012-04-01 c -0.015978 group d
3 0.3191 2012-04-01 d 0.022634 group d
4 -0.0077 2012-04-01 e 0.000000 group d
5 NaN 2012-04-01 a -0.067010 group c
6 NaN 2012-04-01 b -0.021280 group c
7 NaN 2012-04-01 c 0.000000 group c
8 NaN 2012-04-01 d 0.000000 group c
9 NaN 2012-04-01 e 0.000000 group c
This is almost the desired output, except I would like the NaNs of value1 for group c to be filled by the value1 from group d according to date and id. What is the correct way to achieve that?

I think this is unavoidably a two-step process.
To "fill in" value1, you're relating any and all rows with the same (date, id), regardless of group or value.
In [5]: df3 = df2.set_index(['date', 'id']).join(
....: df1.set_index(['date', 'id'])['value1']).reset_index()
To get the final result, you're listing distinguishing rows by all attributes, no longer lumping together groups and values.
In [6]: pd.merge(df1, df3, how = 'outer',
....: on = ['date', 'id', 'value1', 'value2', 'group'])
Out[6]:
value1 date id value2 group
0 -0.2284 2012-04-01 a -0.067469 group_d
1 -0.4875 2012-04-01 b -0.021274 group_d
2 0.1139 2012-04-01 c -0.015978 group_d
3 0.3191 2012-04-01 d 0.022634 group_d
4 -0.0077 2012-04-01 e 0.000000 group_d
5 -0.2284 2012-04-01 a -0.067010 group_c
6 -0.4875 2012-04-01 b -0.021280 group_c
7 0.1139 2012-04-01 c 0.000000 group_c
8 0.3191 2012-04-01 d 0.000000 group_c
9 -0.0077 2012-04-01 e 0.000000 group_c

Related

Summary and crosstabulation in Pyspark (DataBricks)

I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns.
For Example: My df is like this
col1
col2
col3
Cat1
XYZ
A
Cat1
XYZ
C
Cat1
ABC
B
Cat2
ABC
A
Cat2
XYZ
B
Cat2
MNO
A
I want something like this
VarNAME
Category
Count
A
B
C
col1
Cat1
3
1
1
1
col1
Cat2
3
2
0
1
col2
XYZ
3
1
1
1
col2
ABC
2
1
1
0
col2
MNO
1
1
0
0
col3
A
3
3
0
0
col3
B
2
0
2
0
Col3
C
1
0
0
1
So, Basically, I want cross-tabulation for all individual columns with col3 and the total count.
I can do it in Python using a loop but the loop is somewhat different in pyspark.
Here are my 2 cents.
Created a sample dataframe
df = spark.createDataFrame(
[("Cat1","XYZ","A"),
("Cat1","XYZ","C"),
("Cat1","ABC","B"),
("Cat2","ABC","A"),
("Cat2","XYZ","B"),
("Cat2","MNO","A")
],schema = ['col1','col2','col3'])
Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it.
Then performed union for all these dataframes
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df_union = \
df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))
Printing the data frame based on the column order
df_union.select('VarName','Category','count','A','B','C').show()
Please check the sample output for the reference:

KDB: String comparison with a table

I have a table bb:
bb:([]key1: 0 1 2 1 7; col1: 1 2 3 4 5; col2: 5 4 3 2 1; col3:("11";"22" ;"33" ;"44"; "55"))
How do I do a relational comparison of string? Say I want to get records with col3 less than or equal to "33"
select from bb where col3 <= "33"
Expected result:
key1 col1 col2 col3
0 1 5 11
1 2 4 22
2 3 3 33
If you want col3 to remain of string type, then just cast temporarily within the qsql query?
q)select from bb where ("J"$col3) <= 33
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
If you are looking for classical string comparison, regardless to if string is number or not, I would propose the next approach:
a. Create methods which behave similar to common Java Comparators. Which returns 0 when strings are equal, -1 when first string is less than second one, and 1 when first is greater than the second
.utils.compare: {$[x~y;0;$[x~first asc (x;y);-1;1]]};
.utils.less: {-1=.utils.compare[x;y]};
.utils.lessOrEq: {0>=.utils.compare[x;y]};
.utils.greater: {1=.utils.compare[x;y]};
.utils.greaterOrEq: {0<=.utils.compare[x;y]};
b. Use them in where clause
bb:([]key1: 0 1 2 1 7;
col1: 1 2 3 4 5;
col2: 5 4 3 2 1;
col3:("11";"22" ;"33" ;"44"; "55"));
select from bb where .utils.greaterOrEq["33"]'[col3]
c. As you see below, this works for arbitrary strings
cc:([]key1: 0 1 2 1 7;
col1: 1 2 3 4 5;
col2: 5 4 3 2 1;
col3:("abc" ;"def" ;"tyu"; "55poi"; "gab"));
select from cc where .utils.greaterOrEq["ffff"]'[col3]
.utils.compare could also be written in vector form, though, I'm not sure if it will be more time/memory efficient
.utils.compareVector: {
?[x~'y;0;?[x~'first each asc each(enlist each x),'enlist each y;-1;1]]
};
one way would be to evaluate the strings before comparison:
q)bb:([]key1: 0 1 2 1 7; col1: 1 2 3 4 5; col2: 5 4 3 2 1; col3:("11";"22" ;"33" ;"44"; "55"))
q)bb
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
1 4 2 "44"
7 5 1 "55"
q)
q)
q)select from bb where 33>=value each col3
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
in this case value each returns the strings values as integers and then performs the comparison

KDB: select first n rows from each group

How can I extract the first n rows from each group? For example: for table
bb: ([]sym:(4#`a),(5#`b);val: til 9)
sym val
-------------
a 0
a 1
a 2
a 3
b 4
b 5
b 6
b 7
b 8
How can I select the first 2 rows of each group by sym?
Thanks
Can use fby:
q)select from bb where ({x in 2#x};i) fby sym
sym val
-------
a 0
a 1
b 4
b 5
You can try this:
q)select from t where i in raze exec 2#i by sym from t
sym val
-------
a 0
a 1
b 4
b 5

How to produce grouped column from many rows on PostgreSQL

If i could produce this result from many to many relationship from this kind of query:
SELECT x1.id AS id1, x3.id AS id3
FROM humans x1
LEFT JOIN memberships x2
ON x1.id = x2.human_id
LEFT JOIN groups x3
ON x2.group_id = x3.id
WHERE x1.id IN ( 1,2,3,4 )
ORDER BY 1,2
id1 | id3
----+----
1 | A
1 | B
1 | C
2 | D
2 | E
3 | F
4 | (null)
5 | G
5 | Z
how to convert it into this kind of table?
id1 | id3s
----+--------
1 | A, B, C
2 | D, E
3 | F
4 | (null)
5 | G, Z
Use string_agg and a group by:
SELECT x1.id AS id1, string_agg(x3.id,',' order by x3.id asc) AS id3s
FROM humans x1
LEFT JOIN memberships x2
ON x1.id = x2.human_id
LEFT JOIN groups x3
ON x2.group_id = x3.id
WHERE x1.id IN ( 1,2,3,4 )
GROUP BY x1.id
ORDER BY 1

Query in postgresql

I have two tables t1 and t2 as following
t1
A B C D E
1 2 c d e
3 1 d e f
4 2 f g h
t2
A B
1 2
8 6
4 2
Here A,B,C,D,E are the columns of t1 and A,B are the columns of t2 where A and B are common columns.
What I have done so far
I have written the following query
WITH temp as (
select *
from t2
)
select tab1.*
from t1 tab1, temp tab2
where (tab1.A!=tab2.A OR tab1.B!=tab2.B)
I wanted this output
A B C D E
3 1 d e f
But I am getting this output
A B C D E
1 2 c d e
1 2 c d e
3 1 d e f
3 1 d e f
3 1 d e f
4 2 f g h
4 2 f g h
What query should I use?
If I understand you correctly, you'd like those rows from T1 that don't have corresponding rows in T2. The easiest way in my opinion is a LEFT OUTER JOIN:
psql=> select * from t1;
a | b | c | d | e
---+---+---+---+---
1 | 2 | c | d | e
3 | 1 | d | e | f
4 | 2 | f | g | h
(3 rows)
psql=> select * from t2;
a | b
---+---
1 | 2
8 | 6
4 | 2
(3 rows)
psql=> select t1.a, t1.b, t1.c, t1.d, t1.e from t1 left outer join t2 on (t1.a = t2.a and t1.b = t2.b) where t2.a is null;
a | b | c | d | e
---+---+---+---+---
3 | 1 | d | e | f
(1 row)
Edit: Here's the select without the where clause, with the rows from t2 added (otherwise it'd be just like a select * from t1). As you can see, the first row contains NULLs for t2_a and t2_b:
psql=> select t1.a, t1.b, t1.c, t1.d, t1.e, t2.a as t2_a, t2.b as t2_b from t1 left outer join t2 on (t1.a = t2.a and t1.b = t2.b);
a | b | c | d | e | t2_a | t2_b
---+---+---+---+---+------+------
3 | 1 | d | e | f | |
1 | 2 | c | d | e | 1 | 2
4 | 2 | f | g | h | 4 | 2
(3 rows)
How about:
SELECT * FROM t1 WHERE (a,b) NOT IN (SELECT a,b FROM t2);