merge two columns in dataframe - merge

I have a df that I want to have two columns combined or merged (I am not sure of the correct term) by grouping of another column.
Here is my df
> print(BC_data)
Treatment Day LB PCA
Day2F1 Untreated 2 4400000 10900000
Day2F2 Untreated 2 5800000 5200000
Day2F3 Untreated 2 5700000 5900000
Day2F4 Metro 2 13100000 11500000
Day2F5 Metro 2 9600000 9100000
Day2F6 Metro 2 6900000 9700000
Day2F7 Pen 2 11400000 5100000
Day2F8 Pen 2 8000000 7300000
Day2F9 Pen 2 6300000 9300000
Day2F10 Rif 2 600000 4600000
Day2F11 Rif 2 400000 25000000
I would like to have the column LB and PCA put together in one column and grouped by days. to become something like this
Treatment Day LB-PCA
Day2F1 Untreated 2 4400000
Day2F2 Untreated 2 5800000
Day2F3 Untreated 2 5700000
Day2F1 Untreated 2 10900000
Day2F2 Untreated 2 5200000
Day2F3 Untreated 2 5900000
......
Can any one help?
Thanks in advance

You can just concatenate the records in two steps. First, select all records in the original df but renaming column LB as LB-PCA. Next, concatenate all rows in the original df but now using PCA as LB-PCA. Finally, sort if needed.

Related

How do I convert a dictionary of dictionaries into a table?

I've got a dictionary of dictionaries:
`1`2!((`a`b`c!(1 2 3));(`a`b`c!(4 5 6)))
| a b c
-| -----
1| 1 2 3
2| 4 5 6
I'm trying to work out how to turn it into a table that looks like:
1 a 1
1 b 2
1 c 3
2 a 4
2 b 5
2 c 6
What's the easiest/'right' way to achieve this in KDB?
Not sure if this is the shortest or best way, but my solution is:
ungroup flip`c1`c2`c3!
{(key x;value key each x;value value each x)}
`1`2!((`a`b`c!(1 2 3));(`a`b`c!(4 5 6)))
Which gives expected table with column names c1, c2, c3
What you're essentially trying to do is to "unpivot" - see the official pivot page here: https://code.kx.com/q/kb/pivoting-tables/
Unfortunately that page doesn't give a function for unpivoting as it isn't trivial and it's hard to have a general solution for it, but if you search the Kx/K4/community archives for "unpivot" you'll find some examples of unpivot functions, for example this one from Aaron Davies:
unpiv:{[t;k;p;v;f] ?[raze?[t;();0b;{x!x}k],'/:(f C){![z;();0b;x!enlist each (),y]}[p]'v xcol't{?[x;();0b;y!y,:()]}/:C:(cols t)except k;enlist(not;(.q.each;.q.all;(null;v)));0b;()]};
Using this, your problem (after a little tweak to the input) becomes:
q)t:([]k:`1`2)!((`a`b`c!(1 2 3));(`a`b`c!(4 5 6)));
q)`k xasc unpiv[t;1#`k;1#`p;`v;::]
k v p
-----
1 1 a
1 2 b
1 3 c
2 4 a
2 5 b
2 6 c
This solution is probably more complicated than it needs to be for your use case as it tries to solve for the general case of unpivoting.
Just an update to this, I solved this problem a different way to the selected answer.
In the end, I:
Converted each row into a table with one row in it and all the columns I needed.
Joined all the tables together.

Spark Window Functions That depend on itself

Say I have a column of sorted timestamps in a DataFrame. I want to write a function that adds a column to this DataFrame that cuts the timestamps into sequential time slices according to the following rules:
start at the first row and keep iterating down to the end
for each row, if you've walked n number of rows in the current group OR you have walked more than time interval t in the current group, make a cut
return a new column with the group assignment for each row, which should be an increasing integer
In English: each group should be no more than n rows, and should not span more than t time
For example: (Using integers for timestamps to simplify)
INPUT
time
---------
1
2
3
5
10
100
2000
2001
2002
2003
OUTPUT (after slice function with n = 3 and t = 5)
time | group
----------|------
1 | 1
2 | 1
3 | 1
5 | 2 // cut because there were no cuts in the last 3 rows
10 | 2
100 | 3 // cut because 100 - 5 > 5
2000 | 4 // cut because 2000 - 100 > 5
2001 | 4
2002 | 4
2003 | 5 // cut because there were no cuts in the last 3 rows
I have a feeling this can be done with window functions in Spark. Afterall, window functions were created to help developers compute moving averages. You'd basically calculate an aggregate (in this case average) of a column (stock price) per window of n rows.
The same should be able to be accomplished here. For each row, if the last n rows contains no cut, or the timespan between the last cut and the current timestamp is greater than t, cut = true, o.w. cut = false. But what I can't seem to figure out is how to make the Window Function aware of itself. That would be like the moving average of a particular row aware of the last moving average.

How to stack two columns for grouping?

I have the following DataFrame df that represents a graph with nodes A, B, C and D. Each node belongs to a group 1 or 2:
src dst group_src group_dst
A B 1 1
A B 1 1
B A 1 1
A C 1 2
C D 2 2
D C 2 2
I need to calculate the distinct number of nodes and the number of edges per group. The result should be the following:
group nodes_count edges_count
1 2 3
2 2 2
The edge A->C is not considered because the nodes belong to different groups.
I do not know how to stack the columns group_src and group_dst in order to group by unique column group. Also I do not know how to calculate the number of edges inside the group.
df
.groupBy("group_src","group_dst")
.agg(countDistinct("srcId","dstId").as("nodes_count"))
I think it may be necessary to use two steps:
val edges = df.filter($"group_src" === $"group_dst")
.groupBy($"group_src".as("group"))
.agg(count("*").as("edges_count"))
val nodes = df.select($"src".as("id"), $"group_src".as("group"))
.union(df.select($"dst".as("id"), $"group_dst".as("group"))
.groupBy("group").agg(countDistinct($"id").as("nodes_count"))
nodes.join(edges, "group")
You can accomplish "stacking" of columns by using .union() after selecting specific columns.

Create new binary column based off of join in spark

My situation is I have two spark data frames, dfPopulation and dfSubpopulation.
dfSubpopulation is just that, a subpopulation of dfPopulation.
I would like a clean way to create a new column in dfPopulation that is binary of whether the dfSubpopulation key was in the dfPopulation key. E.g. what I want is to create the new DataFrame dfPopulationNew:
dfPopulation = X Y key
1 2 A
2 2 A
3 2 B
4 2 C
5 3 C
dfSubpopulation = X Y key
1 2 A
3 2 B
4 2 C
dfPopulationNew = X Y key inSubpopulation
1 2 A 1
2 2 A 0
3 2 B 1
4 2 C 1
5 3 C 0
I know this could be down fairly simply with a SQL statement, however given that a lot of Sparks optimization is now using the DataFrame construct, I would like to utilize that.
Using SparkSQL compared to DataFrame operations should make no difference from a performance perspective, the execution plan is the same. That said, here is one way to do it using a join
val dfPopulationNew = dfPopulation.join(
dfSubpopulation.withColumn("inSubpopulation", lit(1)),
Seq("X", "Y", "key"),
"left_outer")
.na.fill(0, Seq("inSubpopulation"))

select rows by comparing columns using HDFStore

How can I select some rows by comparing two columns from hdf5 file using Pandas? The hdf5 file is too big to load into memory. For example, I want to select rows where column A and columns B is equal. The dataframe is save in file 'mydata.hdf5'. Thanks.
import pandas as pd
store = pd.HDFstore('mydata.hdf5')
df = store.select('mydf',where='A=B')
This doesn't work. I know that store.select('mydf',where='A==12') will work. But I want to compare column A and B. The example data looks like this:
A B C
1 1 3
1 2 4
. . .
2 2 5
1 3 3
You cannot directly do this, but the following will work
In [23]: df = DataFrame({'A' : [1,2,3], 'B' : [2,2,2]})
In [24]: store = pd.HDFStore('test.h5',mode='w')
In [26]: store.append('df',df,data_columns=True)
In [27]: store.select('df')
Out[27]:
A B
0 1 2
1 2 2
2 3 2
In [28]: store.select_column('df','A') == store.select_column('df','B')
Out[28]:
0 False
1 True
2 False
dtype: bool
This should be pretty efficient.