How to stack two columns for grouping?

How to stack two columns for grouping? - scala

I have the following DataFrame df that represents a graph with nodes A, B, C and D. Each node belongs to a group 1 or 2:
src dst group_src group_dst
A B 1 1
A B 1 1
B A 1 1
A C 1 2
C D 2 2
D C 2 2
I need to calculate the distinct number of nodes and the number of edges per group. The result should be the following:
group nodes_count edges_count
1 2 3
2 2 2
The edge A->C is not considered because the nodes belong to different groups.
I do not know how to stack the columns group_src and group_dst in order to group by unique column group. Also I do not know how to calculate the number of edges inside the group.
df
.groupBy("group_src","group_dst")
.agg(countDistinct("srcId","dstId").as("nodes_count"))

I think it may be necessary to use two steps:
val edges = df.filter($"group_src" === $"group_dst")
.groupBy($"group_src".as("group"))
.agg(count("*").as("edges_count"))
val nodes = df.select($"src".as("id"), $"group_src".as("group"))
.union(df.select($"dst".as("id"), $"group_dst".as("group"))
.groupBy("group").agg(countDistinct($"id").as("nodes_count"))
nodes.join(edges, "group")
You can accomplish "stacking" of columns by using .union() after selecting specific columns.

Related

Processing each row in kdb table and appending arbitrary results in a new table

I have a table
t:([]a:`a`b`c;b:1 2 3;c:`x`y`z)
I would like to iterate and process each row.
The thing is that the processing logic for each row may result in arbitrary lines of data, after the full iteration the result maybe as such e.g.
results:([]a:`a1`b1`b2`b3`c1`c2;x:1 2 2 2 3 3)
I have the following idea so far but doesn't seem to work:
uj { // some processing function } each t
But how does one return arbitrary number of data append the results into a new table?

Assuming you are using something from the table entries to indicate your arbitrary value, you can use a dictionary to indicate a number (or a function) which can be used to apply these values.
In this example, I use the c column of the original table to indicate the number of rows to return (and the number from 1 to count to).
As each entry of the table is a dictionary, I can index using the column names to get the values and build a new table.
I also use raze to join each of the results together, as they will each have the same schema.
raze {[x]
d:`x`y`z!1 3 2;
([]a:((),`$string[x[`a]],/:string 1+til d[x[`c]]);x:((),d[x[`c]])#x[`b])
} each t

Not sure if this is what you want, but you can try something like this:
ungroup select a:`${y,/:x}[string b]'[string a],b from t
Or you can use accumulators if you need the result of the previous row calculations like this:
{y[`b]+:last[x]`b;x,y}/[t;t]

If your processing function is outputting tables that conform, just raze should suffice:
raze {y#enlist x}'[t;1 3 2]
a b c
-----
a 1 x
b 2 y
b 2 y
b 2 y
c 3 z
c 3 z
Otherwise use (uj/)
(uj/) {y#enlist x}'[t;1 3 2]
a b c
-----
a 1 x
b 2 y
b 2 y
b 2 y
c 3 z
c 3 z

Your best answer will depend very much on how you want to use the results computed from each row of t. It might suit you to normalise t; it might not. The key point here:
A table cell can be any q data structure.
The minimum you can do in this regard is to store the result of your processing function in a new column.
Below, an arbitrary binary function f returns its result as a dictionary.
q)f:{n:1+rand 3;(`$string[x],/:"123" til n)!n#y}
q)f [`a;2]
a1| 2
a2| 2
q)update d:a f'b from t
a b c d
---------------------
a 1 x `a1`a2`a3!1 1 1
b 2 y (,`b1)!,2
c 3 z `c1`c2!3 3
But its result could be any q data structure.
You were considering a unary processing function:
q)pf:{#[x;`d;:;] f . x`a`b}
q)pf each t
a b c d
---------------------
a 1 x `a1`a2`a3!1 1 1
b 2 y `b1`b2!2 2
c 3 z `c1`c2`c3!3 3 3
You might find other suggestions at KX Community.

If I understand correctly your question you need something like this :
(uj/){}each t
Check this bit :
(uj/)enlist[t],{x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
This part :
x:update x:i from
// functional form of a function that takes random rows/columns
?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];
// some for of if-else and an update to generate column a (not bullet proof)
{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]
Basically the above gives something like :
q){x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
+`a`b`c`x!(`a0`a1`a2`a3`a4`a5`a6`a7;1 1 1 1 1 1 1 1;`x`x`x`x`x`x`x`x;0 1 2 3 ..
+`a`x!(`a0`a1`a2`a3`a4`a5;0 1 2 3 4 5)
+`a`b`c`x!(`a0`a1`a2;1 1 1;`x`x`x;0 1 2)
+`a`b`c`x!(`a0`a1`a2`a3`a4`a5`a6`a7`a8`a9`a10`a11;1 1 1 1 1 1 1 1 1 1 1 1;`x`..
or taking the first one :
q)first{x:update x:i from?[rand[20]#enlist x;();0b;{x!x}rand[4]#cols[x]];{(x;![x;();0b;(enlist`a)!enlist($;enlist`;((';{raze string(x;y)});`a;`i))])[y~`a]}/[x;cols x]}each t
a b x
--------
a0 1 0
a1 1 1
a2 1 2
a3 1 3
a4 1 4
a5 1 5
a6 1 6
a7 1 7
a8 1 8
a9 1 9
a10 1 10
You can do
(uj/)enist[t],{ // some function }each t
to get what you want. Drop the enlist[t] if you don't want the table you start with in your result
Hope this helps.

Create new binary column based off of join in spark

My situation is I have two spark data frames, dfPopulation and dfSubpopulation.
dfSubpopulation is just that, a subpopulation of dfPopulation.
I would like a clean way to create a new column in dfPopulation that is binary of whether the dfSubpopulation key was in the dfPopulation key. E.g. what I want is to create the new DataFrame dfPopulationNew:
dfPopulation = X Y key
1 2 A
2 2 A
3 2 B
4 2 C
5 3 C
dfSubpopulation = X Y key
1 2 A
3 2 B
4 2 C
dfPopulationNew = X Y key inSubpopulation
1 2 A 1
2 2 A 0
3 2 B 1
4 2 C 1
5 3 C 0
I know this could be down fairly simply with a SQL statement, however given that a lot of Sparks optimization is now using the DataFrame construct, I would like to utilize that.

Using SparkSQL compared to DataFrame operations should make no difference from a performance perspective, the execution plan is the same. That said, here is one way to do it using a join
val dfPopulationNew = dfPopulation.join(
dfSubpopulation.withColumn("inSubpopulation", lit(1)),
Seq("X", "Y", "key"),
"left_outer")
.na.fill(0, Seq("inSubpopulation"))

merge two columns in dataframe

I have a df that I want to have two columns combined or merged (I am not sure of the correct term) by grouping of another column.
Here is my df
> print(BC_data)
Treatment Day LB PCA
Day2F1 Untreated 2 4400000 10900000
Day2F2 Untreated 2 5800000 5200000
Day2F3 Untreated 2 5700000 5900000
Day2F4 Metro 2 13100000 11500000
Day2F5 Metro 2 9600000 9100000
Day2F6 Metro 2 6900000 9700000
Day2F7 Pen 2 11400000 5100000
Day2F8 Pen 2 8000000 7300000
Day2F9 Pen 2 6300000 9300000
Day2F10 Rif 2 600000 4600000
Day2F11 Rif 2 400000 25000000
I would like to have the column LB and PCA put together in one column and grouped by days. to become something like this
Treatment Day LB-PCA
Day2F1 Untreated 2 4400000
Day2F2 Untreated 2 5800000
Day2F3 Untreated 2 5700000
Day2F1 Untreated 2 10900000
Day2F2 Untreated 2 5200000
Day2F3 Untreated 2 5900000
......
Can any one help?
Thanks in advance

You can just concatenate the records in two steps. First, select all records in the original df but renaming column LB as LB-PCA. Next, concatenate all rows in the original df but now using PCA as LB-PCA. Finally, sort if needed.

Joining multiple times in kdb

I have two tables
table 1 (orders) columns: (date,symbol,qty)
table 2 (marketData) columns: (date,symbol,close price)
I want to add the close for T+0 to T+5 to table 1.
{[nday]
value "temp0::update date",string[nday],":mdDates[DateInd+",string[nday],"] from orders";
value "temp::temp0 lj 2! select date",string[nday],":date,sym,close",string[nday],":close from marketData";
table1::temp
} each (1+til 5)
I'm sure there is a better way to do this, but I get a 'loop error when I try to run this function. Any suggestions?

See here for common errors. Your loop error is because you're setting views with value, not globals. Inside a function value evaluates as if it's outside the function so you don't need the ::.
That said there's lots of room for improvement, here's a few pointers.
You don't need the value at all in your case. E.g. this line:
First line can be reduced to (I'm assuming mdDates is some kind of function you're just dropping in to work out the date from an integer, and DateInd some kind of global):
{[nday]
temp0:update date:mdDates[nday;DateInd] from orders;
....
} each (1+til 5)
In this bit it just looks like you're trying to append something to the column name:
select date",string[nday],":date
Remember that tables are flipped dictionaries... you can mess with their column names via the keys, as illustrated (very noddily) below:
q)t:flip `a`b!(1 2; 3 4)
q)t
a b
---
1 3
2 4
q)flip ((`$"a","1"),`b)!(t`a;t`b)
a1 b
----
1 3
2 4
You can also use functional select, which is much neater IMO:
q)?[t;();0b;((`$"a","1"),`b)!(`a`b)]
a1 b
----
1 3
2 4

Seems like you wanted to have p0 to p5 columns with prices corresponding to date+0 to date+5 dates.
Using adverb over to iterate over 0 to 5 days :
q)orders:([] date:(2018.01.01+til 5); sym:5?`A`G; qty:5?10)
q)data:([] date:20#(2018.01.01+til 10); sym:raze 10#'`A`G; price:20?10+10.)
q)delete d from {c:`$"p",string[y]; (update d:date+y from x) lj 2!(`d`sym,c )xcol 0!data}/[ orders;0 1 2 3 4]
date sym qty p0 p1 p2 p3 p4
---------------------------------------------------------------
2018.01.01 A 0 10.08094 6.027448 6.045174 18.11676 1.919615
2018.01.02 G 3 13.1917 8.515314 19.018 19.18736 6.64622
2018.01.03 A 2 6.045174 18.11676 1.919615 14.27323 2.255483
2018.01.04 A 7 18.11676 1.919615 14.27323 2.255483 2.352626
2018.01.05 G 0 19.18736 6.64622 11.16619 2.437314 4.698096

Conditional Distinct Count in Crystal Reports

I have a dataset like this:
ID PersonID ClassID Attended Converted
1 1 1 1 0
2 1 1 1 1
3 1 1 1 1
4 2 1 1 1
5 3 2 0 0
6 3 2 1 1
7 4 2 1 0
I'm building a report that groups by ClassID (actually I'm using a parameter that allows grouping on a few different cols, but for simplicity here, I'm just using ClassID). I need to do a calculation in each group footer. In order to do the calculation, I need to count records with PersonIDs unique to that group. The catch is, in one case, these records also need to match a criteria. EG:
X = [Count of records where Converted = 1 with distinct PersonID]
Y = [Count of records where Attended = 1]
Then I need to display the quotient as a percentage:
(X/Y)*100
So the final report would look something like this:
ID PersonID Attended Converted
CLASS 1 GROUP
1 1 1 0
2 1 1 1
3 1 1 1
4 2 1 1
Percent= 2/4 = 50%
CLASS 2 GROUP
5 3 0 0
6 3 1 1
7 4 1 0
Percent= 1/2 = 50%
Notice in Class 1 Group, there are 3 records with Converted = 1 but 'X' (the numerator) is equal to 2 because of the duplicate PersonID. How can I calculate this in Crystal Reports?

I had to create a few different formulas to make this work with the help of this site.
First I created a function called fNull as suggested by that site, that is just blank. I was wondering if just typing null in its place would do the job but didn't get to testing it. Next I created formulas to evaluate if a row was attended and if a row was converted.
fTrialAttended:
//Allows row to be counted if AttendedTrial is true
if {ConversionData.AttendedTrial} = true
then CStr({ConversionData.PersonID})
else {#fNull}
fTrialsConverted:
//Allows row to be counted if Converted is true
if {ConversionData.Converted} = true
then CStr({ConversionData.PersonID})
else {#fNull}
Note that I'm returning the PersonID if attended or converted is true. This lets me do the distinct count in the next formula (X from the original question):
fX:
DistinctCount({#fTrialsConverted}, {ConversionData.ClassID})
This is placed in the group footer. Again remember #fTrialsConverted is returning the PersonID of trials converted (or fNull, which won't be counted). One thing I don't understand is why I had to explicitly include the group by field (ClassID) if it's in the group footer, but I did or it would count the total across all groups. Next, Y was just a straight up count.
fY:
//Counts the number of trials attended in the group
Count({#fTrialsAttended}, {ConversionData.ClassID})
And finally a formula to calculate the percentage:
if {#fY} = 0 then 0
else ({#fX}/{#fY})*100
The last thing I'll share is I wanted to also calculate the total across all groups in the report footer. Counting total Y was easy, it's the same as the fY formula except you leave out the group by parameter. Counting total X was trickier because I need the sum of the X from each group and Crystal can't sum another sum. So I updated my X formula to also keep a running total in a global variable:
fX:
//Counts the number of converted trials in the group, distinct to a personID
whileprintingrecords;
Local NumberVar numConverted := DistinctCount({#fTrialsConverted}, {#fGroupBy});
global NumberVar rtConverted := rtConverted + numConverted; //Add to global running total
numConverted; //Return this value
Now I can use rtConverted in the footer for the calculation. This lead to just one other bewildering thing that took me a couple hours to figure out. rtConverted was not being treated as a global variable until I explicitly added the global keyword, despite all the documentation I've seen saying global is the default. Once I figured that out, it all worked great.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to stack two columns for grouping? - scala

Related

Processing each row in kdb table and appending arbitrary results in a new table

Create new binary column based off of join in spark

merge two columns in dataframe

Joining multiple times in kdb

Conditional Distinct Count in Crystal Reports

Categories

Resources