I have the data like that in a spark.dataframe A :
Ben 1
Ben 2
Ben 4
Ben 3
Jerry 2
Jerry 2
Jane 3
Jane 5
James 1
James 1
We have the Action_id range 1-5.
We want to get the spark.dataframe B like that:
Name Action_id=1 Action_id=2 Action_id=3 Action_id=4 Action_id=5
Ben 1 1 1 1 0
Jane 0 0 1 0 1
Jerry 0 2 0 0 0
James 2 0 0 0 0
For example, the '1' in (Ben,Action_id=1) means that in the previous dataframe, Ben take action 1 for one time.
How can I transform the dataframe A to dataframe B ?
You are looking for a PivotTable using Count aggregation:
In Scala :
import org.apache.spark.sql.{functions => F}
val df = Seq(("Ben", 1),
("Ben", 2),
("Ben", 4),
("Ben", 3),
("Jerry", 2),
("Jerry", 2),
("Jane", 3),
("Jane", 5),
("James", 1),
("James", 1)).toDF("Name", "Action_id")
df.groupBy("Name").pivot("Action_id").agg(F.count("Action_id")).na.fill(0).show
I do not have access to pyspark shell right now, but this should be something like this :
import pyspark.sql.functions as F
(df
.groupby(df.Name)
.pivot("Action_id")
.agg(F.count("Action_id"))
.na.fill(0)
.show())
Related
I have a date-level promotion data frame that looks something like this:
ID
Date
Promotions
Converted to customer
1
2-Jan
2
0
1
10-Jan
3
1
1
14-Jan
3
0
2
10-Jan
19
1
2
10-Jan
8
0
2
10-Jan
12
0
Now I want to see what were the number of promotions it took to convert someone into a customer
For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.
Eg.
ID
Date
1
5
2
19
I am unable to think of an idea to solve it. Can you please help me?
#Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).
You can use:
prom = (df.groupby('ID')['Promotions'].cumsum()
.where(df['Converted to customer'].eq(1))
.dropna().astype(int))
out = df.loc[prom.index, ['ID', 'Date']].assign(Promotion=prom)
print(out)
# Output
ID Date Promotion
1 1 10-Jan 5
3 2 10-Jan 19
Use one groupby to generate a mask to hide the rows, then one groupby.sum for the sum:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = df[~mask].groupby('ID')['Promotions'].sum()
Output:
ID
1 5
2 19
Name: Promotions, dtype: int64
Alternative output:
df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})
Output:
ID Number
0 1 5
1 2 19
If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = (df[~mask]
.groupby('ID', as_index=False)
.agg(**{'Number': ('Promotions', 'sum'),
'Converted': ('Converted to customer', 'max')
})
)
Output:
ID Number Converted
0 1 5 1
1 2 19 1
2 3 39 0
Alternative input:
ID Date Promotions Converted to customer
0 1 2-Jan 2 0
1 1 10-Jan 3 1
2 1 14-Jan 3 0
3 2 10-Jan 19 1
4 2 10-Jan 8 0
5 2 10-Jan 12 0
6 3 10-Jan 19 0 # this group has
7 3 10-Jan 8 0 # no conversion
8 3 10-Jan 12 0 # to customer
you want to compute something by ID, so a groupby ID seems appropriate, e.g.
data.groupby("ID").apply(fct)
Now write a separate function agg_fct which computes the result for a
dataframe consisting of only one ID
Assuming data are ordered by Date, I guess that
def agg_fct(df):
index_of_conv = df["Converted to customer"].argmax()
return df.iloc[0:index_of_conv,df.columns.get_loc("Promotions")].sum()
would be fine. You might want to make some adjustments in case of a customer who has never been converted.
I have these two tables:
tab:([]col1:`abc`def`ghe`abc;val_00:`a`b`c`e;val_01:`d`e`f`t;val_02:`g`h`e`g;val_03:`r`t`y`o)
tab2:([]col1:`abc`abc`abc`abc`def`def`def`ghe`ghe`ghe;col2:0 1 2 3 4 5 6 7 8 9;col3:`Ashley`Peter`John`Molly`Apple`Orange`Banana`Robin`Tony`Bob)
and this is the result I am looking for:
tabResult:([]col1:`abc`def`ghe`abc;val_00:`Ashley`b`c`Ashley;val_01:`Peter`e`f`Peter;val_02:`John`h`e`John;val_03:`Molly`t`y`Molly)
col1 val_00 val_01 val_02 val_03
abc Ashley Peter John Molly
def b e h t
ghe c f e y
abc Ashley Peter John Molly
I would like to update tab depending on tab2. If col1=`abc,col2=1 in tab2, I would like to update val_01 to `Peter in tab, and if col1 =`abc,col2=2 in tab2, I would like to update val_02 field with `John in tab etc.
This is what I have so far:
{![tab;enlist(=;`col1;enlist x);0b;(enlist y)!enlist z]} . (`abc;`val_01;)
The function above works if the field is numerical and I use a number as the last arg. However, I am not sure how to update symbols and how to generalise this function for all tables.
If I'm understanding your request correctly, you're trying to update a field that has a long type with values that are of symbol type. This is going to fail with a 'type error as column values are expected to be uniform in type. What you can alternatively do is create new columns for the symbol entries, and after that select the columns you want.
Is something like this what you had in mind? I've assumed that the column name is determined by its col2 value in tab. Also it looks like you have two val_01 columns in your tab input, I assumed one of these was supposed to be val_02.
q)(uj/){![tab;enlist(=;`col1;enlist x);0b;(enlist`$"val_0",string[y],"_sym")!enlist enlist z]}.'flip tab2`col1`col2`col3
col1 val_00 val_01 val_02 val_03 val_01_sym val_02_sym val_03_sym val_04_sym val_05_sym val_06_sym val_07_sym val_08_sym val_09_sym
-----------------------------------------------------------------------------------------------------------------------------------
abc 1 2 2 3 Peter
def 2 2 3 2
ghe 3 3 1 1
abc 1 2 2 3 John
def 2 2 3 2
ghe 3 3 1 1
abc 1 2 2 3 Molly
def 2 2 3 2
ghe 3 3 1 1
abc 1 2 2 3
def 2 2 3 2 Apple
ghe 3 3 1 1
abc 1 2 2 3
def 2 2 3 2 Orange
ghe 3 3 1 1
abc 1 2 2 3
def 2 2 3 2 Banana
ghe 3 3 1 1
abc 1 2 2 3
def 2 2 3 2
ghe 3 3 1 1 Robin
abc 1 2 2 3
def 2 2 3 2
ghe 3 3 1 1 Tony
abc 1 2 2 3
def 2 2 3 2
ghe 3 3 1 1 Bob
EDIT:
Based on your comments, I've amended my solution:
q)cols[tab]#{![x;enlist(=;`col1;enlist y`col1);0b;(enlist`$"val_0",string y`col2)!enlist enlist y`col3]}/[tab;tab2]
col1 val_00 val_01 val_02 val_03
--------------------------------
abc Ashley Peter John Molly
def b e h t
ghe c f e y
abc Ashley Peter John Molly
I am working on a dataset that contains predicted label (predicted) vs. true label (label) for each id and a column indicating whether the predicted label equals true label (match). I want to show the percentage of correct prediction for each label versus the total number of observations belonging to that label.
As an example, given the following data:
id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
label <- c(6, 5, 1, 5, 4, 2, 3, 1, 6, 1)
predicted <- c(6, 5, 1, 3, 2, 2, 3, 1, 4, 4)
match <- c(1, 1, 1, 0, 0, 1, 1, 1, 0, 0)
dt <- data.frame(id, label, predicted, match)
head(dt)
id label predicted match
1 1 6 6 1
2 2 5 5 1
3 3 1 1 1
4 4 5 3 0
5 5 4 2 0
6 6 2 2 1
If I group_by(label) and count(label, predicted) and then mutate(percent = sum(match == 1)/sum(n)), it is expected that I should obtain a new grouped data frame like this
library(plyr)
library(dplyr)
dt %>% group_by(label) %>% dplyr::count(label, predicted) %>% mutate(percent = sum(match == 1)/sum(n))
dt
id label predicted match percent
1 3 1 1 1 0.67
2 8 1 1 1 0.67
3 10 1 4 0 0.67
4 6 2 2 1 1.00
5 7 3 3 1 1.00
6 5 4 2 0 0.00
7 4 5 3 0 0.50
8 2 5 5 1 0.50
9 9 6 4 0 0.50
10 1 6 6 1 0.50
However, my code gives me this following output instead
dt
# A tibble: 6 x 4
# Groups: label [5]
label predicted n percent
<dbl> <dbl> <int> <dbl>
1 1.00 1.00 2 0.600
2 1.00 4.00 1 0.600
3 2.00 2.00 1 0.600
4 3.00 3.00 1 0.600
5 4.00 2.00 1 0.600
6 5.00 3.00 1 0.600
It calculated the percentage of correct prediction for "all" label (hence, all equals 0.600) instead of doing that for each label. How should I modify my code to achieve my desired output?
I wasn't able to reproduce your output with the code that you shared. I think the following will accomplish what you are seeking, though (I used total as the variable name rather than n):
dt %>%
arrange(label) %>%
group_by(label) %>%
mutate(total = n(),
percent = sum(match == 1) / total)
# A tibble: 10 x 6
# Groups: label [6]
id label predicted match total percent
<dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 3 1 1 1 3 0.667
2 8 1 1 1 3 0.667
3 10 1 4 0 3 0.667
4 6 2 2 1 1 1
5 7 3 3 1 1 1
6 5 4 2 0 1 0
7 2 5 5 1 2 0.5
8 4 5 3 0 2 0.5
9 1 6 6 1 2 0.5
10 9 6 4 0 2 0.5
I have the following data frame :
df1
uid text frequency
1 a 1
1 b 0
1 c 2
2 a 0
2 b 0
2 c 1
I need to flatten it on the basis of uid to :
df2
uid a b c
1 1 0 2
2 0 0 1
I've worked on similar lines in R but haven't been able to translate it into sql or scala.
Any suggestions on how to approach this?
You can group by uid, use text as a pivot column and sum frequencies:
df1
.groupBy("uid")
.pivot("text")
.sum("frequency").show()
Consider the following table:
myTable:
a b
-------
1
2
3 10
4 50
5 30
How do I replace the empty cells of b with a zero? So the result would be:
a b
-------
1 0
2 0
3 10
4 50
5 30
Right now I'm doing:
myTable: update b:{$[x~0Ni;0;x]}'b from myTable
But I am wondering whether there is a better/easier solution for doing this.
Using the fill operator (^)
Example Table:
q) tbl:flip`a`b!(2;0N)#10?0N 0N 0N,til 3
a b
---
0 2
1 1
1 1
1
1
Fill nulls in all columns with 0:
q)0^tbl
a b
---
0 2
1 1
1 1
0 1
1 0
Fill nulls only in selective columns with 0:
q)update 0^b from tbl
a b
---
0 2
1 1
1 1
1
1 0
In some cases when you want to fill the null values from the previous non-null value, you can use fills functions.
q)tbl:flip`a`b!(2;0N)#10?0N 0N 0N,til 3
a b
---
2 1
1
1 2
0
update fills a, fills b from tbl
q)a b
---
2 1
2 1
2 1
1 2
0 2
using fills with aggregation
q)tbl:update s:5?`g`a from flip`a`b!(2;0N)#10?0N 0N 0N,til 3
a b s
-----
1 a
2 a
0 0 g
2 2 g
0 a
q)update fills a, fills b by s from tbl
a b s
-----
1 a
1 2 a
0 0 g
2 2 g
0 2 a