Is there a simple way in Pyspark to find out number of promotions it took to convert someone into customer? - pyspark

I have a date-level promotion data frame that looks something like this:
ID
Date
Promotions
Converted to customer
1
2-Jan
2
0
1
10-Jan
3
1
1
14-Jan
3
0
2
10-Jan
19
1
2
10-Jan
8
0
2
10-Jan
12
0
Now I want to see what were the number of promotions it took to convert someone into a customer
For eg., It took (2+3) promotions to convert ID 1 to the customer and (19) to convert ID 2 to the customer.
Eg.
ID
Date
1
5
2
19
I am unable to think of an idea to solve it. Can you please help me?
#Corralien and mozway have helped with the solution in Python. But I am unable to implement it in Pyspark because of the huge dataframe size (>1 TB).

You can use:
prom = (df.groupby('ID')['Promotions'].cumsum()
.where(df['Converted to customer'].eq(1))
.dropna().astype(int))
out = df.loc[prom.index, ['ID', 'Date']].assign(Promotion=prom)
print(out)
# Output
ID Date Promotion
1 1 10-Jan 5
3 2 10-Jan 19

Use one groupby to generate a mask to hide the rows, then one groupby.sum for the sum:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = df[~mask].groupby('ID')['Promotions'].sum()
Output:
ID
1 5
2 19
Name: Promotions, dtype: int64
Alternative output:
df[~mask].groupby('ID', as_index=False).agg(**{'Number': ('Promotions', 'sum')})
Output:
ID Number
0 1 5
1 2 19
If you potentially have groups without conversion to customer, you might want to also aggregate the "" column as indicator:
mask = (df.groupby('ID', group_keys=False)['Converted to customer']
.apply(lambda s: s.eq(1).shift(fill_value=False).cummax())
)
out = (df[~mask]
.groupby('ID', as_index=False)
.agg(**{'Number': ('Promotions', 'sum'),
'Converted': ('Converted to customer', 'max')
})
)
Output:
ID Number Converted
0 1 5 1
1 2 19 1
2 3 39 0
Alternative input:
ID Date Promotions Converted to customer
0 1 2-Jan 2 0
1 1 10-Jan 3 1
2 1 14-Jan 3 0
3 2 10-Jan 19 1
4 2 10-Jan 8 0
5 2 10-Jan 12 0
6 3 10-Jan 19 0 # this group has
7 3 10-Jan 8 0 # no conversion
8 3 10-Jan 12 0 # to customer

you want to compute something by ID, so a groupby ID seems appropriate, e.g.
data.groupby("ID").apply(fct)
Now write a separate function agg_fct which computes the result for a
dataframe consisting of only one ID
Assuming data are ordered by Date, I guess that
def agg_fct(df):
index_of_conv = df["Converted to customer"].argmax()
return df.iloc[0:index_of_conv,df.columns.get_loc("Promotions")].sum()
would be fine. You might want to make some adjustments in case of a customer who has never been converted.

Related

Counting presence or absence of tallies by rows

Hi I am not sure how to explain what I need but I'll try. I need a query (if there is one) for counting if a tally was present in a point (column). So all species with more than one tally in a point will count only as one.
This is how the data looks:
Sp Site Pnt1 Pnt2 Pnt3 Total
A 1 1 1 1 3
A 2 1 1 2
A 3 1 1
B 1 1 1 1 3
B 2 1 1
C 1 1 1 2
C 2 0
I want to count if the sites have tally or not and if they are repeated by points (for the same species) I want to count them as one. I would like the resulting table to look like this.
Sp Pnt1 Pnt2 Pnt3 Total
A 1 1 1 3
B 1 1 1 3
C 0 1 1 2
Thanks for all the help you can provide.

Calculated field in Tableau

I have a very simple problem but i am totally new in Tableau. So needs some help in solving this problem.
My Data Set contain
Year_Track_4,Year_Track_5,Year_Track_6,Year_Track_7,.... N
Each Year_Track contain 1 /0 values. 1 means graduated and 0 means didnot graduated or failed
enter image description here
y4 y5 N
1 8
0 5
1 6
0 1
1 2
1 5
1 7
1 8
1 5
0 7
1 5
1 8
1 6
1 1
So , I want to create a placeholder in Tableau or Calculated Field or parameter to select one YEAR and count number of graduated or didn't graduated.
I need to create the same for OverAll_0 and OverAll_1 as one Calculated field and it contains the value of 1 and 0 . So, that i can use the SUM(N) and and calculate it.
I used IFF statement to solve this problem
IIF(Year_Track_4 = 0) then 'graduated in 4 year '
.......
......

Pandas: Convert datetime series to an int by month

I have a dataframe that contains a series of dates, e.g.:
0 2014-06-17
1 2014-05-05
2 2014-01-07
3 2014-06-29
4 2014-03-15
5 2014-06-06
7 2014-01-29
What I would like to do is convert these dates to integers by the month, since all the values are within the same year. So I would like to get something like this:
0 6
1 5
2 1
3 6
4 3
5 6
7 1
Is there a quick way to do this with Pandas?
EDIT: Answered by jezrael. Thanks so much!
Use function dt.month:
print (df)
Dates
0 2014-06-17
1 2014-05-05
2 2014-01-07
3 2014-06-29
4 2014-03-15
5 2014-06-06
6 2014-01-29
print (df.Dates.dt.month)
0 6
1 5
2 1
3 6
4 3
5 6
6 1
Name: Dates, dtype: int64

Delete adjacent repeated terms

I have the following vector a:
a=[8,8,9,9,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8]
From a I want to delete all "adjacent" repetitions to obtain:
b=[8,9,1,2,3,4,5,6,7,8]
However, when I do:
unique(a,'stable')
ans =
8 9 1 2 3 4 5 6 7
You see, unique only really gets the unique elements of a, whereas what I want is to delete the "duplicates"... How do I do this?
It looks like a run-length-encoding problem (check here). You can modify Mohsen's solution to get the desired output. (i.e. I claim no credit for this code, yet the question is not a duplicate in my opinion).
Here is the code:
a =[8,8,9,9,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8]
F=find(diff([a(1)-1, a]));
Since diff(a) returns an array of length (length(a) -1), we want to add a value at the beginning (i.e the a(1)) to get a vector the same size as a. Here we subtract 1 so that, as mentioned by #surgical_tubing, the command find effectively finds it because it looks for non zero elements, so we want to make sure the value is non zero.
Hence diff([a(1)-1, a]) looks like this:
Columns 1 through 8
1 0 1 0 -8 0 1 0
Columns 9 through 16
1 0 1 0 1 0 1 0
Columns 17 through 20
1 0 1 0
Now having found the repeated elements, we index back into a with the positions found by find:
newa=a(F)
and output:
newa =
Columns 1 through 8
8 9 1 2 3 4 5 6
Columns 9 through 10
7 8

Simulink - From file block - Not defined time steps in .mat file (with examples)

I have file with a lot of data, for example here is a file with this data ->
time: 1 2 3 4 5 6 7 8 9 10
data: 1 0 1 1 1 1 1 0 1 0
but in my file I have dropped the multiple data going in sequence like this ->
time: 1 2 3 8 9 10
data: 1 0 1 0 1 0
If I run this data the result is this ->
My question is how to achieve result like in this picture shown with red arrows.
Simplier, how to repeat the value in not defined time steps (4,5,6,7 example above)
you can achive this by not droping the final (7 in this example) like this:
time: 1 2 3 7 8 9 10
data: 1 0 1 1 0 1 0
this way simulink will interpolate ones there.