Pyspark - redistribute percentages - pyspark

I have a table like the following:
city | center | qty_out | qty_out %
----------------------------------------
A | 1 | 10 | .286
A | 2 | 2 | .057
A | 3 | 23 | .657
B | 1 | 40 | .8
B | 2 | 10 | .2
city-center is unique/the primary key.
If any center within a city has a qty_out % of less than 10% (.10), I want to ignore it and redistribute its % among the other centers of the city. So the result above would become
city | center | qty_out_%
----------------------------------------
A | 1 | .3145
A | 3 | .6855
B | 1 | .8
B | 2 | .2
How can I go about this? I was thinking a window function to partition but can't think of a window function to use with this
column_list = ["city","center"]
w = Window.partitionBy([col(x) for x in column_list]).orderBy('qty_out_%')

I am not statistician, so I cannot comment on the equation, however, if I write the Spark SQL as literally as you mentioned, it'll be like this.
w = Window.partitionBy('city')
redist_cond = F.when(F.col('qty_out %') < 0.1, F.col('qty_out %'))
df = (df.withColumn('redist', F.sum(redist_cond).over(w) / (F.count('*').over(w) - F.count(redist_cond).over(w)))
.fillna(0, subset=['redist'])
.filter(F.col('qty_out %') >= 0.1)
.withColumn('qty_out %', redist_cond.otherwise(F.col('qty_out %') + F.col('redist')))
.drop('redist'))

Related

pyspark - preprocessing with a kind of "product-join"

I have 2 datasets that I can represent as:
The first dataframe is my raw data. It contains millions of row and around 6000 areas.
+--------+------+------+-----+-----+
| user | area | time | foo | bar |
+--------+------+------+-----+-----+
| Alice | A | 5 | ... | ... |
| Alice | B | 12 | ... | ... |
| Bob | A | 2 | ... | ... |
| Charly | C | 8 | ... | ... |
+--------+------+------+-----+-----+
This second dataframe is a mapping table. It has around 200 areas (not 5000) for 150 places. Each area can have 1-N places (and a place can have 1-N areas too). It can be represented unpivoted this way:
+------+--------+-------+
| area | place | value |
+------+--------+-------+
| A | placeZ | 0.1 |
| B | placeB | 0.6 |
| B | placeC | 0.4 |
| C | placeA | 0.1 |
| C | placeB | 0.04 |
| D | placeA | 0.4 |
| D | placeC | 0.6 |
| ... | ... | ... |
+------+--------+-------+
or pivoted
+------+--------+--------+--------+-----+
| area | placeA | placeB | placeC | ... |
+------+--------+--------+--------+-----+
| A | 0 | 0 | 0 | ... |
| B | 0 | 0.6 | 0.4 | ... |
| C | 0.1 | 0.04 | 0 | ... |
| D | 0.4 | 0 | 0.6 | ... |
+------+--------+--------+--------+-----+
I would like to create a kind of product-join to have something like:
+--------+--------+--------+--------+-----+--------+
| user | placeA | placeB | placeC | ... | placeZ |
+--------+--------+--------+--------+-----+--------+
| Alice | 0 | 7.2 | 4.8 | 0 | 0.5 | <- 7.2 & 4.8 comes from area B and 0.5 from area A
| Bob | 0 | 0 | 0 | 0 | 0.2 |
| Charly | 0.8 | 0.32 | 0 | 0 | 0 |
+--------+--------+--------+--------+-----+--------+
I see 2 options so far:
Perform a left join between the main table and the pivoted one
Multiply each column by the time (around 150 columns)
Groupby user with a sum
Perform a outer join between the main table and the unpivoted one
Multiply the time by value
Pivot place
Groupby user with a sum
I don't like the first option because of the number of multiplications involved (the mapping dataframe is quite sparse).
I prefer the second option but I see two problems :
If someday, the dataset does not have a place represented, the column will not exist and the dataset will have a different shape (hence failing).
Some other features like foo, bar will be duplicated with the outer-join and I'll have to handle it on case by case at the grouping stage (sum or average).
I would like to know if there is something more ready-to-use for this kind of product-join in spark ? I have seen the OneHotEncoder but it only provides only a "1" on each column (so it is even worse than the first solution).
Thanks in advance,
Nicolas

calculate aggregation and percentage simultaneous after groupBy in scala/Spark Dataset/Dataframe

I am learning to work with Scala and spark. It's my first incidents using them. I have some structured Scala DataSet(org.apache.spark.sql.Dataset) like following format.
Region | Id | RecId | Widget | Views | Clicks | CTR
1 | 1 | 101 | A | 5 | 1 | 0.2
1 | 1 | 101 | B | 10 | 4 | 0.4
1 | 1 | 101 | C | 5 | 1 | 0.2
1 | 2 | 401 | A | 5 | 1 | 0.2
1 | 2 | 401 | D | 10 | 2 | 0.1
NOTE: CTR = Clicks/Views
I want to merge the mapping regardless of Widget (i.e using Region, Id, RecID).
The Expected Output I want is like following:
Region | Id | RecId | Views | Clicks | CTR
1 | 1 | 101 | 20 | 6 | 0.3
1 | 1 | 101 | 15 | 3 | 0.2
What I am getting is like below:
>>> ds.groupBy("Region","Id","RecId").sum().show()
Region | Id | RecId | sum(Views) | sum(Clicks) | sum(CTR)
1 | 1 | 101 | 20 | 6 | 0.8
1 | 1 | 101 | 15 | 3 | 0.3
I understand that it is summing up all the CTR from original but I want to groupBy as explained but still want to get the expected CTR value. I also don't want to change column names as it is changing in my approach.
Is there any possible way of calculating in such manner. I also have #Purchases and CoversionRate (#Purchases/Views) and I want to do the same thing with that field also. Any leads will be appreciated.
You can calculate the ctr after the aggregation. Try the below code.
ds.groupBy("Region","Id","RecId")
.agg(sum(col("Views")).as("Views"), sum(col("Clicks")).as("Clicks"))
.withColumn("CTR" , col("Views") / col("Clicks"))
.show()

How to find the lowest and biggest value in a maximum distance between points (SQL)

Currently, I have a PostgreSQL database (and a SQL Server database with almost the same structure), with some data, like example below:
+----+---------+-----+
| ID | Name | Val |
+----+---------+-----+
| 01 | Point A | 0 |
| 02 | Point B | 050 |
| 03 | Point C | 075 |
| 04 | Point D | 100 |
| 05 | Point E | 200 |
| 06 | Point F | 220 |
| 07 | Point G | 310 |
| 08 | Point H | 350 |
| 09 | Point I | 420 |
| 10 | Point J | 550 |
+----+---------+-----+
ID = PK (auto increment);
Name = unique;
Val = unique;
Now, suppose I have only Point F (220), and I wanna to find the lowest value and biggest value with a maximum distance less than 100 between the data.
So, my result must return:
Lowest: Point E (200)
Biggest: Point I (420)
Step by step explanation (and because english is not my primary language):
Looking for lowest value:
Initial value = Point F (220);
Look for the lower closest value of Point F (220): Point E (200);
200(E) < 220(F) = True; 220(F) - 200(E) < 100 = True;
Lowest value until now = Point E (200)
Repeat
Look for the lower closest value of Point E (200): Point D (100);
100(D) < 200(E) = True; 200(E) - 100(D) < 100 = False;
Lowest value = Point E (200); Break;
Looking fot the biggest value:
Initial value = Point F (220);
Look for the biggest closest value of Point F (220): Point G (310);
310(G) > 220(F) = True; 310(G) - 220(F) < 100 = True;
Biggest value until now = Point G (310)
Repeat
Look for the biggest closest value of Point G (310): Point H (350);
350(H) > 310(G) = True; 350(H) - 310(G) < 100 = True;
Biggest value until now = Point H (350)
Repeat
Look for the biggest closest value of Point H (350): Point I (420);
420(I) > 350(H) = True; 420(I) - 350(H) < 100 = True;
Biggest value until now = Point I (420)
Repeat
Look for the biggest closest value of Point I (420): Point J (550);
550(J) > 420(I) = True; 550(J) - 420(I) < 100 = False;
Biggest value Point I (420); Break;
This can be done using Windows Functions and some working.
In a step by step fashion, you would start by having one table (let's call it point_and_prev_next) defined by this select:
SELECT
id, name, val,
lag(val) OVER(ORDER BY id) AS prev_val,
lead(val) OVER(ORDER BY id) AS next_val
FROM
points
which produces:
| id | name | val | prev_val | next_val |
|----|---------|-----|----------|----------|
| 1 | Point A | 0 | (null) | 50 |
| 2 | Point B | 50 | 0 | 75 |
| 3 | Point C | 75 | 50 | 100 |
| 4 | Point D | 100 | 75 | 200 |
| 5 | Point E | 200 | 100 | 220 |
| 6 | Point F | 220 | 200 | 310 |
| 7 | Point G | 310 | 220 | 350 |
| 8 | Point H | 350 | 310 | 420 |
| 9 | Point I | 420 | 350 | 550 |
| 10 | Point J | 550 | 420 | (null) |
The lag and lead window functions serve to get the previous and next values from the table (sorted by id, and not partitioned by anything).
Next, we make a second table point_and_dist_prev_next which uses val, prev_val and next_val, to compute distance to previous point and distance to next point. This would be computed with the following SELECT:
SELECT
id, name, val, (val-prev_val) AS dist_to_prev, (next_val-val) AS dist_to_next
FROM
point_and_prev_next
This is what you get after executing it:
| id | name | val | dist_to_prev | dist_to_next |
|----|---------|-----|--------------|--------------|
| 1 | Point A | 0 | (null) | 50 |
| 2 | Point B | 50 | 50 | 25 |
| 3 | Point C | 75 | 25 | 25 |
| 4 | Point D | 100 | 25 | 100 |
| 5 | Point E | 200 | 100 | 20 |
| 6 | Point F | 220 | 20 | 90 |
| 7 | Point G | 310 | 90 | 40 |
| 8 | Point H | 350 | 40 | 70 |
| 9 | Point I | 420 | 70 | 130 |
| 10 | Point J | 550 | 130 | (null) |
And, at this point, (and starting with point "F"), we can get the first "wrong point up" (the first that fails the "distance to previous" < 100) by means of the following query:
SELECT
max(id) AS first_wrong_up
FROM
point_and_dist_prev_next
WHERE
dist_to_prev >= 100
AND id <= 6 -- 6 = Point F
This just looks for the point closest to our reference one ("F") which FAILS to have a distance with the previous one < 100.
The result is:
| first_wrong_up |
|----------------|
| 5 |
The first "wrong point" going down is computed in an equivalent manner.
All these queries can be put together using Common Table Expressions, also called WITH queries, and you get:
WITH point_and_dist_prev_next AS
(
SELECT
id, name, val,
val - lag(val) OVER(ORDER BY id) AS dist_to_prev,
lead(val) OVER(ORDER BY id)- val AS dist_to_next
FROM
points
),
first_wrong_up AS
(
SELECT
max(id) AS first_wrong_up
FROM
point_and_dist_prev_next
WHERE
dist_to_prev >= 100
AND id <= 6 -- 6 = Point F
),
first_wrong_down AS
(
SELECT
min(id) AS first_wrong_down
FROM
point_and_dist_prev_next
WHERE
dist_to_next >= 100
AND id >= 6 -- 6 = Point F
)
SELECT
(SELECT name AS "lowest value"
FROM first_wrong_up
JOIN points ON id = first_wrong_up),
(SELECT name AS "biggest value"
FROM first_wrong_down
JOIN points ON id = first_wrong_down) ;
Which provides the following result:
| lowest value | biggest value |
|--------------|---------------|
| Point E | Point I |
You can check it at SQLFiddle.
NOTE: It is assumed that the id column is always increasing. If it were not, the val column would have to be used instead (assuming, obviously, that it always keeps growing).

How to set sequence number of sub-elements in TSQL unsing same element as parent?

I need to set a sequence inside T-SQL when in the first column I have sequence marker (which is repeating) and use other column for ordering.
It is hard to explain so I try with example.
This is what I need:
|------------|-------------|----------------|
| Group Col | Order Col | Desired Result |
|------------|-------------|----------------|
| D | 1 | NULL |
| A | 2 | 1 |
| C | 3 | 1 |
| E | 4 | 1 |
| A | 5 | 2 |
| B | 6 | 2 |
| C | 7 | 2 |
| A | 8 | 3 |
| F | 9 | 3 |
| T | 10 | 3 |
| A | 11 | 4 |
| Y | 12 | 4 |
|------------|-------------|----------------|
So my marker is A (each time I met A I must start new group inside my result). All rows before first A must be set to NULL.
I know that I can achieve that with loop but it would be slow solution and I need to update a lot of rows (may be sometimes several thousand).
Is there a way to achive this without loop?
You can use window version of COUNT to get the desired result:
SELECT [Group Col], [Order Col],
COUNT(CASE WHEN [Group Col] = 'A' THEN 1 END)
OVER
(ORDER BY [Order Col]) AS [Desired Result]
FROM mytable
If you need all rows before first A set to NULL then use SUM instead of COUNT.
Demo here

PostgreSQL Query?

DB
| ID| VALUE | Parent | Position | lft | rgt |
|---|:------:|:-------:|:--------------:|:--------:|:--------:|
| 1 | A | | | 1 | 12 |
| 2 | B | 1 | L | 2 | 9 |
| 3 | C | 1 | R | 10 | 11 |
| 4 | D | 2 | L | 3 | 6 |
| 5 | F | 2 | R | 7 | 8 |
| 6 | G | 4 | L | 4 | 5 |
Get All Nodes directly under current Node in left side
SELECT "categories".* FROM "categories" WHERE ("categories"."position" = 'L') AND ("categories"."lft" >= 1 AND "categories"."lft" < 12) ORDER BY "categories"."lft"
output { B,D,G } incoorect!
Question !
how have Nodes directly under current Node in left and right side?
output-lft {B,D,F,G}
output-rgt {C}
It sounds like you're after something analogous to Oracle's CONNECT_BY statement, which is used to connect hierarchical data stored in a flat table.
It just so happens there's a way to do this with Postgres, using a recursive CTE.
here is the statement I came up with.
WITH RECURSIVE sub_categories AS
(
-- non-recursive term
SELECT * FROM categories WHERE position IS NOT NULL
UNION ALL
-- recursive term
SELECT c.*
FROM
categories AS c
JOIN
sub_categories AS sc
ON (c.parent = sc.id)
)
SELECT DISTINCT categories.value
FROM categories,
sub_categories
WHERE ( categories.parent = sub_categories.id
AND sub_categories.position = 'L' )
OR ( categories.parent = 1
AND categories.position = 'L' )
here is a SQL Fiddle with a working example.