pyspark - preprocessing with a kind of "product-join"

pyspark - preprocessing with a kind of "product-join" - pyspark

I have 2 datasets that I can represent as:
The first dataframe is my raw data. It contains millions of row and around 6000 areas.
+--------+------+------+-----+-----+
| user | area | time | foo | bar |
+--------+------+------+-----+-----+
| Alice | A | 5 | ... | ... |
| Alice | B | 12 | ... | ... |
| Bob | A | 2 | ... | ... |
| Charly | C | 8 | ... | ... |
+--------+------+------+-----+-----+
This second dataframe is a mapping table. It has around 200 areas (not 5000) for 150 places. Each area can have 1-N places (and a place can have 1-N areas too). It can be represented unpivoted this way:
+------+--------+-------+
| area | place | value |
+------+--------+-------+
| A | placeZ | 0.1 |
| B | placeB | 0.6 |
| B | placeC | 0.4 |
| C | placeA | 0.1 |
| C | placeB | 0.04 |
| D | placeA | 0.4 |
| D | placeC | 0.6 |
| ... | ... | ... |
+------+--------+-------+
or pivoted
+------+--------+--------+--------+-----+
| area | placeA | placeB | placeC | ... |
+------+--------+--------+--------+-----+
| A | 0 | 0 | 0 | ... |
| B | 0 | 0.6 | 0.4 | ... |
| C | 0.1 | 0.04 | 0 | ... |
| D | 0.4 | 0 | 0.6 | ... |
+------+--------+--------+--------+-----+
I would like to create a kind of product-join to have something like:
+--------+--------+--------+--------+-----+--------+
| user | placeA | placeB | placeC | ... | placeZ |
+--------+--------+--------+--------+-----+--------+
| Alice | 0 | 7.2 | 4.8 | 0 | 0.5 | <- 7.2 & 4.8 comes from area B and 0.5 from area A
| Bob | 0 | 0 | 0 | 0 | 0.2 |
| Charly | 0.8 | 0.32 | 0 | 0 | 0 |
+--------+--------+--------+--------+-----+--------+
I see 2 options so far:
Perform a left join between the main table and the pivoted one
Multiply each column by the time (around 150 columns)
Groupby user with a sum
Perform a outer join between the main table and the unpivoted one
Multiply the time by value
Pivot place
Groupby user with a sum
I don't like the first option because of the number of multiplications involved (the mapping dataframe is quite sparse).
I prefer the second option but I see two problems :
If someday, the dataset does not have a place represented, the column will not exist and the dataset will have a different shape (hence failing).
Some other features like foo, bar will be duplicated with the outer-join and I'll have to handle it on case by case at the grouping stage (sum or average).
I would like to know if there is something more ready-to-use for this kind of product-join in spark ? I have seen the OneHotEncoder but it only provides only a "1" on each column (so it is even worse than the first solution).
Thanks in advance,
Nicolas

Related

PostgreSQL: Transforming rows into columns when more than three columns are needed

I have a table like the following one:
+---------+-------+-------+-------------+--+
| Section | Group | Level | Fulfillment | |
+---------+-------+-------+-------------+--+
| A | Y | 1 | 82.2 | |
| A | Y | 2 | 23.2 | |
| A | M | 1 | 81.1 | |
| A | M | 2 | 28.2 | |
| B | Y | 1 | 89.1 | |
| B | Y | 2 | 58.2 | |
| B | M | 1 | 32.5 | |
| B | M | 2 | 21.4 | |
+---------+-------+-------+-------------+--+
And this would be my desired output:
+---------+-------+--------------------+--------------------+
| Section | Group | Level1_Fulfillment | Level2_Fulfillment |
+---------+-------+--------------------+--------------------+
| A | Y | 82.2 | 23.2 |
| A | M | 81.1 | 28.2 |
| B | Y | 89.1 | 58.2 |
| B | M | 32.5 | 21.4 |
+---------+-------+--------------------+--------------------+
Thus, for each section and group I'd like to obtain their percents of fulfillment for level 1 and level 2. To achieve this, I've tried crosstab(), but using this function returns me an error ("The provided SQL must return 3 columns: rowid, category, and values.") because I'm using more than three columns (I need to maintain section and group as identifiers for each row). Is possible to use crosstab in this case?
Regards.

I find crosstab() unnecessary complicated to use and prefer conditional aggregation:
select section,
"group",
max(fulfillment) filter (where level = 1) as level_1,
max(fulfillment) filter (where level = 2) as level_2
from the_table
group by section, "group"
order by section;
Online example

calculate aggregation and percentage simultaneous after groupBy in scala/Spark Dataset/Dataframe

I am learning to work with Scala and spark. It's my first incidents using them. I have some structured Scala DataSet(org.apache.spark.sql.Dataset) like following format.
Region | Id | RecId | Widget | Views | Clicks | CTR
1 | 1 | 101 | A | 5 | 1 | 0.2
1 | 1 | 101 | B | 10 | 4 | 0.4
1 | 1 | 101 | C | 5 | 1 | 0.2
1 | 2 | 401 | A | 5 | 1 | 0.2
1 | 2 | 401 | D | 10 | 2 | 0.1
NOTE: CTR = Clicks/Views
I want to merge the mapping regardless of Widget (i.e using Region, Id, RecID).
The Expected Output I want is like following:
Region | Id | RecId | Views | Clicks | CTR
1 | 1 | 101 | 20 | 6 | 0.3
1 | 1 | 101 | 15 | 3 | 0.2
What I am getting is like below:
>>> ds.groupBy("Region","Id","RecId").sum().show()
Region | Id | RecId | sum(Views) | sum(Clicks) | sum(CTR)
1 | 1 | 101 | 20 | 6 | 0.8
1 | 1 | 101 | 15 | 3 | 0.3
I understand that it is summing up all the CTR from original but I want to groupBy as explained but still want to get the expected CTR value. I also don't want to change column names as it is changing in my approach.
Is there any possible way of calculating in such manner. I also have #Purchases and CoversionRate (#Purchases/Views) and I want to do the same thing with that field also. Any leads will be appreciated.

You can calculate the ctr after the aggregation. Try the below code.
ds.groupBy("Region","Id","RecId")
.agg(sum(col("Views")).as("Views"), sum(col("Clicks")).as("Clicks"))
.withColumn("CTR" , col("Views") / col("Clicks"))
.show()

Copy the date in org table

Suppose such a spreadsheet in org table
|------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------+-------+------------+--------+--------+------------|
| 2019/09/17 | A | 2.64 | 1 | 2.64 | materials |
| | B | 52.67 | 2 | 105.34 | diagnosis |
| | C | 3.08 | 1 | 3.08 | materials |
| | D | 3.85 | 2 | 7.7 | materials |
| | E | 33.66 | 2 | 67.32 | materials |
| | F | 40 | 1 | 40 | treatments |
| | G | 16.5 | 1 | 16.5 | materials |
| | H | 4 | 3 | 12 | treatments |
| | I | 40 | 1 | 40 | bed |
| | M | 6 | 13 | 78 | treatments |
|------------+-------+------------+--------+--------+------------|
#+TBLFM: $5=$3*$4
How could copy the date 2019/09.17 to the bottom of data column?

The link that #manandearth posted in the comments describes how to duplicate (perhaps with slight modifications) the entries in a column. Briefly, pressing S-RET in a cell duplicates its contents from the cell above (if it is not empty) - if the cell is full and the next cell is empty then it duplicates the full cell to the empty cell. If the contents are numeric, then the "duplication" involves a slight modification: it increases the value by 1. The same happens with a date: it increases the date to next day (but the date has to be in a format that Org mode recognizes: either an active date <YYYY-MM-DD> or an inactive data [YYYY-MM-DD]). The increment by default is 1 in these cases, but can be set to something else by setting the variable org-table-copy-increment to a different value. That's the "interactive" case I mention in my comment.
The other way to fill a column in a table is by using a formula. For example here's a formula to fill the first column with a copy of the first entry in the column:
#+TBLFM: #3$1..#>$1 = #2$1
This says: Set all rows from row 3 (#3) to the last row (#>) of column 1 ($1) to the value of the cell in row 2 (#2), column 1 ($1). Note that row 1 is the header. Press C-c C-c on the table formula line above and ... wait, what happened?
|------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------+-------+------------+--------+--------+------------|
| 2019/09/17 | A | 2.64 | 1 | 2.64 | materials |
| 13.196078 | B | 52.67 | 2 | 105.34 | diagnosis |
| 13.196078 | C | 3.08 | 1 | 3.08 | materials |
| 13.196078 | D | 3.85 | 2 | 7.7 | materials |
| 13.196078 | E | 33.66 | 2 | 67.32 | materials |
| 13.196078 | F | 40 | 1 | 40 | treatments |
| 13.196078 | G | 16.5 | 1 | 16.5 | materials |
| 13.196078 | H | 4 | 3 | 12 | treatments |
| 13.196078 | I | 40 | 1 | 40 | bed |
| 13.196078 | M | 6 | 13 | 78 | treatments |
|------------+-------+------------+--------+--------+------------|
#+TBLFM: #3$1..#>$1 = #2$1
It does not quite work in this case for a technical reason: Org mode uses Calc in table formula calculations and Calc looks at 2019/09/17 and says: "Aha, I have to divide 2019 by 9 and then divide the result by 17", and fills the rest of the column with the result of the divisions: 13.196078. You may have meant 2019/09/17 to be a date, but Org mode does not know that: it gives it to Calc which interprets it as an arithmetic expression. The solution here is the same as in the linked answer: make Org mode aware that it's a date by making it either an active date: <2019-09-17> or an inactive date: [2019-09-17]:
|------------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------------+-------+------------+--------+--------+------------|
| [2019-09-17] | A | 2.64 | 1 | 2.64 | materials |
| [2019-09-17 Tue] | B | 52.67 | 2 | 105.34 | diagnosis |
| [2019-09-17 Tue] | C | 3.08 | 1 | 3.08 | materials |
| [2019-09-17 Tue] | D | 3.85 | 2 | 7.7 | materials |
| [2019-09-17 Tue] | E | 33.66 | 2 | 67.32 | materials |
| [2019-09-17 Tue] | F | 40 | 1 | 40 | treatments |
| [2019-09-17 Tue] | G | 16.5 | 1 | 16.5 | materials |
| [2019-09-17 Tue] | H | 4 | 3 | 12 | treatments |
| [2019-09-17 Tue] | I | 40 | 1 | 40 | bed |
| [2019-09-17 Tue] | M | 6 | 13 | 78 | treatments |
|------------------+-------+------------+--------+--------+------------|
#+TBLFM: #3$1..#>$1 = #2$1
This does not do automatic incrementation but if that's what you want, it's easy to accomplish: Calc can do calculations on dates, so we can increment daily by adding to the date in each row the row number minus 2 (e.g. row 3 would get an increment of 3 - 2 = 1, row 4 would get 4 - 2 = 2, etc). To accomplish this, you have to get the row number of the current row: the idiom is ##. Then the formula becomes:
#+TBLFM: #3$1..#>$1 = #2$1 + ## - 2
and the table becomes:
|------------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------------+-------+------------+--------+--------+------------|
| [2019-09-17] | A | 2.64 | 1 | 2.64 | materials |
| [2019-09-18 Wed] | B | 52.67 | 2 | 105.34 | diagnosis |
| [2019-09-19 Thu] | C | 3.08 | 1 | 3.08 | materials |
| [2019-09-20 Fri] | D | 3.85 | 2 | 7.7 | materials |
| [2019-09-21 Sat] | E | 33.66 | 2 | 67.32 | materials |
| [2019-09-22 Sun] | F | 40 | 1 | 40 | treatments |
| [2019-09-23 Mon] | G | 16.5 | 1 | 16.5 | materials |
| [2019-09-24 Tue] | H | 4 | 3 | 12 | treatments |
| [2019-09-25 Wed] | I | 40 | 1 | 40 | bed |
| [2019-09-26 Thu] | M | 6 | 13 | 78 | treatments |
|------------------+-------+------------+--------+--------+------------|
#+TBLFM: #3$1..#>$1 = #2$1+ ## - 2
The various anomalies of the display of dates (do we include the day of the week? do we include the time?) might be worked around using org-time-stamp-custom-formats but that gets us into waters that I have not explored.

Split postgres records into groups based on time fields

I have a table with records that look like this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
The time column represents a time in milliseconds. What I want to do is find all coord-x, coord-y as a set of points for a given timeframe for a given id. For any given id there is a unique coord-x, coord-y, and time.
What I need to do however is group these points as long as they're n milliseconds apart. So if I have this:
| id | coord-x | coord-y | time |
---------------------------------
| 1 | 0 | 0 | 123 |
| 1 | 0 | 1 | 124 |
| 1 | 0 | 3 | 125 |
| 1 | 0 | 6 | 140 |
| 1 | 0 | 7 | 141 |
I would want a result similar to this:
| id | points | start-time | end-time |
| 1 | (0,0), (0,1), (0,3) | 123 | 125 |
| 1 | (0,140), (0,141) | 140 | 141 |
I do have PostGIS installed on my database, the times I posted above are not representative but I kept them small just as a sample, the time is just a millisecond timestamp.

The tricky part is picking the expression inside your GROUP BY. If n = 5, you can do something like time / 5. To match the example exactly, the query below uses (time - 3) / 5. Once you group it, you can aggregate them into an array with array_agg.
SELECT
array_agg(("coord-x", "coord-y")) as points,
min(time) AS time_start,
max(time) AS time_end
FROM "<your_table>"
WHERE id = 1
GROUP BY (time - 3) / 5
Here is the output
+---------------------------+--------------+------------+
| points | time_start | time_end |
|---------------------------+--------------+------------|
| {"(0,0)","(0,1)","(0,3)"} | 123 | 125 |
| {"(0,6)","(0,7)"} | 140 | 141 |
+---------------------------+--------------+------------+

How to compute the dot product of two column (think full column as a vector)?

gave this table:
| a | b | c |
|---+---+----+
| 3 | 4 | |
| 1 | 2 | |
| 1 | 3 | |
| 2 | 2 | |
I want to get the dot product of two column a and b ,the result should be equel to (3*4)+(1*2)+(1*3)+(2*2) which is 21.
I don't want use the clumsy formula (B1*B2+C1*C2+D1*D2+E1*E2) because actually I have a large table waiting to calculate.
I know emacs's Calc tool has a "vprod" function which can do those sort of things ,but I dont' know how to turn the full column to a vector.
Can anybody tell me how to achieve this task,appreciate it!

In emacs-calc, the simple product of 2 vectors calculates the dot product.
This works (I put the result in #6$3; also the parenthesis can be omitted):
| a | b | c |
|---+---+----|
| 3 | 4 | |
| 1 | 2 | |
| 1 | 3 | |
| 2 | 2 | |
|---+---+----|
| | | 21 |
#+TBLFM: #6$3=(#I$1..#II$1)*(#I$2..#II$2)
#I and #II span from the 1st hline to the second.

This can be solved using babel and R in org-mode:
#+name: mytable
| a | b | c |
|---+---+----+
| 3 | 4 | |
| 1 | 2 | |
| 1 | 3 | |
| 3 | 2 | |
#+begin_src R :var mytable=mytable
sum(mytable$a * mytable$b)
#+end_src
#+RESULTS:
: 23

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark - preprocessing with a kind of "product-join" - pyspark

Related

PostgreSQL: Transforming rows into columns when more than three columns are needed

calculate aggregation and percentage simultaneous after groupBy in scala/Spark Dataset/Dataframe

Copy the date in org table

Split postgres records into groups based on time fields

How to compute the dot product of two column (think full column as a vector)?

Categories

Resources