How to do OUTER JOIN in scala - scala

I havce two data frames : df1 and df2
df1
|--- id---|---value---|
| 1 | 23 |
| 2 | 23 |
| 3 | 23 |
| 2 | 25 |
| 5 | 25 |
df2
|-idValue-|---count---|
| 1 | 33 |
| 2 | 23 |
| 3 | 34 |
| 13 | 34 |
| 23 | 34 |
How do I get this ?
|--- id--------|---value---|---count---|
| 1 | 23 | 33 |
| 2 | 23 | 23 |
| 3 | 23 | 34 |
| 2 | 25 | 23 |
| 5 | 25 | null |
I am doing :
val groupedData = df1.join(df2, $"id" === $"idValue", "outer")
But I don't see the last column in the groupedData. Is this correct way of doing ? Or Am I doing any thing wrong ?

From your expected output, you need LEFT OUTER JOIN.
val groupedData = df1.join(df2, $"id" === $"idValue", "left_outer").
select(df1("id"), df1("count"), df2("count")).
take(10).foreach(println)

Related

values columns Sorting pyspark

I have this DataFrame bellow:
Ref ° | Indice_1 | Indice_2 | 1 | 2 | indice_from | indice_from | indice_to | indice_to
---------------------------------------------------------------------------------------------------------------------------------------------
1 | 19 | 37.1 | 32 | 62 | ["20031,10031"] | ["13,11/12"] | ["40062,30062"] | ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
2 | 19 | 37.1 | 44 | 12 | ["40062,30062"] | ["13,11/12"] | ["40062,30062"] | ["14A,14"]
---------------------------------------------------------------------------------------------------------------------------------------------
3 | 19 | 37.1 | 22 | 64 | ["20031,10031"] | ["13,11/12"] | ["20031,10031"] | ["13,11/12"]
---------------------------------------------------------------------------------------------------------------------------------------------
4 | 19 | 37.1 | 32 | 98 | ["20032,10032"] | ["13,11/12"] | ["40062,30062"] | ["13,11/12"]
I want sort asc the values of the column indice_from, indice_from, indice_to, and indice_to and I shouldn't touch the rest of the columns of my DataFrame.
Knowing that the 2 columns indice_from and indice_to contains some times a number + letter like: ["14,14A"]
In case if I have an example like ["14,14A"], always I should have the same structure, for example if I have:
The number 15, the second value should 15 + letter, and 15 < 15 + letter, if first value is 9, the second value should 9 + letter and 9 < 9+letter
New DataFrame:
Ref ° | Indice_1 | Indice_2 | 1 | 2 | indice_from | indice_from | indice_to | indice_to
---------------------------------------------------------------------------------------------------------------------------------------------
1 | 19 | 37.1 | 32 | 62 | ["10031,20031"] | ["11/12,13"] | ["30062,40062"] | ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
2 | 19 | 37.1 | 44 | 12 | ["30062,40062"] | ["11/12,13"] | ["30062,40062"] | ["14,14A"]
---------------------------------------------------------------------------------------------------------------------------------------------
3 | 19 | 37.1 | 22 | 64 | ["10031,20031"] | ["11/12,13"] | ["10031,20031"] | ["11/12,13"]
---------------------------------------------------------------------------------------------------------------------------------------------
4 | 19 | 37.1 | 32 | 98 | ["10031,20031"] | ["11/12,13"] | ["30062,40062"] | ["11/12,13"]
Someone please can help how can I sort the values of columns indice_from, indice_from, indice_to, and indice_to to obtain new Dataframe like the second df above ?
Thank you
If I understand it correctly then
from pyspark.sql import functions as F
columns_to_sort = ['indice_from', 'indice_from', 'indice_to', 'indice_to']
for c in columns_to_sort:
df = (
df
.withColumn(
c,
F.sort_array(c)
)
)
will do the trick. Let me know if it doesn't

Sort using auxiliary fields, start and end

In PostgreSQL, what is the best way to sort records using start and end fields in a generic way, without the need to include in the query the first record (where start_id=3)?
Example table:
+-------+----------+--------+--------+
| FK_ID | START_ID | END_ID | STRING |
+-------+----------+--------+--------+
| 77 | 1 | 9 | E |
| 82 | 5 | 2 | A |
| 77 | 7 | 1 | I |
| 77 | 3 | 7 | W |
| 82 | 9 | 5 | Q |
| 77 | 9 | 5 | X |
| 82 | 2 | 7 | G |
+-------+----------+--------+--------+
Sorted where FK_ID = 77:
+----+---+---+---+
| 77 | 3 | 7 | W |
| 77 | 7 | 1 | I |
| 77 | 1 | 9 | E |
| 77 | 9 | 5 | X |
+----+---+---+---+
Sorted where FK_ID = 82:
+----+---+---+---+
| 82 | 9 | 5 | Q |
| 82 | 5 | 2 | A |
| 82 | 2 | 7 | G |
+----+---+---+---+
Result query sequence:
+-------+----------+
| FK_ID | SEQUENCE |
+-------+----------+
| 82 | QAG |
| 77 | WIEX |
+-------+----------+
I do not think this is the most efficient way but you can try with a recursive CTE
WITH RECURSIVE path AS (
SELECT * FROM myTable AS t1 WHERE NOT EXISTS(
SELECT 1 FROM myTable AS t2 WHERE t1.fk_id = t2.fk_id AND t2.end_id = t1.start_id
) ORDER BY start_id LIMIT 1
UNION ALL
SELECT myTable.* FROM myTable JOIN path ON path.end_id = myTable.start_id
)
SELECT fk_id,array_to_string(array_agg(string)) FROM path GROUP BY fk_id

SQL query help: Calculate max of previous rows in the same query

I want to find for each row(where B = C = D = 1), the max of A among its previous rows(where B = C = D = 1) excluding its row after its ordered in chronological order.
Data in table looks like this:
+-------+-----+-----+-----+------+------+
|Grp id | B | C | D | A | time |
+-------+---- +-----+-----+------+------+
| 111 | 1 | 0 | 0 | 52 | t |
| 111 | 1 | 1 | 1 | 33 | t+1 |
| 111 | 0 | 1 | 0 | 34 | t+2 |
| 111 | 1 | 1 | 1 | 22 | t+3 |
| 111 | 0 | 0 | 0 | 12 | t+4 |
| 222 | 1 | 1 | 1 | 16 | t |
| 222 | 1 | 0 | 0 | 18 | t2+1 |
| 222 | 1 | 1 | 0 | 13 | t2+2 |
| 222 | 1 | 1 | 1 | 12 | t2+3 |
| 222 | 1 | 1 | 1 | 09 | t2+4 |
| 222 | 1 | 1 | 1 | 22 | t2+5 |
| 222 | 1 | 1 | 1 | 19 | t2+6 |
+-------+-----+-----+-----+------+------+
Above table is resultant of below query. Its obtained after left joins as below. Joins are necessary according to my project requirement.
SELECT Grp id, B, C, D, A, time, xxx
FROM "DCR" dcr
LEFT JOIN "DCM" dcm ON "Id" = dcm."DCRID"
LEFT JOIN "DC" dc ON dc."Id" = dcm."DCID"
ORDER BY dcr."time"
Result column needs to be evaluated based on formula I mentioned above. It needs to be calculated in same pass as we need to consider only its previous rows. Above xxx needs to be replaced by a subquery/statement to obtain the result.
And the result table should look like this:
+-------+-----+-----+-----+------+------+------+
|Grp id | B | C | D | A | time |Result|
+-------+---- +-----+-----+------+------+------+
| 111 | 1 | 0 | 0 | 52 | t | - |
| 111 | 1 | 1 | 1 | 33 | t+1 | - |
| 111 | 1 | 1 | 1 | 34 | t+2 | 33 |
| 111 | 1 | 1 | 1 | 22 | t+3 | 34 |
| 111 | 0 | 0 | 0 | 12 | t+4 | - |
| 222 | 1 | 1 | 1 | 16 | t | - |
| 222 | 1 | 0 | 0 | 18 | t2+1 | - |
| 222 | 1 | 1 | 0 | 13 | t2+2 | - |
| 222 | 1 | 1 | 1 | 12 | t2+3 | 16 |
| 222 | 1 | 1 | 1 | 09 | t2+4 | 16 |
| 222 | 1 | 1 | 1 | 22 | t2+5 | 16 |
| 222 | 1 | 1 | 1 | 19 | t2+6 | 22 |
+-------+-----+-----+-----+------+------+------+
The column could be computed with a window function:
CASE WHEN b = 1 AND c = 1 AND d = 1
THEN max(a) FILTER (WHERE b = 1 AND c = 1 AND d = 1)
OVER (PARTITION BY "grp id"
ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
ELSE NULL
END
I didn't test it.

Tibco Spotfire - Calculate average only if there are minimum 3 values in a column - see desc

I want to calculate average in Spotfire only when there are minimum 3 values. if there are no values or just 2 values the average should be blank
Raw data:
Product Age Average
1
2
3 10
4 12
5 13 11
6
7 18
8 19
9 20 19
10 21 20
The only way I could really do this is with 3 calculated columns. Insert these calculated columns in this order:
If(Min(If([Age] IS NULL,0,[Age])) over (LastPeriods(3,[Product]))<>0,1) as [BitFlag]
Avg([Age]) over (LastPeriods(3,[Product])) as [TempAvg]
If([BitFlag]=1,[TempAvg]) as [Average]
This will give you the following results. You can ignore / hide the two columns you don't care about.
RESULTS
+---------+-----+---------+------------------+------------------+
| Product | Age | BitFlag | TempAvg | Average |
+---------+-----+---------+------------------+------------------+
| 1 | | | | |
| 2 | | | | |
| 3 | 10 | | 10 | |
| 4 | 12 | | 11 | |
| 5 | 13 | 1 | 11.6666666666667 | 11.6666666666667 |
| 6 | | | 12.5 | |
| 7 | 18 | | 15.5 | |
| 8 | 19 | | 18.5 | |
| 9 | 20 | 1 | 19 | 19 |
| 10 | 21 | 1 | 20 | 20 |
| 11 | | | 20.5 | |
| 12 | 22 | | 21.5 | |
| 13 | 36 | | 29 | |
| 14 | | | 29 | |
| 15 | 11 | | 23.5 | |
| 16 | 23 | | 17 | |
| 17 | 14 | 1 | 16 | 16 |
+---------+-----+---------+------------------+------------------+

org mode - simplify table sum formula of multiple column

Now I need 5 formula for sum of each column, it works fine but I wish it can be simplified to one formula. Is it possible?
|----+----+----+-----+----|
| a | b | c | d | e |
|----+----+----+-----+----|
| 1 | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 |
|----+----+----+-----+----|
| 34 | 38 | 42 | 160 | 50 |
|----+----+----+-----+----|
#+TBLFM: #>$5=vsum(#2$5..#-1$5)::#>$4=vsum(#2$1..#-1$4)::#>$3=vsum(#2$3..#-1$3)::#>$2=vsum(#2$2..#-1$2)::#>$1=vsum(#2$1..#-1$1)
This should work:
|----+----+----+----+----|
| a | b | c | d | e |
|----+----+----+----+----|
| 1 | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 |
|----+----+----+----+----|
| 34 | 38 | 42 | 46 | 50 |
|----+----+----+----+----|
#+TBLFM: #>$1..#>$5=vsum(#2$0..#-1$0)
$0 on the RHS is the current column.