How to compare each value with every other value in pyspark?

How to compare each value with every other value in pyspark? - pyspark

I have a dataframe in spark as shown below
a b
( 21 , 23 )
( 23 , 21 )
( 22 , 21 )
( 21 , 22 )
I want a dataframe which should look like this:-
( 21 , 22 )
( 21 , 23 )
( 22 , 21 )
( 22 , 23 )
( 23 , 21 )
( 23 , 22 )
So it should consider all possible combinations for both columns. How this can be achieved?
I tried Cartesian join but it is taking too much time for very small dataset.Any other alternatives?
Thanks.

try
zip(*pairs_rdd).flatten.deduplicate.foreach(n => (n,n-1)).cache()

It is hard to say why your join is "taking too much time" without seeing your code. I find the following method works reasonably fast for me:
df = sqlContext.createDataFrame(
[
Row(a=21, b=22),
Row(a=22, b=23),
]
)
# rename to avoid identical colume names in the result
df_copy = df.alias('df_copy')
df_copy = df_copy.withColumnRenamed('a', 'a_copy')
df_copy = df_copy.withColumnRenamed('b', 'b_copy')
df.join(df_copy, how='outer').select(df.a, df_copy.b_copy).collect()

Related

I need to show my table data with group by

i have data like this in table
ROLL_NO SHOE_NO SHOCKS CAP SHIRT TROUSER JACKET
21 7 12 32 28 22 32
22 7 15 30 22 12 22
23 8 16 31 21 14 20
24 9 17 33 28 19 26
25 7 16 30 22 12 22
26 7 15 31 22 12 22
27 8 15 30 22 12 22
28 7 17 30 22 12 22
29 8 15 30 12 22
31 8 15 30 22 12
now i need some grouping tricks to show data like this
ITEM SIZE COUNT
SHOE 7 5
SHOE 8 4
SHOE 9 1
SHOCKS 12 1
SHOCKS 15 5
SHOCKS 16 2
SHOCKS 17 2
CAP 32 1
CAP 30 6
CAP 31 2
CAP 33 1
SHIRT 28 2
SHIRT 22 6
SHIRT 21 1
SHIRT NULL 1
TROUSER 22 1
TROUSER 12 7
TROUSER 14 1
TROUSER 19 1
JACKET 32 1
JACKET 22 6
JACKET 20 1
JACKET 26 1
JACKET NULL 1
create table script and insert satement is as follow-
create table uniform_size (
ROLL_NO NUMBER UNIQUE,
SHOE_NO NUMBER,
SHOCKS NUMBER,
CAP NUMBER,
SHIRT NUMBER,
TROUSER number,
JACKET NUMBER
);
INSERT INTO UNIFORM_SIZE VALUES ( 21,7,12,32,28,22,32);
INSERT INTO UNIFORM_SIZE VALUES ( 22,7,15,30,22,12,22);
INSERT INTO UNIFORM_SIZE VALUES ( 23,8,16,31,21,14,20);
INSERT INTO UNIFORM_SIZE VALUES ( 24,9,17,33,28,19,26);
INSERT INTO UNIFORM_SIZE VALUES ( 25,7,16,30,22,12,22);
INSERT INTO UNIFORM_SIZE VALUES ( 26,7,15,31,22,12,22);
INSERT INTO UNIFORM_SIZE VALUES ( 27,8,15,30,22,12,22);
INSERT INTO UNIFORM_SIZE VALUES ( 28,7,17,30,22,12,22);
INSERT INTO UNIFORM_SIZE VALUES ( 29,8,15,30,NULL,12,22);
INSERT INTO UNIFORM_SIZE VALUES ( 31,8,15,30,22,12,NULL);
INSERT INTO UNIFORM_SIZE VALUES ( 32,NULL,15,30,22,12,23);
INSERT INTO UNIFORM_SIZE VALUES ( 33,NULL,15,31,22,12,23);
INSERT INTO UNIFORM_SIZE VALUES ( 34,9,NULL,30,22,12,23);
INSERT INTO UNIFORM_SIZE VALUES ( 35,9,18,31,22,12,23);
INSERT INTO UNIFORM_SIZE VALUES ( 36,9,NULL,30,28,12,23);
INSERT INTO UNIFORM_SIZE VALUES ( 37,9,18,30,22,12,24);
INSERT INTO UNIFORM_SIZE VALUES ( 38,10,19,30,22,12,24);
INSERT INTO UNIFORM_SIZE VALUES ( 39,10,19,30,22,14,24);
INSERT INTO UNIFORM_SIZE VALUES ( 40,NULL,NULL,NULL,NULL,NULL,NULL);
thank you and regards
i have tried some grouping tricks but didn't got the desired result

It seems like you are trying to get unique values from each column and get the count of each value. Not sure if this is what you are looking for, but it seems helpful to me.
select 'SHOE' article, SHOE_NO size, count(SHOE_NO) cnt from uniform_size group by SHOE_NO
union all
select 'SHOCKS' article, SHOCKS size, count(SHOCKS) cnt from uniform_size group by SHOCKS
union all
select 'CAP' article, CAP size, count(CAP) cnt from uniform_size group by CAP
union all
select 'SHIRT' article, SHIRT size, count(SHIRT) cnt from uniform_size group by SHIRT
union all
select 'TROUSER' article, TROUSER size, count(TROUSER) cnt from uniform_size group by TROUSER
union all
select 'JACKET' article, JACKET size, count(JACKET) cnt from uniform_size group by JACKET;
result
SHOE||0
SHOE|7|5
SHOE|8|4
SHOE|9|5
SHOE|10|2
SHOCKS||0
SHOCKS|12|1
SHOCKS|15|7
SHOCKS|16|2
SHOCKS|17|2
SHOCKS|18|2
SHOCKS|19|2
CAP||0
CAP|30|12
CAP|31|4
CAP|32|1
CAP|33|1
SHIRT||0
SHIRT|21|1
SHIRT|22|13
SHIRT|28|3
TROUSER||0
TROUSER|12|14
TROUSER|14|2
TROUSER|19|1
TROUSER|22|1
JACKET||0
JACKET|20|1
JACKET|22|6
JACKET|23|5
JACKET|24|3
JACKET|26|1
JACKET|32|1

Usage of DISTINCT in reversed int pairs duplicates elimination

I have a following question:
create table memorization_word_translation
(
id serial not null
from_word_id integer not null
to_word_id integer not null
);
This table stores pairs of integers, that are often in reverse order, for example:
35 36
35 37
36 35
37 35
37 39
39 37
Question is - if I make a query, for example:
select * from memorization_word_translation
where from_word_id = 35 or to_word_id = 35
I would get
35 36
35 37
36 35 - duplicate of 35 36
37 35 - duplicate of 35 37
How is to use DISTINCT in this example to filter out all duplicates even if they are reversed?
I want to keep it only like this:
35 36
35 37

You can do it with ROW_NUMBER() window function:
select from_word_id, to_word_id
from (
select *,
row_number() over (
partition by least(from_word_id, to_word_id),
greatest(from_word_id, to_word_id)
order by (from_word_id > to_word_id)::int
) rn
from memorization_word_translation
where 35 in (from_word_id, to_word_id)
) t
where rn = 1
See the demo.

demo:db<>fiddle
You could try a it with a small sorting algorithm (here a comparison) in combination with DISTINCT ON.
The DISTINCT ON clause works an arbitrary columns or terms, e.g. on a tuple. This CASE clause sorts the two columns into tuples and removes tied (ordered) ones. The source columns can be returned in your SELECT statement:
select distinct on (
CASE
WHEN (from_word_id >= to_word_id) THEN (from_word_id, to_word_id)
ELSE (to_word_id, from_word_id)
END
)
*
from memorization_word_translation
where from_word_id = 35 or to_word_id = 35

Postgres: Nested records in a Recursive query in depth first manner

I am working on a simple comment system where a user can comment on other comments, thus creating a hierarchy. To get the comments in a hierarchical order I am using Common Table Expression in Postgres.
Below are the fields and the query used:
id
user_id
parent_comment_id
message
WITH RECURSIVE CommentCTE AS (
SELECT id, parent_comment_id, user_id
FROM comment
WHERE parent_comment_id is NULL
UNION ALL
SELECT child.id, child.parent_comment_id, child.user_id
FROM comment child
JOIN CommentCTE
ON child.parent_comment_id = CommentCTE.id
)
SELECT * FROM CommentCTE
The above query returns records in a breadth first manner:
id parent_comment_id user_id
10 null 30
9 null 30
11 9 30
14 10 31
15 10 31
12 11 30
13 12 31
But can it be modified to achieve something like below where records are returned together for that comment set, in a depth first manner? The point is to get the data in this way to make rendering on the Front-end smoother.
id parent_comment_id user_id
9 null 30
11 9 30
12 11 30
13 12 31
10 null 30
14 10 31
15 10 31

Generally I solve this problem by synthesising a "Path" column which can be sorted lexically, e.g. 0001:0003:0006:0009 is a child of 0001:0003:0006. Each child entry can be created by concatenating the path element to the parent's path. You don't have to return this column to the client, just use it for sorting.
id parent_comment_id user_id sort_key
9 null 30 0009
11 9 30 0009:0011
12 11 30 0009:0011:0012
13 12 31 0009:0011:0012:0013
10 null 30 0010
14 10 31 0010:0014
15 10 31 0010:0015
The path element doesn't have to be anything in particular provided it sorts lexically in the order you want children at that level to sort, and is unique at that level. Basing it on an auto-incrementing ID is fine.
Using a fixed length path element is not strictly speaking necessary but makes it easier to reason about.
WITH RECURSIVE CommentCTE AS (
SELECT id, parent_comment_id, user_id,
lpad(id::text, 4) sort_key
FROM comment
WHERE parent_comment_id is NULL
UNION ALL
SELECT child.id, child.parent_comment_id, child.user_id,
concat(CommentCTE.sort_key, ':', lpad(id::text, 4))
FROM comment child
JOIN CommentCTE
ON child.parent_comment_id = CommentCTE.id
)
SELECT * FROM CommentCTE order by sort_key

While loop to add data for pivot

Currently i have a requirement which needs a table to look like this:
Instrument Long Short 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 ....
Fixed 41 41 35 35 35 35 35 35 35 53 25 25
Index 16 16 22 22 22 32 12 12 12 12 12 12
Credits 29 29 41 16 16 16 16 16 16 16 16 16
Short term 12 12 5 5 5 5 5 5 5 5 5 17
My worktable looks like the following:
Instrument Long Short Annual Coupon Maturity Date Instrument ID
Fixed 10 10 10 01/01/2025 1
Index 5 5 10 10/05/2016 2
Credits 15 15 16 25/06/2020 3
Short term 12 12 5 31/10/2022 4
Fixed 13 13 15 31/03/2030 5
Fixed 18 18 10 31/01/2019 6
Credits 14 14 11 31/12/2013 7
Index 11 11 12 31/10/2040 8
..... etc
So basically the long and the short in the pivot should be the sum of each distinct instrument ID. And then for each year i need to take the sum of each Annual Coupon until the maturity date year where the long and the coupon rate are added together.
My thinking was that i had to create a while loop which would populate a table with a record for each year for each instrument until the maturity date, so that i could then pivot using an sql pivot some how. Does this seem feasible? Any other ideas on the best way of doing this, particularly i might need help on the while loop?

The following solution uses a numbers table to unfold ranges in your table, performs some special processing on some of the data columns in the unfolded set, and finally pivots the results:
WITH unfolded AS (
SELECT
t.Instrument,
Long = SUM(z.Long ) OVER (PARTITION BY Instrument),
Short = SUM(z.Short) OVER (PARTITION BY Instrument),
Year = y.Number,
YearValue = t.AnnualCoupon + z.Long + z.Short
FROM YourTable t
CROSS APPLY (SELECT YEAR(t.MaturityDate)) x (Year)
INNER JOIN numbers y ON y.Number BETWEEN YEAR(GETDATE()) AND x.Year
CROSS APPLY (
SELECT
Long = CASE y.Number WHEN x.Year THEN t.Long ELSE 0 END,
Short = CASE y.Number WHEN x.Year THEN t.Short ELSE 0 END
) z (Long, Short)
),
pivoted AS (
SELECT *
FROM unfolded
PIVOT (
SUM(YearValue) FOR Year IN ([2013], [2014], [2015], [2016], [2017], [2018], [2019], [2020],
[2021], [2022], [2023], [2024], [2025], [2026], [2027], [2028], [2029], [2030],
[2031], [2032], [2033], [2034], [2035], [2036], [2037], [2038], [2039], [2040])
) p
)
SELECT *
FROM pivoted
;
It returns results for a static range years. To use it for a dynamically calculated year range, you'll first need to prepare the list of years as a CSV string, something like this:
SET #columnlist = STUFF(
(
SELECT ', [' + CAST(Number) + ']'
FROM numbers
WHERE Number BETWEEN YEAR(GETDATE())
AND (SELECT YEAR(MAX(MaturityDate)) FROM YourTable)
ORDER BY Number
FOR XML PATH ('')
),
1, 2, ''
);
then put it into the dynamic SQL version of the query:
SET #sql = N'
WITH unfolded AS (
...
PIVOT (
SUM(YearValue) FOR Year IN (' + #columnlist + ')
) p
)
SELECT *
FROM pivoted;
';
and execute the result:
EXECUTE(#sql);
You can try this solution at SQL Fiddle.

Extract Unique Time Slices in Oracle

I use Oracle 10g and I have a table that stores a snapshot of data on a person for a given day. Every night an outside process adds new rows to the table for any person whose had any changes to their core data (stored elsewhere). This allows a query to be written using a date to find out what a person 'looked' like on some past day. A new row is added to the table even if only a single aspect of the person has changed--the implication being that many columns have duplicate values from slice to slice since not every detail changed in each snapshot.
Below is a data sample:
SliceID PersonID StartDt Detail1 Detail2 Detail3 Detail4 ...
1 101 08/20/09 Red Vanilla N 23
2 101 08/31/09 Orange Chocolate N 23
3 101 09/15/09 Yellow Chocolate Y 24
4 101 09/16/09 Green Chocolate N 24
5 102 01/10/09 Blue Lemon N 36
6 102 01/11/09 Indigo Lemon N 36
7 102 02/02/09 Violet Lemon Y 36
8 103 07/07/09 Red Orange N 12
9 104 01/31/09 Orange Orange N 12
10 104 10/20/09 Yellow Orange N 13
I need to write a query that pulls out time slices records where some pertinent bits, not the whole record, have changed. So, referring to the above, if I only want to know the slices in which Detail3 has changed from its previous value, then I would expect to only get rows having SliceID 1, 3 and 4 for PersonID 101 and SliceID 5 and 7 for PersonID 102 and SliceID 8 for PersonID 103 and SliceID 9 for PersonID 104.
I'm thinking I should be able to use some sort of Oracle Hierarchical Query (using CONNECT BY [PRIOR]) to get what I want, but I have not figured out how to write it yet. Perhaps YOU can help.
Thanks you for your time and consideration.

Here is my take on the LAG() solution, which is basically the same as that of egorius, but I show my workings ;)
SQL> select * from
2 (
3 select sliceid
4 , personid
5 , startdt
6 , detail3 as new_detail3
7 , lag(detail3) over (partition by personid
8 order by startdt) prev_detail3
9 from some_table
10 )
11 where prev_detail3 is null
12 or ( prev_detail3 != new_detail3 )
13 /
SLICEID PERSONID STARTDT N P
---------- ---------- --------- - -
1 101 20-AUG-09 N
3 101 15-SEP-09 Y N
4 101 16-SEP-09 N Y
5 102 10-JAN-09 N
7 102 02-FEB-09 Y N
8 103 07-JUL-09 N
9 104 31-JAN-09 N
7 rows selected.
SQL>
The point about this solution is that it hauls in results for 103 and 104, who don't have slice records where detail3 has changed. If that is a problem we can apply an additional filtration, to return only rows with changes:
SQL> with subq as (
2 select t.*
3 , row_number () over (partition by personid
4 order by sliceid ) rn
5 from
6 (
7 select sliceid
8 , personid
9 , startdt
10 , detail3 as new_detail3
11 , lag(detail3) over (partition by personid
12 order by startdt) prev_detail3
13 from some_table
14 ) t
15 where t.prev_detail3 is null
16 or ( t.prev_detail3 != t.new_detail3 )
17 )
18 select sliceid
19 , personid
20 , startdt
21 , new_detail3
22 , prev_detail3
23 from subq sq
24 where exists ( select null from subq x
25 where x.personid = sq.personid
26 and x.rn > 1 )
27 order by sliceid
28 /
SLICEID PERSONID STARTDT N P
---------- ---------- --------- - -
1 101 20-AUG-09 N
3 101 15-SEP-09 Y N
4 101 16-SEP-09 N Y
5 102 10-JAN-09 N
7 102 02-FEB-09 Y N
SQL>
edit
As egorius points out in the comments, the OP does want hits for all users, even if they haven't changed, so the first version of the query is the correct solution.

In addition to OMG Ponies' answer: if you need to query slices for all persons, you'll need partition by:
SELECT s.sliceid
, s.personid
FROM (SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (
PARTITION BY t.personid ORDER BY t.startdt
) prev_val
FROM t) s
WHERE (s.prev_val IS NULL OR s.prev_val != s.detail3)

I think you'll have better luck with the LAG function:
SELECT s.sliceid
FROM (SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (PARTITION BY t.personid ORDER BY t.startdt) 'prev_val'
FROM TABLE t) s
WHERE s.personid = 101
AND (s.prev_val IS NULL OR s.prev_val != s.detail3)
Subquery Factoring alternative:
WITH slices AS (
SELECT t.sliceid,
t.personid,
t.detail3,
LAG(t.detail3) OVER (PARTITION BY t.personid ORDER BY t.startdt) 'prev_val'
FROM TABLE t)
SELECT s.sliceid
FROM slices s
WHERE s.personid = 101
AND (s.prev_val IS NULL OR s.prev_val != s.detail3)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse