Why PySpark partitionBy isn't working properly? - pyspark

I have a table as:
COL1
COL2
COL3
COMP
0005
2008-08-04
COMP
0009
2002-01-01
COMP
01.0
2002-01-01
COMP
0005
2008-01-01
COMP
0005
2001-10-20
CTEC
0009
2001-10-20
COMP
0005
2009-10-01
COMP
01.0
2003-07-01
COMP
02.0
2004-01-01
CTEC
0009
2021-09-24
At first I want to partition the table on COL1, then do another partition on COL2, then sort the COL3 in descending order. Then I'm trying to add row number.
I write:
windowSpec = Window.partitionBy(col("COL1")).partitionBy(col("COl2")).orderBy(desc("COL3"))
TBL = TBL.withColumn(f"RANK", F.row_number().over(windowSpec))
My expected output is this:
COL1
COL2
COL3
RANK
COMP
0005
2009-10-01
1
COMP
0005
2008-08-04
2
COMP
0005
2008-01-01
3
COMP
0005
2001-10-20
4
COMP
0009
2002-01-01
1
COMP
01.0
2003-07-01
1
COMP
01.0
2002-01-01
2
COMP
02.0
2004-01-01
1
CTEC
0009
2021-09-24
1
CTEC
0009
2001-10-20
2
But the output I'm getting is like this:
COL1
COL2
COL3
RANK
COMP
0005
2009-10-01
1
COMP
0005
2008-08-04
2
COMP
0005
2008-01-01
3
COMP
0005
2001-10-20
4
COMP
0009
2002-01-01
2
COMP
01.0
2003-07-01
1
COMP
01.0
2002-01-01
2
COMP
02.0
2004-01-01
1
CTEC
0009
2021-09-24
1
CTEC
0009
2001-10-20
3
Can anyone please help me to figure out where I'm doing the mistake??

Related

Unpivot data in PostgreSQL

I have a table in PostgreSQL with the below values,
empid hyderabad bangalore mumbai chennai
1 20 30 40 50
2 10 20 30 40
And my output should be like below
empid city nos
1 hyderabad 20
1 bangalore 30
1 mumbai 40
1 chennai 50
2 hyderabad 10
2 bangalore 20
2 mumbai 30
2 chennai 40
How can I do this unpivot in PostgreSQL?
You can use a lateral join:
select t.empid, x.city, x.nos
from the_table t
cross join lateral (
values
('hyderabad', t.hyderabad),
('bangalore', t.bangalore),
('mumbai', t.mumbai),
('chennai', t.chennai)
) as x(city, nos)
order by t.empid, x.city;
Or this one: simpler to read- and real plain SQL ...
WITH
input(empid,hyderabad,bangalore,mumbai,chennai) AS (
SELECT 1,20,30,40,50
UNION ALL SELECT 2,10,20,30,40
)
,
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
)
SELECT
empid
, CASE i
WHEN 1 THEN 'hyderabad'
WHEN 2 THEN 'bangalore'
WHEN 3 THEN 'mumbai'
WHEN 4 THEN 'chennai'
ELSE 'unknown'
END AS city
, CASE i
WHEN 1 THEN hyderabad
WHEN 2 THEN bangalore
WHEN 3 THEN mumbai
WHEN 4 THEN chennai
ELSE NULL::INT
END AS city
FROM input CROSS JOIN i
ORDER BY empid,i;
-- out empid | city | city
-- out -------+-----------+------
-- out 1 | hyderabad | 20
-- out 1 | bangalore | 30
-- out 1 | mumbai | 40
-- out 1 | chennai | 50
-- out 2 | hyderabad | 10
-- out 2 | bangalore | 20
-- out 2 | mumbai | 30
-- out 2 | chennai | 40

Show complete date range with NULL in PostgreSQL

I'm trying to create this query to get all complete date on range and data with nulls if the date is not exist on the table
For example this is my tbl_example
Original data:
id | userid(str) | comment(str) | mydate(date)
1 0001 sample1 2019-06-20T16:00:00.000Z
2 0002 sample2 2019-06-21T16:00:00.000Z
3 0003 sample3 2019-06-24T16:00:00.000Z
4 0004 sample4 2019-06-25T16:00:00.000Z
5 0005 sample5 2019-06-26T16:00:00.000Z
Then:
select * from tbl_example where mydate between '2019-06-20' AND
DATE('2019-06-20') + interval '5 day')
how to output all the dates on range with possible null like this
Expected output:
id | userid(str) | comment(str) | mydate(date)
1 0001 sample1 2019-06-20T16:00:00.000Z
2 0002 sample2 2019-06-21T16:00:00.000Z
null null null 2019-06-22T16:00:00.000Z
null null null 2019-06-23T16:00:00.000Z
4 0003 sample3 2019-06-24T16:00:00.000Z
5 0004 sample4 2019-06-25T16:00:00.000Z
This is my sample test environment: http://www.sqlfiddle.com/#!17/f5285/2
OK, just see my SQL as below:
with all_dates as (
select generate_series(min(mydate),max(mydate),'1 day'::interval) as dates from tbl_example
)
,null_dates as (
select
a.dates
from
all_dates a
left join
tbl_example t on a.dates = t.mydate
where
t.mydate is null
)
select null as id, null as userid, null as comment, dates as mydate from null_dates
union
select * from tbl_example order by mydate;
id | userid | comment | mydate
----+--------+---------+---------------------
1 | 0001 | sample1 | 2019-06-20 16:00:00
2 | 0002 | sample1 | 2019-06-21 16:00:00
| | | 2019-06-22 16:00:00
| | | 2019-06-23 16:00:00
3 | 0003 | sample1 | 2019-06-24 16:00:00
4 | 0004 | sample1 | 2019-06-25 16:00:00
5 | 0005 | sample1 | 2019-06-26 16:00:00
(7 rows)
Or the generate_series clause you can just write the date arguments you want ,as below:
select generate_series('2019-06-20 16:00:00','2019-06-20 16:00:00'::timestamp + '5 days'::interval,'1 day'::interval) as dates
SELECT id, userid, "comment", d.mydate
FROM generate_series('2019-06-20'::date, '2019-06-25'::date, INTERVAL '1 day') d (mydate)
LEFT JOIN tbl_example ON d.mydate = tbl_example.mydate
Result

Check condition in date interval between now and next month

I have a table in PostgreSQL 10. The table has the following structure
| date | entity | col1 | col2 |
|------+--------+------+------|
Every row represents an event that happens to an entity in a given date. The event has attributes represented by col1 and col2.
I want to add a new column that indicates if with respect to the current row there are events in which the column col2 fulfills a given condition (in the following example the condition is col2 > 20) in a given interval (say 1 month) .
| date | entity | col1 | col2 | fulfill |
|------+--------+------+------+---------|
| t1 | A | a1 | 10 | F |
| t1 | B | b | 9 | F |
| t2 | A | a2 | 10 | T |
| t3 | A | a3 | 25 | F |
| t3 | B | b2 | 8 | F |
t3 is a date inside t2 + interval 1 month.
What is the most efficient way to acomplish this?
I am not sure if I got your problem correctly. My case is 'T if there is a value >= 10 between now an the next month'
I have the following data:
val event_date
--- ----------
22 2016-12-31 -- should be T because val >= 10
8 2017-03-20 -- should be F because in [event_date, eventdate + 1 month no val >= 10]
6 2017-03-22 -- F
42 2017-12-31 -- T because there are 2 values >= 10 in next month
25 2018-01-24 -- T val >= 10
9 2018-02-11 -- F
1 2018-03-01 -- T because in month there is 1 val >= 10
2 2018-03-10 -- T same
20 2018-04-01 -- T
7 2018-04-01 -- T because an same day val >= 10
1 2018-07-24 -- F
22 2019-01-01 -- T
4 2020-10-22 -- T
123 2020-11-04 -- T
The query:
SELECT DISTINCT
e1.val,
e1.event_date,
CASE
WHEN MAX(e2.val) over (partition BY e1.event_date) >= 10
THEN 'T'
ELSE 'F'
END AS fulfilled
FROM
testdata.events e1
JOIN
testdata.events e2
ON
e1.event_date <= e2.event_date
AND e2.event_date <=(e1.event_date + interval '1 month') ::DATE
ORDER BY
e1.event_date
The result:
val event_date fulfilled
--- ---------- ---------
22 2016-12-31 T
8 2017-03-20 F
6 2017-03-22 F
42 2017-12-31 T
25 2018-01-24 T
9 2018-02-11 F
1 2018-03-01 T
2 2018-03-10 T
20 2018-04-01 T
7 2018-04-01 T
1 2018-07-24 F
22 2019-01-01 T
4 2020-10-22 T
123 2020-11-04 T
Currently I am not finding a solution without joining the same table which seems not very stylish to me.

TSQL: Need to Count Multiple Columns and Group by their Contents

I have the following dataset:
StartDate EnterDate Order#
---------- ---------- ------
2018-01-01 2018-01-01 1
2018-01-01 2018-01-01 2
2018-01-01 2018-01-02 3
2018-01-02 2018-01-02 4
2018-01-02 2018-01-03 5
2018-01-02 2018-01-03 6
2018-01-03 2018-01-04 7
2018-01-03 2018-01-04 8
2018-01-03 2018-01-04 9
2018-01-03 2018-01-05 10
I need to COUNT the number of dates in each column.
Example output:
Date StartDate EnterDate
---------- --------- ---------
01-01-2018 3 2
01-02-2018 3 2
01-03-2018 4 2
01-04-2018 0 3
01-05-2018 0 1
NULL can be substituted for 0.
You can use full join to achieve that
select
Date = isnull(t.StartDate, q.EnterDate), StartDate = isnull(t.cnt, 0), EnterDate = isnull(q.cnt, 0)
from (
select
StartDate, count(*) cnt
from
myTable
group by StartDate
) t
full join (
select
EnterDate, count(*) cnt
from
myTable
group by EnterDate
) q on t.StartDate = q.EnterDate

How to query with lead() values not in current range

I´m having problems querying when lead() values are not within the range of current row, rows on the range's edge return null lead() values.
Let’s say I have a simple table to keep track of continuous counters
create table anytable
( wseller integer NOT NULL,
wday date NOT NULL,
wshift smallint NOT NULL,
wconter numeric(9,1) )
with the following values
wseller wday wshift wcounter
1 2016-11-30 1 100.5
1 2017-01-03 1 102.5
1 2017-01-25 2 103.2
1 2017-02-05 2 106.1
2 2015-05-05 2 81.1
2 2017-01-01 1 92.1
2 2017-01-01 2 93.1
3 2016-12-01 1 45.2
3 2017-01-05 1 50.1
and want net units for current year
wseller wday wshift units
1 2017-01-03 1 2
1 2017-01-25 2 0.7
1 2017-02-05 2 2.9
2 2017-01-01 1 11
2 2017-01-01 2 1
3 2017-01-05 1 4.9
If I use
seletc wseller, wday, wshift, wcounter-lead(wcounter) over (partition by wseller order by wseller, wday desc, wshift desc)
from anytable
where wday>='2017-01-01'
gives me nulls on the first wseller by partition. I´m using this query within a large CTE.
What am I doing wrong?
The scope of a window function takes into account conditions in the WHERE clause. Move the condition to the outer query:
select *
from (
select
wseller, wday, wshift,
wcounter- lead(wcounter) over (partition by wseller order by wday desc, wshift desc)
from anytable
) s
where wday >= '2017-01-01'
order by wseller, wday, wshift
wseller | wday | wshift | ?column?
---------+------------+--------+----------
1 | 2017-01-03 | 1 | 2.0
1 | 2017-01-25 | 2 | 0.7
1 | 2017-02-05 | 2 | 2.9
2 | 2017-01-01 | 1 | 11.0
2 | 2017-01-01 | 2 | 1.0
3 | 2017-01-05 | 1 | 4.9
(6 rows)