Pyspark Joing using monthly range

Pyspark Joing using monthly range - pyspark

There are three tables A , B in Hive
A Table has the following columns and is Partitioned based upon Day. We need to extract data from 1st jan 2016 till 31st Dec 2016. I've just mentioned sample but these records are in millions for 1 year.
ID Day Name Description
1 2016-09-01 Sam Retail
2 2016-01-28 Chris Retail
3 2016-02-06 ChrisTY Retail
4 2016-02-26 Christa Retail
3 2016-12-06 ChrisTu Retail
4 2016-12-31 Christi Retail
Table B
ID SkEY
1 1.1
2 1.2
3 1.3
Table C
Start_Date End_Date Month_No
2016-01-01 2016-01-31 1
2016-02-01 2016-02-28 2
2016-03-01 2016-03-31 3
2016-04-01 2016-04-30 4
2016-05-01 2016-05-31 5
2016-06-01 2016-06-30 6
2016-07-01 2016-07-31 7
2016-08-01 2016-08-31 8
2016-09-01 2016-09-30 9
2016-10-01 2016-10-30 10
2016-11-01 2016-11-31 11
2016-12-01 2016-12-31 12
I've tried to write the code in spark but didn't work and resulting in a cartisa product on the join and performance was also very bad
Df_A=spark.sql("select * from A join B where a.day>=b.start_date
and a.day<=b.end_date and b.month_no=(I)")
Actual Output should have the code in pyspark where A join B where every month needs to be processed. the value of I should automatically be incremented from 1 to 12 along with month dates.
A Join B as shown above and A Join C using ID as well as performance should be good

from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark import HiveContext
hiveContext= HiveContext(sc)
def UDF_df(i):
print(i[0])
ABC2=spark.sql("select * From A where day where day
='{0}'.format(i[0]))
Join=ABC2.join(Tab2.join(ABC2.ID == Tab2.ID))\
.select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)
Join\
.select("Tab2.skey","ABC2.Day","ABC2.Name","ABC2.Description")
.write\
.mode("append")\
.format("parquet')\
.insertinto("Table")
ABC=spark.sql("select distinct day from A where day<= ' 2016-01-01' and day<='2016-
12-31'")
Tab2=spark.sql("select * from B where day is not null)
for in in ABC.collect():
UDF_df(i)
The following query is working but taking a long time as the number of
columns are around 60(just used sample 3). Also didn't join Table C as I
wasn't sure how to join to avoid cartisan join. performance isn't good, am
not sure how to optimise the query.

Related

Getting ranking based on a number from CTE

I have a complex situation in PostgreSQL 11 where i need to generate a numbering based on a single figure which i get it from a CTE.
Below is the CTE
WITH pending_orders_to_be_processed_details
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY so.create_date ) as queue_no
, name,so.create_date ::TIMESTAMP
FROM picking sp
LEFT JOIN order so ON so.name=sp.origin
WHERE sp.state IN('assigned','confirmed')
)
,orders_which_can_be_processed_today AS
(
-- This CTE will give me a count of orders
and its hourly average, Lets say count is 400 and hourly avg is 3
)
Now i need to number the details according to the hourly average, Means the first 3 orders need to be ranked as 1, next 3 to be ranked as 2 and so on, so that i can able to identify that these can be processed based on this ranking.
Input will be
name queu_number. create_date
so1 1 2021-03-11 12:00:00
so2 2 2021-03-11 13:00:00
so3 3 2021-03-11 14:00:00
so4 4 2021-03-11 15:00:00
so5 5 2021-03-11 16:00:00
so6 6 2021-03-11 17:00:00
so7 7 2021-03-11 18:00:00
so8 8 2021-03-11 19:00:00
so9 9 2021-03-11 20:00:00
The expected output will be
name rank
so1 1
so2 1
so3 1
so4 2
so5 2
so6 2
so7 3
so8 3
so9 3
Any help/suggestions.

Edit: I recently learned about a function, which fits well here:
demo:db<>fiddle
You can use the ntile() window function for that:
SELECT
*,
ntile(3) OVER (ORDER BY create_date)
FROM mytable
demo:db<>fiddle
Since you already created a cumulative row count, you can use this to create your expected rank:
SELECT
*,
floor((queue_no - 1) / 3) + 1 as rank
FROM my_cte
queue_no - 1 (so, 1 to 3 will be shifted to 0 to 2)
Diff by 3: so, 0 to 2 will be 0.x and 3 to 5 will be 1.x, ...
Now round these result to 0, 1, 2, ...
If you want to start with 1 instead of 0, add 1

Getting data from alternate dates of same ID column

I've a table data as below, now I need to fetch the record with in same code, where (Value2-Value1)*2 of one row >= (Value2-Value1) of consequtive date row. (all dates are uniform with in all codes)
---------------------------------------
code Date Value1 Value2
---------------------------------------
1 1-1-2018 13 14
1 2-1-2018 14 16
1 4-1-2018 15 18
2 1-1-2019 1 3
2 2-1-2018 2 3
2 4-1-2018 3 7
ex: output needs to be
1 1-1-2018 13 14
as I am begginer to SQL coding, tried my best, but cannot get through with compare only on consequtive dates.

Use a self join.
You can specify all the conditions you've listed in the ON clause:
SELECT T0.code, T0.Date, T0.Value1, T0.Value2
FROM Table As T0
JOIN Table As T1
ON T0.code = T1.code
AND T0.Date = DateAdd(Day, 1, T1.Date)
AND (T0.Value2 - T0.Value1) * 2 >= T1.Value2 - T1.Value1

How do i write a group by query in PostgreSQL

I'm getting errors with PostgreSQL when am writing a group by query,
am sure someone will tell me to put all the columns I've selected in group by, but that will not give me the correct results.
Am writing a query that will select all the vehicles in the database and group the results by vehicles, giving me the total distance and cost for a given period.
Here is how am doing the query.
SELECT i.vehicle AS vehicle,
i.costcenter AS costCenter,
i.department AS department,
SUM(i.quantity) AS liters,
SUM(i.totalcost) AS Totalcost,
v.model AS model,
v.vtype AS vtype
FROM fuelissuances AS i
LEFT JOIN vehicles AS v ON i.vehicle = v.id
WHERE i.dates::text LIKE '%2019-03%' AND i.deleted_at IS NULL
GROUP BY i.vehicle;
If I put all the columns that are in the select in the group bt, the results will not be correct.
How do i go about this without putting all the columns in group by and creating sub-queries?
The fuel table looks like:
vehicle dates department quantity totalcost
1 2019-01-01 102 12 1200
1 2019-01-05 102 15 1500
1 2019-01-13 102 18 1800
1 2019-01-22 102 10 1000
2 2019-01-01 102 12 1260
2 2019-01-05 102 19 1995
2 2019-01-13 102 28 2940
Vehicle Table
id model vtype
1 1 2
2 4 6
2 5 7
This is the results i expect from the query
vehicle dates department quantity totalcost model vtype
1 2019-01-01 102 12 1200 1 2
1 2019-01-05 102 15 1500 1 2
1 2019-01-13 102 18 1800 1 2
1 2019-01-22 102 10 1000 1 2
1 2019-01-18 102 10 1000 1 2
1 65 6500
2 2019-01-01 102 12 1260 5 7
2 2019-01-05 102 19 1995 5 7
2 2019-01-13 102 28 2940 5 7
1 45 6195

Your query doesn't really make sense. Apparently there can be multiple departments and costcenters per vehicle in the fuelissuances table - which of those should be returned?
One way to deal with that, is to return all of them, e.g. as an array:
SELECT i.vehicle,
array_agg(i.costcenter) as costcenters,
array_agg(i.department) as departments,
SUM(i.quantity) AS liters,
SUM(i.totalcost) AS Totalcost,
v.model,
v.vtype
FROM fuelissuances AS i
LEFT JOIN vehicles AS v ON i.vehicle = v.id
WHERE i.dates >= date '2019-03-01'
and i.date < date '2019-04-01'
AND i.deleted_at IS NULL
group by i.vehicle, v.model, v.vtype;
Instead of an array, you could also return a comma separated lists of those values, e.g. string_agg(i.costcenter, ',') as costcenters.
Adding the columns v.model and v.vtype won't (shouldn't) change anything as the group by i.vehicle will only return a single vehicle anyway and thus the model and vtype won't change for that in the group.
Note that I removed the useless aliases and replaced the condition on the date with a proper range condition that can make use of an index on the dates column.
Edit
Based on your new sample data, you want a running total, rather than a "regular" aggregation. This can easily be done using window functions
SELECT i.vehicle,
i.costcenter,
i.department,
SUM(i.quantity) over (w) AS liters,
SUM(i.totalcost) over (w) AS Totalcost,
v.model,
v.vtype
FROM fuelissuances AS i
LEFT JOIN vehicles AS v ON i.vehicle = v.id
WHERE i.dates >= date '2019-01-01'
and i.dates < date '2019-02-01'
AND i.deleted_at IS NULL
window w as (partition by i.vehicle order by i.dates)
order by i.vehicle, i.dates;
I would not create those "total" lines using SQL, but rather in your front end that display the data.
Online example: https://rextester.com/CRJZ27446

You need to use a nested query to get those SUM you want inside that query.
SELECT i.vehicle AS vehicle,
i.costcenter AS costCenter,
i.department AS department,
(SELECT SUM(i.quantity) FROM TABLES WHERE CONDITIONS GROUP BY vehicle) AS liters,
(SELECT SUM(i.totalcost) FROM TABLES WHERE CONDITIONS GROUP BY vehicle) AS Totalcost,
v.model AS model,
v.vtype AS vtype
FROM fuelissuances AS i
LEFT JOIN vehicles AS v ON i.vehicle = v.id
WHERE i.dates::text LIKE '%2019-03%' AND i.deleted_at IS NULL;

Convert day of year (from extract) back to a date

I am trying to group data by the day of the year that it falls on. I have been able to achieve this with the code below. The issue is that I lose the information as to which day (i.e. Jan 1st, Jan 2nd etc) each grouping represents. I am simply left with a number (e.g. 1, 2 etc.) representing the day of the year. Is there any to convert this number back into the more descriptive date? Thanks a lot.
CREATE TABLE tmp2 AS
SELECT extract(doy from trd_exctn_dt) as day_of_year
,sum(dollar_vol) AS dollar_vol
FROM tmp
GROUP BY extract(doy from trd_exctn_dt);
Current Output:
day_of_year | dollar_vol
------------|------------
1 10
2 15
3 7
Desired Output: N.b. The exact format of the first column doesn't matter too much. I would be happy with DD/MM, MM/DD or any other clear output.
day_of_year | dollar_vol
------------|------------
Jan 1 | 10
Jan 2 | 15
Jan 3 | 7

Using the to_char fucntion:
SELECT to_char(trd_exctn_dt,'MM/DD') as day_of_year ,sum(dollar_vol) AS dollar_vol
FROM tmp
GROUP BY day_of_year ;

Optimized querying in PostgreSQL

Assume you have a table named tracker with following records.
issue_id | ingest_date | verb,status
10 2015-01-24 00:00:00 1,1
10 2015-01-25 00:00:00 2,2
10 2015-01-26 00:00:00 2,3
10 2015-01-27 00:00:00 3,4
11 2015-01-10 00:00:00 1,3
11 2015-01-11 00:00:00 2,4
I need the following results
10 2015-01-26 00:00:00 2,3
11 2015-01-11 00:00:00 2,4
I am trying out this query
select *
from etl_change_fact
where ingest_date = (select max(ingest_date)
from etl_change_fact);
However, this gives me only
10 2015-01-26 00:00:00 2,3
this record.
But, I want all unique records(change_id) with
(a) max(ingest_date) AND
(b) verb columns priority being (2 - First preferred ,1 - Second preferred ,3 - last preferred)
Hence, I need the following results
10 2015-01-26 00:00:00 2,3
11 2015-01-11 00:00:00 2,4
Please help me to efficiently query it.
P.S :
I am not to index ingest_date because I am going to set it as "distribution key" in Distributed Computing setup.
I am newbie to Data Warehouse and querying.
Hence, please help me with optimized way to hit my TB sized DB.

This is a typical "greatest-n-per-group" problem. If you search for this tag here, you'll get plenty of solutions - including MySQL.
For Postgres the quickest way to do it is using distinct on (which is a Postgres proprietary extension to the SQL language)
select distinct on (issue_id) issue_id, ingest_date, verb, status
from etl_change_fact
order by issue_id,
case verb
when 2 then 1
when 1 then 2
else 3
end, ingest_date desc;
You can enhance your original query to use a co-related sub-query to achieve the same thing:
select f1.*
from etl_change_fact f1
where f1.ingest_date = (select max(f2.ingest_date)
from etl_change_fact f2
where f1.issue_id = f2.issue_id);
Edit
For an outdated and unsupported Postgres version, you can probably get away using something like this:
select f1.*
from etl_change_fact f1
where f1.ingest_date = (select f2.ingest_date
from etl_change_fact f2
where f1.issue_id = f2.issue_id
order by case verb
when 2 then 1
when 1 then 2
else 3
end, ingest_date desc
limit 1);
SQLFiddle example: http://sqlfiddle.com/#!15/3bb05/1

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark Joing using monthly range - pyspark

Related

Getting ranking based on a number from CTE

Getting data from alternate dates of same ID column

How do i write a group by query in PostgreSQL

Convert day of year (from extract) back to a date

Optimized querying in PostgreSQL

Categories

Resources