How to add partition filter in subquery used as table in AWS Glue PySpark? - pyspark

I have created a job using Aws Glue 4.0 with PySpark 3.3 to perform read and write operation in oracle DB using JDBC.I am using G1.X with 10 worked nodes
SQL subquery is used as dbtable. By default Spark will load the entire table in one Partition But I want that read and write should be done parallelly. For this I used numPartitions, partitionColumn, lowerBound and upperBound options
Sample script
def insert_in_db():
sql = (select a,b from tbl1 join tbl2 on a.id=b.id)
df1 = spark.read.format("jdbc").options(driver=driver,url=jdbc_url,dbtable=sql,user=usr,password=pwd,numPartitions=10,partitionColumn="column_name") \
.option("oracle.jdbc.mapDateToTimestamp", "false")\
.option("lowerBound", "2022-11-13") \
.option("upperBound", "2023-01-11") \
.option("sessionInitStatement", "ALTER SESSION SET NLS_DATE_FORMAT = 'YYYY-MM-DD'")\
.load()
On checked the log I found the partition filter were created as below
jdbc.JDBCRelation (Logging.scala:logInfo(61)): Number of partitions: 10, WHERE clauses of these partitions: "DATEVAL" \< '2022-11-23' or "DATEVAL" is null, "DATEVAL" \>= '2022-11-23' AND "DATEVAL" \< '2022-11-28', "DATEVAL" \>= '2022-11-28' AND "DATEVAL" \< '2022-12-03', "DATEVAL" \>= '2022-12-03' AND "DATEVAL" \< '2022-12-08', "DATEVAL" \>= '2022-12-08' AND "DATEVAL" \< '2022-12-13', "DATEVAL" \>= '2022-12-13' AND "DATEVAL" \< '2022-12-18', "DATEVAL" \>= '2022-12-18' AND "DATEVAL" \< '2022-12-23', "DATEVAL" \>= '2022-12-23' AND "DATEVAL" \< '2022-12-28', "DATEVAL" \>= '2022-12-28' AND "DATEVAL" \< '2023-01-02', "DATEVAL" \>= '2023-01-02'
My concern is that the table is a subquery, so when JBDC will send this query to database for execution then this subquery will run for every partition and the partition filter created by PySpark will be added later. So, for every parallel run Oracle will run the complete query which will be memory intensive at database side.
Is there any solution where partition filter gets applied as a part of subquery i.e. instead of running query like
select \* from (select a,b from tbl1 join tbl2 on a.id=b.id) where "DATEVAL" \>= '2022-11-23' AND "DATEVAL" \< '2022-11-28'
the query runs like
select a,b from tbl1 join tbl2 on a.id=b.id where "DATEVAL" \>= '2022-11-23' AND "DATEVAL" \< '2022-11-28'
Used numPartitions, partitionColumn, lowerBound and upperBound options for parallel read / write
By using this I thought that partition filter will be added in subquery but it's actually added after subquery.

Related

Strange PostgreSQL index using while using LIMIT..OFFSET

PostgreSQL 9.6.3 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
Table and indices:
create table if not exists orders
(
id bigserial not null constraint orders_pkey primary key,
partner_id integer,
order_id varchar,
date_created date,
state_code integer,
state_date timestamp,
recipient varchar,
phone varchar,
);
create index if not exists orders_partner_id_index on orders (partner_id);
create index if not exists orders_order_id_index on orders (order_id);
create index if not exists orders_partner_id_date_created_index on orders (partner_id, date_created);
The task is to create paging/sorting/filtering data.
The query for the first page:
select order_id, date_created, recipient, phone, state_code, state_date
from orders
where partner_id=1 and date_created between '2019-04-01' and '2019-04-30'
order by order_id asc limit 10 offset 0;
The query plan:
QUERY PLAN
"Limit (cost=19495.48..38990.41 rows=10 width=91)"
" -> Index Scan using orders_order_id_index on orders (cost=0.56..**41186925.66** rows=21127 width=91)"
" Filter: ((date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date) AND (partner_id = 1))"
Index orders_partner_id_date_created_index is not used, so the cost is extremely high!
But starting from some offset values (the exact value differs from time to time, looks like it depends on total row count) the index starts to be used:
select order_id, date_created, recipient, phone, state_code, state_date
from orders
where partner_id=1 and date_created between '2019-04-01' and '2019-04-30'
order by order_id asc limit 10 offset 40;
Plan:
QUERY PLAN
"Limit (cost=81449.76..81449.79 rows=10 width=91)"
" -> Sort (cost=81449.66..81502.48 rows=21127 width=91)"
" Sort Key: order_id"
" -> Bitmap Heap Scan on orders (cost=4241.93..80747.84 rows=21127 width=91)"
" Recheck Cond: ((partner_id = 1) AND (date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date))"
" -> Bitmap Index Scan on orders_partner_id_date_created_index (cost=0.00..4236.65 rows=21127 width=0)"
" Index Cond: ((partner_id = 1) AND (date_created >= '2019-04-01'::date) AND (date_created <= '2019-04-30'::date))"
What's happening? Is this a way to force the server to use the index?
General answer:
Postgres stores some information about your tables
Before executing the query, planner prepares execution plan based on those informations
In your case, planner thinks that for certain offset value this sub-optimal plan will be better. Note that your desired plan requires sorting all selected rows by order_id, while this "worse" plan does not. I'd guess that Postgres bets there will be quite many such rows for various orders and just tests one order after another, starting from lowest.
I can think of two solutions:
A) provide more data to the planer, by running
ANALYZE orders;
(https://www.postgresql.org/docs/9.6/sql-analyze.html)
or bo changing gathered statistics
ALTER TABLE orders SET STATISTCS (...);
(https://www.postgresql.org/docs/9.6/planner-stats.html)
B) Rewrite query in a way that hints desired index usage, like this:
WITH
partner_date (partner_id, date_created) AS (
SELECT 1,
generate_series('2019-04-01'::date, '2019-04-30'::date, '1 day'::interval)::date
)
SELECT o.order_id, o.date_created, o.recipient, o.phone, o.state_code, o.state_date
FROM orders o
JOIN partner_date pd
ON (o.partner_id, o.date_created) = (pd.partner_id, pd.date_created)
ORDER BY order_id ASC LIMIT 10 OFFSET 0;
Or maybe even better:
WITH
partner_date (partner_id, date_created) AS (
SELECT 1,
generate_series('2019-04-01'::date, '2019-04-30'::date, '1 day'::interval)::date
),
all_data AS (
SELECT o.order_id, o.date_created, o.recipient, o.phone, o.state_code, o.state_date
FROM orders o
JOIN partner_date pd
ON (o.partner_id, o.date_created) = (pd.partner_id, pd.date_created)
)
SELECT *
FROM all_data
ORDER BY order_id ASC LIMIT 10 OFFSET 0;
Disclaimer - I can't explain why the first query should be interpreted in other way by Postgres planner, just think it could. On the other hand, second query separates offsets/limits from joins and I'd be very surprised if Postgres still did it the "bad" (according to you benchmarks) way.

PostgreSQL: aggregate expression on a subquery

A total beginner's question: I wanted to run a sub-query with GROUP BY statement, and then find out a row with maximum value in the result. I have built an expression like that below:
SELECT agg.facid, agg.Slots
FROM
(SELECT facid AS facid, SUM(slots) AS Slots FROM cd.bookings
GROUP BY facid
ORDER BY SUM(slots) DESC) AS agg
WHERE agg.Slots = (SELECT MAX(Slots) FROM agg);
In my mind, this should first create a 2-column table with facid and SUM(slots) values, and then by addressing these columns as agg.facid and agg.Slots I should get only the row with max value in "Slots". However, instead I am getting this error:
ERROR: relation "agg" does not exist
LINE 6: WHERE agg.Slots = (SELECT MAX(Slots) FROM agg);
This is probably something very simple, so I am sorry in advance for a silly problem ;)
I am working on PostgreSQL 10, with pgAdmin 4.
Use a Common Table Expression:
WITH agg AS (
SELECT facid AS facid, SUM(slots) AS Slots
FROM cd.bookings
GROUP BY facid
)
SELECT agg.facid, agg.Slots
FROM agg
WHERE agg.Slots = (SELECT MAX(Slots) FROM agg);
So a bit more of a research, and I figured a solution which might be clean enough to my liking, using a Common Table Expression:
WITH sum AS (SELECT facid, SUM(slots) AS Slots FROM cd.bookings GROUP BY facid)
SELECT facid, Slots
FROM sum
WHERE Slots = (SELECT MAX(Slots) FROM sum);
The first line declares a CTE, which can later be called for a sub-query calculating what is the max value in aggregated slots column.
Hope it helps anyone interested.
Does this do what you are looking for?
SELECT
facid,
SUM(slots)
FROM cd.bookings
GROUP BY
facid
HAVING SUM(slots) = MAX(slots)

Select non-equal value in SQL grouping

I have a dataset of food eaten:
create table test
(group_id integer,
food varchar,
item_type varchar);
insert into test values
(764, 'apple', 'new_food'),
(123, 'berry', 'new_food'),
(123, 'apple', 'others'),
(123, 'berry', 'others'),
(86, 'carrot', 'others'),
(86, 'carrot', 'new_food'),
(86, 'banana', 'others');
In each group, the new food eaten is of item_type new_food. The previous food that was being eaten is whatever else in the group doesn't equal the new_food's value.
The dataset I would like from this would be:
| group | previous_food | new_food |
------------------------------------
764 null apple
123 apple berry
86 banana carrot
However, I can't get the group selections correct. My attempt is currently:
select
group_id,
max(case when item_type != 'new_food' then food else null end) as previous_food,
max(case when item_type = 'new_food' then food else null end) as new_food
from test
group by group_id
However, we can't rely on the max() function to pick the correct previous food since they are not necessarily alphabetically ordered.
I just need whichever other food in the grouping != the new_food. How can I get this?
Can I avoid using a subquery or is that inevitable? The database says I can't nest aggregate functions and it is frustrating.
Here is my sqlfiddle so far: http://sqlfiddle.com/#!17/a2a46/1
EDIT: I've solved this with a subquery here: http://sqlfiddle.com/#!17/dd8b9/12 but can we do better? Surely there must be a way of doing this comparison easily within the grouping no?

How to divide counts on a single table?

This is Postgres 8.x, specifically Redshift
I have a table that I'm querying to return a single value, which is the result of a simple division operation. Table's grain looks along the likes of user_id | campaign_title.
Division operation is like the count of rows where campaign_name is ilike %completed% divided by count of distinct user_ids.
So I have the numerator and denominator queries all written out, but I'm honestly confused how to combine them.
Numerator:
select count(*) as num_completed
from public.reward
where campaign_title ilike '%completion%'
;
Denominator:
select count(distinct(user_id))
from public.reward
The straightforward solution, just divide one by the other:
select (select count(*) as num_completed
from public.reward
where campaign_title ilike '%completion%')
/
(select count(distinct user_id) from public.reward);
The slightly more complicated but faster solution:
select count(case when campaign_title ilike '%completion%' then 1 end)
/
count(distinct user_id)
from public.reward;
The expression count(case when campaign_title ilike '%completion%' then 1 end) will only count rows that meet the condition specified in the when clause.
Unrelated but:
distinct is not a function. Writing distinct(user_id) is useless. And - in the case of Postgres - it can actually get you into trouble if you keep thinking of distinct as a function, because the expression (column_one, column_2) is something different in Postgres than the list of columns: column_one, column_2

Alternative LAG function in SQL Server 2005

Maybe somebody could help me. I am using SQL Server 2005 and I can't use the lag function.
I have a table:
2014-02-03 07:42:00.000
2014-02-03 18:49:00.000
2014-02-06 14:54:00.000
2014-02-07 17:58:00.000
2014-02-20 13:39:00.000
How I can get this result:
2014-02-03 07:42:00.000 NULL
2014-02-03 18:49:00.000 2014-02-03 07:42:00.000
2014-02-06 14:54:00.000 2014-02-03 18:49:00.000
2014-02-07 17:58:00.000 2014-02-06 14:54:00.000
2014-02-20 13:39:00.000 2014-02-07 17:58:00.000
suppose your table is called dt and the colum x then:
select x,(select max(x) from dt d2 where d2.x<d1.x) from dt d1
this ill work even in SQL Server 7!
fiddle
select min(date),null from datetable
union
select t1.date, max(t2.date)
from dateTable t1
join dateTable t2 on t2.date < t1.date
group by t1.date