How to minimize my query execution time using pyspark?
I am using Postgres Database,
And spark installed in my local machine having 10GB RAM
Query Execution time in PgAdmin - 10 Sec
Query Execution time in Pyspark - 10 Sec
Find below is my pyspark code
from pyspark.sql import DataFrameReader
url = "jdbc:postgresql://168.23.233.4:5432/MyDatabase"
properties = {
"driver": "org.postgresql.Driver",
"user": "postgres",
"password": "123"
}
df = sqlContext.read.jdbc(url=url,table="(select.. very big query limit 10) AS t", properties=properties)
df.show()
Query has to join more than 13 tables each table has 1 million rows.
Please help me to faster query using Spark.
I have try this based on this blog enter link description here.
Find Below query running inside the pyspark code,
select '2019-02-27' as "Attendance_date",e.id as e_id,concat(e.first_name::text, e.last_name::text) as "Employee_name",e.emp_id as "Employee_id",
e.user_id as "User_id",e.customer_id,att.id attendance_id, al.id as Attendance_logs_id,aa1.id as attendance_approval_id,
e.client_emp_id as "Client_employee_id", e.contact_no as "Contact_no",
att.imei as ImeiNumber,e.email_id as "Email_id",
concat(man.first_name::text, man.last_name::text) as "Manager_name", man.id as "Manager_id",
att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,rl.role_name as "Role_name",b.branch_name as "Branch_name",
b.branch_code as "Branch_code",cty.city_name as "City",sm.state as "State",
gsv1.name as "Geo_Country",gsv2.name as "Geo_State"
,sh.shift_name as "Shift_name",sh.id as shift_id
,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME)
as "Check_in_time"
,al.check_in_lat as "Check_in_latitude", al.check_in_long as "Check_in_longitude",
(select string_agg(value, ', ') from json_each_text(al.check_in_address::json))as "Check_in_address",
att.check_in_late as "Check_in_late_remarks",al.check_in_distance_variation as "Check_in_distance",al.check_in_selfie as "Check_in_selfie",
case when #aa1.approval_flag = 2 then ch_in.attendance_reason end as "Check_in_rejection_remarks",qc_ch_in.attendance_reason
as "Check_in_qc_review",
case when #att.regularize_flag = 1 or #att.approval_flag = 1 THEN 'Approved'
when #aa2.approval_flag = 1 THEN 'Approved' when #aa2.approval_flag = 2 THEN 'Rejected' when #aa2.approval_flag = 0
THEN 'Pending' else null END as "Check_out_status",
case when att.attendance_type='P' and #att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and #att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and #att.approval_flag is null then '' else 'Waiting for Approval' end as "TL approval status",
case when att.attendance_type='P' then 'Marked'
when att.attendance_type='L' then 'Marked' when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null)
or (e.customer_id is not null and ehv.id is not null) then 'Holiday'when el.employee_id is not null then 'Marked'
when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked'
when em.employee_id is not null then 'Marked' else 'Not Marked' end "Status",
case when att.attendance_type='P'
then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null
then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day'
when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null
) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when
ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent'
end as "Attendance_reason",
case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,
man_behalf.last_name::text) else null end as "Onbehalf_name",att.Check_Out_Qc_Review,att.Check_Out_Distance,
al.Check_Out_Address
from employees e
left join employee_applied_holidays eh on eh.employee_id=e.id and date('2019-02-27') between eh.from_date and eh.to_date
left join employee_applied_weekoffs ew on ew.employee_id=e.id and date('2019-02-27') between ew.from_date and ew.to_date
left join employee_applied_marketoffs em on em.employee_id=e.id and date('2019-02-27') between em.from_date and em.to_date
inner join users u on u.ref_id = e.id and u.customer_id=200
inner join user_role_groups urg on u.id = urg.user_id and urg.active_flag = 1
inner join attendance_setups ass on ass.role_group_id = urg.role_group_id
left join attendances att on att.employee_id = e.id and att.start_date = '2019-02-27' and att.delete_flag = 0
left join employee_leaves el ON el.id=(select id from employee_leaves el2 where el2.employee_id=e.id and
el2.active_flag=1 and date('2019-02-27') between el2.from_date and el2.to_date order by id desc limit 1)
left join leave_types lt ON lt.id=(select leave_type from employee_leaves el where el.employee_id=e.id and
el.active_flag=1 and date('2019-02-27') between el.from_date and el.to_date order by id desc limit 1)
left join attendance_logs al on al.attendance_id = att.id and al.attendance_flag = 1
left join attendance_approvals aa1 on al.id = aa1.attendance_log_id and aa1.action = 1 and aa1.active_flag = 1
left join attendance_approvals aa2 on al.id = aa2.attendance_log_id and aa1.action = 2 and aa2.active_flag = 1
inner join branches b on b.id = e.branch_id left join employees man on man.id = e.manager_id
left join employees man_behalf on man_behalf.id = att.on_behalf_attendance
left join employee_weekoff ewo on e.id = ewo.emp_id and date_part('dow','2019-02-27'::TIMESTAMP)+1 = ewo.weekoff_id and
ewo.active_flag =1 left join employee_holidays_view ehv on e.id = ehv.id and ehv.holiday_date = '2019-02-27'
left join company_employee_holidays_view cehv on e.id = cehv.id and ehv.holiday_date = '2019-02-27'
inner join roles rl on rl.id = e.role_id inner join cities cty on cty.id = b.city_id
inner join states on states.id = b.state_id inner join state_master sm on sm.id = states.state_id
inner join countries on countries.id = b.country_id inner join country_master cm on cm.country_id = countries.country_id
left join shifts sh on sh.id = att.shift_id left join attendance_reasons ch_in on ch_in.id = aa1.reason_id
left join sessions se on sh.id=se.shift_id left join attendance_reasons ch_out on ch_out.id = aa2.reason_id
left join attendance_reasons qc_ch_in on qc_ch_in.id = att.check_in_qc_review
left join attendance_reasons qc_ch_out on qc_ch_out.id = att.check_out_qc_review
left join attendance_reasons check_in on check_in.id = al.reason_id
left join time_zones tz on b.timezone = tz.time_zone inner join geo_outlet_mapping gom
on b.id = gom.outlet_id left join geo_structure_values gsv1 on gsv1.id = gom.level1 left join
geo_structure_values gsv2 on gsv2.id = gom.level2 left join geo_structure_values gsv3 on gsv3.id = gom.level3
where e.customer_id=200
group by concat(e.first_name::text, e.last_name::text) ,e.emp_id ,e.user_id ,e.client_emp_id , e.contact_no , e.email_id,e.profile_picture,(select string_agg(role_group_name, ', ') from role_group where role_group_id = any((select array_agg(role_group_id) from user_role_groups where user_id = u.id and active_flag = 1)::int[])),concat(man.first_name::text, man.last_name::text), rl.role_name,b.branch_name,b.branch_code,cty.city_name,sm.state,cm.country,gsv1.name,gsv2.name,case when #ass.reference_point = 1 THEN b.latitude else e.latitude END,case when #ass.reference_point = 1 THEN b.longitude else e.longitude END,sh.shift_name,sh.start_time, sh.end_time,((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME),case
when current_date='2019-02-27' and sh.end_time<cast(current_time as time without time zone) then null else (case when se.check_out_flag=1 then cast(att.total_hours as interval) when se.check_out_flag=0 then sh.end_time-((to_timestamp(EXTRACT(EPOCH FROM al.check_in_time::TIME) + ((tz.operator||''||tz.difference)::INTEGER))::TIME AT TIME ZONE 'utc')::TIME) end) end,al.check_in_lat, al.check_in_long,(select string_agg(value, ', ') from json_each_text(al.check_in_address::json)),att.check_in_late,al.check_in_distance_variation,al.check_in_selfie,case when #att.regularize_flag = 1 or #att.approval_flag = 1 THEN 'Approved' when #aa1.approval_flag = 1 THEN 'Approved' when #aa1.approval_flag = 2 THEN 'Rejected' when #aa1.approval_flag = 0 THEN 'Pending' else null END,case when #aa1.approval_flag = 2 then ch_in.attendance_reason end,qc_ch_in.attendance_reason,case when #att.regularize_flag = 1 or #att.approval_flag = 1 THEN 'Approved' when #aa2.approval_flag = 1 THEN 'Approved' when #aa2.approval_flag = 2 THEN 'Rejected' when #aa2.approval_flag = 0 THEN 'Pending' else null END,
case when att.attendance_type='P' and #att.approval_flag = 1 or att.attendance_type='L' and el.approval_flag=1 or
att.attendance_type='H' and eh.approval_flag=1 or att.attendance_type='M' and em.approval_flag=1 or
att.attendance_type='W' and ew.approval_flag=1 then 'Approved'
when att.attendance_type='P' and #att.approval_flag = 0 or att.attendance_type='L' and el.approval_flag=0 or
att.attendance_type='H' and eh.approval_flag=0 or att.attendance_type='M' and em.approval_flag=0 or
att.attendance_type='W' and ew.approval_flag=0 then 'Waiting for Approval'
when att.attendance_type='P' and el.approval_flag=2 or att.attendance_type='H' and eh.approval_flag=2 or
att.attendance_type='M' and em.approval_flag=2 or att.attendance_type='W' and ew.approval_flag=2 then 'Rejected'
when att.attendance_type='P' and #att.approval_flag is null then '' else 'Waiting for Approval' end,
case when att.attendance_type='P' then 'Marked' when att.attendance_type='L' then 'Marked'
when att.attendance_type='HL' or att.attendance_type='HP' then 'Marked' when att.attendance_type='W'
or ewo.weekoff_id is not null then 'Marked' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when el.employee_id is not null then 'Marked' when eh.employee_id is not null then 'Marked' when ew.employee_id is not null then 'Marked' when em.employee_id is not null then 'Marked' else 'Not Marked' end,case when att.attendance_type='P' then check_in.attendance_reason when att.attendance_type='L' then lt.leave_type_name when el.employee_id is not null then lt.leave_type_name when att.attendance_type='HL' or att.attendance_type='HP' then 'Half Day' when att.attendance_type='W' or ewo.weekoff_id is not null then 'Week off' when (e.customer_id is null and cehv.id is not null) or (e.customer_id is not null and ehv.id is not null) then 'Holiday' when eh.employee_id is not null then 'Holiday' when ew.employee_id is not null then 'Week off' when em.employee_id is not null then 'Marketoff' else 'Absent' end ,case when att.on_behalf_attendance is not null then concat(man_behalf.first_name::text,man_behalf.last_name::text) else null end
,att.id,e.customer_id,al.id,aa1.id,att.imei,man.id,att.Uniform,att.Samsung_Logo,att.Blue_Color_Check,att.Blue_Color_Percentage,
al.Face_Detection_Flag,att.Check_Out_Qc_Review,att.Check_Out_Distance,sh.id,att.id,e.id;
As far as I know, in this case, Query execution time in pyspark and pgAdmin would obviously take the same time, as both the queries are getting executed on top of Postgres DB only.
At this point, you have not yet utilized the distributed computing and storage functionality of spark. You have just created a RDD out of the output of SQL from Postgres DB. Only, after this point your operations with this RDD will show a difference in speed.
So, optimization would be on the Postgres DB side only. Below points will help:
Optimize your SQL so that it runs faster
Read chunks of tables(simple SQLs) into RDD, and consider doing actions/transformations in pyspark for achieving desired results
instead of the SQL with complex joins.
As a part of project, I did change the field names and lengths of 2 columns. And the since I moved these database tables into Production environment. The data in those 2 columns is lost. How can i upload the data into those 2 columns. I got the copy of the file (*.tsv file). Its quite huge around 200 MB. And the only tool i got is SQL developer.
What are the best possible ways to import these 2 columns in production environment.
Here's how the updated table structure is-
2 fields which i updated are- RPTC_PAY_NBR and RPTC_PAYROLL_NBR
desc PS_RPTC_TAX_WAGE
Name Null Type
PAYGROUP NOT NULL VARCHAR2(3)
FILE_NBR NOT NULL VARCHAR2(6)
CHECK_DT DATE
WEEK_NBR NOT NULL VARCHAR2(2)
RPTC_PAYROLL_NBR NOT NULL VARCHAR2(2)
CHECK_NBR NOT NULL NUMBER(10)
PAY_END_DT DATE
EMPLID NOT NULL VARCHAR2(11)
EMPL_RCD_NBR NOT NULL NUMBER(4)
VOID_IND NOT NULL VARCHAR2(1)
RPTC_PAY_NBR NOT NULL VARCHAR2(2)
RPTC_CUR_FEDTAX NOT NULL NUMBER(14,2)
RPTC_CUR_FEDWGS NOT NULL NUMBER(14,2)
RPTC_CUR_FUTAWGS NOT NULL NUMBER(14,2)
RPTC_FUTA_ER_LIMIT NOT NULL NUMBER(14,2)
RPTC_CUR_EE_MCWG NOT NULL NUMBER(14,2)
RPTC_CUR_ER_MCWG NOT NULL NUMBER(14,2)
RPTC_CUR_MC_SUR NOT NULL NUMBER(14,2)
RPTC_CUR_MC_SURWG NOT NULL NUMBER(14,2)
RPTC_CUR_EE_SSWG NOT NULL NUMBER(14,2)
RPTC_CUR_MEDTAX NOT NULL NUMBER(14,2)
RPTC_CUR_ER_SSWG NOT NULL NUMBER(14,2)
RPTC_SS_RATE NOT NULL NUMBER(16,4)
RPTC_SS_TAX_LMT NOT NULL NUMBER(14,2)
RPTC_SS_TXBL_LMT NOT NULL NUMBER(14,2)
RPTC_CUR_ER_SSMEDW NOT NULL NUMBER(14,2)
RPTC_CUR_SSTAX NOT NULL NUMBER(14,2)
RPTC_CUR_SSMED NOT NULL NUMBER(14,2)
RPTC_YTD_MED_UNC NOT NULL NUMBER(14,2)
RPTC_YTD_SS_UNC NOT NULL NUMBER(14,2)
Please advise.
I have a query that returns me this result:
-----DATE--------------VALUE1---VALUE2
|2016-09-20 11:15:00| 5653856 | 37580
|2016-09-20 11:16:00| NULL NULL
|2016-09-20 11:18:00| NULL NULL
|2016-09-20 11:20:00| NULL NULL
|2016-09-20 11:30:00| 5653860 37580
|2016-09-20 11:32:00| NULL NULL
|2016-09-20 11:34:00| NULL NULL
In this table, only the records in xx:00, xx:15, xx:30, xx:45, have values, other records are null.
How can I make a condition in my query to get only 00,15,30 and 45 times records and dont show the others?
This is the query:
SELECT t.date,
MAX(CASE WHEN t.id= '924' THEN t.value END) - MAX(CASE WHEN t.id= '925' THEN t.valueEND) as IMA_71,
MAX(CASE WHEN t.id= '930' THEN t.value END) as IMA_73
FROM records t
where office=10
and date between '2016-09-20 11:15:00' and '2016-10-21 11:15:00'
GROUP BY t.office,t.date order by t.date asc;
You could use extract to determine the minute, and filter on that:
where extract('minute' from t.date) in (0, 15, 30, 45)
My query is:
SELECT BRAND,BRAND_GROUP, SUB_BRAND ,SUM(INCOME) AS TOTAL_INCOME FROM
"tema".MMT WHERE BRAND_GROUP IS NULL AND SUB_BRAND IS NULL GROUP BY
BRAND,BRAND_GROUP,SUB_BRAND
UNION
SELECT BRAND,BRAND_GROUP, SUB_BRAND ,SUM(INCOME) AS TOTAL_INCOME FROM
"tema".BGT WHERE BRAND_GROUP IS NULL AND SUB_BRAND IS NULL GROUP BY
BRAND,BRAND_GROUP,SUB_BRAND;
and my output is :
BRAND BRAND_GROUP SUB_BRAND TOTAL_INCOME
----- ----------- --------- ------------
GBS NULL NULL 10000
SWG NULL NULL 10000
GBS NULL NULL 20000
STG NULL NULL 20000
GTS NULL NULL 30000
The problem is that i have 2 categories of BRAND and I want to have just 1. Like this :
Brand Brand_Group Sub_brand Total_Income
GBS - - 30000
STG - - 20000
GTS - - 30000
SWG - - 10000
Can someone help me with an ideea?
I think you want to push your UNION query down into a sub-query, and then do the sum on the results of that, like below.
SELECT
BRAND
,BRAND_GROUP
,SUB_BRAND
,SUM(INCOME) AS TOTAL_INCOME
FROM (
SELECT
BRAND
,BRAND_GROUP
,SUB_BRAND
,INCOME
FROM "tema".MMT
WHERE BRAND_GROUP IS NULL
AND SUB_BRAND IS NULL
GROUP BY
BRAND
,BRAND_GROUP
,SUB_BRAND
UNION ALL
SELECT
BRAND
,BRAND_GROUP
,SUB_BRAND
,INCOME
FROM "tema".BGT
WHERE BRAND_GROUP IS NULL
AND SUB_BRAND IS NULL
GROUP BY
BRAND
,BRAND_GROUP
,SUB_BRAND
) tbl
GROUP BY
BRAND
,BRAND_GROUP
,SUB_BRAND
Two comments:
I changed your query to use UNION ALL vs UNION, because UNION will eliminate duplicates.
Do you need to select the BRAND_GROUP and SUB_BRAND, if you are only getting the rows that are null for those columns? Seems somewhat redundant to me.
I am a newbie to sql server 2012 and would like to find out how to create the following output using t-sql. there are many club numbers so there has to be a loop or cursor. Please help!!
Tables
club_number name number
---------- -------------------------------------------------- -----------
355292 NULL NULL
NULL Giviton Mbunge 355308
NULL Etero Aaron 355317
NULL Evason Banda 355326
NULL Kachibobo Batoni 355335
NULL Kashamba Nkhani 355344
355353 NULL NULL
NULL Daniel Banda 355362
NULL James Aaron 355371
NULL Amson Kamanga 355380
NULL Gostino George 355399
355405 NULL NULL
NULL Yohane Zimba 355414
NULL Haward M.Chilembwe 355423
NULL Zikiele Blangete 355432
355441 NULL NULL
Result: I would like to see the above TABLE as below, which query can do it? please help
club_number name number
---------- -------------------------------------------------- -----------
355292 NULL NULL
355292 Giviton Mbunge 355308
355292 Etero Aaron 355317
355292 Evason Banda 355326
355292 Kachibobo Batoni 355335
355292 Kashamba Nkhani 355344
355353 NULL NULL
355353 Daniel Banda 355362
355353 James Aaron 355371
355353 Amson Kamanga 355380
355353 Gostino George 355399
355405 NULL NULL
355405 Yohane Zimba 355414
355405 Haward M.Chilembwe 355423
355405 Zikiele Blangete 355432
355441 NULL NULL
SELECT club_number = MAX(club_number) OVER
(
ORDER BY COALESCE(club_number, number)
ROWS UNBOUNDED PRECEDING
),
name, number
FROM dbo.your_table
ORDER BY club_number;