Loading data from Oracle table using spark JDBC is extremely slow - pyspark

I am trying to read 500 millions records from a table using spark jdbc and then performance join on that tables .
When i execute a sql from sql developer it takes 25 Minutes .
But when i load this using spark JDBC it takes forever last time it ran for 18 hours and then i cancelled it .
I am using AWS-GLUE for this .
this is how i read using spark jdbc
df = glueContext.read.format("jdbc")
.option("url","jdbc:oracle:thin://abcd:1521/abcd.com")
.option("user","USER_PROD")
.option("password","ffg#Prod")
.option("numPartitions", 15)
.option("partitionColumn", "OUTSTANDING_ACTIONS")
.option("lowerBound", 0)
.option("upperBound", 1000)
.option("dbtable","FSP.CUSTOMER_CASE")
.option("driver","oracle.jdbc.OracleDriver").load()
customer_casedf=df.createOrReplaceTempView("customer_caseOnpremView")
I have used partitionColumn OUTSTANDING_ACTIONS and here is data distribution
Column 1 is partitionColumn and second is their occurrence
1 8988894
0 4227894
5 2264259
9 2263534
8 2262628
2 2261704
3 2260580
4 2260335
7 2259747
6 2257970
This is my Join where customer_caseOnpremView table loading is taking more than 18 hours and othere two tables takes 1 minutes
ThirdQueryResuletOnprem=spark.sql("SELECT CP.CLIENT_ID,COUNT(1) NoofCases FROM customer_caseOnpremView CC JOIN groupViewOnpremView FG ON FG.ID = CC.OWNER_ID JOIN client_platformViewOnpremView CP ON CP.CLIENT_ID = SUBSTR(FG.PATH, 2, INSTR(FG.PATH, '/') + INSTR(SUBSTR(FG.PATH, 1 + INSTR(FG.PATH, '/')), '/') - 2) WHERE FG.STATUS = 'ACTIVE' AND FG.TYPE = 'CLIENT' GROUP BY CP.CLIENT_ID")
Please suggest how to make it fast .
I have no of worker from 10 to 40
I have used Executor type standard to GP2 biggest one but no impact on job

As your query has lot of filters you don't even need to bring in the whole dataset and then apply filter on it. But you can push this query down to db engine which will in turn filter the data and return back the result for Glue job.
This can be done as explained in https://stackoverflow.com/a/54375010/4326922 and below is an example for mysql which can be applied for oracle too with few changes.
query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult"
datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()

Related

Large Queries Hanging with the MongoDB Bi Connector (and Tableau)

I’m hoping that someone with experience using the MongoDB BI connector and Tableau will be able to help me out. I am totally lost and don’t have any idea how to debug this issue I’m having with the BI connector.
Currently I am running both MongoDB and Tableau inside of a kubernetes cluster. Inside the cluster I also have a pod running the BI Connector. Tableau is successfully connecting to MongoDB via the BI connector and I am able to create workbooks and visualizations with multiple collections from MongoDB inside of Tableau.
Where my problem starts is that some large queries simply hang and never complete or give any errors. I’ve noticed this in Tableau as well as via a mysql cli client I have connected to the BI connector. In both cases the BI connector never completes the request. I know that the query has valid SQL syntax so I am totally stumped here.
Is there perhaps some kind of limitation that I am facing because I am not using MongoDB’s Atlas product?
I will include some queries that definitely work and the query that does not work. Any help would be so greatly appreciated. Or if anyone has any insight as to what may cause the long hang.
Below are two example queries that work fine:
SELECT
market,
CONVERT(date, date) as date,
clicks,
conversions,
cost,
impressions
FROM ad_metrics
SELECT
go.customer_id,
go.id as lead_id,
CONVERT(DATE_SUB(go.created_at, INTERVAL 7 HOUR), date) as date,
go.market as market,
go.make,
go.model,
go.no_way,
go.repair_location,
go.source as website_source,
go.utm_content,
go.utm_source,
go.post_tax_amount_requested
FROM god_objects go
WHERE (go.is_a_test = 0 OR go.is_a_test IS NULL)
AND (go.carparts = 0 OR go.carparts IS NULL)
AND (go.no_way = 0 OR go.no_way IS NULL)
and below is the query that hangs:
WITH tableau_ads as (
SELECT
market,
CONVERT(date, date) as date,
clicks,
conversions,
cost,
impressions
FROM ad_metrics
), tableau_leads as (
SELECT
go.customer_id,
go.id as lead_id,
CONVERT(DATE_SUB(go.created_at, INTERVAL 7 HOUR), date) as date,
go.market as market,
go.make,
go.model,
go.no_way,
go.repair_location,
go.source as website_source,
go.utm_content,
go.utm_source,
go.post_tax_amount_requested
FROM god_objects go
WHERE (go.is_a_test = 0 OR go.is_a_test IS NULL)
AND (go.carparts = 0 OR go.carparts IS NULL)
AND (go.no_way = 0 OR go.no_way IS NULL)
), tableau_sales as (
SELECT
q.id as quote_id,
go.id as lead_id,
j.id as job_id,
go.market,
CONVERT(DATE_SUB(go.sold_at, INTERVAL 7 HOUR), date) as date,
CONVERT(DATE_SUB(go.initial_job_date, INTERVAL 7 HOUR), date) as initial_job_date,
go.post_tax_amount_requested,
go.amount_collected,
go.customer_id,
go.make,
go.model,
go.repair_location,
go.source as website_source,
go.utm_content,
go.utm_source,
q.balance_amount_due,
q.assigned_technician_id,
q.payment_status,
q.quote_grand_total,
q.total_transaction_amount,
j.is_active,
j.technician_id
FROM quotes q
LEFT JOIN god_objects go ON go.id = q.lead_id
LEFT JOIN jobs j ON go.id = j.lead_id
WHERE (go.is_a_test = 0 OR go.is_a_test IS NULL)
AND (go.carparts = 0 OR go.carparts IS NULL)
AND go.initial_job_date IS NOT NULL
AND go.post_tax_amount_requested >= 200.0
) SELECT
tableau_leads.market,
tableau_leads.date,
sum(tableau_ads.clicks) as ad_clicks,
sum(tableau_ads.conversions) as ad_conversions,
sum(tableau_ads.cost) as ad_cost,
sum(tableau_ads.impressions) as ad_impressions,
COUNT(tableau_leads.lead_id) as lead_count,
COUNT(tableau_sales.quote_id) as sale_count,
SUM(tableau_leads.post_tax_amount_requested) as lead_amount_requested,
SUM(tableau_sales.post_tax_amount_requested) as sale_amount_requested
FROM tableau_leads
LEFT JOIN tableau_ads ON (tableau_leads.market = tableau_ads.market AND tableau_leads.date = tableau_ads.date)
LEFT JOIN tableau_sales ON (tableau_leads.market = tableau_sales.market AND tableau_leads.date = tableau_sales.date)
GROUP BY market, date
Is there something wrong with my SQL? Is there any other kind of issue? Again, any help would be greatly appreciated!

PostgreSQL 14.2: out of memory - Failed on request of size 24576 in memory context "TupleSort main"

I have recently installed a PostgreSQL 14.1 in parallel to my old 12.9 on my RedHat server. Both instances are running their default configurations. The server itself has 48 CPU and 188 GB RAM, which seemed to be more than sufficient for 12.9
Everything worked as expected, but I keep receiving an error message.
out of memory - Failed on request of size 24576 in memory context "TupleSort main"
SQL state: 53200
SQL tables: pos has 18 584 522 rows // orderedposteps has 18 rows // posteps has 18 rows
CREATE TEMP TABLE actualpos ON COMMIT DROP AS
SELECT DISTINCT lsa.id
FROM pos sa
JOIN orderedposteps osas ON osas.stepid = sa.stepid
JOIN posteps sas ON sas.id = osas.stepid
JOIN LATERAL
(
SELECT innersa.*
FROM pos innersa
JOIN orderedposteps innerosas ON innerosas.stepid = innersa.stepid
WHERE (innersa.id = sa.id) AND
(innersa.iscached IS FALSE) AND
(innersa.isobsolete IS FALSE)
ORDER BY innersa.createdtimestamp DESC, innerosas.stepindex DESC
LIMIT 1
) lsa ON TRUE
LEFT JOIN LATERAL
(
SELECT innersa.*
FROM pos innersa
JOIN orderedposteps innerosas ON innerosas.stepid = innersa.stepid
WHERE (innersa.id = sa.id) AND
(innersa.iscached IS TRUE) AND
(innersa.isobsolete IS FALSE)
ORDER BY innersa.createdtimestamp DESC, innerosas.stepindex DESC
LIMIT 1
) sacheck ON TRUE
LEFT JOIN orderedposteps osascheck ON osascheck.stepid = sacheck.stepid
WHERE ((sacheck IS NULL) OR (sacheck.createdtimestamp < sa.createdtimestamp) OR (osascheck.stepindex < osas.stepindex))
AND (((osas.stepindex < v_laststepindex) AND (sa.isfailure != sas.isvalidsum) AND (sa.iscached IS FALSE)) OR ((osas.stepindex = v_laststepindex) AND (sa.iscached IS FALSE)))
ORDER BY lsa.createdtimestamp DESC LIMIT 50000
The only difference I can see is the RAM utilization, showed by htop.
While 12.9 only consumes up to 10 GB RAM, the 14.1 grows up to 62GB and crashes by reaching more or less 62GB.
I have already tried to increase the work_mem via
ALTER SYSTEM SET work_mem = '4MB';
Used pgtune as well in order to change some other values, but nothing has a significant effect.
I am pretty sure the SQL can be simplified and tuned, which I could do, but I want to understand where the difference between 12.9 and 14.1 is, or what to change configuration wise, instead of refactoring one function to work with the lasted version.

Copy snowflake task results into stage and download to csv

Basically I need to automate all of the below in a snowflake TASK
Create/replace a csv file format and stage in Snowflake
Run task query (which runs every few days to pulls some stats)
Unload the query results each time it runs into the Stage csv
Download the contents of the stage csv to a local file on my machine
What I can't get right is the COPY INTO stage, how do I unload the results of the task each time it is run, into the stage?
I don't know what to put in the FROM statement - TITANLOADSUCCESSVSFAIL is not recognized but this is the name of the TASK
COPY INTO #TitanLoadStage/unload/ FROM TITANLOADSUCCESSVSFAIL FILE_FORMAT = TitanLoadSevenDays
First time using stage, and downloading locally with SF so appreciate any advice on how to get this up and running!
Thanks,
Nick
Full Code:
-- create a csv file format
CREATE OR REPLACE FILE FORMAT TitanLoadSevenDays
type = 'CSV'
field_delimiter = '|';
--create a snowflake staging table using the csv
CREATE OR REPLACE STAGE TitanLoadStage
file_format = TitanLoadSevenDays;
CREATE TASK IF NOT EXISTS TitanLoadSuccessVsFail
WAREHOUSE = ITSM_LWH
SCHEDULE = 'USING CRON 1 * * * * Australia/Canberra' --every minute for testing purposes
COMMENT = 'Last 7 days of Titan game success vs fail load %'
AS
WITH SUCCESSCTE AS (
SELECT CLIENTNAME
, COUNT(EVENTTYPE) AS SuccessLoad --count success load events for that game
FROM vw_fact_gameload60
WHERE EVENTTYPE = 103 --success load events
AND USERTYPE = 1 --real users
AND APPID = 2 --titan games
AND EVENTARRIVALDATE >= DATEADD(DAY, -7, CAST(GETDATE() AS DATE)) --only looking at the last week
GROUP BY CLIENTNAME
),
FAILCTE AS ( --same as above but for failed loads
SELECT CLIENTNAME
, COUNT(EVENTTYPE) AS FailedLoads -- count failed load events for that game
FROM vw_fact_gameload60
WHERE EVENTTYPE = 106 -- failed load events
AND USERTYPE = 1 -- real users
AND APPID = 2 -- Titan games
AND EVENTARRIVALDATE >= DATEADD(DAY, -7, CAST(GETDATE() AS DATE)) -- last 7 days
--AND FACTEVENTARRIVALDATE BETWEEN DATEADD(DAY, -7, GETDATE())AND GETDATE() -- last 7 days
GROUP BY CLIENTNAME
)
SELECT COALESCE(s.CLIENTNAME, f.CLIENTNAME) AS ClientName
, ZEROIFNULL(s.SuccessLoad) + ZEROIFNULL(f.FailedLoads) AS TotalLoads --sum the success and failed loads found for 103, 106 events only, calculated in CTEs
, ZEROIFNULL(s.SuccessLoad) AS Cnt_SuccessLoad --count from success cte
, ZEROIFNULL(f.FailedLoads) AS Cnt_FailedLoads --count from fail cte
, CONCAT(ZEROIFNULL(ROUND(s.SuccessLoad * 100.0 / TotalLoads,2)) , '%') As Pct_Success --percentage of SuccessLoads against total
, CONCAT(ZEROIFNULL(ROUND(f.FailedLoads * 100.0 / TotalLoads,2)), '%') AS Pct_Fail---percentage of failedLoads against total
FROM SUCCESSCTE s
FULL OUTER JOIN FAILCTE f -- outer join in the fail CTE by game name, outer required because some titan games sucess or fail events are NULL
ON s.CLIENTNAME = f.Clientname
ORDER BY CLIENTNAME ASC
--copy the results from the query to the snowflake staging table created above
COPY INTO #TitanLoadStage/unload/ FROM TITANLOADSUCCESSVSFAIL FILE_FORMAT = TitanLoadSevenDays
-- export the stage data to csv located in common folder
GET #TitanLoadStage/unload/data_0_0_0.csv.gz file:\\itsm\group\ITS%20Management\Common\All%20Staff\SMD\Games\Snowflake%20and%20GamesDNA\Snowflake\SnowflakeCSV\TitanLoad.csv
-- start the task
ALTER TASK IF EXISTS TitanLoadSuccessVsFail RESUME
If you want to get the results of a query ran through a task, you need to materialize the results of said query to a table.
What you have now:
CREATE TASK mytask_minute
WAREHOUSE = mywh
SCHEDULE = '5 MINUTE'
AS
SELECT 1 x;
COPY INTO #TitanLoadStage/unload/
FROM mytask_minute;
(mytask_minute is not a table, so you can't select from it)
What you should do instead:
CREATE TASK mytask_minute
WAREHOUSE = mywh
SCHEDULE = '5 MINUTE'
AS
CREATE OR REPLACE TABLE task_results_table
AS
SELECT 1 x;
COPY INTO #TitanLoadStage/unload/
SELECT *
FROM task_results_table;

How to load 533 columns of data into snowflake table?

We have a table with 533 columns with a lot of LOB columns that have to be moved to snowflake. Since our source transformation system having an issue to manage 533 columns in one job. We have split ted the columns into 2 jobs. The first job will insert 283 columns and the second job needs to update the remaining column.
We are using one copy command and upsert command respectively for these two jobs.
copy command
copy into "ADIUATPERF"."APLHSTRO"."FNMA1004_APPR_DEMO" (283 columns) from #"ADIUATPERF"."APLHSTRO".fnma_talend_poc/jos/outformatted.csv
--file_format = (format_name = '"ADIUATPERF"."APLHSTRO".CSV_DQ_HDR0_NO_ESC_CARET');
FILE_FORMAT = (DATE_FORMAT='dd-mm-yyyy', TIMESTAMP_FORMAT='dd-mm-yyyy',TYPE=CSV, ESCAPE_UNENCLOSED_FIELD = NONE,
SKIP_HEADER=1, field_delimiter ='|', RECORD_DELIMITER = '\\n', FIELD_OPTIONALLY_ENCLOSED_BY = '"',
NULL_IF = ('')) PATTERN='' on_error = 'CONTINUE',FORCE=true;
Upsert command
MERGE INTO db.schema._table as target
USING
(SELECT t.$1
from #"ADIUATPERF"."APLHSTRO".fnma_talend_poc/jos/fnma1004_appr.csv
--file_format = (format_name = '"ADIUATPERF"."APLHSTRO".CSV_DQ_HDR0_NO_ESC_CARET');
(FILE_FORMAT =>'CSV', ESCAPE_UNENCLOSED_FIELD => NONE,
SKIP_HEADER=>1, field_delimiter =>'|', RECORD_DELIMITER => '\\n', FIELD_OPTIONALLY_ENCLOSED_BY => '"',
NULL_IF => (''))
) source ON target.document_id = source.document_id
WHEN MATCHED THEN
--update lst_updated
UPDATE SET <columns>=<values>;
I would like to know if we have any other option ?
I would recommend that you run the COPY INTO for both of your split files into temp/transient tables, first. And then execute a single CTAS statement using the JOIN between those 2 tables on document_id. Don't MERGE from a flat file. You could, optionally, run a MERGE on the 2nd temp table into the first table (not temp), if you wished, but I think a straight CTAS from 2 "half" tables might be faster for you.

Update rows returned by a complex SQL query with data from query result

I have a multi-table join and want to update a table based on the result of that join. The join table produces both the scope of the update (only those rows whose effort.id appears in the result should be updated) and the data for the update (a new column should be set to the value of a calculated column).
I've made progress but can't quite make it work. Here's my statement:
UPDATE
efforts
SET
dropped_int = jt.split
FROM
(
SELECT
ef.id,
s.id split,
s.kind,
s.distance_from_start,
s.sub_order,
max(s.distance_from_start + s.sub_order)
OVER (PARTITION BY ef.id) AS max_dist
FROM
split_times st
LEFT JOIN splits s ON s.id = st.split_id
LEFT JOIN efforts ef ON ef.id = st.effort_id
) jt
WHERE
((jt.distance_from_start + jt.sub_order) = max_dist)
AND
kind <> 1;
The SELECT produces the correct join table:
id split kind dfs sub max_dist dropped dropped_int
403 33 2 152404 1 152405 TRUE 33
404 33 2 152404 1 152405 TRUE 33
405 31 2 143392 1 143393 TRUE 33
406 31 2 143392 1 143393 TRUE 33
407 29 2 132127 1 132128 TRUE 33
408 29 2 132127 1 132128 TRUE 33
409 29 2 132127 1 132128 TRUE 33
and does indeed update the efforts.id column, but there are two problems: First, it updates all efforts, not just those that are produced from the query, and second, it sets effort.id to the split value of the first row in the query result, but I need it to set each effort to the associated split value.
If this were non-SQL, it might look something like:
jt_rows.each do |jt_row|
efforts[jt_row].dropped_int = jt[jt_row].split
end
But I don't know how to do that in SQL. It seems like this should be a fairly common problem, but after a couple of hours of searching I'm coming up short.
How should I modify my statement to produce the described result? If it matters, this is Postgres 9.5. Thanks in advance for any suggestions.
EDIT:
I did not get a workable answer but ended up solving this with a mixture of SQL and native code (Ruby/Rails):
dropped_splits = SplitTime.joins(:split).joins(:effort)
.select('DISTINCT ON (efforts.id) split_times.effort_id, split_times.split_id')
.where(efforts: {dropped: true})
.order('efforts.id, splits.distance_from_start DESC, splits.sub_order DESC')
update_hash = Hash[dropped_splits.map { |x| [x.effort_id, {dropped_split_id: x.split_id, updated_at: Time.now}] }]
Effort.update(update_hash.keys, update_hash.values)
Use a condition in the WHERE clause that relates efforts table with a subquery:
efforts.id = jt.id
that is:
WHERE
((jt.distance_from_start + jt.sub_order) = max_dist)
AND
kind <> 1
AND
efforts.id = jt.id