can we get the number of row inserted through copy command? Some records might fail, then what is the no of records successfully inserted?
I have a file with json object in Amazon S3 and trying to load data into Redshift using copy command. How do I know how many of records successfully got inserted and how many failed?
Loading some example data:
db=# copy test from 's3://bucket/data' credentials '' maxerror 5;
INFO: Load into table 'test' completed, 4 record(s) loaded successfully.
COPY
db=# copy test from 's3://bucket/err_data' credentials '' maxerror 5;
INFO: Load into table 'test' completed, 1 record(s) loaded successfully.
INFO: Load into table 'test' completed, 2 record(s) could not be loaded. Check 'stl_load_errors' system table for details.
COPY
Then the following query:
with _successful_loads as (
select
stl_load_commits.query
, listagg(trim(filename), ', ') within group(order by trim(filename)) as filenames
from stl_load_commits
left join stl_query using(query)
left join stl_utilitytext using(xid)
where rtrim("text") = 'COMMIT'
group by query
),
_unsuccessful_loads as (
select
query
, count(1) as errors
from stl_load_errors
group by query
)
select
query
, filenames
, sum(stl_insert.rows) as rows_loaded
, max(_unsuccessful_loads.errors) as rows_not_loaded
from stl_insert
inner join _successful_loads using(query)
left join _unsuccessful_loads using(query)
group by query, filenames
order by query, filenames
;
Giving:
query | filenames | rows_loaded | rows_not_loaded
-------+------------------------------------------------+-------------+-----------------
45597 | s3://bucket/err_data.json | 1 | 2
45611 | s3://bucket/data1.json, s3://bucket/data2.json | 4 |
(2 rows)
Related
I have log_min_duration_statement=0 in config.
When I check log file, sql statement and duration are saved into different rows.
(Not sure what I have wrong, but statement and duration are not saved together as this answer points)
As I understand session_line_num for duration record always equals to session_line_num + 1 for relevant statement, for same session of course.
Is this correct? is below query reliable to correctly get statement with duration in one row?
(csv log imported into postgres_log table):
WITH
sql_cte AS(
SELECT session_id, session_line_num, message AS sql_statement
FROM postgres_log
WHERE
message LIKE 'statement%'
)
,durat_cte AS (
SELECT session_id, session_line_num, message AS duration
FROM postgres_log
WHERE
message LIKE 'duration%'
)
SELECT
t1.session_id,
t1.session_line_num,
t1.sql_statement,
t2.duration
FROM sql_cte t1
LEFT JOIN durat_cte t2
ON t1.session_id = t2.session_id AND t1.session_line_num + 1 = t2.session_line_num;
I have the following table
id,link_id,url,type,download,filename
44,11,https://google.com,extra,1,126cd08a-b963-48e5-878e-96dea057d57e.jpg
45,11,https://google.com,extra,0,53bfa01d-91d0-4b84-9389-b06e5e4ef618.jpg
46,11,https://google.com,extra,0,364cfdc2-c0b6-43fc-8936-33e49896014a.jpg
47,12,https://google.com,extra,0,9d26efbd-e6e0-42df-bde0-04c05babffe4.jpg
48,13,https://yahoo.com,extra,0,2d58b9f7-1860-40d8-88f0-9fc08cd7275f.jpg
49,13,https://yahoo.com,extra,0,574b1646-6316-4a4b-8e28-56c38c0999b9.jpg
...
I want to write a query to check download=1 and update all the rows with the same url and downoload=0 to set the same filename and download=1
so the output would be
44,11,https://google.com,extra,1,126cd08a-b963-48e5-878e-96dea057d57e.jpg
45,11,https://google.com,extra,1,126cd08a-b963-48e5-878e-96dea057d57e.jpg
46,11,https://google.com,extra,1,126cd08a-b963-48e5-878e-96dea057d57e.jpg
47,12,https://google.com,extra,1,126cd08a-b963-48e5-878e-96dea057d57e.jpg
48,13,https://yahoo.com,extra,0,2d58b9f7-1860-40d8-88f0-9fc08cd7275f.jpg
49,13,https://yahoo,extra,0,574b1646-6316-4a4b-8e28-56c38c0999b9.jpg
demo:db<>fiddle
UPDATE mytable t
SET download = s.download, filename = s.filename -- 3
FROM (
SELECT url, download, filename -- 1
FROM mytable
WHERE download = 1
) s
WHERE t.url = s.url -- 2
Find the expected record with download = 1
Find all related url records
Update the columns with the data fetched in step 1
Basically I need to automate all of the below in a snowflake TASK
Create/replace a csv file format and stage in Snowflake
Run task query (which runs every few days to pulls some stats)
Unload the query results each time it runs into the Stage csv
Download the contents of the stage csv to a local file on my machine
What I can't get right is the COPY INTO stage, how do I unload the results of the task each time it is run, into the stage?
I don't know what to put in the FROM statement - TITANLOADSUCCESSVSFAIL is not recognized but this is the name of the TASK
COPY INTO #TitanLoadStage/unload/ FROM TITANLOADSUCCESSVSFAIL FILE_FORMAT = TitanLoadSevenDays
First time using stage, and downloading locally with SF so appreciate any advice on how to get this up and running!
Thanks,
Nick
Full Code:
-- create a csv file format
CREATE OR REPLACE FILE FORMAT TitanLoadSevenDays
type = 'CSV'
field_delimiter = '|';
--create a snowflake staging table using the csv
CREATE OR REPLACE STAGE TitanLoadStage
file_format = TitanLoadSevenDays;
CREATE TASK IF NOT EXISTS TitanLoadSuccessVsFail
WAREHOUSE = ITSM_LWH
SCHEDULE = 'USING CRON 1 * * * * Australia/Canberra' --every minute for testing purposes
COMMENT = 'Last 7 days of Titan game success vs fail load %'
AS
WITH SUCCESSCTE AS (
SELECT CLIENTNAME
, COUNT(EVENTTYPE) AS SuccessLoad --count success load events for that game
FROM vw_fact_gameload60
WHERE EVENTTYPE = 103 --success load events
AND USERTYPE = 1 --real users
AND APPID = 2 --titan games
AND EVENTARRIVALDATE >= DATEADD(DAY, -7, CAST(GETDATE() AS DATE)) --only looking at the last week
GROUP BY CLIENTNAME
),
FAILCTE AS ( --same as above but for failed loads
SELECT CLIENTNAME
, COUNT(EVENTTYPE) AS FailedLoads -- count failed load events for that game
FROM vw_fact_gameload60
WHERE EVENTTYPE = 106 -- failed load events
AND USERTYPE = 1 -- real users
AND APPID = 2 -- Titan games
AND EVENTARRIVALDATE >= DATEADD(DAY, -7, CAST(GETDATE() AS DATE)) -- last 7 days
--AND FACTEVENTARRIVALDATE BETWEEN DATEADD(DAY, -7, GETDATE())AND GETDATE() -- last 7 days
GROUP BY CLIENTNAME
)
SELECT COALESCE(s.CLIENTNAME, f.CLIENTNAME) AS ClientName
, ZEROIFNULL(s.SuccessLoad) + ZEROIFNULL(f.FailedLoads) AS TotalLoads --sum the success and failed loads found for 103, 106 events only, calculated in CTEs
, ZEROIFNULL(s.SuccessLoad) AS Cnt_SuccessLoad --count from success cte
, ZEROIFNULL(f.FailedLoads) AS Cnt_FailedLoads --count from fail cte
, CONCAT(ZEROIFNULL(ROUND(s.SuccessLoad * 100.0 / TotalLoads,2)) , '%') As Pct_Success --percentage of SuccessLoads against total
, CONCAT(ZEROIFNULL(ROUND(f.FailedLoads * 100.0 / TotalLoads,2)), '%') AS Pct_Fail---percentage of failedLoads against total
FROM SUCCESSCTE s
FULL OUTER JOIN FAILCTE f -- outer join in the fail CTE by game name, outer required because some titan games sucess or fail events are NULL
ON s.CLIENTNAME = f.Clientname
ORDER BY CLIENTNAME ASC
--copy the results from the query to the snowflake staging table created above
COPY INTO #TitanLoadStage/unload/ FROM TITANLOADSUCCESSVSFAIL FILE_FORMAT = TitanLoadSevenDays
-- export the stage data to csv located in common folder
GET #TitanLoadStage/unload/data_0_0_0.csv.gz file:\\itsm\group\ITS%20Management\Common\All%20Staff\SMD\Games\Snowflake%20and%20GamesDNA\Snowflake\SnowflakeCSV\TitanLoad.csv
-- start the task
ALTER TASK IF EXISTS TitanLoadSuccessVsFail RESUME
If you want to get the results of a query ran through a task, you need to materialize the results of said query to a table.
What you have now:
CREATE TASK mytask_minute
WAREHOUSE = mywh
SCHEDULE = '5 MINUTE'
AS
SELECT 1 x;
COPY INTO #TitanLoadStage/unload/
FROM mytask_minute;
(mytask_minute is not a table, so you can't select from it)
What you should do instead:
CREATE TASK mytask_minute
WAREHOUSE = mywh
SCHEDULE = '5 MINUTE'
AS
CREATE OR REPLACE TABLE task_results_table
AS
SELECT 1 x;
COPY INTO #TitanLoadStage/unload/
SELECT *
FROM task_results_table;
I would like do a left join query in sails.js. I think i should use populate
I have three models
caracteristique{
id,
name,
races:{
collection: 'race',
via: 'idcaracteristique',
through: 'racecaracteristique'
},
}
race{
id,
name,
caracteristiques:{
collection: 'caracteristique',
via: 'idrace',
through: 'racecaracteristique'
}
}
RaceCarecteristique{
idrace: {
model:'race'
},
idcaracteristique: {
model: 'caracteristique'
},
bonusracial:{
type: 'number',
}
My data are:
Table Caracteristiques
id name
1 | strength
2 | dex
3 | Charisme
Table Race
id name
1 | human
2 | Org
TableRaceCarecteristique
idrace idcaracteristique bonusracial
1 | 2 | +2
This sql request give me for human, all caracteristiques and if exist bonusracial
'SELECT caracteristique.id, caracteristique.name, bonusracial
FROM caracteristique
LEFT OUTER JOIN (select idcaracteristique, bonusracial
from racecaracteristique
where idrace=$1 ) as q
ON q.idcaracteristique = caracteristique.id';
I have this result:
caracteristique.id, caracteristique.name, bonusracial
1 | strength | null
2 | dex | 2
3 | Charisme | null
How use populate to do this ?
When using a SQL-database adapter (MySQL, PQSL etc) you can utilise a method for performing actual, handwritten SQL statements. When all else fails, this might be your best bet to find an acceptable solution, within the framework.
The .sendNativeQuery() method sends your parameterized SQL statement to the native driver, and responds with a raw, non-ORM-mangled result. Actual database-schema specific tables and columns appear in the result, so you need to be careful with changes to models etc. as they might change the schema in the backend database.
The method takes two parameters, the parameterized query, and the array of values to be inserted. The array is optional and can be omitted if you have no parameters to replace in the SQL statement.
Using your already parameterized query from above, I'm sending the query to fetch the data for an "org" (orc perhaps?) in the example below. See the docs linked at the bottom.
Code time:
let query = `
SELECT caracteristique.id, caracteristique.name, bonusracial
FROM caracteristique
LEFT OUTER JOIN (select idcaracteristique, bonusracial
from racecaracteristique
where idrace=$1 ) as q
ON q.idcaracteristique = caracteristique.id`;
var rawResult = await sails.sendNativeQuery(query, [ 2 ]);
console.log(rawResult);
Docs: .sendNativeQuery()
With PostgreSQL 9.5, I would like to track the total amount of bytes written (since DB cluster start) to:
WAL
temp files
temp tables
For 1.:
select
pg_size_pretty(archived_count * 16*1024*1024) temp_bytes,
(now() - stats_reset)::text uptime
from pg_stat_archiver;
For 2.:
select
(now() - stats_reset)::text uptime,
pg_size_pretty(temp_bytes) temp_bytes
from pg_stat_database where datname = 'mydb';
How do I get 3.?
In response to a comment below, I did some tests to check where temp tables are actually written.
First, the DB parameter temp_buffers is at 8GB on this cluster:
select pg_size_pretty(setting::bigint*8192) from pg_settings
where name = 'temp_buffers';
-- "8192 MB"
Lets create a temp table:
drop table if exists foo;
create temp table foo as
select random() from generate_series(1, 1000000000);
-- Query returned successfully: 1000000000 rows affected, 10:22 minutes execution time.
Check the PostgreSQL backend pid and OID of the created temp table:
select pg_backend_pid(), 'pg_temp.foo'::regclass::oid;
-- 46573;398695055
Check the RSS size of the backend process
~$ grep VmRSS /proc/46573/status
VmRSS: 9246276 kB
As can be seen, this is only slightly above the 8GB set with temp_buffers.
The data inserted into the temp table is however immediately written, and it is written to the normal tablespace directories, not temp files:
select * from pg_relation_filepath('pg_temp.foo')
-- "base/16416/t3_398695055"
Here is the number of files and amount written:
with temp_table_files as
(
select * from pg_ls_dir('base/16416/') fn
where fn like 't3_398695055%'
)
select
count(*) as cnt,
pg_size_pretty(sum((pg_stat_file('base/16416/' || fn)).size)) as size
from temp_table_files;
-- 34;"34 GB"
And finally verify that the set of temp files owned by this backend PID is indeed empty:
with temp_files_per_pid as
(
with temp_files as
(
select
temp_file,
(regexp_replace(temp_file, $r$^pgsql_tmp(\d+)\..*$$r$, $rr$\1$rr$, 'g'))::int as pid,
(pg_stat_file('base/pgsql_tmp/' || temp_file)).size as size
from pg_ls_dir('base/pgsql_tmp') temp_file
)
select pid, pg_size_pretty(sum(size)) from temp_files group by pid order by pid
)
select * from temp_files_per_pid where pid = 46573;
Returns nothing.
What is also "interesting", after dropping the temp table
DROP TABLE foo;
the RSS of the backend process does not reduce:
~$ grep VmRSS /proc/46573/status
VmRSS: 9254544 kB
Doing the following will also not free the RSS again:
RESET ALL;
DEALLOCATE ALL;
DISCARD TEMP;
What I know, there are not any special metric for temp tables. The temp tables uses session (process) memory to temp_buffers size (8MB by default). When these temp buffers are full, then temporary files are generated.