Pull the data on a daily basis - amazon-redshift

I have data in my redshift cluster, What is the best way to pull the data on daily basis from redshift and create a new new table YY in redshift basis of few sql queries.
Like we have a table XX in redshift and i want to create a table in redshift from pull the top 10 rows from table XX
Create table YY as Select top 10 * from XX

Using AWS-Glue you could schedule the Job and then write the the scripts code to do specific things. AWS-glue code could be triggered on following 3 type of events, in your case I think #1 is applicable.
A trigger that is based on a cron schedule.
A trigger that is event-based; for example, the successful completion of another job can start an AWS Glue job.
A trigger that starts a job on demand.
For your case in my opinion, this should be more applicable.
I hope this should give you some pointers.

Related

Is there a way to describe an external/spectrum table via redshift?

In AWS Athena you can write
SHOW CREATE TABLE my_table_name;
and see a SQL-like query that describes how to build the table's schema. It works for tables whose schema are defined in AWS Glue. This is very useful for creating tables in a regular RDBMS, for loading and exploring data views.
Interacting with Athena in this way is manual, and I would like to automate the process of creating regular RDBMS tables that have the same schema as those in Redshift Spectrum.
How can I do this through a query that can be run via psql? Or is there another way to get this via the aws-cli?
Redshift Spectrum does not support SHOW CREATE TABLE syntax, but there are system tables that can deliver same information. I have to say, it's not as useful as the ready to use sql returned by Athena though.
The tables are
svv_external_schemas - gives you information about glue database mapping and IAM roles bound to it
svv_external_tables - gives you the location information, and also data format and serdes used
svv_external_columns - gives you the column names, types and order information.
Using that data, you could reconstruct the table's DDL.
For example to get the list of columns and their types in the CREATE TABLE format one can do:
select distinct
listagg(columnname || ' ' || external_type, ',\n')
within group ( order by columnnum ) over ()
from svv_external_columns
where tablename = '<YOUR_TABLE_NAME>'
and schemaname = '<YOUR_SCHEM_NAME>'
the query give you the output similar to:
col1 int,
col2 string,
...
*) I am using listagg window function and not the aggregate function, as apparently listagg aggregate function can only be used with user defined tables. Bummer.
I had been doing something similar to #botchniaque's answer in the past, but recently stumbled across a solution in the AWS-Labs' amazon-redshift-utils code package that seems to be more reliable than my hand-spun queries:
amazon-redshift-utils: v_generate_external_tbl_ddl
If you don't have the ability to create a view backed with the ddl listed in that package, you can run it manually by removing the CREATE statement from the start of the query. Assuming you can create it as a view, usage would be:
SELECT ddl
FROM admin.v_generate_external_tbl_ddl
WHERE schemaname = '<external_schema_name>'
-- Optionally include specific table references:
-- AND tablename IN ('<table_name_1>', '<table_name_2>', ..., '<table_name_n>')
ORDER BY tablename, seq
;
They added show external table now.
SHOW EXTERNAL TABLE external_schema.table_name [ PARTITION ]
SHOW EXTERNAL TABLE my_schema.my_table;
https://docs.aws.amazon.com/redshift/latest/dg/r_SHOW_EXTERNAL_TABLE.html

Datastage multiple parametric (conditionned) query execution

I would like to create a job than based on some values in Table A, execute a Select query in Table B where the WHERE CONDITION must be parametric.
For example: I have 10 columns in A with 100 rows filled. 9 of my columns can be nullable so I have to create a query that controls the nullability of a value, if null then it must NOT be considered a research criteria in the Select statement.
I thought about using a SPARSE lookup where I'd pass a string that I created with the concatenation of the research parameters if they're not null but the job fails because you need to map the columns.
I even created a file with queries as string and then I loop the file and pass the string as a variable for the DB2 connector stage. It works... but I have more than 10000 rows means 10000 queries.. not that fast.
Thanks for your help.
PS: I'm new to this stuff :D
what you can do is to use Before SQL option at your source/target stage. Namely, your job will have at least two stages. One source db2 stage and one copy or sequential or peek stage as target or Row generator and target db2 connector.
In your input db2 connector you can pass your sql script as parameter into before sql provided that it is generated in advance and pass it as value to your before sql of db2 connector. Your actual sql statement will use "dummy" script such as "select current date from sysibm.sysdummy1" to complete your execution.
Hope it makes sense.

Talend - Insert/Update versus table-file-table

I have been using insert/update to update or insert a table in mysql from sql server. The job is set up as a cronjob. The job runs every 8 hours. The number of records in the source table is around 400000. Every 8 hours around 100 records might get updated or inserted.
I run the job in such a away that at the source level, I only take the modified runs between the last run and the current run.
I have observed that just to update / insert 100 rows the time taken is 30 minutes.
However, another way was to dump all of the 400000 in a file and then truncate the destination table and insert all of those records all over again. This process is done at every job run
So, now may I know why does insert/update take so much time?
Thanks
Rathi
As you said you run the job in such a away that at the source level, I only take the modified runs between the last run and the current run.
So just insert all these modified rows in a temp table
Take the min date modified date from temp table or use the same criteria which you use to extract only modified rows from source and delete all the rows from the destination table.
Then you can insert all the rows from temp to end table.
Let me know if you have any question.
Without knowing how your database is configured, it's hard to tell the exact reason, but I'd say the updates are slow because you don't have an index on your target table.
Try adding an index on your insert/update key column, it will speed things up.
Also, are you doing a commit after each insert ? If so, disable autocommit, and only commit on success like this : tMysqlOutput -- OnComponentOk -- tMysqlCommit.

SELECT vs CREATE TABLE AS SELECT execution time

My function should return a TABLE which is created by lots of joins and is relatively "big".
If inside of my function i put return query select <complex query goes here>; then it takes ages (more like 10-15 mins) to run.
However, if instead of returning a TABLE, I return VOID and simply create a table within function body - it finished under 1 min.
The same goes for running this "complex query" as select <complex query goes here> VS create table <table name> as select <complex query goes here> and then select * from <table_name>.
Why is there such a difference in execution time?
P.S. The select clause of the query has around 35 columns with some logic inside.
P.P.S. The query returns only about 90K rows, so I doubt that it is the time that takes to send the data over the network
answer
select differs from create table as select in manner where you use the data, first will send data to the client and the latter will save it to disk server side.
why
Possible reasons could be slow link, and "feature" of the client. According to the fact that local psql running \copy (select * from) to 'local_file' took 3 seconds and yet PgAdmin took ages to display sam data, I assume you version PgAdmin (or any version at all) is not meant for your amount of data to display (as you say 36MB). So it was not the link, but the client.

Creating a connection from Microsoft SQL server to an AS/400

I'm trying to connect from Microsoft SQL server to as AS/400 so i can pull data from the AS/400 then flag the data as being pulled.
I've successfully created and OLE DB "IBMDASQL" connection, and am able to pull data some data, but i'm running into an issue when i try to pull data from a very large table
This runs fine, and returns a count of 170 million:
select count(*)
from transactions
This query executed for 15 hours before i gave up on it. (It should return zero since i haven't flagged anything as 'in process' yet)
select count(*)
from transactions
where processed = 'In process'
I'm a Microsoft guy, but my AS/400 guy says that there is an index on the 'processed' column and that locally, that query run instantaneously.
Any thoughts on what i might be doing wrong? I found a table with only 68 records in it, and was able to run this query in about a second:
select count(*)
from smallTable
where RandomColumn = 'randomValue'
So I know that the AS/400 is at least able to understand that type of query.
I have had to fight this battle many times.
There are two ways of approaching this.
1) Stage your data from the AS400 into SQL server where you can optimize your indexes
2) Ask the AS400 folks to create logical views which speed up data retrieval, your AS400 programmer is correct, index will help but I forget the term they use to define a "view" similar to a sql server view, I beleive its something like "physical" v/s "logical". Logical is what you want.
Thirdly, 170 million is a lot of records, even for a relational database like SQL server, have you considered running an SSIS package nightly that stages your data into your own SQL table to see if it improves performance?
I would suggest this way to have good performance, i suppose you have at least SQL2005, i havent tested yet but this is a tip
Let the AS400 perform the select in native way by creating stored procedure in the AS400
open a AS400 session
launch STRSQL
create an AS400 stored procedure in this way to get/update the recordset
CREATE PROCEDURE MYSELECT (IN PARAM CHAR(10))
LANGUAGE SQL
DYNAMIC RESULT SETS 1
BEGIN
DECLARE C1 CURSOR FOR SELECT * FROM MYLIB.MYFILE WHERE MYFIELD=PARAM;
OPEN C1;
RETURN;
END
create an AS400 stored procedure to update the recordset
CREATE PROCEDURE MYUPDATE (IN PARAM CHAR(10))
LANGUAGE SQL
RESULT SETS 0
BEGIN
UPDATE MYLIB.MYFILE SET MYFIELD='newvalue' WHERE MYFIELD=PARAM;
END
Call those AS400 SP from SQL SERVER
declare #myParam char(10)
set #myParam = 'In process'
-- get the recordset
EXEC ('CALL NAME_AS400.MYLIB.MYSELECT(?) ', #myParam) AT AS400 -- < AS400 = name of linked server
-- update
EXEC ('CALL NAME_AS400.MYLIB.MYUPDATE(?) ', #myParam) AT AS400
Hope it helps
I recommend following the suggestions in the IBM Redbook SQL Performance Diagnosis on IBM DB2 Universal Database for iSeries to determine what's really happening.
IBM technical support can also be extremely helpful in diagnosing issues such as these. Don't be afraid to get in touch with them as the software support is generally included as part of the maintenance contract and there is no charge to talk to them.
I've seen OLEDB connections eat up 100% cpu for hours and when the same query is run through VisualExplain (query analyzer) it estimates mere seconds to execute.
We found that running the query like this performed liked expected:
SELECT *
FROM OpenQuery( LinkedServer,
'select count(*)
from transactions
where processed = ''In process''')
GO
Could this be a collation problem? - your WHERE clause is testing on a text field and if the collations of the two servers don't match this clause will be applied clientside rather than serverside so you are first of all pulling all 170 million records down to the client and then performing the WHERE clause on it there.
Based on the past interactions I have had, the query should take about the same amount of time no matter how you access the data. Another thought would be if you could create a view on the table to get the data you need or use a stored procedure.