We are trying to use redshift serverless. It shows the query history is stored in the table sys_query_history.
But looking at the table, the query_text field has a 4000 characters limit. They'd truncate the query. Is there another way to get the full executed queries in Redshift Serverless?
Related
I am looking to find the data source of couple of Tables in Redshift. I have gone through all the stored procedures in Redshift instance. I couldn't find any stored procedure which populates these tables in Redshift. I have also checked the Data Migration Service and didn't see these tables are being migrated from RDS instance. However, the tables are updated regularly each day.
What would be the way to find how data is populated in those 2 tables? Is there any logs or system tables I can look in to?
One place I'd look is svl_statementtext. That will pull any queries and utility queries that may be inserting or running copy jobs against that table. Just use a WHERE text LIKE %yourtablenamehere% and see what comes back.
https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html
Also check scheduled queries in the Redshift UI console.
currently scraping data and dumping them on a cloudSQL postgres database .. this data tends to grow exponentially and I need an efficient way to execute queries .. database grows by ~3GB/day and I'm looking to keep data for at least 3 months .. therefore, I've connected my CloudSQL to BigQuery .. the following is an example of a query that I'm running on BigQuery but I'm skeptical .. not sure if the query is being executed in Postgres or BigQuery ..
SELECT * FROM EXTERNAL_QUERY("project.us-cloudsql-instance", "SELECT date_trunc('day', created_at) d, variable1, AVG(variable2) FROM my_table GROUP BY 1,2 ORDER BY d;");
seems like the query is being executed in postgreSQL though, not BigQuery .. is this true? if it is, is there a way for me to load data from postgresql to bigquery in realtime and execute queries directly in bigquery ?
I think you are using federated queries. These queries are intended to collect data from BigQuery and from a CloudSQLInstance:
BigQuery Cloud SQL federation enables BigQuery to query data residing in Cloud SQL in real-time, without copying or moving data. It supports both MySQL (2nd generation) and PostgreSQL instances in Cloud SQL.
The query is being executed in CloudSQL and this could lead into a lower performance than if you run in BigQuery.
EXTERNAL_QUERY executes the query in Cloud SQL and returns results as a temporary table. The result would be a BigQuery table.
Now, the current ways to load data into BigQuery are from: GCS, other Google Ad Manager and Google Ads, a readtable data source, By inserting individual records using streaming inserts, DML statements and BigQuery I/O transform in a Dataflow pipeline.
This solution is well worth to take a look which is pretty similar to what you need:
The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.
Although technically, it is possible to rewrite the query as
SELECT date_trunc('day', created_at) d, variable1, AVG(variable2)
FROM EXTERNAL_QUERY("project.us-cloudsql-instance",
"SELECT created_at, variable1, variable2 FROM my_table")
GROUP BY 1,2 ORDER BY d;
It is not recommended though. Better do aggregation and filtering on CloudSQL as much as possible to reduce the amount of data that has to be transfered from CloudSQL to BigQuery.
I have a Dblink query Amazon RDS (Postgres) that execute an INSERT with rows from an Amazon Redshift cluster.
The query terminates after 15/20 minutes, if not more, but I can see that all rows are being inserted after only few minutes.
I'm running these queries via JetBrains' DataGrip.
Some other similar dblink on the same connection, terminate as expected.
The only difference I see being the size of the table, which is bigger in the first case.
All these queries are simply copying the whole table. Pretty much like this:
insert into rds_table(
select *
from db_link('foreign_server',
$REDSHIFT$
select *
from redshift_table
$REDSHIFT$) as table_n(...)
);
Where "foreign server" is my connection to Redshift.
I know that the query is completed because rds_table has the same number of rows as redshift_table.
DataGrip shows the query as still running:
and won't let me run other queries until I manually stop the query.
If I do so, the inserted rows remain in the database, meaning that the transaction has already committed.
Why is this happening? Is it a problem with DataGrip or with Postgres?
How can I fix it?
Is there any other better alternative to migrate data from Redshift to RDS?
If a concurrent transaction can already see the inserted data, that means that the inserting transaction and consequently the INSERT statement must already be finished.
If DataGrip shows the statement as still running, it is lying to you.
So this must be a DataGrip bug.
I am trying to migrate a huge table from postgres into Redshift.
The size of the table is about 5,697,213,832
tool: pentaho Kettle Table input(from postgres) -> Table output(Redshift)
Connecting with Redshift JDBC4
By observation I found the inserting into Redshift is the bottleneck. only about 500 rows/second.
Is there any ways to accelerate the insertion into Redshift in single machine mode ? like using JDBC parameter?
Have you consider using S3 as mid-layer?
Dump your data to csv files and apply gzip compression. Upload files to the S3 and then use copy command to load the data.
http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
The main reason for bottleneck of redshift performance, which i considered is that Redshift treats each and every hit to the cluster as one single query. It executes each query on its cluster and then proceeds to the next stage. Now when i am sending across multiple rows (in this case 10), each row of data is treated a separate query. Redshift executes each query one by one and loading of the data is completed once all the queries are executed. It means if you are having 100 million rows, there would be 100 million queries running on your Redshift cluster. Well the performance goes to dump !!!
Using S3 File Output step in PDI will load your data to S3 Bucket and then apply the COPY command on the redshift cluster to read the same data from S3 to Redshift. This will solve your problem of performance.
You may also read the below blog links :
Loading data to AWS S3 using PDI
Reading Data from S3 to Redshift
Hope this helps :)
Better to export data to S3, then use COPY command to import data into Redshift. In this way, the import process is fast while you don't need to vacuum it.
Export your data to S3 bucket and use the COPY command in Redshift . COPY command is the fastest way to insert data in Redshift .
Does VACUUM; with no other arguments run per database or per current schema on amazon redshift?
The reason I am asking this is because when VACUUM completes on one schema and I change the default schema, and run it again, it takes a whole hour to complete.
VACUUM with no arguments runs on the entire database. See the Amazon Redshift VACUUM doc: "Reclaims space and resorts rows in either a specified table or all tables in the current database.".