currently scraping data and dumping them on a cloudSQL postgres database .. this data tends to grow exponentially and I need an efficient way to execute queries .. database grows by ~3GB/day and I'm looking to keep data for at least 3 months .. therefore, I've connected my CloudSQL to BigQuery .. the following is an example of a query that I'm running on BigQuery but I'm skeptical .. not sure if the query is being executed in Postgres or BigQuery ..
SELECT * FROM EXTERNAL_QUERY("project.us-cloudsql-instance", "SELECT date_trunc('day', created_at) d, variable1, AVG(variable2) FROM my_table GROUP BY 1,2 ORDER BY d;");
seems like the query is being executed in postgreSQL though, not BigQuery .. is this true? if it is, is there a way for me to load data from postgresql to bigquery in realtime and execute queries directly in bigquery ?
I think you are using federated queries. These queries are intended to collect data from BigQuery and from a CloudSQLInstance:
BigQuery Cloud SQL federation enables BigQuery to query data residing in Cloud SQL in real-time, without copying or moving data. It supports both MySQL (2nd generation) and PostgreSQL instances in Cloud SQL.
The query is being executed in CloudSQL and this could lead into a lower performance than if you run in BigQuery.
EXTERNAL_QUERY executes the query in Cloud SQL and returns results as a temporary table. The result would be a BigQuery table.
Now, the current ways to load data into BigQuery are from: GCS, other Google Ad Manager and Google Ads, a readtable data source, By inserting individual records using streaming inserts, DML statements and BigQuery I/O transform in a Dataflow pipeline.
This solution is well worth to take a look which is pretty similar to what you need:
The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.
Although technically, it is possible to rewrite the query as
SELECT date_trunc('day', created_at) d, variable1, AVG(variable2)
FROM EXTERNAL_QUERY("project.us-cloudsql-instance",
"SELECT created_at, variable1, variable2 FROM my_table")
GROUP BY 1,2 ORDER BY d;
It is not recommended though. Better do aggregation and filtering on CloudSQL as much as possible to reduce the amount of data that has to be transfered from CloudSQL to BigQuery.
Related
We are trying to use redshift serverless. It shows the query history is stored in the table sys_query_history.
But looking at the table, the query_text field has a 4000 characters limit. They'd truncate the query. Is there another way to get the full executed queries in Redshift Serverless?
I want to write a Spark pipeline to perform aggregation on my production DB data and then write data back to the DB. My goal of writing the pipeline is to perform aggregation and not impact production DB while it runs, meaning I don't want users experiencing lag nor DB having heavy IOPS while the aggregation is performed. For example, an equivalent aggregation query just run as SQL would take a long time and also use up the RDS IOPS, which results in users not being able to get data - trying to avoid this. A few questions:
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)? For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
When writing data back to DB, does that incur load on DB as well?
I'm using a PostgreSQL database in case this matters.
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
By default there will be a single partition in Glue to which the whole table is read into.But you can do parallel reads using this and make sure to chose a column that will not affect the DB performance.
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)?
Yes when you pass a query instead of table you will be only reading the result of it from the DB and reducing the large n/w and IO transfer. This means you are delegating it to DB engine to calculate the result.Refer to this on how you can do it.
For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
Yes depending on the table size and query complexity this might affect DB performance and if you have a read replica then you can simply use that.
When writing data back to DB, does that incur load on DB as well?
Yes it depends on how you are writing the result back to DB. Few partitions is always good i.e, not too many and not too less.
I've a procedure in Oracle PL/SQL which fetches transactional data based on certain condition, then performs some logical calculations. I used cursor to store the SQL and then I used FETCH (cursor) BULK COLLECT INTO (table type variable) LIMIT 10000, iterated over this table variable to perform calculation and ultimately storing the value in a DB table. Once 10000 rows have been processed, query will be executed to fetch next set of records,
This helped me limiting number of times SQL is executed via cursor and limiting the number of records loaded into memory.
I am trying to migrate this code to plpgsql. How can I achieve this functionality in plpgsql?
You cannot achieve this functionality in PostgreSQL.
I wrote an extension https://github.com/okbob/dbms_sql . It can be used for reduce of necessary work related to migration from Oracle to Postgres.
But you don't need this feature in Postgres. Although PL/pgSQL is similar to PL/SQL, the architecture is very different - and bulk collect operations are not necessary.
I managed a health-care database which is hosted in AWS RDS. The system info as below:
PostgreSQL 9.6
8 v-cores and 16GB RAM
DB size now is 35GB
The problem is I want to join few thousand users in accounts tables with other health-metric tables (up to 10, and few millions record per table) to make a custom data report (using Google Data Studio).
Here what I did:
Join all the needed tables as one materialized view.
Feed Google Data Studio by this materialized view.
But, I have waited 10 hours and it still runs without end. I thought it will never finished. Does anyone experience in huge data report? Just give me the keywords.
Here is my materialized view definition:
CREATE MATERIALIZED VIEW report_20210122 AS
SELECT /* long, but simple list */
FROM accounts
INNER JOIN user_weartime ON accounts.id = user_weartime.user_id
INNER JOIN admin_exchanges ON accounts.id = admin_exchanges.user_id
INNER JOIN user_health_source_stress_history ON accounts.id = user_health_source_stress_history.user_id
INNER JOIN user_health_source_step_history ON accounts.id = user_health_source_step_history.user_id
INNER JOIN user_health_source_nutri_history ON accounts.id = user_health_source_nutri_history.user_id
INNER JOIN user_health_source_heart_history ON accounts.id = user_health_source_heart_history.user_id
INNER JOIN user_health_source_energy_history ON accounts.id = user_health_source_energy_history.user_id
INNER JOIN user_health_source_bmi_history ON accounts.id = user_health_source_bmi_history.user_id
where accounts.id in (/* 438 numbers */);
Creating a materialized view for a huge join is probably not going to help you.
You didn't show us the query for the report, but I expect that it contains some aggregate functions, and you don't want to report a list of millions of raw data.
First, make sure that you have all the appropriate indexes in place. Which indexes you need depends on the query. For the one you are showing, you would want an index on accounts(id), and (if you want a nested loop join) on admin_exchanges(user_id), and similarly for the other tables.
But to find out the correct indexes for your eventual query, you'd have to look at its execution plan.
Sometimes a materialized view can help considerably, but typically by pre-aggregating some data.
If you join more than 8 tables, increasing join_collapse_limit can give you a better plan.
I changed my idea and know how to do that using FULL JOIN ON start_date AND user_id, then each health metrics should be a columns in huge view. My report now has more than 500k rows and 40 columns but the view creation still very FAST and also the Query time on view
I would ask you why are you using direct connection to PostgreSQL to display data in DataStudio. Despite this being supported, this only makes sense if you don't want to invest time in developing a good data flow (that is, your data is small) or if you want to display real time data.
But since your data is huge and you're using a Materialized View, I guess none of these are the case.
I suggest you to move to BigQuery. DataStudio and BigQuery play really nice together, and it is made to process huge amounts of data very fast. I bet your query would run in seconds in BigQuery and it'll cost cents.
Sadly, BigQuery only supports Cloud SQL external connectors and it can't connect directly to your AWS RDS service. You'll need to write a ETL job somewhere, or move your database to Cloud SQL for PostgreSQL (which I recommend, if it is possible).
Check out these answers, if you're interesting in transfer data from AWS RDS to BigQuery:
how to load data from AWS RDS to Google BigQuery in streaming mode?
Synchronize Amazon RDS with Google BigQuery
I'm currently struggling querying be chunk of data that is stored in partitioned table (partition per date)
the data looks like that:
date, product_id, orders
2019-11-01, 1, 100
2019-11-01, 2, 200
2019-11-02, 1, 300
I have hundreds of date-partitions and millions of rows per date.
Now, if I want to query, for instance, total orders for product id 1 and 2 for period of 2 weeks, and group by date (to show in a graph per date), the db has to go to 2 weeks of partitions and fetch the data for them.
That process might be taking a long time when the number of products is big or the time frame required is long.
I have read that AWS Redshift is suitable for this kind of tasks. I'm considering shifting my partitioned tables (aggregated analytics per date) to that technology but I wonder if that's really what I should do to make those queries to run much faster.
Thanks!
As per your use case Redshift is really a good choice for you.
To gain the best performance out of Redshift, it is very important to set proper distribution and sort key. In your case "date" column should be distribution key and "productid" should be sort key. Another important note, Do not encode "date" and "productid" column.
You should get better performance.
If you are struggling with traditional SQL databases, then Amazon Redshift is certainly an option. It can handle tables with billions of rows.
This would involve loading the data from Amazon S3 into Redshift. This will allow Redshift to optimize the way that the data is stored, making it much faster for querying.
Alternatively, you could consider using Amazon Athena, which can query the data directly from Amazon S3. It understands data that is partitioned into separate directories (eg based on date).
Which version of PostgreSQL are you using?
Are you using native partioning or inheritance partitioning trigger-based?
Latest version of postgresql improved partitioning management.
Considering your case Amazon Redshift can be a good choice, so does Amazon Athena. But it is also important to consider your application framework. Are you opt moving to Amazon only for Database or you have other Amazon services in the list too?
Also before making the decision please check the cost of Redshift.