I managed a health-care database which is hosted in AWS RDS. The system info as below:
PostgreSQL 9.6
8 v-cores and 16GB RAM
DB size now is 35GB
The problem is I want to join few thousand users in accounts tables with other health-metric tables (up to 10, and few millions record per table) to make a custom data report (using Google Data Studio).
Here what I did:
Join all the needed tables as one materialized view.
Feed Google Data Studio by this materialized view.
But, I have waited 10 hours and it still runs without end. I thought it will never finished. Does anyone experience in huge data report? Just give me the keywords.
Here is my materialized view definition:
CREATE MATERIALIZED VIEW report_20210122 AS
SELECT /* long, but simple list */
FROM accounts
INNER JOIN user_weartime ON accounts.id = user_weartime.user_id
INNER JOIN admin_exchanges ON accounts.id = admin_exchanges.user_id
INNER JOIN user_health_source_stress_history ON accounts.id = user_health_source_stress_history.user_id
INNER JOIN user_health_source_step_history ON accounts.id = user_health_source_step_history.user_id
INNER JOIN user_health_source_nutri_history ON accounts.id = user_health_source_nutri_history.user_id
INNER JOIN user_health_source_heart_history ON accounts.id = user_health_source_heart_history.user_id
INNER JOIN user_health_source_energy_history ON accounts.id = user_health_source_energy_history.user_id
INNER JOIN user_health_source_bmi_history ON accounts.id = user_health_source_bmi_history.user_id
where accounts.id in (/* 438 numbers */);
Creating a materialized view for a huge join is probably not going to help you.
You didn't show us the query for the report, but I expect that it contains some aggregate functions, and you don't want to report a list of millions of raw data.
First, make sure that you have all the appropriate indexes in place. Which indexes you need depends on the query. For the one you are showing, you would want an index on accounts(id), and (if you want a nested loop join) on admin_exchanges(user_id), and similarly for the other tables.
But to find out the correct indexes for your eventual query, you'd have to look at its execution plan.
Sometimes a materialized view can help considerably, but typically by pre-aggregating some data.
If you join more than 8 tables, increasing join_collapse_limit can give you a better plan.
I changed my idea and know how to do that using FULL JOIN ON start_date AND user_id, then each health metrics should be a columns in huge view. My report now has more than 500k rows and 40 columns but the view creation still very FAST and also the Query time on view
I would ask you why are you using direct connection to PostgreSQL to display data in DataStudio. Despite this being supported, this only makes sense if you don't want to invest time in developing a good data flow (that is, your data is small) or if you want to display real time data.
But since your data is huge and you're using a Materialized View, I guess none of these are the case.
I suggest you to move to BigQuery. DataStudio and BigQuery play really nice together, and it is made to process huge amounts of data very fast. I bet your query would run in seconds in BigQuery and it'll cost cents.
Sadly, BigQuery only supports Cloud SQL external connectors and it can't connect directly to your AWS RDS service. You'll need to write a ETL job somewhere, or move your database to Cloud SQL for PostgreSQL (which I recommend, if it is possible).
Check out these answers, if you're interesting in transfer data from AWS RDS to BigQuery:
how to load data from AWS RDS to Google BigQuery in streaming mode?
Synchronize Amazon RDS with Google BigQuery
Related
I am using a Custom SQL in Amazon QuickSight for joining several tables from RedShift. I wonder where the join happens, does QuickSight sends the query to the RedShift cluster and gets the results back, or does the join happen in QuickSight? I thought to create a view in RedShift and select data from the view to make sure the join happens in RedShift, however, read in few articles that using views in RedShift is not a good idea.
Quicksight pushes SQL down to the underlying database e.g. Redshift.
Using custom SQL is the same as using a view inside Redshift from a performance point of view.
In my opinion it is easier to manage as a Redshift view as you can:
Use Quicksight wizards more effectively
Drop and recreate the view as needed to add new columns
Have visibility into your SQL source code by storing it on a code
repo e.g. git.
currently scraping data and dumping them on a cloudSQL postgres database .. this data tends to grow exponentially and I need an efficient way to execute queries .. database grows by ~3GB/day and I'm looking to keep data for at least 3 months .. therefore, I've connected my CloudSQL to BigQuery .. the following is an example of a query that I'm running on BigQuery but I'm skeptical .. not sure if the query is being executed in Postgres or BigQuery ..
SELECT * FROM EXTERNAL_QUERY("project.us-cloudsql-instance", "SELECT date_trunc('day', created_at) d, variable1, AVG(variable2) FROM my_table GROUP BY 1,2 ORDER BY d;");
seems like the query is being executed in postgreSQL though, not BigQuery .. is this true? if it is, is there a way for me to load data from postgresql to bigquery in realtime and execute queries directly in bigquery ?
I think you are using federated queries. These queries are intended to collect data from BigQuery and from a CloudSQLInstance:
BigQuery Cloud SQL federation enables BigQuery to query data residing in Cloud SQL in real-time, without copying or moving data. It supports both MySQL (2nd generation) and PostgreSQL instances in Cloud SQL.
The query is being executed in CloudSQL and this could lead into a lower performance than if you run in BigQuery.
EXTERNAL_QUERY executes the query in Cloud SQL and returns results as a temporary table. The result would be a BigQuery table.
Now, the current ways to load data into BigQuery are from: GCS, other Google Ad Manager and Google Ads, a readtable data source, By inserting individual records using streaming inserts, DML statements and BigQuery I/O transform in a Dataflow pipeline.
This solution is well worth to take a look which is pretty similar to what you need:
The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.
Although technically, it is possible to rewrite the query as
SELECT date_trunc('day', created_at) d, variable1, AVG(variable2)
FROM EXTERNAL_QUERY("project.us-cloudsql-instance",
"SELECT created_at, variable1, variable2 FROM my_table")
GROUP BY 1,2 ORDER BY d;
It is not recommended though. Better do aggregation and filtering on CloudSQL as much as possible to reduce the amount of data that has to be transfered from CloudSQL to BigQuery.
I have a database with billions of row so when i load the data into Power BI it will take alot of time to visualize
So if i use Slicer for limiting the range of data, would it just load data in this range?
When you use PowerBI in direct query mode, it writes queries to your database getting the database to pre-aggregate the data for you.
So, if you're connecting to the StackOverflow database and your visualisation is showing count of posts per day, then the SQL query which PowerBI will send to the database is something like:
SELECT CreationDate = CONVERT(date, CreationDate), PostCount = COUNT(*)
FROM Posts
GROUP BY CONVERT(date, CreationDate);
So it may be behaving differently from what you expect, since you ask about 'would it just load data in this range.'
So, in direct query mode, the performance of the visualisation is related to the performance of your database much more than the rate at which data can be streamed over the network, because not many rows will need to be sent across the network.
Is there a general preference or best practice when it comes whether the data should be aggregated in memory on an ETL worker (with pandas groupby or pd.pivot_table, for example), versus doing a groupby query at the database level?
At the visualization layer, I connect to the last 30 days of detailed interaction-level data, and then the last few years of aggregated data (daily level).
I suppose that if I plan on materializing the aggregated table, it would be best to just do it during the ETL phase since that can be done remotely and not waste the resources of the database server. Is that correct?
If your concern is to put as little load on the source database server as possible, it is best to pull the tables from the source database to a staging area and do joins and aggregations there. But take care that the ETL tool does not perform a nested loop join on the source database tables, that is to pull in one of the tables and then run thousands of queries against the other table to find matching rows.
If your target is to perform joins and aggregations as fast and efficient as possible, by all means push them down to the source database. This may put more load on the source database though. I say “may” because if all you need is an aggregation on a single table, it can be cheaper to perform this in the source database than to pull the whole table.
If you aggregate by day, what if you boss wants it aggregated by hour or week?
The general rule is: Your fact table granularity should be as granular as possible. Then you can drill-down.
You can create pre-aggregated tables too, for example by hour, day, week, month, etc. Space is cheap these days.
Tools like Pentaho Aggregation Designer can automate this for you.
I am new to Tableau, and having performance issues and need some help. I have a query that joins several large tables. I am using a live data connection to a MySQL db.
The issue I am having is that it is not applying the filter criteria before asking MySQL for the data. So it is essentially doing a SELECT * from my query and not applying the filter criteria to the where clause. It pulls all the data from MySQL db back to Tableau, then throws away the un-needed data based on my filter criteria. My two main filter criteria are on account_id and a date range.
I can cleanly get a list of the accounts from just doing a select from my account table to populate the filter list, then need to know how to apply that selection when it goes to pull the data from the main data query from MySQL.
To apply a filter at the data source first, try using context filters.
Performance can also be improved by using extracts.
I would personally use an extract, go into your MySQL DB Back-end, run the query, and a CREATE TABLE extract1 AS statement, or whatever you want to call your data table.
When you import this table into Tableau it will already have a SELECT * of your aggregate data in the workbook. From here your query efficiency will be increased ten fold.
Unfortunately, it's going to take awhile for Tableau processing time + mySQL backend DB query time = Ntime to process your data.
Try the extracts...
I've been struggling with the very same thing. I have found that the tableau extracts aren't any faster than pulling directly from a SQL table. What I have done is within SQL created tables that already have the filtered data in them, so the Select * will have only the needed data. The downside to this is it takes up more space on the server, but this isn't a problem on my side.
For the Large Data sets Tableau recommend using an Extract.
An extract will create a snapshot of the data that you are connected with and processing on this data will be faster than a live connection.
All the charts and visualization will load faster and saves your time, each time when you go to the Dashboard.
For the filters that you are using to filter the data-set will work faster in an extract connection. But to get the latest data you have to refresh the extract or schedule a refresh in the server ( if you are uploading the report to server).
There are multiple type of filters available in Tableau, the use of which depends on your application, context filters and global filters can be use to filter the whole set of data.