I use grafana to view metrics in timescaledb.
For large scale metrics I create a view to aggregate them to a small dataset, I configure a sql in grafana, which table is fixed, I want the table name is changed according to the time range, say: time range less than 6 hours, query the detail table, time range greater than 24 hours query the aggregate view.
So I am looking for a proxy or postgresql plugin which can used to modify the sql before execute it.
AFAIK there is no PostgreSQL extension to modify SQL query but there is a proxy that says it can rewrite and filter SQL query: https://github.com/wgliang/pgproxy.
You might alternatively look at TimescaleDB's real-time aggregates, which were released in 1.7
Basically it will transparently take the "union" between pre-calculated aggregates > 6 hours with the "raw" data < 6 hours.
Not quite what you are asking for, but might get you to the same place, and works transparently with grafana.
https://blog.timescale.com/blog/achieving-the-best-of-both-worlds-ensuring-up-to-date-results-with-real-time-aggregation/
I would suggest taking a look at Gallium Data, it's a free database proxy that allows you to change database requests before they hit the database, and database responses before they reach the clients.
Disclosure: I'm the founder of Gallium Data.
Related
I am planning to use AWS RDS Postgres version 10.4 and above for storing data in a single table comprising of ~15 columns.
My use case is to serve:
1. Periodically (after 1 hour) store/update rows in to this table.
2. Periodically (after 1 hour) fetch data from the table say 500 rows at a time.
3. Frequently fetch small data (10 rows) from the table (100's of queries in parallel)
Does AWS RDS Postgres support serving all of above use cases
I am aware of Read-Replicas support, but is there any in built load balancer to serve the queries that come in parallel?
How many read queries can Postgres be able to process concurrently?
Thanks in advance
Your usecases seems to be a normal fit for all relational database systems. So I would say: yes.
The question is: how fast the DB can handle the 100 queries (3).
In general the postgresql documentation is one of the best I ever read. So give it a try:
https://www.postgresql.org/docs/10/parallel-query.html
But also take into consideration how big your data is!
That said, try w/o read replicas first! You might not need them.
I would like to query several different DB's using grafana, and in order to keep metrics history I would like to keep it in influxDB.
I know that I can write my own little process that holds queries and send it to influx, but I wonder if its possible by grafana only?
You won't be able to use Grafana to do that. Grafana isn't really an appropriate tool for transforming/writing data. But either way, its query engine generally just works with one single datasource/database at a time, rather than multiple, which is what you'd need here.
Is there a general preference or best practice when it comes whether the data should be aggregated in memory on an ETL worker (with pandas groupby or pd.pivot_table, for example), versus doing a groupby query at the database level?
At the visualization layer, I connect to the last 30 days of detailed interaction-level data, and then the last few years of aggregated data (daily level).
I suppose that if I plan on materializing the aggregated table, it would be best to just do it during the ETL phase since that can be done remotely and not waste the resources of the database server. Is that correct?
If your concern is to put as little load on the source database server as possible, it is best to pull the tables from the source database to a staging area and do joins and aggregations there. But take care that the ETL tool does not perform a nested loop join on the source database tables, that is to pull in one of the tables and then run thousands of queries against the other table to find matching rows.
If your target is to perform joins and aggregations as fast and efficient as possible, by all means push them down to the source database. This may put more load on the source database though. I say “may” because if all you need is an aggregation on a single table, it can be cheaper to perform this in the source database than to pull the whole table.
If you aggregate by day, what if you boss wants it aggregated by hour or week?
The general rule is: Your fact table granularity should be as granular as possible. Then you can drill-down.
You can create pre-aggregated tables too, for example by hour, day, week, month, etc. Space is cheap these days.
Tools like Pentaho Aggregation Designer can automate this for you.
Need to be able to report on Unique Visitors, but would like to avoid pre-computing every possible permutation of keys and creating multiple tables.
As a simplistic example, let's say I need to report Monthly Uniques in a table that has the following columns
date (Month/Year)
page_id
country_id
device_type_id
monthly_uniques
In Druid and Redis, Hyperloglog data type will take care of this (assuming a small margin of error is acceptable), where I would be able to run a query by any combination of the dimensions and receive a viable estimate of the uniques.
Closest I was able to find in PostgreSQL world is postgresql-hll plugin, but it seems to be for PostgreSQL 9.0+.
Is there a way to represent this in Redshift without either having to pre-compute or store visitor IDs (greatly inflating the table size, but allowing to use RedShift's "approximate count" hll implementation)?
Note: RedShift is the preferred platform, but I already know that other self-hosted PostgreSQL forks can support this, such as CitusDB. Looking for ways to do this with RedShift.
Redshift recently announced support for HyperLogLog Sketches:
https://aws.amazon.com/about-aws/whats-new/2020/10/amazon-redshift-announces-support-hyperloglog-sketches/
https://docs.aws.amazon.com/redshift/latest/dg/hyperloglog-overview.html
UPDATE: blog post on HLL usage https://aws.amazon.com/blogs/big-data/use-hyperloglog-for-trend-analysis-with-amazon-redshift/
Redshift announced new HLL capabilities in October 2020. If your Redshift release version is 1.0.19097 or later, you can use all HLL functions available. See more at AWS Redshift documentation here
You can do something like
SELECT hll(column_name) AS unique_count FROM YOURTABLE;
or create HLL sketches directly
Redshift, while technically postgresql-derived, was forked over ten years ago. It still speaks the same line protocol as postgres, but its code has diverged a great deal. Among other incompatibilities, it no longer allows for custom datatypes. That means that the type of plugin you're looking to use is not going to be feasible.
However, as you pointed out, if you're able to get all the raw data in, you can use the built-in approximation capability.
I am new to Tableau, and having performance issues and need some help. I have a query that joins several large tables. I am using a live data connection to a MySQL db.
The issue I am having is that it is not applying the filter criteria before asking MySQL for the data. So it is essentially doing a SELECT * from my query and not applying the filter criteria to the where clause. It pulls all the data from MySQL db back to Tableau, then throws away the un-needed data based on my filter criteria. My two main filter criteria are on account_id and a date range.
I can cleanly get a list of the accounts from just doing a select from my account table to populate the filter list, then need to know how to apply that selection when it goes to pull the data from the main data query from MySQL.
To apply a filter at the data source first, try using context filters.
Performance can also be improved by using extracts.
I would personally use an extract, go into your MySQL DB Back-end, run the query, and a CREATE TABLE extract1 AS statement, or whatever you want to call your data table.
When you import this table into Tableau it will already have a SELECT * of your aggregate data in the workbook. From here your query efficiency will be increased ten fold.
Unfortunately, it's going to take awhile for Tableau processing time + mySQL backend DB query time = Ntime to process your data.
Try the extracts...
I've been struggling with the very same thing. I have found that the tableau extracts aren't any faster than pulling directly from a SQL table. What I have done is within SQL created tables that already have the filtered data in them, so the Select * will have only the needed data. The downside to this is it takes up more space on the server, but this isn't a problem on my side.
For the Large Data sets Tableau recommend using an Extract.
An extract will create a snapshot of the data that you are connected with and processing on this data will be faster than a live connection.
All the charts and visualization will load faster and saves your time, each time when you go to the Dashboard.
For the filters that you are using to filter the data-set will work faster in an extract connection. But to get the latest data you have to refresh the extract or schedule a refresh in the server ( if you are uploading the report to server).
There are multiple type of filters available in Tableau, the use of which depends on your application, context filters and global filters can be use to filter the whole set of data.