I'm using a TimeScaleDB and PostgreSQL to manage timeseries data.
When optimizing the table is it recommended to rely purely on TimeScaleDB hypertable or should I also add indexes independently the same way I would do when not using a hypertable?
What is critical in that scenario is the performance of retrieving the data.
TimescaleDB creates an index on the time dimension by default. If your queries often select data on values from other columns, it can be helpful to create indexes on such columns as you would do with normal tables. However, in the case of TimescaleDB all indexes should be compound and include the time dimension column. You might drop the automatically created index on the time dimension after you created new indexes.
As usual creating new indexes should take in consideration that indexes occupy additional space and require more processing resources to maintain them.
Timescale has a blog post, which has advises on adding indexes.
Related
Today I read on Hackernews about BRIN indexing with PostgreSQL.
We are working with TimescaleDB for large and simple sensor data series.
Does BRIN indexing in TimescaleDB give additional value or do TimescaleDB features make BRIN indexing obsolete?
TimescaleDB is just a thin layer that speeds up inserting into partitions, it doesn't boost your queries to the best of my knowledge.
BRIN indexes are only useful in the case when the logical ordering of the rows according to a column is the exact same (or the exact opposite) of the physical ordering of the rows.
In practice, that means that rows have to be inserted in the same order as the column in question (e.g., earlier timestamps get inserted before later ones), and there is never an UPDATE or DELETE.
If this is the case, you can use a BRIN index on that column, which will take almost no space.
I'm a newbie on database design hoping to learn by experimenting and implementing, and I'm sure some version of this question has been asked within database design in general, but this is specific to Tableau.
I have a few dashboards that are drawing from a PostgreSQL database table that contains several million rows of data. Performance rerendering views is quite slow (ie., if I select a different parameter, Tableau's Executing SQL query popup will appear and it often takes several minutes to complete).
I debugged this by using the performance recording option within Tableau, exporting the SQL queries that Tableau is using to a text file, and then using EXPLAIN ANALYZE to figure out what exactly the bottlenecks were. Unfortunately, I can't post the SQL queries themselves, but I've created a contrived case below to be as helpful as possible. Here is what my table looks like currently. The actual values being rendered by Tableau are in green, and the columns that I have foreign key references on are in yellow:
I see within the Query Plan that there's a lot of costly bitmap heap scans that are implementing the filters am using on Tableau on the frontend on neighborhood_id, view, animal, date_updated, animal_name.
I've attempted to place a multiple index on these fields, but upon rerunning the queries, it does not look like the PG query planner is electing to use these indices.
So my proposed solution is to create foreign key references for each of these fields (neighborhood_id, view, animal, date_updated, animal_name)- again, yellow represents a FK reference:
My hope is that these FK references will force the query planner is use an index scan instead of a sequential scan or bitmap heap scan. However, my questions are
Before, all the data was more or less stored in this one table, with
two joins to shelter and age_of_animal tables. Now, this table
will be joined to 8 smaller subtables- will these joins drastically
reduce performance? The subtables are quite small (ie. the animal
table will have only 40 entries).
I know the question is difficult to answer without seeing the actual
query and query plan, but what are some high-level reasons would the
query planner elect to not use an index? I've read through some articles like "Why Postgres Won't Always Use An Index" but mostly they refer to cases where it's a small table and a simple query where the cost of the index lookup is greater than simply traversing the rows. I don't think applies to my case though- I have millions of rows and a complex filter on 5+ columns.
Is the PG Query Planner any more likely to use multiple column
indices on a collection of foreign key columns versus regular
columns? I know that PG does not automatically add indices on
foreign keys, so I imagine I'll still need to add indices after
creating the foreign key references.
Of course, the answers to my questions could be "Why don't you just try it and see?", but in this case refactoring such a large table is quite costly and I'd like some intuition on whether it's worth the cost prior to undertaking it.
I have 200+ millions of records in postgresql-9.5 table. Almost all queries are analytical queries. To increase and optimize the query performance so far I am trying with Indexing and seems that its not sufficient. What are the other options i need to look it into?
Depending on where clause condition create partitioned table (https://www.postgresql.org/docs/10/static/ddl-partitioning.html)
,it will reduce query cost drastically,also if there is certain fixed value in where clause do partial indexing on partitioned table.
Important point check order of columns in where clause and match it while indexing
You should upgrade to PostgreSQL v10 so that you can use parallel query.
That enables you to run sequential and index scans with several background workers in parallel, which can speed up these operations on large tables.
A good database layout, good indexing, lots of RAM and fast storage are also important factors for good performance of analytical queries.
If the analysis involves a lot of aggregation, consider materialized views to store the aggregates. Materialized views do take up space and they need to be refreshed too. But they are very useful for data aggregation.
Many posts like this stackoverflow link claim that there is no concept of a clustered index in PostgreSQL. However, the PostgreSQL documentation contains something similar. A few people claim it is similar to a clustered index in SQL Server.
Do you know what the exact difference between these two is, if there is any?
A clustered index or index organized table is a data structure where all the table data are organized in index order, typically by organizing the table in a B-tree structure.
Once a table is organized like this, the order is automatically maintained by all future data modifications.
PostgreSQL does not have such clustering indexes. What the CLUSTER command does is rewrite the table in the order of the index, but the table remains a fundamentally unordered heap of data, so future data modifications will not maintain that index order.
You have to CLUSTER a PostgreSQL table regularly if you want to maintain an approximate index order in the face of data modifications to the table.
Clustering in PostgreSQL can improve performance, because tuples found during an index scan will be close together in the heap table, which can turn random access to the heap to faster sequential access.
I am new to Redhsift and migrting oracle to Redshift.
One of the oracle tables have around 60 indexes. AWS recommends its a good practice to have around 6 compound sort keys.
How would these 60 oracle indexes translate to Redhsift sort keys ? I understand there is no automated conversion or can't have all 60 of them as compound sort keys. I am new to redshift and May I know , how usually this conversion is approached.
In Oracle we can keep adding indexes to the same table and the queries / reports can use them. But in Redshift Changing sortkeys is through recreating the table. How do we make all queries which uses different filter columns and join columns on the same table have best performance?
Thanks
Redshift is columnar database, and it doesn't have indexes in the same meaning as in Oracle at all.
You can think of Redshift's compound sort key (not interleaved) as IOT in Oracle (index organized table), with all the data sorted physically by this compound key.
If you create interleaved sort key on x columns, it will act as a separate index on each of x columns in some manner.
In any way, being columnar database, Redshift can outperform Oracle in many aggregation queries due to it's compression and data structure. The main factors that affect performance in Redshift are distribution style and key, sort key and columns encoding.
If you can't fit all your queries with one table structure, you can duplicate the table with different structure, but the same data. This approach is widely used in big data columnar databases (for example projections in Vertica) and helps to achieve performance with storage being the cost.
Please review this page with several useful tips about Redshift performance:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/
First a few key points
Redshift <> Oracle
Redshift does not have indexes, Redshift Sort Keys <> Oracle Indexes.
Hopefully, you are not expecting Redshift to replace Oracle for your OLTP workload. most of those 60 indexes are likely to be for optimising OLTP type workload.
Max Redshift sortkeys per table = 1
You cannot sort your Redshift data in more than 1 way! the sort key orders your table data. It is not an index.
You can specify an interleaved or a compound sort key.
Query Tuning
Hopefully, you will be using Redshift for analytical type queries. You should define sort and distribution based upon your expected queries. you should follow the best practice here and the tutorial here
Tuning Redshift is partly an art, you will need to use trial and error!
If you want specific guidance on this, please can you edit your question to be specific on what you are doing?