Should I create an index on a column with repetitive values and where in lookup - postgresql

I have a materialized view (which is very much a table) where I need to make where in kind of queries.
The column I want to query (say view_id) definitely has repetitions (15-20).
The where in queries would also be very large i.e - it would contain a lot of view_id to query.
Should I go ahead and create an index on this column?
Will it give me some performance improvements?
I have another column which would help help me have a multi column index(unique). Should this be a better option?

With questions such as these on performance, there is no substitute for testing it with your exact case. There's little harm in trying it out (even on a production system, but utilize a test system if you can!), other than perhaps slowing performance until you undo what you did. Postgres makes this kind of tinkering safe.
#tim-biegeleisen's first comment is spot on: with your setup, your cardinality is reduced, but that doesn't mean it's not a win.
In short, try it and see. There is no better answer you will get than what your own dataset and access patterns will give you.

Related

Why does paginating with offset using PSQL make sense?

I've been looking into pagination (paginate by timestamp) with a PSQL dbms. My approach currently is to build a b+ index to greatly reduce the cost of finding the start of the next chunk. But everywhere I look in tutorials and on NPM modules like express-paginate (https://www.npmjs.com/package/express-paginate), people seem to get chunks using offset one way or the other or fetching all the data anyways but simply sending them in chunks which to me doesn't seem to be a complete optimization that pagination is for.
I can see that they're still making an optimization by lazy loading and streaming the chunks (thus saving bandwidth and any download/processing time on the client-side), but since offset on psql still requires scanning previous rows. In the worst case where a user wants to view all the data, doesn't this approach have a very high server cost since if you have per say n chunks, you're accessing the first chunk n times, the second chunk n-1 times, the third chunk n-2 times, etc. I understand that this is really in terms of IOs so it's not that expensive but it still bothers me?
Am I missing something very obvious here? I feel like I am because there seems to be a lot more established and experienced engineers who seem to be using this approach. I'm guessing there is some part of the equation or mechanism that I'm just missing from my understanding.
No, you understand this quite well.
The reason why so many people and tools still advocate pagination with OFFSET and LIMIT (or FETCH FIRST n ROWS ONLY, to use the standard's language) is that they don't know a lot about databases. It is easy to understand LIMIT and OFFSET even if you the word “index” to you has no other meaning than ”the last pages in a book”.
There is another reason: to implement key set pagination, you must have an ORDER BY clause in your query, that ORDER BY clause has to contain a unique column, and you have to create an index that supports that ordering.
Moreover, your database has to be able to handle conditions like
... WHERE (name, id) > ('last_found', 42)
and support a multi-column index scan for them.
Since many tools strive to support several database systems, they are likely to go for the simple but inefficient method that works with every query on most database systems.

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

Analyse Database Table and Usage

I just got into a new company and my task is to optimize the Database performance. One possible (and suggested) way would be to use multiple servers instead of one. As there are many possible ways to do that, i need to analyse the DB first. Is there a tool with which i can measure how many Inserts/Updates and Deletes are performed for each table?
I agree with Surfer513 that the DMV is going to be much better than CDC. Adding CDC is fairly complex and will add a load to the system. (See my article here for statistics.)
I suggest first setting up a SQL Server Trace to see which commands are long-running.
If your system makes heavy use of stored procedures (which hopefully it does), also check out sys.dm_exec_procedure_stats. That will help you to concentrate on the procedures/tables/views that are being used most-often. Look at execution_count and total_worker_time.
The point is that you want to determine which parts of your system are slow (using Trace) so that you know where to spend your time.
One way would be to utilize Change Data Capture (CDC) or Change Tracking. Not sure how in depth you are looking for with this, but there are other simpler ways to get a rough estimate (doesn't look like you want exacts, just ballpark figures..?).
Assuming that there are indexes on your tables, you can query sys.dm_db_index_operational_stats to get data on inserts/updates/deletes that affect the indexes. Again, this is a rough estimate but it'll give you a decent idea.

A question about indexes regarding to the gain of inserts & updates in database

I’m having a question about the fine line between the gain of an index to a table there is growing steadily in size every month and the gain of queries with an index.
The situation is, that I’ve two tables, Table1 and Table2. Each table grows slowly but regularly each month (with about 100 new rows for Table1 and a couple of rows for Table2).
My concrete question is whether to have an index or to drop it. I’ve made some measurement that an covering index on Table2 improve my SELECT queries and some rather much but again, I’ve to consider the pros and cons but having a really hard time to decide.
For Table1 it might not be necessary to have an index because the SELECT queries there is not that common.
I would appreciate any suggestion, tips or just good advice to what is a good solution.
By the way, I’m using IBM DB2 version 9.7 as my Database system
Sincerely
Mestika
Any additional index will make your inserts slower and your queries faster.
To take a smart decision, you will have to measure exactly by how much, with the amount of data that you expect to see. If you have multiple clients accessing the database at the same time, it may make sense to write a small multithreaded application that simulates the maximum load, both for inserts and for queries.
Your results will depend on the nature of your data and on the hardware that you are running. If you want to know the best answer for your usecase, there is no way around testin accurately yourself with your data and your hardware.
Then you will have to ask yourself:
Which query performance do I need?
If the query performance is good enough without the index anyway, easy: Don't add the index!
Which insert performance do I need?
Can it drop below the needed limit with the additional index? If not, easy: Add the index!
If you discover that you absolutely need the index for query performance and you can't get the required insert performance with the index, you may need to buy better hardware. Solid state discs can do wonders for database servers and they are getting affordable.
If your system is running fine for everyone anyway, worry less, let it run as is.

Database Optimization techniques for amateurs

Can we get a list of basic optimization techniques going (anything from modeling to querying, creating indexes, views to query optimization). It would be nice to have a list of these, one technique per answer. As a hobbyist I would find this to be very useful, thanks.
And for the sake of not being too vague, let's say we are using a maintstream DB such as MySQL or Oracle, and that the DB will contain 500,000-1m or so records across ~10 tables, some with foreign key contraints, all using the most typical storage engines (eg: InnoDB for MySQL). And of course, the basics such as PKs are defined as well as FK contraints.
Learn about indexes, and use them properly. Generally speaking*, follow these guidelines:
Every table should have a clustered index
Fields used for filters and sorts are good candidates for indexing
More selective fields are better candidates for indexing
For best performance on crucial queries, design "covering indexes" for those queries
Make sure your indexes are actually being used, and remove those that aren't
If your table has 15 fields, and you make 15 indexes, each with only a single field, you're doing it wrong :)
*There are some exceptions to these rules if you know what you're doing. My experience is Microsoft SQL Server, but I would presume most of this advice would still apply to a different RDMS.
IMO, by far the best optimization is to have the data model fit the problem domain for which it was built. When it does not, the resulting symptom is difficult-to-write or convoluted queries in order to get the information desired and that typically rears itself when reports are built against the database. Thus, in designing a database it helps to have an idea as to the types and nature of the information, such as reports, that the users will want from the system.
When talking database design, check out the database normalization, e.g. the wikipedia article: Normal forms.
If you have a good design and still you need to optimize for performance, try Denormalisation.
If you have specific needs which are not covered by relational model efficiently, look at other models covered by the term NoSQL.
Some query/schema optimizations:
Be mindful when using DISTINCT or GROUP BY. I find that many new developers will use DISTINCT in places where it really is not needed or could be rewritten more efficiently using an Exists statement or a derived query.
Be mindful of Left Joins. All too often I find new SQL developers will ignore the schema in place and use Left Joins where they really are not necessary. For example:
Select
From Orders
Left Join Customers
On Customers.Id = Orders.CustomerId
If Orders.CustomerId is a required column, then it is not necessary to use a left join.
Be a student of new features. Currently, MySQL does not support common-table expressions which means that some types of queries are cumbersome and probably slower to write than they would be if CTEs were supported. However, that will not be true forever. Keep up on new syntax features in MySQL which might be used to make existing queries more efficient.
You do not have to use surrogate keys everywhere. There might be tables better suited to an intelligent key (e.g. US State abbreviations, Currency Codes etc) which would enable developers to avoid additional joins in many cases.
If possible, find ways of archiving data to an OLAP or reporting server. The smaller you can make the production data, the faster it will run.
A design that concisely models your problem is always a good start. Overgeneralizing the data model can lead to performance problems. For example, I've heard reports of projects striving for uber-flexibility that use the RDBMS as a dumb "name/value" store - and resulting performance was appalling.
Once a good design is in place, then use the tools provided by the RDBMS to help it achieve good performance. Single field PKs (no composites), but composite business keys as an index with unique constraint, use of appropriate data types, e.g. using appropriate numeric types for numeric values rather than char or similar. Physical attributes of the hardware the RDBMS is running on should also be considered, since the bulk of query time is often disk I/O - but of course don't take this for granted - use a profiler to find out where the time is going.
Depending upon the update/query ratio, materialized views/indexed views can be useful in improving performance for slow running queries. A poor-man's alternative is to use triggers to invoke a procedure that populates the table with a result of a slow-running, infrequently-changed view.
Query optimization is a bit of a black art since it is often database-dependent, but some rules of thumb are given here - Optimizing SQL.
Finally, although possibly outside the intended scope of your question, use a good data access layer in your application, and avoid the temptation to roll your own - there are surely tested and performant implementations available for all major languages. Use of caching at the data access layer, middle tier and application layer can help improve performance considerably.
Do use less query whenever possible. Use "JOIN", and group your tables so that a single query gives your results.
A good example is the Modified Preorder Tree Transversal (MPTT) to get all of a tree node parents, ordered, in a single query.
Take a holistic approach to optimization.
Consider the impact of slow disks, network latency, lack of memory, and server load.