Many posts like this stackoverflow link claim that there is no concept of a clustered index in PostgreSQL. However, the PostgreSQL documentation contains something similar. A few people claim it is similar to a clustered index in SQL Server.
Do you know what the exact difference between these two is, if there is any?
A clustered index or index organized table is a data structure where all the table data are organized in index order, typically by organizing the table in a B-tree structure.
Once a table is organized like this, the order is automatically maintained by all future data modifications.
PostgreSQL does not have such clustering indexes. What the CLUSTER command does is rewrite the table in the order of the index, but the table remains a fundamentally unordered heap of data, so future data modifications will not maintain that index order.
You have to CLUSTER a PostgreSQL table regularly if you want to maintain an approximate index order in the face of data modifications to the table.
Clustering in PostgreSQL can improve performance, because tuples found during an index scan will be close together in the heap table, which can turn random access to the heap to faster sequential access.
I'm building several very large data tables on Amazon Redshift, that should hold data covering several frequently-queried properties with the relevant metrics.
We're using an even distribution style ("diststyle even") to have all the nodes participate in query calculations, but I am not certain about the length of the sortkey.
It definitely should be compound - every query will use first filter on date and network - but after that level I have about 7 additional relevant factors that can be queried on.
All the examples I've seen use a compound sort key of 2-3 fields, 4 at most.
My question is -why not use a sortkey that includes all the key fields in the table? What are the downsides for having a long sortkey?
VACUUM will also take longer if you have several sort keys.
Based on your experience, is there any practical limit on the number of indexes per one table in Postresql? In theory, there is not, as per the documentation, citation: "Maximum Indexes per Table Unlimited" But:
Is it that the more indexes you have the slower the queries? Does it make a difference if I have tens vs hundreds or even thousands indexes? I am asking after I've read the documentation on postgres' partial indexes which makes me think of some very creative solutions that, however, require a lot of indexes.
There is overhead in having a high number of indexes in a few different ways:
Space consumption, although this would be lower with partial indexes of course.
Query optimisation, through making the choice of optimiser plan potentialy more complex.
Table modification time, through the additional work in modifying indexes when a new row is inserted, or current row deleted or modified.
I tend by default to go heavy on indexing as:
Space is generally pretty cheap
Queries with bound variables only need to be optimised once
Rows generally have to be found much more often than they are modified, so it's generally more important to design the system for efficiently finding rows than it is for reducing overhead in making modifications to them.
The impact of missing a required index can be very high, even if the index is only required occasionally.
I've worked on an Oracle system with denormalised reporting tables having over 200 columns with 100 of them indexed, and it was not a problem. Partial indexes would have been nice, but Oracle does not support them directly (you use a rather inconvenient CASE hack).
So I'd go ahead and get creative, as long as you're aware of the pros and cons, and preferably you would also measure the impact that you're having on the system.
We know that there is the concept of a primary key in traditional RDBMS systems. This primary key is basically used to index records in the table on this particular key for faster retrieval. I know that there are NOSQL stores like Cassandra which offer secondary key indexing but is there a way or an existing DB which follows exactly the same schema as in traditional RDBMS systems (i.e. a DB split into various tables to hold different kinds of data) but provides indexing on 2 or more keys.
An example of a use case for the same is:
There is a one-to-one mapping between 10 different people's names and their ages. Now if I keep this information in a table with the name of the person being the primary key, then retrieval of age given the name of a person is relatively faster than retrieving the name given the age of the person. If i could index both the columns, then the second case also would have been faster.
An alternative to doing this with traditional RDBMS would be to have 2 tables with the same data with just the difference that the primary key in one of them is the name and in the other, it is the age but that would be a wastage of a large amount of space in case of large number of records.
It is sad to see no response on this question for a very long time. In all this time of doing some research on the same , I found FastBit Index as one of the plausible solutions for doing indexing on virtually every column of the record in a table. It also provides SQL like semantics for querying data and delivers performance of the order of a few milliseconds when queried on millions of rows of data (of the order of GBs).
Please suggest if there are any other NOSQL or SQL DBs which can deliver similar kind of functionality with a good performance level.
Greeting!
I have the following problem. I have a table with huge number of rows which I need to search and then group search results by many parameters. Let's say the table is
id, big_text, price, country, field1, field2, ..., fieldX
And we run a request like this
SELECT .... WHERE
[use FULLTEXT index to MATCH() big_text] AND
[use some random clauses that anyway render indexes useless,
like: country IN (1,2,65,69) and price<100]
This we be displayed as search results and then we need to take these search results and group them by a number of fields to generate search filters
(results) GROUP BY field1
(results) GROUP BY field2
(results) GROUP BY field3
(results) GROUP BY field4
This is a simplified case of what I need, the actual task at hand is even more problematic, for example sometimes the first results query does also its own GROUP BY. And example of such functionality would be this site
http://www.indeed.com/q-sales-jobs.html
(search results plus filters on the left)
I've done and still doing a deep research on how MySQL functions and at this point I totally don't see this possible in MySQL. Roughly speaking MySQL table is just a heap of rows lying on HDD and indexes are tiny versions of these tables sorted by the index field(s) and pointing to the actual rows. That's a super oversimplification of course but the point is I don't see how it is possible to fix this at all, i.e. how to use more than one index, be able to do fast GROUP BY-s (by the time query reaches GROUP BY index is completely useless because of range searches and other things). I know that MySQL (or similar databases) have various helpful things such index merges, loose index scans and so on but this is simply not adequate - the queries above will still take forever to execute.
I was told that the problem can be solved by NoSQL which makes use of some radically new ways of storing and dealing with data, including aggregation tasks. What I want to know is some quick schematic explanation of how it does this. I mean I just want to have a quick glimpse at it so that I could really see that it does that because at the moment I can't understand how it is possible to do that at all. I mean data is still data and has to be placed in memory and indexes are still indexes with all their limitation. If this is indeed possible, I'll then start studying NoSQL in detail.
PS. Please don't tell me to go and read a big book on NoSQL. I've already done this for MySQL only to find out that it is not usable in my case :) So I wanted to have some preliminary understanding of the technology before getting a big book.
Thanks!
There are essentially 4 types of "NoSQL", but three of the four are actually similar enough that an SQL syntax could be written on top of it (including MongoDB and it's crazy query syntax [and I say that even though Javascript is one of my favorite languages]).
Key-Value Storage
These are simple NoSQL systems like Redis, that are basically a really fancy hash table. You have a value you want to get later, so you assign it a key and stuff it into the database, you can only query a single object at a time and only by a single key.
You definitely don't want this.
Document Storage
This is one step up above Key-Value Storage and is what most people talk about when they say NoSQL (such as MongoDB).
Basically, these are objects with a hierarchical structure (like XML files, JSON files, and any other sort of tree structure in computer science), but the values of different nodes on the tree can be indexed. They have a higher "speed" relative to traditional row-based SQL databases on lookup because they sacrifice performance on joining.
If you're looking up data in your MySQL database from a single table with tons of columns (assuming it's not a view/virtual table), and assuming you have it indexed properly for your query (that may be you real problem, here), Document Databases like MongoDB won't give you any Big-O benefit over MySQL, so you probably don't want to migrate over for just this reason.
Columnar Storage
These are the most like SQL databases. In fact, some (like Sybase) implement an SQL syntax while others (Cassandra) do not. They store the data in columns rather than rows, so adding and updating are expensive, but most queries are cheap because each column is essentially implicitly indexed.
But, if your query can't use an index, you're in no better shape with a Columnar Store than a regular SQL database.
Graph Storage
Graph Databases expand beyond SQL. Anything that can be represented by Graph theory, including Key-Value, Document Database, and SQL database can be represented by a Graph Database, like neo4j.
Graph Databases make joins as cheap as possible (as opposed to Document Databases) to do this, but they have to, because even a simple "row" query would require many joins to retrieve.
A table-scan type query would probably be slower than a standard SQL database because of all of the extra joins to retrieve the data (which is stored in a disjointed fashion).
So what's the solution?
You've probably noticed that I haven't answered your question, exactly. I'm not saying "you're finished," but the real problem is how the query is being performed.
Are you absolutely sure you can't better index your data? There are things such as Multiple Column Keys that could improve the performance of your particular query. Microsoft's SQL Server has a full text key type that would be applicable to the example you provided, and PostgreSQL can emulate it.
The real advantage most NoSQL databases have over SQL databases is Map-Reduce -- specifically, the integration of a full Turing-complete language that runs at high speed that query constraints can be written in. The querying function can be written to quickly "fail out" of non-matching queries or quickly return with a success on records that meet "priority" requirements, while doing the same in SQL is a bit more cumbersome.
Finally, however, the exact problem you're trying to solve: text search with optional filtering parameters, is more generally known as a search engine, and there are very specialized engines to handle this particular problem. I'd recommend Apache Solr to perform these queries.
Basically, dump the text field, the "filter" fields, and the primary key of the table into Solr, let it index the text field, run the queries through it, and if you need the full record after that, query your SQL database for the specific index you got from Solr. It uses some more memory and requires a second process, but will probably best suite your needs, here.
Why all of this text to get to this answer?
Because the title of your question doesn't really have anything to do with the content of your question, so I answered both. :)