Creating the optimum index for my database - postgresql

I have a table in postgresql with the following information:
rawData (fileID integer references otherTable, lineNum integer, data1 double, ...)
When I am searching this table, I do so with the following query:
SELECT lineNum, data1, ...other data FROM rawData WHERE
fileID = ? AND data1 < ? ORDER BY lineNum;
In general, the data in this table is a number of entries for each fileID, and each fileID has lineNum from 0 to x, with lineNum never repeating for each fileID (but it does repeat for different fileID's). Then data1 is effectively a random number that may or may not overlap.
In order to speed up the reading of this data, I am trying to create an index on it, but am having trouble figuring out the best way to index it. Currently I am looking at one of the following two index methods, and am wondering which would be better for my search, or if there is another option that I haven't thought of that would be better than either of them.
index idea 1:
CREATE INDEX searchIndex ON rawData (fileID, data1, lineNum);
index idea 2:
CREATE INDEX searchIndex ON rawData (fileID, lineNum, data1);
Note that at this time, this and a search not constrained by data1 are the only searches that I run on this table, so I'm not too concerned about this index slowing down other searches.
Lastly, would I have to change my search query to use the index, or would it automatically use that index when I search the table?

You should look at using this instead:
CREATE INDEX searchIndex ON rawData (fileID, lineNum);
A few things:
In particular, as per docs, Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized.
Since your second search query requires filtering without the data1 column, keeping the second column lineNum should be sufficient (since you mention it would be quasi-random), and in the rare occurrence that there are repeats, table fetches should ensure correctness. But what this would mean is that the Index would be 1/3rd smaller in size, which is a big win (Think index small-enough to be in memory / index-only-scans etc.)

Either index can be used. Which is faster will depend on many things, like how many rows are in the table, how many lineNum there are per fileID, how selective the data1 < ? clause is, what your hardware is, what our config settings are, which version of PostreSQL you are using, what physical order the table rows lie in, etc.
The only way to know for sure is to try it with your own data on your own system and see.
I'd just build an index on (fileID, lineNum, data1), or even just (fileID, lineNum), because that seems more natural, and then forget about it. Most likely it will be fast enough. Once there is a demonstrable performance problem, than you will have the test case at hand which is needed to come to a real conclusion.

Related

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

How to force postgresql to use particular index when all fields match

Consider I have a table T with fields a,b,c,d with two indices: first on a,b,c fields and second on a,b,d fields. Types of a,b,c and d are integer. Both indices are almost the same (on production they both have about 2Gb size, they have the same creation time and the same statistics of usage, table overall have about 60 millions rows).
I make two queries:
select * from T where a=... and b=... and c=...;
select * from T where a=... and b=... and d=...;
I expect that for the first query index on a,b,c fields is used and for the second index on a,b,d fields is used. However it's not the case and for both queries first index is used, but in second case with "filter"(I used expect analyze to gain this knowledge). For me such behavior is unacceptable, because in some circumstances number of entries in filter grows very fast and autovacuum/analyze (which actually helps the planner to use the right index) works too slow to prevent the unexpected latencies and downtime.
So my question is: how can I force postgresql not to use wrong index with filtering, but rather use the right index when all fields in query's 'where' and in that index match?
Finally I found the solution. It's not perfect and I will not mark it as the best one, however it works and could help someone.
What I have actually done is I have changed the indices, instead of a,b,c and a,b,d indices now I'm having c,a,b and d,a,b.
One problem appeared: I needed an index on 'a', because some queries rely on it. However when I add index solely on 'a', the problem from the first post appears again (the index on 'a' is used when planner thinks that its cost is lower than cost of c,a,b or d,a,b). So I decided to add new field to the table which is a copy (has the same data) of a field 'a', let's call it 'a1', and I added the index on this field. Now when I need to filter contents of 'a' somehow instead I'm filtering on 'a1' field. It's weird, but I couldn't find another solution.

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

How to formulate index_name in SQL?

I am trying to create an index on one of my tables with an accurate label. Here is how I am trying it...expecting "sysname" to resolve to the column or table name. But after I run this command and view it in the Object Explorer, it is listed as
"[<Name of Missing Index, sysname + '_prod',>]".
How do u define index_names in a better descriptive fashion? (I am trying to add the extension "_prod" to the index_name, since INDEX of index_name already exists).
USE [AMDMetrics]
GO
CREATE NONCLUSTERED INDEX
[<Name of Missing Index, sysname + '_prod',>]
ON [SMARTSOLVE].[V_CXP_CUSTOMER_PXP] ([QXP_UDF_STRING_8], [QXP_REPORT_DATE],
[QXP_XRS_DESCRIPTION])
INCLUDE ([QXP_ID], [QXP_EXCEPTION_NO], [QXP_BASE_EXCEPTION], [QXP_CATEGORY],
[QXP_OCCURENCE_DATE], [QXP_COORD_ID], [QXP_SHORT_DESC], [QXP_ROOT_CAUSE],
[QXP_DESCRIPTION], [QXP_QEI_ID], [PXP_LOT_NUMBER], [CXP_ID], [CXP_AWARE_DATE],
[QXP_XSV_CODE], [QXP_COORD_NAME], [PXP_PRODUCT_CODE], [PXP_PRODUCT_NAME],
[QXP_ORU_NAME], [QXP_RESOLUTION_DESC], [QXP_CLOSED_DATE], [CXP_CLIENT_CODE],
[CXP_CLIENT_NAME])
I'm not 100% sure what you are trying to do, but it seems like you are trying to find a way to properly name your index (or find a good naming convention). Conventions are best when they are easy to follow, and make sense to people without having to explain it to them. A lot of different conventions fit this MO, but the one that is most common is this:
Index Type Prefix Complete Index name
-------------------------------------------------------------------
Index (not unique, non clustered) IDX_ IDX_<name>_<column>
Index (unique, non clustered) UDX_ UDX_<name>_<column>
Index (not unique, clustered) CIX_ CIX_<name>_<column>
Index (unique, clustered) CUX_ CUX_<name>_<column>
Although on a different note, I have to question why you have so many columns in your INCLUDE list....without knowing the size of those columns, there are some drawbacks to adding so many columns:
Avoid adding unnecessary columns. Adding too many index columns,
key or nonkey, can have the following performance implications:
- Fewer index rows will fit on a page. This could create I/O increases
and reduced cache efficiency.
- More disk space will be required to store the index. In particular,
adding varchar(max), nvarchar(max), varbinary(max), or xml data types
as nonkey index columns may significantly increase disk space requirements.
This is because the column values are copied into the index leaf level.
Therefore, they reside in both the index and the base table.
- Index maintenance may increase the time that it takes to perform modifications,
inserts, updates, or deletes, to the underlying table or indexed view.
You will have to determine whether the gains in query performance outweigh
the affect to performance during data modification and in additional disk
space requirements.
From here: http://msdn.microsoft.com/en-us/library/ms190806.aspx

Better performance for SQLite Select Statement

I'm developing an Iphone App where the user types in any string into a searchbar and presses the search button. After that a result list should appear.
In my SQLite I have four columns a, b, c, d. Let's say they have the following Values:
Dataset 1:
a: code1
b: report1
c: description1_1
d: description1_2
Dataset 2:
a: code2
b: report2
c: description2_1
d: description2_2
So if the user enters a value of: "1_1" then the first dataset will be selected because of clumn c.
If the user enters a value of: "report" then the first and second dataset will be selected.
As I'm using a database with nearly 60.000 Datasets searching for a part-string is really killing the performance.
Setting an index at all 4 columns will make the size of the SQLite database much too huge.
So I didn't use an index at all.
My Select Statement looks like this:
NSString *sql = [NSString stringWithFormat:#"SELECT * FROM scode WHERE a LIKE '%#%#%#' OR c LIKE '%#%#%#' OR d LIKE '%#%#%#'", wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard];
Is there any good way to enhance the performance of searching for a part-string in all columns?
Thank you and kind regards,
Daniel
You're after Full Text Searching, which SQLite doesn't natively support. I don't have any experience with 3rd party support, but based on results there are a few options.
You answered your own question: Do the index on all four columns. And measure the size difference. Considering the storage capacity of the iPhone, you're probably out of balance trying to reduce storage.
The rule of thumb with SQLite performance is not to doa query that isn't indexed.
You can see what SQLite is actually doing by creating your database on the Mac using the same schema and EXPLAIN QUERY PLAN. (There's also EXPLAIN, which is more detailed but less obvious.)
You can create a separate table, with two columns: a pattern string and a key value (which is used to refer to your data tables). Lets call this table "search_index".
Then, on any change to your data table entries, you update the "search_index" table:
remove rows with keys of changed data table rows
for each column in data table, use the first X characters of the data, and add them to search_index with the key
You can work out the details yourself, but in this way, you just build your own (partial) search index.
When querying, you can use up to X characters to search in the search_index table alone. If the user types more than X characters you at least have a limited set of data table rows to search in. So you can search those 60k rows easily.
Find a good value for X to balance storage requirements and usability and performance.
EDIT: Looks like you do not want to search only the beginning of the words? Well, then you should not just use the "first X characters", but you should split the data into single words, and use the full words in search_index. Though in practice you will still have around a fourth of the index storage requirements compared to giving all columns an index. So, its still a good thing to build your own "search_index".