How to formulate index_name in SQL? - tsql

I am trying to create an index on one of my tables with an accurate label. Here is how I am trying it...expecting "sysname" to resolve to the column or table name. But after I run this command and view it in the Object Explorer, it is listed as
"[<Name of Missing Index, sysname + '_prod',>]".
How do u define index_names in a better descriptive fashion? (I am trying to add the extension "_prod" to the index_name, since INDEX of index_name already exists).
USE [AMDMetrics]
GO
CREATE NONCLUSTERED INDEX
[<Name of Missing Index, sysname + '_prod',>]
ON [SMARTSOLVE].[V_CXP_CUSTOMER_PXP] ([QXP_UDF_STRING_8], [QXP_REPORT_DATE],
[QXP_XRS_DESCRIPTION])
INCLUDE ([QXP_ID], [QXP_EXCEPTION_NO], [QXP_BASE_EXCEPTION], [QXP_CATEGORY],
[QXP_OCCURENCE_DATE], [QXP_COORD_ID], [QXP_SHORT_DESC], [QXP_ROOT_CAUSE],
[QXP_DESCRIPTION], [QXP_QEI_ID], [PXP_LOT_NUMBER], [CXP_ID], [CXP_AWARE_DATE],
[QXP_XSV_CODE], [QXP_COORD_NAME], [PXP_PRODUCT_CODE], [PXP_PRODUCT_NAME],
[QXP_ORU_NAME], [QXP_RESOLUTION_DESC], [QXP_CLOSED_DATE], [CXP_CLIENT_CODE],
[CXP_CLIENT_NAME])

I'm not 100% sure what you are trying to do, but it seems like you are trying to find a way to properly name your index (or find a good naming convention). Conventions are best when they are easy to follow, and make sense to people without having to explain it to them. A lot of different conventions fit this MO, but the one that is most common is this:
Index Type Prefix Complete Index name
-------------------------------------------------------------------
Index (not unique, non clustered) IDX_ IDX_<name>_<column>
Index (unique, non clustered) UDX_ UDX_<name>_<column>
Index (not unique, clustered) CIX_ CIX_<name>_<column>
Index (unique, clustered) CUX_ CUX_<name>_<column>
Although on a different note, I have to question why you have so many columns in your INCLUDE list....without knowing the size of those columns, there are some drawbacks to adding so many columns:
Avoid adding unnecessary columns. Adding too many index columns,
key or nonkey, can have the following performance implications:
- Fewer index rows will fit on a page. This could create I/O increases
and reduced cache efficiency.
- More disk space will be required to store the index. In particular,
adding varchar(max), nvarchar(max), varbinary(max), or xml data types
as nonkey index columns may significantly increase disk space requirements.
This is because the column values are copied into the index leaf level.
Therefore, they reside in both the index and the base table.
- Index maintenance may increase the time that it takes to perform modifications,
inserts, updates, or deletes, to the underlying table or indexed view.
You will have to determine whether the gains in query performance outweigh
the affect to performance during data modification and in additional disk
space requirements.
From here: http://msdn.microsoft.com/en-us/library/ms190806.aspx

Related

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Optimal approach for large bulk updates to indexed bit sets

The environment for this question is PostgreSQL 9.6.5 on AWS RDS.
The question is about an optimal schema design and batch update strategy for a table with 300 million rows containing the following logical data model:
id: primary key, string up to 40 characters long
code: integer 1-999
year: integer year
flags: variable number (1000+) each associated with a name, new flags added over time. Ideally, a flag should be thought of as having three values: absent (null), on (true/1) and off (false/0). It is possible, at the cost of additional updates (see below), to treat a flag as a simple bit (on or off, no absent). "On" values are typically very sparse: < 1/1000.
Queries typically involve boolean expressions on the presence or absence of one or more flags (by name) with code and year occasionally involved also.
The data is updated in batch via Apache Spark, i.e., updates can be represented as flat file(s), e.g., in COPY format, or as SQL operations. Only one update is active at any one time. Updates to code and year are very infrequent. Updates to flags affect 1-5% of rows per update (3-15 million rows). It is possible for the update rows to include all flags and their values, just the "on" flags to be updated or just the flags whose values have changed. In the former case, Spark would need to query the data to get the current values of flags.
There will be a small read load during updates.
The question is about an optimal schema and associated update strategy to support the query & updates as described.
Some comments from research so far:
Using 1,000+ boolean columns would create a very efficient row representation but, in addition to some DDL complexity, would require 1,000+ indexes.
Bit strings would be great if there was a way to index individual bits. Also, they do not offer a good way to represent absent flags. Using this approach would require maintaining a lookup table between flag names and bit IDs. Merging updates, if needed, works with ||, though, given PostgreSQL's MVCC there doesn't seem to be much benefit to updating just flags as opposed to replacing an entire row.
JSONB fields offer indexing. They also offer null representation but that comes at a cost: all flags that are "off" would need to be explicitly set, which would make the fields quite large. If we ignore null representation, JSONB fields would be relatively small. To further shrink them, we could use short 1-3 character field names with a lookup table. Same comments re: merging as with bit strings.
tsvector/tsquery: have no experience with this data type but, in theory, seems to be an exact representation of a set of "on" flags by name. Must use a lookup table mapping flag names to tokens with the additional requirement to ensure there are no collisions due to stemming.
Don't store the flags in the main table.
Assuming that the main table is called data, define something like the following:
CREATE TABLE flag_names (
id smallint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE flag (
flagname_id smallint NOT NULL REFERENCES flag_names(id),
data_id text NOT NULL REFERENCES data(id),
value boolean NOT NULL,
PRIMARY KEY (flagname_id, data_id)
);
If a new flag is created, insert a new row in flag_names.
If a flag is set to TRUE or FALSE, insert or update a row in the flag table.
Join flag with data to test if a certain flag is set.

Creating the optimum index for my database

I have a table in postgresql with the following information:
rawData (fileID integer references otherTable, lineNum integer, data1 double, ...)
When I am searching this table, I do so with the following query:
SELECT lineNum, data1, ...other data FROM rawData WHERE
fileID = ? AND data1 < ? ORDER BY lineNum;
In general, the data in this table is a number of entries for each fileID, and each fileID has lineNum from 0 to x, with lineNum never repeating for each fileID (but it does repeat for different fileID's). Then data1 is effectively a random number that may or may not overlap.
In order to speed up the reading of this data, I am trying to create an index on it, but am having trouble figuring out the best way to index it. Currently I am looking at one of the following two index methods, and am wondering which would be better for my search, or if there is another option that I haven't thought of that would be better than either of them.
index idea 1:
CREATE INDEX searchIndex ON rawData (fileID, data1, lineNum);
index idea 2:
CREATE INDEX searchIndex ON rawData (fileID, lineNum, data1);
Note that at this time, this and a search not constrained by data1 are the only searches that I run on this table, so I'm not too concerned about this index slowing down other searches.
Lastly, would I have to change my search query to use the index, or would it automatically use that index when I search the table?
You should look at using this instead:
CREATE INDEX searchIndex ON rawData (fileID, lineNum);
A few things:
In particular, as per docs, Indexes with more than three columns are unlikely to be helpful unless the usage of the table is extremely stylized.
Since your second search query requires filtering without the data1 column, keeping the second column lineNum should be sufficient (since you mention it would be quasi-random), and in the rare occurrence that there are repeats, table fetches should ensure correctness. But what this would mean is that the Index would be 1/3rd smaller in size, which is a big win (Think index small-enough to be in memory / index-only-scans etc.)
Either index can be used. Which is faster will depend on many things, like how many rows are in the table, how many lineNum there are per fileID, how selective the data1 < ? clause is, what your hardware is, what our config settings are, which version of PostreSQL you are using, what physical order the table rows lie in, etc.
The only way to know for sure is to try it with your own data on your own system and see.
I'd just build an index on (fileID, lineNum, data1), or even just (fileID, lineNum), because that seems more natural, and then forget about it. Most likely it will be fast enough. Once there is a demonstrable performance problem, than you will have the test case at hand which is needed to come to a real conclusion.

Use case for hstore against multiple columns

I'm having some troubles deciding on which approach to use.
I have several entity "types", let's call them A,B and C, who share a certain number of attributes (about 10-15). I created a table called ENTITIES, and a column for each of the common attributes.
A,B,C also have some (mostly)unique attributes (all boolean, can be 10 to 30 approx).
I'm unsure what is the best approach to follow in modelling the tables:
Create a column in the ENTITIES table for each attribute, meaning that entity types that don't share that attribute will just have a null value.
Use separate tables for the unique attributes of each entity type, which is a bit harder to manage.
Use an hstore column, each entity will store its unique flags in this column.
???
I'm inclined to use 3, but I'd like to know if there's a better solution.
(4) Inheritance
The cleanest style from a database-design point-of-view would probably be inheritance, like #yieldsfalsehood suggested in his comment. Here is an example with more information, code and links:
Select (retrieve) all records from multiple schemas using Postgres
The current implementation of inheritance in Postgres has a number of limitations, though. Among others, you cannot define a common foreign key constraints for all inheriting tables. Read the last chapter about caveats carefully.
(3) hstore, json (pg 9.2+) / jsonb (pg 9.4+)
A good alternative for lots of different or a changing set of attributes, especially since you can even have functional indices on attributes inside the column:
unique index or constraint on hstore key
Index for finding an element in a JSON array
jsonb indexing in Postgres 9.4
EAV type of storage has its own set of advantages and disadvantages. This question on dba.SE provides a very good overview.
(1) One table with lots of columns
It's the simple, kind of brute-force alternative. Judging from your description, you would end up with around 100 columns, most of them boolean and most of them NULL most of the time. Add a column entity_id to mark the type. Enforcing constraints per type is a bit awkward with lots of columns. I wouldn't bother with too many constraints that might not be needed.
The maximum number of columns allowed is 1600. With most of the columns being NULL, this upper limit applies. As long as you keep it down to 100 - 200 columns, I wouldn't worry. NULL storage is very cheap in Postgres (basically 1 bit per column, but it's more complex than that.). That's only like 10 - 20 bytes extra per row. Contrary to what one might assume (!), most probably much smaller on disk than the hstore solution.
While such a table looks monstrous to the human eye, it is no problem for Postgres to handle. RDBMSes specialize in brute force. You might define a set of views (for each type of entity) on top of the base table with just the columns of interest and work with those where applicable. That's like the reverse approach of inheritance. But this way you can have common indexes and foreign keys etc. Not that bad. I might do that.
All that said, the decision is still yours. It all depends on the details of your requirements.
In my line of work, we have rapidly-changing requirements, and we rarely get downtime for proper schema upgrades. Having done both the big-record with lots on nulls and highly normalized (name,value), I've been thinking that it might be nice it have all the common attributes in proper columns, and the different/less common ones in a "hstore" or jsonb bucket for the rest.

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!
You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.
You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).
When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance
Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.