indexes needed for table inheritance in postgres? - postgresql

This is a fairly simple question, but it's one I can't find a firm answer on.
I have a parent table in PostgreSQL, and then several child tables which have been defined. A trigger has been established, and the children tables only have data inserted if a field, say field x, meets a certain criteria.
When I query the parent table with a field based upon x, PostgreSQL knows to immediately go to the child table that is related to that particular value of x.
That all being said, I don't need to specify a particular index on the column x do I? PostgreSQL already knows how to sort on it, and by adding an index to the parent x, PostgreSQL is therefore generating unique indexes on x for each of the new child tables.
Creating that index is a bit redundant, right?

Creating an index on the child table for x, if x only has one value (or a very, very small number of values) if probably a loss, yes. The planner would scan the whole table anyway.
If x is a timestamp and you're specifying a timeframe that may not be a whole partition, or if x is another range or set of values, an index would be a win most likely.
Edit: When I say one value or range of values, I mean, per child table.

Related

How can I create an arbitrary ranking of records in Postgres?

The Problem
I'm looking to create a user defined ranking of records in Postgres.
That is, the order in which the records are ranked is not defined by some underlying score but rather via the choices of a collection of users.
These choices are subject to frequent changes and the ranking will be constantly changing with both new records being added and existing records being moved to new positions.
For the sake of argument, assume that these operations occur with high frequency.
Furthermore, we need to be able to determine when given an arbitrary subset of all records, how they should be ordered according to the ranking.
A Naive Solution
A very naive solution would be to track the ranking as an integer directly on the model and 'push' all the higher ranked records up by one when inserting a new record. This is obviously not ideal as we would need to modify potentially the entire table at once.
A Better Solution
Maintain a 'score' on each record in the interval [0, 1]. This can be indexed using a BTREE and used to rank the records. The first two records would have the scores 0 and 1. When inserting a new record some intermediate value would be chosen (e.g. 0.5) and the record inserted. This choice could be optimised in order to minimise the number of digits in the score.
A Question
The above seems like a complex solution to a common problem. Furthermore, the problem is actually being solved by the underlying BTREE index with the score something of a hack to create the index.
Is there a neater way to solve the problem?

Sphinx centralize multiple tables into a single index

I do have multiple tables (MySQL) and I want to have a single index for them.
Each table has the primary key of int autoincrement type.
The structure of collected data is the same for each table (so no conflict), but as the IDs collide so it seems that I have to query each index separately (unless you can give me a hint of how to avoid ID collision)
Question is: If I query each index separately does it means that the weight of returned results are comparable between indexes?
unless you can give me a hint of how to avoid ID collision
See for example
http://sphinxsearch.com/forum/view.html?id=13078
You can just arrange for the ids to be offset differently. The 'sphinx document id' doesnt have to match the real primary key, but having a simple mapping makes the application simpler.
You have a choice between one-index, one-source (using a single sql query to union all the tables together. one-index, many-source. (a source per table, all making one index) or many-indexes (one index per table, each with own source). Which ever way will give the same query results.
If I query each index separately does it means that the weight of returned results are comparable between indexes?
Pretty much. The difference should be negiblibe that doesnt matter whic way round you do it.

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

ETL Process when and how to add in Foreign Keys T-SQL SSIS

I am in the early stages of creating a Data Warehouse based loosely on the Kimball methodology.
I am currently investigating my source data. I understand by the adding of a Primary key (not a natural key) this will then allow me to make the connections between the facts and dimensions.
Sounds like a silly question but how exactly is this done? Are there any good articles that run through this process?
I would imagine we bring in all of the Dimensions first. And when the fact data is brought over a lookup is performed that "pushes" the Foreign key into the Fact table? At what point is this done? Within SSIS whats is the "best practice" method? Is this all done in one package for example?
Is that roughly how it happens?
In this case do we have to be particularly careful in what order we load our data, or we could be loading facts for which there is no corresponding dimension?
I would imagine we bring in all of the Dimensions first. And when the
fact data is brought over a lookup is performed that "pushes" the
Foreign key into the Fact table? At what point is this done? Within
SSIS whats is the "best practice" method? Is this all done in one
package for example?
It would depend on your schema and table design.
Assuming it's star schema and the FK is based on the data value itself:
DIM1 <- FACT1 -> DIM2
^
|
FACT2 -> DIM3
you'll first fill DIM1 and DIM2 before inserting into FACT1 as you would need the FK.
Assuming it's snowflake schema:
DIM1_1
^
|
DIM1 <- FACT1 -> DIM2
you'll first fill DIM1_1 then DIM1 and DIM2 before inserting into FACT1.
Assuming the FK relation is based on something else (mostly a number) instead of the data value itself (kinda an optimization when dealing with huge amount of data and/or strings as dimension values), you won't need to wait until you insert the data into DIM table. I'm sure it's very confusing :), so I'll try to explain in short. The steps involved would be something like (assume a simple star schema with 2 tables, FACT1 and DIMENSION1):
Extract FACT and DIMENSION values from the data set you are processing.
Generate a unique number based on the DIMENSION's value (which say is a string), using a reproducible algorithm (e.g. SHA1, given same string, it always gives same number).
Insert into FACT1 table, the number and FACT values.
Insert into DIMENSION1 table, the number and DIMENSION values.
Steps 3 & 4 can be done in parallel. as long as there is NO constraint in place. A join on a numeric column would be more efficient than one of a string.
And there is no need to store the mapping for #2 because it's reproducible (just ensure you pick the right algo).
Obviously this can be extended for snowflake schema and/or multiple dimensions.
HTH

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!
You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.
You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).
When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance
Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.