Sphinx centralize multiple tables into a single index - sphinx

I do have multiple tables (MySQL) and I want to have a single index for them.
Each table has the primary key of int autoincrement type.
The structure of collected data is the same for each table (so no conflict), but as the IDs collide so it seems that I have to query each index separately (unless you can give me a hint of how to avoid ID collision)
Question is: If I query each index separately does it means that the weight of returned results are comparable between indexes?

unless you can give me a hint of how to avoid ID collision
See for example
http://sphinxsearch.com/forum/view.html?id=13078
You can just arrange for the ids to be offset differently. The 'sphinx document id' doesnt have to match the real primary key, but having a simple mapping makes the application simpler.
You have a choice between one-index, one-source (using a single sql query to union all the tables together. one-index, many-source. (a source per table, all making one index) or many-indexes (one index per table, each with own source). Which ever way will give the same query results.
If I query each index separately does it means that the weight of returned results are comparable between indexes?
Pretty much. The difference should be negiblibe that doesnt matter whic way round you do it.

Related

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

H2: Insert is slow because of index on column

I am using the h2 database to store data.
Each record has to be unique in the database (unique in the sense that the combination of timestamp, name, message,.. doesn't appear twice in the table). Therefore one column in the table is the hash of the data in the record. To speed up searching if the record already exists I created an index on the hash column. Indeed searching for a record with given hash is very fast.
But here is the problem: While in the beginning insertion of 10k records is fast enough (takes about a second), it gets awefully slow when having already one million records in the database (takes a minute). This probably because the new hashes need to be integrated into the existing index b-tree.
Is there any way to speed this up or is there a better way to ensure uniqueness of data records in the table?
Edit: To be more concrete:
Let's say my records are transactions which have the following fields:
time stamp, type, sender recipient, amount, message
A transaction should only appear once in the table so before inserting a new transaction I have to check if the transaction is already in the table. Since the sha 256 hash of all fields is unique my idea was to add a column 'hash' to the table where the hash of the fields is put in. Before inserting a new record I calculate the hash of the fields and query the table for the hash.
Index has its own over head. If you have a table where you will be having lots of insertions, I would suggest to avoid indexing on it as it has over-head of hash.
May I know what do you mean by --> one column in the table is the hash of the data in the record??
You can create a unique key constraint (here it will be the composite key of all those 3 mentioned columns), Let me know the requirements, may be we can give you a better solution of doing it in a simpler way :)
Danyal
Man, this is probably not a good way to query all the records, check them for duplicates and then insert the new row :). As soon as you move ahead, the overhead will increase as the number of the records increase.
Create a unique key constraint (check http://www.h2database.com/html/grammar.html ) on the combination of these field, you don't need to compute the hash, database will handle the hash thing. Just try to add the duplicate record, you will get the exception, catch the exception and show the error message as duplicate insertion..
Once you create the unique index, it won't allow you to insert any duplicate records. It is pretty secure and safe.
Indexing randomly distributed data is bad for performance. Once there are more entries in the index than fit in the cache, then updating the index will get very slow, specially when using a hard disk. This is because seeks on a hard disk are very slow. This, in combination with the random distribution of the data, will lead to very bad performance. With solid state disks it's a bit better, because random access reads are faster there.

What is the best order for the fields of a composite key?

Say I have a table with the following fields:
LeagueID
MatchID
SomeData
A League will host many Matches. Usually each League will use its own local database, so the LeagueID field will be the same in all records for this local database. Once a year the League uploads its data to the national authority and then the LeagueID will be necessary to dsicriminate the Matches that have the same MatchID.
What is the best way to implement the composite primary key (using the EF Fluent API)?
Entity<Match>.HasKey(match=>new {match.LeagueID,match.MatchID})
OR
Entity<Match>.HasKey(match=>new {match.MatchID,match.LeagueID})
To the human eye the order League - Match is logical, as it will hold the Matches of a particular League together. But I understood that when composing a composite key it is important for performance reasons to use the most discriminating field first.
I think you can have your cake and eat it too.
The Database
When implementing a key in the database generally a narrower key with more selective field(s) will yield better performance. This holds true for single & composite keys. I said generally, because a more selective index that doesn't really match your query pattern can be pretty useless. For example, in your composite key, if MatchID is first (the more selective), but you query more frequently by LeagueID (less selective), the selectivity will work against you.
The real issue, i think, is not is index A or B more selective, but "do you have appropriate indexes for the ways you query?" (and enforce data integrity, but that's a different discussion). So you need to figure out how you query this table. If you query by :
LeagueID most of the time -- index LeagueID, MatchID
MatchID most of the time -- index MatchID, LeagueID
composite LeagueID & MatchID the majority of the time -- index
MatchID, LeagueID
a mixed bag -- you may want two indexes one for each order, but you'll have to figure out if the extra overhead of maintaining two indexes is worth the hit on insert/update/delete.
EF & The Query
For the most part, the order of the columns in your query (or the way you build a match in EF) won't make a difference. Meaning where a=#a and b=#b will yeild the same query plan & performance as where b=#b and a=#a.
Assuming you're using SQL Server, the order you write a where clause matters very little. The books on line explain the issue succinctly, stating:
The order of evaluation of logical operators can vary depending on choices made by the query optimizer. ).
You can choose the order of the fields in the database by using the HasColumnOrder

Postgres hstore for time series

I am new to postgres and am experimenting with the hstore extension.Looking for some guidance. I need to support basic reporting on timeseries data for various products that we sell. I have a large amount data in the format "Timestamp, Value" for each product. This data is available in a csv fle for each product.
I am thinking of using hstore to store this data in the key value format. Assuming that all the timeseries data for a single product can be stored in a single hstore object. I need to be able to query this data by specific times, say what was the value of a product at a given time? Also need to run simple queries like retrieving the times where the product costed more than $100.
I'm planning to have a table with a product id column and an hstore column. But I am not very clear on how to make this work:
The hstore column needs to be loaded from thousands of timestamp,value records that exist in a csv. The hstore should be appended whenever we get a new csv.
The table needs to store the productId and corresponding Timeseries data.
Can you please advise if using hstore would be helpful ? If yes then how can I load data from csv as explained above. Also, if there could be any impact on the performance on inserts/updates in the hstore, as data grows please share your experiences.
I do think you should start with a simple, normalised schema first, especially since you are new to PostgreSQL. Something like:
CREATE TABLE product_data
(
product TEXT, -- I'm making an assumption about the types of your columns
time TIMESTAMP,
value DOUBLE PRECISION,
PRIMARY KEY (product, time);
);
I would definitely keep hstore and similar options in mind, if and when your data becomes large enough that efficiency is more important and simplicity. But note that all options have an efficiency tradeoff.
Do you know how much data you're going to support? Number of products, number of distinct timestamps for each product?
What other queries do you want to run? A query for the times where a single product cost more than $100 would benefit from an index on (product, value), if the product has many distinct timestamps.
Other options
hstore is most useful if you want to store a table set of arbitrary key-value pairs in a row. You could use it here, with a row for each product, and each distinct timestamp for that product being a key in the product's table. The downsides are that keys and values in hstore are text, whereas your keys are timestamps, and your values are numbers of some kind. So there will be a certain reduction in type checking, and a certain increase in type casting cost required. Another possible downside is that some queries on the hstore might not use indexes very efficiently. The above table can use simple btree indexes for range queries (say you want to pull out the values between two dates for a product). But hstore indexes are much more limited; you can use a gist or gin index on an hstore column to find all the rows that feature a certain key.
Another option (which I've played with and use experimentally for some of my databases) is arrays. Basically, each product will have an array of values, and each timestamp maps to an index in the array. This is easy if the timestamps are perfectly regular. For example, if all your products had a value every hour for every day, you could use a table like this:
CREATE TABLE product_data
(
product TEXT,
day DATE,
values DOUBLE PRECISION[], -- An array from 0 to 23.
PRIMARY KEY (product, day);
);
You can construct views and indexes to make querying this table moderate easy. (I wrote a blog post on this technique at http://ejrh.wordpress.com/2011/03/20/vector-denormalisation-in-postgresql/.)
But my advice is still: start with a simple table, then explore ways to improve efficiency when you know you're going to need them.

I have a massive table that I need to optimize. I think I need to use indexes, but I was hoping for some more information about them

So I have a large table that I query (select only) quite frequently. The table is around 12,000 rows long. Since the advent of iOS, the time that it is taking to run these select queries has gone up 4-5x.
I was told that I need to add an index to my table. The query that I am using looks like this:
SELECT * FROM book_content WHERE book_id = ? AND chapter = ? ORDER BY verse ASC
How can I create an index for this table? Is it a command I just run once? What exactly is the index going to do? I didn't learn about these in school so they still seem like some sort of magic to me at this point, so I was hoping to get a little instruction.
Thanks!
You want an index on book_id and chapter. Without an index, a server would do a table scan and essentially load the entire table into memory to do its search. Do a quick search on the CREATE INDEX command for the RDBMS that you are using. You create the index once and every time you do an INSERT or DELETE or UPDATE, the server will update the index automatically. An index can be UNIQUE and it can be on multiple fields (in your case, book_id and chapter). If you make it UNIQUE, the database will not allow you to insert a second row with the same key (in this case, book_id and chapter). On most servers, having one index on two fields is different from having two individual indexes on single fields each.
A Mysql example would be:
CREATE INDEX id_chapter_idx ON book_content (book_id,chapter);
If you want only one record for each book_id, chapter combination, use this command:
CREATE UNIQUE INDEX id_chapter_idx ON book_content (book_id,chapter);
A PRIMARY INDEX is a special index that is UNIQUE and NOT NULL. Each table can only have one primary index. In fact, each table should have one primary index to ensure table integrity, especially during joins.
You don't have to think of indexes as "magic".
An index on an SQL table is much like the index in a printed book - it lets you find what you're looking for without reading the entire book cover-to-cover.
For example, say you have a cookbook, and you're looking for recipes that involve chicken. The index in the back of the book might say something like:
chicken: 30,34,72,84
letting you know that you will find chicken recipes on those 4 pages. It's much faster to find this information in the index than by reading through the whole book, because the index is shorter, and (more importantly) it's in alphabetical order, so you can quickly find the right place in the index.
So, in general you want to create indexes on columns that you will regularly need to query (book_id and chapter, in your example).
When you declare a column as primary key automatically generates an index on that column. In your case for using more often select an index is ideal, because they improve time of selection queries and degrade the time of insertion. So you can create the indexes you think you need without worrying about the performance
Indexes are a very sensitive subject. If you consider using them, you need to be very careful how many you make. The primary key, or id, of each table should have a clustered index. All the rest, it depends on how you plan to use them. I'm very fuzzy in the subject of indexes, and have actually never worked with them, but from a seminar I just watched actually yesterday, you don't want too many indexes - because they can actually slow things down when you don't need to use them.
Let's say you put an index on 5 out of 8 fields on a table. Each index is designated for a particular query somewhere in your software. Well, when 1 query is run, it uses that 1 index, and doesn't need the other 4. So that's unneeded weight on this 1 query. If you need an index, be sure that this is an index which could be useful in many places, not just 1 place.