Is there a way to create an index as real time and disk index in the same time? - sphinx

Is there a way to create an index as real time and disk index, in the same time ?
so I can wether index manualy or while inserting or updating.

Well you run ATTACH INDEX command which converts a disk index to a RT index.
http://sphinxsearch.com/docs/current.html#sphinxql-attach-index
... in this way can 'bulk' create the a disk index first using indexer, attach to a RT index, and then continue to update it 'in real time'.
(of course kinda requires that the disk index and the RT index has the same settings/fields/attributes etc)

No, there's no way to do this. A possible workaround is to implement a simple script which would populate your RT index using INSERTs and run it when needed (when you would like to run 'indexer').

Related

Sphinx / Manticore - base one plain index off another?

I have a plain text index that sucks data from MySQL and inserts it into Manticore in a format I need (e.g. converting datetime strings to timestamp, CONCATing some fields etc.
I then want to create a second plain text index based off this data to group it further. This will save me having to either re-run the normalisation that's done to the first index on INSERT or make it easier for me to query in the future.
For example, my first index is a list of all phone calls that have been made / received (telephone number, duration, agent). The second index should group by Year-Month-Date in such a way that I can see how many calls each agent made on that day. This means I end up with idx_phone_calls and idx_phone_calls_by_date.
Currently, I generate the first index from MySQL, then get Manticore to query itself (by setting the MySQL host to localhost. It works, but it feels as though I should be able to query Manticore directly from within the index. However, I'm struggling to find if that's possible.
Is there a better way to do it?
Well Sphinx/Manticore, has its own GROUP BY function. So maybe can just run the final query against the original index anyway, avoid the need for the second index.
Sphinx's Aggregation (in some way) is more powerful than MySQL, and can do some 'super aggregation' functions (like with WITHIN GROUP ORDER BY)
But otherwise there is no direct way to create an off another (eg there is no CREATE TABLE idx_phone_calls_by_date SELECT ... FROM idx_phone_calls ... )
Your 'solution' of directing indexer to query the data from searchd is good. In general this should be pretty efficent, particully on localhost, there is little overhead. Maintains the logical seperation of searchd being for queries, indexer being for well building indexes.

Postgres query shows under the 'Most time consuming section' of heroku

This below query I am using for a search:
I guess this current time is not bad but still I am searching for some kind of more optimizations. Also I saw in the analyze report this nested loop and nested loop joins shows the red. It would be great if I get an idea to reduce that. I was thinking to add index for search key. It would be great if I can get more suggestions to improve this. Here I have added the explain analyze result with 3 times execution, which ran in production
You could try to add ingredients.name or ingredients.code to an existing index or to create a new index so that more rows are filtered during ingredients index scan.
You should also try to avoid to use fonction on column name such as LOWER(ingredients.name) to make sure the right index is used.

PostgreSQL slow update with index

Very simply update to reset 1 column in a table with approx 5mil rows as:
UPDATE t_Daily
SET Price= NULL
Price is not part of any of the indexes on that table.
Running this without indexes takes 45s.
Running this with one or more indexes takes at least 20 mins (I keep having to stop it).
I fully understand why maintaining indexes affects the performance of insert and update statements, but this update makes no changes to the table indexes so why does it have this terrible effect on performance?
Any ideas much appreciated.
That is normal and expected: updating an index can be about ten times as expensive as updating the table itself. The table has no ordering!
If price is not indexed, you can use HOT updates that avoid updating the indexes. To make use of that, the table has to be defined with a fillfactor under 100 so that updated rows can find room in the same block as the original rows.
Found some further info (thanks to Laurenz-Albe for the HOT tip).
This link https://malisper.me/postgres-heap-only-tuples/ states that
Due to MVCC, an update in Postgres consists of finding the row being updated, and inserting a new version of the row back into the database. The main downside to doing this is the need to readd the row to every index
So it is re-writing the index despite only updating a column not in the index.

Postgres - gin index doesn't work

it seems that my server won't use gin index.
I've created a new database with one table.
I've inserted one row as example.
I've loaded trigram extension and created gin index using trigrams
But when I check if the index works right I can see it doesn't
Any ideas?
SQL: http://pastebin.com/1yDQQA1Z
P.S. A day ago I've followed a tutorial about trigrams. Basically it was the same like my example above. The table had 2 columns, numeric(5, 0) and character varying (the one with gin trgm index). Query was with like operator using "%" and index was working (I could see Bitmap using in query explain), so I know, my server can use index (and its properly installed).
Thanks in advance.
Don't test on one row, it is meaningless.
Here's an excerpt of the documentation explaining why, in Examining Index Usage:
Use real data for experimentation. Using test data for setting up
indexes will tell you what indexes you need for the test data, but
that is all.
It is especially fatal to use very small test data sets. While
selecting 1000 out of 100000 rows could be a candidate for an index,
selecting 1 out of 100 rows will hardly be, because the 100 rows
probably fit within a single disk page, and there is no plan that can
beat sequentially fetching 1 disk page.

Efficient way of insert millions of rows, convert data and deal with it, on PostgreSQL+PostGIS

I have a big collection of data I want to use for user search later.
Currently I have 200 millions resources (~50GB). For each, I have latitude+longitude. The goal is to create spatial index to be able to do spatial queries on it.
So for that, the plan is to use PostgreSQL + PostGIS.
My data are on CSV file. I tried to use custom function to not insert duplicates, but after days of processing I gave up. I found a way to load it fast in the database: with COPY it takes less than 2 hours.
Then, I need to convert latitude+longitude on Geometry format. For that I just need to do:
ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
After some checking, I saw that for 200 millions, I have 50 millions points. So, I think the best way is to have a table "TABLE_POINTS" that will store all the points, and a table "TABLE_RESOURCES" that will store resources with point_key.
So I need to fill "TABLE_POINTS" and "TABLE_RESOURCES" from temporary table "TABLE_TEMP" and not keeping duplicates.
For "POINTS" I did:
INSERT INTO TABLE_POINTS (point)
SELECT DISTINCT ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
FROM TABLE_RESOURCES
I don't remember how much time it took, but I think it was matter of hours.
Then, to fill "RESOURCES", I tried:
INSERT INTO TABLE_RESOURCES (...,point_key)
SELECT DISTINCT ...,point_key
FROM TABLE_TEMP, TABLE_POINTS
WHERE ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326) = point;
but again take days, and there is no way to see how far the query is ...
Also something important, the number of resources will continue to grow up, currently should be like 100K added by day, so storage should be optimized to keep fast access to data.
So if you have any idea for the loading or the optimization of the storage you are welcome.
Look into optimizing postgres first (ie google postgres unlogged, wal and fsync options), second do you really need points to be unique? Maybe just have one table with resources and points combined and not worry about duplicate points as it seems your duplicate lookup maybe whats slow.
For DISTINCT to work efficiently, you'll need a database index on those columns for which you want to eliminate duplicates (e.g. on the latitude/longitude columns, or even on the set of all columns).
So first insert all data into your temp table, then CREATE INDEX (this is usually faster that creating the index beforehand, as maintaining it during insertion is costly), and only afterwards do the INSERT INTO ... SELECT DISTINCT.
An EXPLAIN <your query> can tell you whether the SELECT DISTINCT now uses the index.