How to debug indexed data by thinking sphinx - sphinx

I'm new with sphinx and I'd like to see what's in technology_core because my query is not returning what should be. Actually is not even querying in the db those two results.
Processing by TechnologiesController#index as JS
Parameters: {"utf8"=>"✓", "filter"=>{"tech_type"=>["1"], "area_of_interest"=>["4"]}, "query"=>"", "_"=>"1417734808773"}
Sphinx Query (1.4ms) SELECT *, groupby() AS sphinx_internal_group, id AS sphinx_document_id, count(DISTINCT sphinx_document_id) AS sphinx_internal_count FROM `technology_core` WHERE `tag_id` IN (1, 4) AND `sphinx_deleted` = 0 GROUP BY `technology_id` HAVING COUNT(*)=2 LIMIT 0, 20
Sphinx Found 2 results
Rendered collection (0.0ms)
Rendered technologies/index.js.erb (2.7ms)
Completed 200 OK in 15ms (Views: 12.7ms | ActiveRecord: 0.0ms)
I though data were stored into a new db table but there's no a technology_core table created. Then I remembered there's a db/sphinx folder, so I guess the indexed data must be stored in there.
Is there a way to query that technology_core table?

You can connect to Sphinx via the mysql command-line tool:
mysql --host 127.0.0.1 --port 9306
And then run SphinxQL queries (which is similar but not the same as SQL - the Sphinx Query line in your log is a SphinxQL query). The Sphinx docs cover the syntax pretty well.
It's worth keeping in mind that SphinxQL queries will return all attribute values (well, unless you only request some attributes in your SELECT clause), but they will not return fields.

Related

Cloudant : query in http navigator

I'm using cloudant, with no auth, Cors enabled.
it works very well, Limit and skip working good.
but i can't find how to search for something .
I'm trying to find a document where cp is 24000 , for example with this query :
https://1c54473b-be6e-42d6-b914-d0ecae937981-bluemix.cloudant.com/etablissements/_all_docs?skip=0&limit=10&include_docs=true&q=cp:24000
But, the query doesn't return the right document.
I've also tried
https://1c54473b-be6e-42d6-b914-d0ecae937981-bluemix.cloudant.com/etablissements/_all_docs?skip=0&limit=10&include_docs=true&_search({'cp':24000})
with no luck.
oh, and by the way, do you know if jquery.couch.js lib has been discontinued? I cant even find it on github, nor on my hard disk while im using foxant, and it is not in the directory also..
The /db/_all_docs endpoint hits the primary index of the database where all of the documents in the database can be found in _id order.
If you wish to query the database to get a subset of the data you have three options
Cloudant Query - hit the POST /db/_find endpoint passing in a JavaScript object containing the selector which defines the query you wish to perform (like the WHERE clause of a SQL query) e.g. {selector: {cp: 24000}}
MapReduce - create a Map function in a design document that filters the documents you are interested it. It creates a materialized view that can be queried and filtered later. e.g. function(doc){ emit(doc.cp, null);}
Cloudant Search - this uses the Apache Lucene library to generate an index on the fields you specify. You can then query the index: q=cp:24000, which looks similar to the query you are looking to perform.

Are Postgres WHERE clauses run sequentially?

I'm looking at using Postgres as a database to let our clients segment their customers.
The idea is that they can select a bunch of conditions in our front-end admin, and these conditions will get mapped to a SQL query. Right now, I'm thinking the best structure could be something like this:
SELECT DISTINCT id FROM users
WHERE id IN (
-- condition 1
)
AND id IN (
-- condition 2
)
AND id IN (
-- etc
)
Efficiency and query speed is super important to us, and I'm wondering if this is the best way of structuring things. When going through each of the WHERE clauses, will Postgres pass the id values from one to the next?
The ideal scenario would be, for a group of 1m users:
Query 1 filters down to 100k
Query 2 filters down from 100k to 10k
Query 3 filters down to 10k to 5k
As opposed to:
Query 1 filters from 1m to 100k
Query 2 filters down from 1m to 50k
Query 3 filters down from 1m to 80k
The intersection of all queries are mashed together, to 5k
Maybe I'm misunderstanding something here, I'd love to get your thoughts!
Thanks!
Postgres uses a query planner to figure out how to most efficiently apply your query. It may reorder things or change how certain query operations (such as joins) are implemented, based on statistical information periodically collected in the background.
To determine how the query planner will structure a given query, stick EXPLAIN in front of it:
EXPLAIN SELECT DISTINCT id FROM users ...;
This will output the query plan for that query. Note that an empty table may get a totally different query plan from a table with (say) 10,000 rows, so be sure to test on real(istic) data.
Database engines are much more sophisticated than that.
The specific order of the conditions should not matter. They will take your query as a whole and try to figure out the best way to get the data according to all the conditions you specified, the indexes that each table has, the amount of records each condition will filter out, etc.
If you want to get an idea of how your query will actually be solved you can ask the engine to "explain" it for you: http://www.postgresql.org/docs/current/static/sql-explain.html
However, please note that there is a lot of technical background on how DB engines actually work in order to understand what that explanation means.

Sphinx search engine, realtime index, how to set sort_mode to SPH_SORT_TIME_SEGMENTS

Going thru all the documentation and results from google, I can't find or figure out how to set sort_mode when using realtime indexes.
I am simply using SphinxQL with pdo_mysql to connect to sphinx and running queries like:
SELECT item_id, item_type FROM my_index WHERE MATCH (:search_string) OPTION ...
Can I set a sort_mode? How?
The sort mode is not directly exposed. But can do everything it can do in other ways.
For time-segments see
http://sphinxsearch.com/blog/2010/06/27/doing-time-segments-geodistance-searches-and-overrides-in-sphinxql/

How to hande joins in Mongodb?

I have two tables in PostgreSQL:
urls (table with indexed pages, host is indexed column, 30 mln rows)
hosts (table with information about hosts, host is indexed column, 1mln rows)
One of the most frequent SELECT in my application is:
SELECT urls.*
FROM urls
JOIN hosts ON urls.host = hosts.host
WHERE urls.projects_id = ?
AND hosts.is_spam IS NULL
ORDER by urls.id DESC, LIMIT ?
In projects which have more than 100 000 rows in urls table the query executes very slow.
Since the tables has grown the query is execution slower and slower. I've read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables and i'am taking into consideration move my data to MongoDB. Everything would be easy, if i didn't have to check hosts table during selecting data from urls table. I've heard that MongoDB doesn't support joins, so my question is how to solve above problem? I could put information about host in urls collection, but the field hosts.is_spam could be updated by user and i would have to update the whole urls collection. I don't know it it is right solution.
I would be greatful for any advices.
If you don't use joins, then relational dbs can also work pretty fast. I think, this is the case where you need to denormalize for the sake of performance.
Option 1
Copy is_spam column to the urls table. When this value of the host changes, update all related urls. This is okay if you don't do it too often.
Option 2
I don't know your app, but I assume that the number of spam hosts is relatively small. In this case, you could put their ids to an in-memory store (memcached, redis, ...), query all the urls and filter out spam urls in the app. This way your pagination gets a little broken, but sometimes it's a viable option.
You are correct in that the problem is the join, but my guess is that it's just the wrong kind of join. As Frank H. mentioned, PostgreSQL should be able to process this type of query rather handily depending on the frequency of hosts.is_spam. You probably want to cluster the urls table on id to optimize the order by-limit phase. Since you only care about urls.* you can minimize disk io by creating a partial index on hosts.host where is_spam is not null to make it easy to get just the short list of hosts to avoid.
Try this:
select urls.*
from urls
left join hosts
on urls.host = hosts.host
and hosts.is_spam is not null
where urls.projects_id = ?
and hosts.host is null
Or this:
select *
from urls
where urls.projects_id = ?
and not exists (
select 1
from hosts
where hosts.host = urls.hosts
and hosts.is_spam is not null
)
This will allow PostgreSQL to use an anti-join to pull only urls which are not mapped to a known spammy host. The results may be different from your query if there are urls with a null or invalid host.
It is true that MongoDB doesn't support joins. In a case like this, I would structure my urls collection like this
urls : {
name,
some_other_property,
host
}
You can then fetch the host for a particular URL, and check the is_spam field for it in the hosts collection. Note that this will need to be done by the client querying the DB and cannot be done at the DB itself like you would with a JOIN.
Similar to the answer by #xbones, but with specific examples
Putting a host_id field in your urls documents is one way to go. It will require that you first pull a result of url documents, and then pull a result of spam hosts, then filter locally in your client code
Roughly:
var urls = db.urls.find({projects_id:'ID'}, {_id: 1, host_id: 1});
var hosts = db.hosts.find({is_spam: 1}, {_id: 1});
# psuedocode
ids_array = _id for _id in urls if host_id is not in hosts
urls = db.urls.find({_id: {$in: ids_array}});
Or:
var urls = db.urls.find({projects_id:'ID'});
var hosts = db.hosts.find({is_spam: 1}, {_id: 1});
# psuedocode
urls = url for url in urls if host_id is not in hosts
The first example assumes the result of the project_id query could be huge (and your url documents are bigger) and you only wanted to grab the smallest amount of data possible, then you filter locally, and then batch get the full final url documents.
The second example just gets the full url documents to start, and filters them down locally.

Sphinx deleted documents

I have this problem for a long time, and can't find a solution.
I guess this might be something everybodys faced using Sphinx, but I cnanot get any
usefull information.
I have one index, and a delta.
I queried in a php module both indexes, and then show the results.
For each ID in the result, I create an object for the model, and dsiplay main data for
that model.
I delete one document from the database, phisically.
When I query the index, the ID for this deleted document is still there (in the sphinx
result set).
Maybe I can detect this by code, and avoid showing it, but the result set sphinx gaves me
as result is wrong. xxx total_found, when really is xxx-1.
For example, Sphinx gaves me the first 20 results, but one of this 20 results doesn't
exists anymore, so I have to show only 19 results.
I re-index the main index once per day, and the delta index, each 5 minutes.
Is there a solution for this??
Thanks in advance!!
What I've done in my Ruby Sphinx adapter, Thinking Sphinx, is to track when records are deleted, and update a boolean attribute for the records in the main index (I call it sphinx_deleted). Then, whenever I search, I filter on values where sphinx_deleted is 0. In the sql_query configuration, I have the explicit attribute as follows:
SELECT fields, more_fields, 0 as sphinx_deleted FROM table
And of course there's the attribute definition as well.
sql_attr_bool = sphinx_deleted
Keep in mind that these updates to attributes (using the Sphinx API) are only stored in memory - the underlying index files aren't changed, so if you restart Sphinx, you'll lose this knowledge, unless you do a full index as well.
This is a bit of work, but it will ensure your result count and pagination will work neatly.
Maybe this fits better to my needs, but involves changing the database.
http://sphinxsearch.com/docs/current.html#conf-sql-query-killlist
I suppose you could ask for maybe 25 results from sphinx and then when you get the full data from your DB just have a limit 20 on the query.