I have this problem for a long time, and can't find a solution.
I guess this might be something everybodys faced using Sphinx, but I cnanot get any
usefull information.
I have one index, and a delta.
I queried in a php module both indexes, and then show the results.
For each ID in the result, I create an object for the model, and dsiplay main data for
that model.
I delete one document from the database, phisically.
When I query the index, the ID for this deleted document is still there (in the sphinx
result set).
Maybe I can detect this by code, and avoid showing it, but the result set sphinx gaves me
as result is wrong. xxx total_found, when really is xxx-1.
For example, Sphinx gaves me the first 20 results, but one of this 20 results doesn't
exists anymore, so I have to show only 19 results.
I re-index the main index once per day, and the delta index, each 5 minutes.
Is there a solution for this??
Thanks in advance!!
What I've done in my Ruby Sphinx adapter, Thinking Sphinx, is to track when records are deleted, and update a boolean attribute for the records in the main index (I call it sphinx_deleted). Then, whenever I search, I filter on values where sphinx_deleted is 0. In the sql_query configuration, I have the explicit attribute as follows:
SELECT fields, more_fields, 0 as sphinx_deleted FROM table
And of course there's the attribute definition as well.
sql_attr_bool = sphinx_deleted
Keep in mind that these updates to attributes (using the Sphinx API) are only stored in memory - the underlying index files aren't changed, so if you restart Sphinx, you'll lose this knowledge, unless you do a full index as well.
This is a bit of work, but it will ensure your result count and pagination will work neatly.
Maybe this fits better to my needs, but involves changing the database.
http://sphinxsearch.com/docs/current.html#conf-sql-query-killlist
I suppose you could ask for maybe 25 results from sphinx and then when you get the full data from your DB just have a limit 20 on the query.
Related
I'm working on my app and I just ran into a dilemma regarding what's the best way to handle indexes for firestore.
I have a query that search for publication in a specify community that contains at least one of the tag and in a geohash range. The index for that query looks like this:
community Ascending tag Ascending location.geohash Ascending
Now if my user doesnt need to filter by tag, I run the query without the arrayContains(tag) which prompt me to create another index:
community Ascending location.geohash Ascending
My question is, is it better to create that second index or, to just use the first one and specifying all possible tags in arrayContains in the query if the user want no filters on tag ?
Neither is pertinently better, but it's a typical space vs time tradeoff.
Adding the extra tags in the query adds some overhead there, but it saves you the (storage) cost for the additional index. So you're trading some small amount of runtime performance for a small amount of space/cost savings.
One thing to check is whether the query with tags can actually run on just the second index, as Firestore may be able to do a zigzag merge join. In that case you could only keep the second, smaller index and save the runtime performance of adding additional clauses, but then get a (similarly small) performance difference on the query where you do specify one or more tags.
I’m having trouble to use the text search and the autocomplete because I have a piece with +87k documents, some of them being big (~3.4MB of text).
I already:
Removed every field from the text index, except title , searchBoost and seoDescription ; these are the only fields copied to highSearchText and the field lowSearchText is always set to an empty string.
Modified the standard text index, including the fields type, published and trash in the beginning of it. I'm also modified the queries to have equality conditions on these fields. The result returned by the command db.aposDocs.stats() shows:
type_1_published_1_trash_1_highSearchText_text_lowSearchText_text_title_text_searchBoost_text: 12201984 (~11 MB, fits nicely in memory)
Verified that this index is being used, both in ‘toDistinc’ query as well in the final ‘toArray’ query.
What I think is the biggest problem
The documents have many repeated words in the title, so if the user types a word present in 5k document titles, the server suffers.
Idea I'm testing
The MongoDB docs says that to improve performance the entire collection must fit in RAM (https://docs.mongodb.com/manual/core/index-text/#storage-requirements-and-performance-costs, last bullet).
So, I created a separate collection named “search” with just the fields highSearchText (string, indexed as text) and highSearchWords (array, also indexed), which result in total size of ~ 19 MB.
By doing the same operations of the standard apostrophe autocomplete in this collection, I achieved much faster, but similar results.
I had to write events to automatically update the search collection when the piece changes, but it seems to work until now.
Issues
I'm testing this search collection with the autocomplete. For the simple text search, I’m just limiting the sorted response to 50 results. Maybe I'll have to use the search collection as well, because the search could still breaks.
Is there some easier approach I'm missing? Please, any ideas are welcome.
Is there a simple OR elegant method (or query that I can write) to retrieve the last updated timestamp (of the last updated document) in a collection. I can write a query like this to find the last inserted document
db.collection.find().limit(1).sort({$natural:-1})
but I need information about the last updated document (it could be an insert or an update).
I know that one way is to query the oplog collection for the last record from a collection. But it seems like an expensive operation given the fact that oplog could be of very large size (also not trustworthy as it is a capped collection). Is there a better way to do this?
Thanks!
You could get the last insert time same way you mentioned in the question:
db.collection.find().sort({'_id': -1}).limit(1)
But, There isn't any good way to see the last update/delete time. But, If you are using replica sets you could get that from the oplog.
Or, you could add new field in document as 'lastModified'.
You can also checkout collection-hooks. I hope this will help
One way to go about it is to have a field that holds the time of last update. You can name it updatedAt. Every time you make an update to the document, you'll just update the value to the current time. If you use the ISO format to store the time, you'll be able to sort without issues (that's what I use).
The other way is the _id field.
Method 1
db.collection.find().limit(1).sort({updatedAt: -1})
Method 2
db.collection.find().limit(1).sort({_id: -1})
You can try with ,
db.collection.findOne().sort({$natural:-1}).limit(1);
I am testing a small example for a sharded set up and I notice that updating an embedded field is slower when the search fields are indexed.
I know that indexes are updated during inserts but are the search indexes of the query also updated?
The query for the update and the fields that are updated are not related to any manner.
e.g. (tested with toy data) :
{
id:... (sharded on the id)
embedded :[{ 'a':..,'b':...,'c':.... (indexed on a,b,c),
data:.... (data is what gets updated)
},
...
]
}
In the example above the query for the update is on a,b,c
and the values for the update affect only the data.
The only reasons I can think is that indexes are updated even if the updates are not on the indexed fields. The search part of the update seems to use the indexes when issuing a "find" query with with explain.
Could there be another reason?
I think wdberkeley -on the comments- gives the best explanation.
The document moves because it grows larger and the indexes are updated every time.
As he also notes, updating multiple keys is "bad"....I thinks I will avoid this design for now.
I have a collections of documents (named posts) that each contain a field named category.
Each category is part of a categories collection. There are a fixed number of them (say 15).
How do I fetch the last 10 tldrs from each category?
Another solution would be to set a "flag" in each post which is actually part of the result, like:
topTen: true
Defining a sparse index on that flag would give the fastest query - at the price, of course, of the maintenance of that flag:
set the flag at insertion time (impact: one more index to update)
if it is tolerable that for a certain period the query returns 11 posts instead of 10, then trigger a background process that deletes (unsets) the 11th flag for that category
if it is not tolerable, find and unset the 11th flag at insert time
if the category of an existing post is altered, make sure the flags get set all right (for the old and the new category)
if a post gets removed that has the flag set: find and set the flag for the new 10th post
may be you'd want to provide a periodically ran process, that makes sure the flags are all set as they should be
For more information on sparse indexes, see http://docs.mongodb.org/manual/core/indexes/#index-type-sparse
Probably it will be better to just at first get the list of all categories and then for each of them get their 10 latest posts by separate queries.