index single cap letters with Sphinx? - sphinx

It appears sphinx is not indexing single lettes, especially 'a' however I need it to at very least index capital A e.g. 'Class A'. My min_word_len = 1 but that still makes no difference. Is there a way to force it to index 'A'?

Related

Get Redis values while scanning

I've created a Redis key / value index this way :
set 7:12:321 '{"some:"JSON"}'
The key is delimited by a colon separator, each part of the key represents a hierarchic index.
get 7:12:321 means that I know the exact hierarchy and want only one single item
scan 7:12:* means that I want every item under id 7 in the first level of hierarchy and id 12 in the second layer of hierarchy.
Problem is : if I want the JSON values, I have to first scan (~50000 entries in a few ms) then get every key returned by scan one by one (800ms).
This is not very efficient. And this is the only answer I found on stackoverflow searching "scanning Redis values".
1/ Is there another way of scanning Redis to get values or key / value pairs and not only keys ? I tried hscan with as follows :
hset myindex 7:12:321 '{"some:"JSON"}'
hscan myindex MATCH 7:12:*
 
But it destroys the performance (almost 4s for the 50000 entries)
2/ Is there another data structure in Redis I could use in the same way but which could "scan for values" (hset ?)
3/ Should I go with another data storage solution (PostgreSQL ltree for instance ?) to suit my use case with huge performance ?
I must be missing something really obvious, 'cause this sounds like a common use case.
Thanks for your answers.
Optimization for your current solution
Instead of geting every key returned by scan one-by-one, you should use mget to batch get key-value pairs, or use pipeline to reduce RTT.
Efficiency problem of your current solution
scan command iterates all keys in the database, even if the number of keys that match the pattern is small. The performance decreases when the number of keys increases.
Another solution
Since the hierarchic index is an integer, you can encode the hierarchic indexes into a number, and use that number as the score of a sorted set. In this way, instead of searching by pattern, you can search by score range, which is very fast with a sorted set. Take the following as an example.
Say, the first (right-most) hierarchic index is less than 1000, the second index is less than 100, then you can encode the index (e.g. 7:12:321) into a score (321 + 12 * 1000 + 7 * 100 * 1000 = 712321). Then set the score and the value into a sorted set: zadd myindex 712321 '{"some:"JSON"}'.
When you want to search keys that match 7:12:*, just use zrangebyscore command to get data with a score between 712000 and 712999: zrangebyscore myindex 712000 712999 withscores.
In this way, you can get key (decoded with the returned score) and value together. Also it should be faster than the scan solution.
UPDATE
The solution has a little problem: members of sorted set must be unique, so you cannot have 2 keys with the same value (i.e. json string).
// insert OK
zadd myindex 712321 '{"the_same":"JSON"}'
// failed to insert, members should be unique
zadd myindex 712322 '{"the_same":"JSON"}'
In order to solve this problem, you can combine the key with the json string to make it unique:
zadd myindex 712321 '7:12:321-{"the_same":"JSON"}'
zadd myindex 712321 '7:12:322-{"the_same":"JSON"}'
You could consider using a Sorted Set and lexicographical ranges as long as you only need to perform prefix searches. For more information about this and indexing in general refer to http://redis.io/topics/indexes
Updated with an example:
Consider the following -
$ redis-cli
127.0.0.1:6379> ZADD anotherindex 0 '7:12:321:{"some:"JSON"}'
(integer) 1
127.0.0.1:6379> ZRANGEBYLEX anotherindex [7:12: [7:12:\xff
1) "7:12:321:{\"some:\"JSON\"}"
Now go and read about this so you 1) understand what it does and 2) know how to avoid possible pitfalls :)

What does the digit "1" mean when creating indexes in mongodb

I am new to mongodb and want to make indexes for a specific collection. I have seen people use a digit "1" in front of the field name when they want to create an index. for example:
db.users.ensureIndex({user_name: 1})
now I want to know what does this digit mean and is it necessary to use it?
It's the type of index. MongoDB supports different kinds of indexes. However, only the first two indexes can be combined to a compound index.
1: Ascending binary-tree index.
-1: Descending binary-tree index. Very similar to the default index but the difference can matter for the behavior of compound indexes.
"hashed": A hashtable index. Very fast for lookup by exact value, especially in very large collections. But not usable for inexact queries ($gt, $regex or similar).
"text": A text index designed for searching for words in strings with natural language.
"2d": A geospatial index on a flat plane
"2dsphere": A geospatial index on a sphere
For more information, see the documentation of index types.
It defines the index type on that specefic field. For example the value of 1 creates an index with ascending order, while the value -1 create the index with descending order.
For more information, see the Manual

Checking position of an entry in an index MongoDB

I have a query using pymongo that is outputting some values based on the following:
cursor = db.collect.find({"index_field":{"$regex":'\s'}}
for document in cursor:
print document["_id"]
Now this query has been running for a long time (over 500 million documents) as I expected. I was wondering though if there is a way to check where the query is in its execution by perhaps finding out where the last printed "_id" is in the indexed field. Like is the last printed _id halfway through the btree index? Is it near the end?
I want to know this just to see if I should cancel the query and reoptimize and/or let it finish, but I have no way of knowing where the _id exists in the query.
Also, if anyone has a way to optimize my whitespace query, that would be helpful to. Based on the doc, it seems if I would of used ignorecase it would of been faster, although it doesn't make sense for whitespace checking.
Thanks so much,
J
Query optimization
Your query cannot be optimized, because it's an inefficient$regex search that's looking for the space \s in the the document. What you can do, is to search $regex for a prefix of \s, e.g.
db.collect.find({"index_field": {"$regex": '^\\s'}})
Check out the notes in the link
Indexing problem
$regex can only use an index efficiently when the regular
expression has an anchor for the beginning (i.e. ^) of a string and is
a case-sensitive match. Additionally, while /^a/, /^a.*/, and
/^a.*$/ match equivalent strings, they have different performance
characteristics. All of these expressions use an index if an
appropriate index exists; however, /^a.*/, and /^a.*$/ are slower.
/^a/ can stop scanning after matching the prefix.
DB op's info
Use db.currentOp() to get info on all of your running ops.

Is there a way to index in postgres for fast substring searches

I have a database and want to be able to look up in a table a search that's something like:
select * from table where column like "abc%def%ghi"
or
select * from table where column like "%def%ghi"
Is there a way to index the column so that this isn't too slow?
Edit:
Can I also clarify that the database is read only and won't be updated often.
Options for text search and indexing include:
full-text indexing with dictionary based search, including support for prefix-search, eg to_tsvector(mycol) ## to_tsquery('search:*')
text_pattern_ops indexes to support prefix string matches eg LIKE 'abc%' but not infix searches like %blah%;. A reverse()d index may be used for suffix searching.
pg_tgrm trigram indexes on newer versions as demonstrated in this recent dba.stackexchange.com post.
An external search and indexing tool like Apache Solr.
From the minimal information given above, I'd say that only a trigram index will be able to help you, since you're doing infix searches on a string and not looking for dictionary words. Unfortunately, trigram indexes are huge and rather inefficient; don't expect some kind of magical performance boost, and keep in mind that they take a lot of work for the database engine to build and keep up to date.
If you need just to, for instance, get unique substrings in an entire table, you can create a substring index:
CREATE INDEX i_test_sbstr ON tablename (substring(columname, 5, 3));
-- start at position 5, go for 3 characters
It is important that the substring() parameters in the index definition are
the same as you use in your query.
ref: http://www.postgresql.org/message-id/BANLkTinjUhGMc985QhDHKunHadM0MsGhjg#mail.gmail.com
For the like operator use one of the operator classes varchar_pattern_ops or text_pattern_ops
create index test_index on test_table (col varchar_pattern_ops);
That will only work if the pattern does not start with a % in which case another strategy is required.

Thinking sphinx fuzzy search?

I am implementing sphinx search in my rails application.
I want to search with fuzzy on. It should search for spelling mistakes e.g if is enter search query charact*a*ristics, it should search for charact*e*ristics.
How should I implement this
Sphinx doesn't naturally allow for spelling mistakes - it doesn't care if the words are spelled correctly or not, it just indexes them and matches them.
There's two options around this - either use thinking-sphinx-raspell to catch spelling errors by users when they search, and offer them the choice to search again with an improved query (much like Google does); or maybe use the soundex or metaphone morphologies so words are indexed in a way that accounts for how they sound. Search on this page for stemming, you'll find the relevant section. Also have a read of Sphinx's documentation on the matter as well.
I've no idea how reliable either option would be - personally, I'd opt for #1.
By default, Sphinx does not pay any attention to wildcard searching using an asterisk character. You can turn it on, though:
development:
enable_star: true
# ... repeat for other environments
See http://pat.github.io/thinking-sphinx/advanced_config.html Wildcard/Star Syntax section.
Yes, Sphinx generaly always uses the extended match modes.
There are the following matching modes available:
SPH_MATCH_ALL, matches all query words (default mode);
SPH_MATCH_ANY, matches any of the query words;
SPH_MATCH_PHRASE, matches query as a phrase, requiring perfect match;
SPH_MATCH_BOOLEAN, matches query as a boolean expression (see Section 5.2, “Boolean query syntax”);
SPH_MATCH_EXTENDED, matches query as an expression in Sphinx internal query language (see Section 5.3, “Extended query syntax”);
SPH_MATCH_EXTENDED2, an alias for SPH_MATCH_EXTENDED;
SPH_MATCH_FULLSCAN, matches query, forcibly using the "full scan" mode as below. NB, any query terms will be ignored, such that filters, filter-ranges and grouping will still be applied, but no text-matching.
SPH_MATCH_EXTENDED2 was used during 0.9.8 and 0.9.9 development cycle, when the internal matching engine was being rewritten (for the sake of additional functionality and better performance). By 0.9.9-release, the older version was removed, and SPH_MATCH_EXTENDED and SPH_MATCH_EXTENDED2 are now just aliases.
enable_star
Enables star-syntax (or wildcard syntax) when searching through prefix/infix indexes. >Optional, default is is 0 (do not use wildcard syntax), for compatibility with 0.9.7. >Known values are 0 and 1.
For example, assume that the index was built with infixes and that enable_star is 1. Searching should work as follows:
"abcdef" query will match only those documents that contain the exact "abcdef" word in them.
"abc*" query will match those documents that contain any words starting with "abc" (including the documents which contain the exact "abc" word only);
"*cde*" query will match those documents that contain any words which have "cde" characters in any part of the word (including the documents which contain the exact "cde" word only).
"*def" query will match those documents that contain any words ending with "def" (including the documents that contain the exact "def" word only).
Example:
enable_star = 1