new to sphinx - text analysis and search results 'relativity' - sphinx

being new to sphinx please excuse any mistakes or misused terms.
Sphinx is being used in a web based app with a database of millions of records in order to provide full-text search functionality.
For English content stored in the database the search results are 'accurate' and relative to the search keywords. The same thing does not happen with non-latin characters. I had a look at the morphology configuration setting but the Greek language is not available as an option. Thus for Greek keywords the search results are not always AS relevant to the search keywords as the keywords in English.
Does sphinx perform the same text analysis and indexing to the Greek content the way it does for the English content?
Any information (links, comments, answers) would be helpful.
thanks,

This is most likely affected by the charset_type and charset_table config uptions.
http://sphinxsearch.com/docs/current.html#conf-charset-type
http://sphinxsearch.com/docs/current.html#conf-charset-table
Out of the box sphinx is only really setup for English and Russian (the languages the primary Sphinx developer happens to speak :)
So you will need to enable utf8 mode, and add the required Greek chars to the charset_table.
The sphinx wiki
http://sphinxsearch.com/wiki/doku.php?id=charset_tables
has a set of greek config options can copy/paste.

Related

Word database for iOS custom keyboard

I am writing a custom keyboard for an Asian language and I have a word database with over half a million words. I use Realm for now and use it to give word suggestions. When users type the first few letters keyboard will search the DB and provide words based on priority values given to each word. But this seems inefficient compared to other keyboards in the App Store, I can't find any concrete way or idea on this issue. Anyone can point me in the direction to increase the efficiency of word searching with a custom iOS keyboard.
I haven't tried CoreData but generally, the realm is considered faster than CoreData.
First, type of storage: maybe consider using plist files or .text files wouldn't be bad.
and saving words in a sorted way in ASCII mode would be great.
Second: you need an algorithm to break into a group of words so fast. you can do this by saving the ASCII code.
Here is an example of a binary search algorithm :
Please search around about different algorithms.

Autocorrect a document corpus

I have an approximately 6GB sized document corpus of mostly user generated content on mobile platforms. Due to the nature of origin of this corpus, it is rife with misspelled, abbreviated and truncated words. Is there a way i could autocorrect these words to the nearest English language word?
This might be fun to look at, seen that you tagged your question with machine learning:
http://norvig.com/spell-correct.html
It's a fascinating read. On the other hand, if you are not looking to tinker, a better one might be Enchant, have a look at
https://pypi.org/project/pyenchant/

Postgresql Misspelling in Full Text Search

I'm using postgresql to perform Full Text Search and I am finding that users will not receive results if there are misspellings.
What is the best way to handle misspelt words in Postgres full text search?
Take a look at pg_similarity extension which stuffs PSQL with a lot of similarity operators and functions. This will allow you to add (easy enough) some forgiveness into queries.
By typing "spelling correction postgresql fts" into google I get the top result being a page that links to just such a topic.
It suggests using a separate table of all the valid words in your database and running search terms against that to suggest corrections. The trigram matching allows you to measure how "similar" the real words in your table are to the search terms supplied.

How to solve this Arabic language problem in Sybase PowerBuilder 6 and 7?

How to view arabic characters correctly in Sybase PowerBuilder 6 or 7 as I use Arial(Arabic) or any arabic language in the properties of the table and the database but it shows the characters as strange symbols that has no meaning like ÓíÇÑÉ ÕÛíÑÉ ?
I'm no expert in dealing with Arabic language characters, so there may be a work around with ANSI code pages, but I'd expect your best solution is Unicode. There was a distinct version of PB6 supporting Unicode (i.e. a separate product), but it was discontinued in PB6 and there was no Unicode support until it was integrated into the primary product in PB10. However, unless you have the PB6/Unicode product on hand, or you need Win9x support or some other old platform support, I'd recommend moving to something more current, like PB12.5 just out. Not only will you get Unicode, but a lot of features that will help your application look more up to date and integrate better with modern services. (See http://www.techno-kitten.com/Changes_to_PowerBuilder/changes_to_powerbuilder.html for a list that at the moment is a little out of date, but will get the majority of what you're after.)
Good luck,
Terry.
This problem is called Mojibake and it's due to the PowerBuilder client and the database using different character encodings. This problem is frequently encountered on the web, and also in email. As Terry suggested, you would get the best results using Unicode in the database and PowerBuilder. If that's not possible, you have to use the same code page on the PowerBuilder client as in the database. A complicating issue is that it sounds like you have existing data. If you want to switch encoding you would have to convert the existing data to the new encoding.

How do I implement full text search in Chinese on PostgreSQL?

This question has been asked before:
Postgresql full text search in postgresql - japanese, chinese, arabic
but there are no answers for Chinese as far as I can see. I took a look at the OpenOffice wiki, and it doesn't have a dictionary for Chinese.
Edit: As we are already successfully using PG's internal FTS engine for English documents, we don't want to move to an external indexing engine. Basically, what I'm looking for is a Chinese FTS configuration, including parser and dictionaries for Simplified Chinese (Mandarin).
I know it's an old question but there's a Postgres extension for Chinese: https://github.com/amutu/zhparser/
I've just implemented a Chinese FTS solution in PostgreSQL. I did it by creating NGRAM tokens from Chinese input, and creating the necessary tsvectors using an embedded function (in my case I used plpythonu). It works very well (massively preferable to moving to SQL Server!!!).
Index your data with Solr, it's an open source enterprise search server built on top of Lucene.
You can find more info on Solr here:
http://lucene.apache.org/solr/
A good book on how-to (with PDF download immediately) here:
https://www.packtpub.com/solr-1-4-enterprise-search-server/book
And be sure to use a Chinese tokenizer, such as solr.ChineseTokenizerFactory because Chinese is not whitespace delimited.