How do I implement full text search in Chinese on PostgreSQL? - postgresql

This question has been asked before:
Postgresql full text search in postgresql - japanese, chinese, arabic
but there are no answers for Chinese as far as I can see. I took a look at the OpenOffice wiki, and it doesn't have a dictionary for Chinese.
Edit: As we are already successfully using PG's internal FTS engine for English documents, we don't want to move to an external indexing engine. Basically, what I'm looking for is a Chinese FTS configuration, including parser and dictionaries for Simplified Chinese (Mandarin).

I know it's an old question but there's a Postgres extension for Chinese: https://github.com/amutu/zhparser/

I've just implemented a Chinese FTS solution in PostgreSQL. I did it by creating NGRAM tokens from Chinese input, and creating the necessary tsvectors using an embedded function (in my case I used plpythonu). It works very well (massively preferable to moving to SQL Server!!!).

Index your data with Solr, it's an open source enterprise search server built on top of Lucene.
You can find more info on Solr here:
http://lucene.apache.org/solr/
A good book on how-to (with PDF download immediately) here:
https://www.packtpub.com/solr-1-4-enterprise-search-server/book
And be sure to use a Chinese tokenizer, such as solr.ChineseTokenizerFactory because Chinese is not whitespace delimited.

Related

new to sphinx - text analysis and search results 'relativity'

being new to sphinx please excuse any mistakes or misused terms.
Sphinx is being used in a web based app with a database of millions of records in order to provide full-text search functionality.
For English content stored in the database the search results are 'accurate' and relative to the search keywords. The same thing does not happen with non-latin characters. I had a look at the morphology configuration setting but the Greek language is not available as an option. Thus for Greek keywords the search results are not always AS relevant to the search keywords as the keywords in English.
Does sphinx perform the same text analysis and indexing to the Greek content the way it does for the English content?
Any information (links, comments, answers) would be helpful.
thanks,
This is most likely affected by the charset_type and charset_table config uptions.
http://sphinxsearch.com/docs/current.html#conf-charset-type
http://sphinxsearch.com/docs/current.html#conf-charset-table
Out of the box sphinx is only really setup for English and Russian (the languages the primary Sphinx developer happens to speak :)
So you will need to enable utf8 mode, and add the required Greek chars to the charset_table.
The sphinx wiki
http://sphinxsearch.com/wiki/doku.php?id=charset_tables
has a set of greek config options can copy/paste.

To use unicode or not in web development project using flask and sqlalchemy

I am working on a web development project using flask and sqlalchemy orm. My post is related to use of unicode in developing this app.
What I have understood till now about unicode :
If I want my webapp to handle data in languages other than English I need to use unicode data type for my variables. Because string variables can't handle unicode data.
I use some database which stores unicode data or take responsibility to convert unicode to raw while saving and vice versa while retrieving. Sqlalchemy gives me option to set automatic conversion both ways, so that I dont have to worry about them.
I am using python2.7 so I have to be aware of processing unicode data properly. Normal string operations on unicode data maybe buggy.
Correct me if any of the above assumption is wrong.
Now my doubts or questions :
If I dont use unicodes now then will I have some problems if I or flask people decide to port to python3?
I dont want to hassle with the thought of my webapp catering to different languages right now. I just want to concentrate on first creating the app. Can I do that later without using unicode right now?
If I use unicode now then how it affects my code. Do I replace every string input and output with unicode or what?
Conversion of unicode when saving to database, Can it be source of performance problems?
Basically I am asking whether to use unicode or not with explaining my needs and requirement out of the project.
No, but make sure you separate binary data from text data. That makes it easier to port.
It's easier to use Unicode from the start, but of course you can postpone it. But it's really not very difficult.
You replace everything that should be text data with Unicode, yes.
Only of you make loads of conversions of really massive amounts of text.

How to solve this Arabic language problem in Sybase PowerBuilder 6 and 7?

How to view arabic characters correctly in Sybase PowerBuilder 6 or 7 as I use Arial(Arabic) or any arabic language in the properties of the table and the database but it shows the characters as strange symbols that has no meaning like ÓíÇÑÉ ÕÛíÑÉ ?
I'm no expert in dealing with Arabic language characters, so there may be a work around with ANSI code pages, but I'd expect your best solution is Unicode. There was a distinct version of PB6 supporting Unicode (i.e. a separate product), but it was discontinued in PB6 and there was no Unicode support until it was integrated into the primary product in PB10. However, unless you have the PB6/Unicode product on hand, or you need Win9x support or some other old platform support, I'd recommend moving to something more current, like PB12.5 just out. Not only will you get Unicode, but a lot of features that will help your application look more up to date and integrate better with modern services. (See http://www.techno-kitten.com/Changes_to_PowerBuilder/changes_to_powerbuilder.html for a list that at the moment is a little out of date, but will get the majority of what you're after.)
Good luck,
Terry.
This problem is called Mojibake and it's due to the PowerBuilder client and the database using different character encodings. This problem is frequently encountered on the web, and also in email. As Terry suggested, you would get the best results using Unicode in the database and PowerBuilder. If that's not possible, you have to use the same code page on the PowerBuilder client as in the database. A complicating issue is that it sounds like you have existing data. If you want to switch encoding you would have to convert the existing data to the new encoding.

Normalizing Unicode data for indexing (for Multi-byte languages): What products do this? Does Lucene/Hadoop/Solr?

I have several (1 million+) documents, email messages, etc, that I need to index and search through. Each document potentally has a different encoding.
What products (or configuration for the products) do I need to learn and understand to do this properly?
My first guess is something Lucene-based, but this is something I'm just learning as I go. My main desire is to start the time consuming encoding process ASAP so that we can concurrently build the search front end. This may require some sort of normalisation of double byte characters.
Any help is appreciated.
Convert everything to UTF-8 and run it through Normalization Form D, too. That will help for your searches.
You could try Tika.
Are you implying you need to transform the documents themselves? This sounds like a bad idea, especially on a large, heterogeneous collection.
A good search engine will have robust encoding detection. Lucene does and Solr uses it (Hadoop isn't a search engine). And I don't think it's possible to have a search engine that doesn't use a normalised encoding in its internal index format. So normalisation won't be a choice criteria, though trying out the encoding detection would be.
I suggest you use Solr. The ExtractingRequestHandler handles encodings and document formats. It is relatively easy to get a working prototype using Solr. DataImportHandler enables importing a document repository into Solr.

PostgreSQL Full Text Search case study?

Is there any case study available of any project that uses PostgreSQL 8.3+ Full Text Search on a large amount of data?
Not sure what your definition is of a large one.
There's some data about search.postgresql.org which uses it, available here.