Tell Sphinx (or thinking Sphinx) to ignore periods when indexing - sphinx

I have a strange issue with Sphinx, I am trying to be able to match things like:
L.A. Confidential
So people can search for "LA Confidential" and still get that title. Similarly for "P.M."
to be able to match "PM", etc.
I tried putting the period (full stop character U+002E) in the ignore_char list. This
didn't make any difference.
So then I tried implementing index_sp =1. This did not solve the issue either.
According to what I understand of the documentation, either of these should have solved
this issue correct?
I wonder if it has somthing to do with our math mode which is set to extended2, using
Sphinx 2.0.3.
Any help would be greatly appreciated.
Edit, here is my thinking_sphinx.yml config:
Note that the period character (U+002E) is not used anywhere else in my config except in the ignore_chars line.
production:
mem_limit: 512M
morphology: stem_en
wordforms: "db/sphinx/wordforms.txt"
stopwords: "db/sphinx/stopwords.txt"
ngram_chars: "U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, U+FA0E, U+FA0F, U+FA11, \
U+FA13, U+FA14, U+FA1F, U+FA21, U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, \
U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, U+304B, U+304D, U+304F, U+3051, \
U+3053, U+3055, U+3057, U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, U+3068, \
U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, U+307E..U+3083, U+3085, U+3087, \
U+3089..U+308E, U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, U+30AD, \
U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, U+30C1, U+30C3, U+30C4, U+30C6, \
U+30CA, U+30CB, U+30CD, U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, \/
U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, U+31F1, U+31F2, U+31F3, \
U+31F4, U+31F5, U+31F6, U+31F7, U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, \
U+31FE, U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, U+11A8..U+11F9, \
U+A000..U+A48C, U+A492..U+A4C6"
ngram_len: 1
ignore_chars: "U+0027, U+2013, U+2014, U+0026, U+002E, ., &"
(huge char_set entry here for different languages, ommited.)

I ran a test locally with the following in my thinking_sphinx.yml for Thinking Sphinx v3.0.4 and it worked:
development:
ignore_chars: U+002E
The same in sphinx.yml for Thinking Sphinx v2.0.14 worked too. I am using Sphinx 2.0.8, but I'll be a little surprised if that's the problem. It's certainly unrelated to your match mode.

Related

bitbake failed with ExpansionError

Context:
I'm following the NXP i.MX7 Reference to build a Linux image for the i.MX 7 SABRE board. This process went smoothly, and I was successful in building and loading the krogoth image on the board. The problem arise when I tried to add the openembedded-core layer to my image. I immediately get the error below. I included my bblayers.conf for reference. Any help would be appreciated. I don't even need sqlite, so if there's a way to bypass it, then that would be fine.
Error:
ERROR: ExpansionError during parsing /fsl-community-bsp-platform/sources/openembedded-core/meta/recipes-support/sqlite/sqlite3_3.16.2.bb: Failure expanding variable SQLITE_PV, expression was ${#sqlite_download_version(d)} which triggered exception TypeError: getVar() takes at least 3 arguments (2 given)
bblayers.conf
POKY_BBLAYERS_CONF_VERSION = "2"
BBPATH = "${TOPDIR}"
BSPDIR := "${#os.path.abspath(os.path.dirname(d.getVar('FILE', True)) + '/../..')}"
BBFILES ?= ""
BBLAYERS = " \
${BSPDIR}/sources/poky/meta \
${BSPDIR}/sources/poky/meta-poky \
\
${BSPDIR}/sources/openembedded-core/meta \
\
${BSPDIR}/sources/meta-openembedded/meta-oe \
${BSPDIR}/sources/meta-openembedded/meta-multimedia \
\
${BSPDIR}/sources/meta-fsl-arm \
${BSPDIR}/sources/meta-fsl-arm-extra \
${BSPDIR}/sources/meta-fsl-demos \
"
The only difference between a successful build, and a failling build is the line: ${BSPDIR}/sources/openembedded-core/meta.
Don't add openembedded-core/meta to your bblayers.conf!
In your list, BBLAYERS =, the two entries
${BSPDIR}/sources/poky/meta \
${BSPDIR}/sources/openembedded-core/meta \
are both the same layer. meta in Poky, is taken directly from OpenEmbedded. The Poky repository is combined from multiple upstream repositoris using a script, combo-layer. (Which in my opinion is unfortunate, though I can see why it's being done).
If you wan't e.g. a newer version of meta, you need to update poky, or remove poky completely, and download openembedded-core and bitbake separately.
In my experience building BSP with yocto, specifically with NXP imx7, I have got ExpansionError very frequent. Most of the time, I found out that there is redundant package or layer or recipe in some cases. Once you remove them from installation, It works smooth.
In your case, Just remove following from the build and It should be fine.
${BSPDIR}/sources/openembedded-core/meta \

Tokenizer in moses-SMT system stuck even with 10 sentences

I was trying to make a baseline MT system. Just for checking How it works I made Source (S) and Target (T) language corpus of just 2000 sentences. The very first step is to prepare the data for Machine Translation (MT) system. In this step first we have to perform tokenization as mentioned here Baseline SMT. I've used this code:
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en \
< ~/corpus/training/news-commentary-v8.fr-en.en \
> ~/corpus/news-commentary-v8.fr-en.tok.en
~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l fr \
< ~/corpus/training/news-commentary-v8.fr-en.fr \
> ~/corpus/news-commentary-v8.fr-en.tok.fr
( say S = French & T = English)
I checked after 2 hours it was still running. I got curious since it was not expected. Then I tried with just ten sentences. To my surprise, it's been 30 minutes and it is still running.
Did I do anything wrong?
PS: OS = Ubuntu 14.04.5 LTS
Sony ultrabook
No dual boot.
Please Follow bellow steps ;
git clone https://github.com/moses-smt/mosesdecoder.git
cd mosesdecoder
git clone https://github.com/moses-smt/giza-pp.git
cd giza-pp
make
mkdir tools
cp giza-pp/GIZA++-v2/GIZA++ giza-pp/GIZA++-v2/snt2cooc.out giza-pp/mkcls-v2/mkcls tools
scripts/tokenizer/tokenizer.perl -l fr < ~/corpus/training/news-commentary-v8.fr-en.fr > ~/corpus/news-commentary-v8.fr-en.tok.fr

How can I change the installation path of an autotools-based Bitbake recipe?

I have an autotools-based BitBake recipe which I would like to have binaries installed in /usr/local/bin and libraries installed in /usr/local/lib (instead of /usr/bin and /usr/lib, which are the default target directories).
Here's a part of the autotools.bbclass file which I found important.
CONFIGUREOPTS = " --build=${BUILD_SYS} \
--host=${HOST_SYS} \
--target=${TARGET_SYS} \
--prefix=${prefix} \
--exec_prefix=${exec_prefix} \
--bindir=${bindir} \
--sbindir=${sbindir} \
--libexecdir=${libexecdir} \
--datadir=${datadir} \
--sysconfdir=${sysconfdir} \
--sharedstatedir=${sharedstatedir} \
--localstatedir=${localstatedir} \
--libdir=${libdir} \
...
I thought that the easiest way to accomplish what I wanted to do would be to simply change ${bindir} and ${libdir}, or perhaps change ${prefix} to /usr/local, but I haven't had any success in this area. Is there a way to change these installation variables, or am I thinking about this in the wrong way?
Update:
Strategy 1
As per Ross Burton's suggestion, I've tried adding the following to my recipe:
prefix="/usr/local"
exec_prefix="/usr/local"
but this causes the build to fail during that recipe's do_configure() task, and returns the following:
| checking for GLIB... no
| configure: error: Package requirements (glib-2.0 >= 2.12.3) were not met:
|
| No package 'glib-2.0' found
This package can be found during a normal build without these modified variables. I thought that adding the following line might allow the system to find the package metadata for glib:
PKG_CONFIG_PATH = " ${STAGING_DIR_HOST}/usr/lib/pkgconfig "
but this seems to have made no difference.
Strategy 2
I've also tried Ross Burton's other suggestion to add these variable assignments into my distribution's configuration file, but this causes it to fail during meta/recipes-extended/tzdata's do_install() task. It returns that DEFAULT_TIMEZONE is set to an invalid value. Here's the source of the error from tzdata_2015g.bb
# Install default timezone
if [ -e ${D}${datadir}/zoneinfo/${DEFAULT_TIMEZONE} ]; then
install -d ${D}${sysconfdir}
echo ${DEFAULT_TIMEZONE} > ${D}${sysconfdir}/timezone
ln -s ${datadir}/zoneinfo/${DEFAULT_TIMEZONE} ${D}${sysconfdir}/localtime
else
bberror "DEFAULT_TIMEZONE is set to an invalid value."
exit 1
fi
I'm assuming that I've got a problem with ${datadir}, which references ${prefix}.
Do you want to change paths for everything or just one recipe? Not sure why you'd want to change just one recipe to /usr/local, but whatever.
If you want to change all of them, then the simple way is to set prefix in your local.conf or distro configuration (prefix = "/usr/local").
If you want to do it in a particular recipe, then just assigning prefix="/usr/local" and exec_prefix="/usr/local" in the recipe will work.
These variables are defined in meta/conf/bitbake.conf, where you can see that bindir is $exec_prefix/bin, which is probably why assigning prefix didn't work for you.
Your first strategy was on the right track, but you were clobbering more than you wanted by changing only "prefix". If you look in sources/poky/meta/conf/bitbake.conf you'll find everything you are clobbering when you set the variable "prefix" to something other than "/usr" (like it was in my case). In order to modify only the install path with what would manually be the "--prefix" option to configure, I needed to set all the variables listed here in that recipe:
prefix="/your/install/path/here"
datadir="/usr/share"
sharedstatedir="/usr/com"
exec_prefix="/usr"

hadoop-streaming example failed to run - Type mismatch in key from map

I was running $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer
What hould be the input file when IdentityMapper is the mapper?
I was hoping to see it can sort on certain selected keys and not the entire keys. My input file is simple
"aa bb".
"cc dd"
Not sure what did I miss? I always get this error
java.lang.Exception: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:371)
Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
This is a known bug and here is the JIRA. The bug has been identified in Hadoop 0.21.0, but I don't think it's in any of the Hadoop release version. If you are really interested to fix this, you can
download the source code for Hadoop (for the release you are working)
download the patch from JIRA and apply it
build and test Hadoop
Here are the instructions on how to apply a patch.
Or instead of using an IdentityMapper and the IdentityReducder, use a python/perl scripts which will read the k/v pairs from STDIN and then write the same k/v pairs to the STDOUT without any processing. It's like creating your own IdentityMapper and the IdentityReducder not using Java.
I was trying my hands on Hadoop with my own example, but got the same error. I used KeyValueTextInputFormat to resolve the issue. You can have a look at following blog for the same.
http://sanketraut.blogspot.in/2012/06/hadoop-example-setting-up-hadoop-on.html
Hope it helps you.
Peace.
Sanket Raut

Postgresql full text search in postgresql - japanese, chinese, arabic

I'm designing a fulltext search function in postgresql for my current project.
It works ok with ispell/myspell dictionaries so far.
Now I need to add support for chinese, japanese and arabic search.
Where do I start?
There are no templates or dictionaries available for those languages
as far as I can see.
Will it work with pg_catalog.simple configuration?
Just a hint from the manual: A large list of dictionaries is available on the OpenOffice Wiki.
Dictionaries won't help you too much with Chinese - you'll need to look in to NGRAM tokenising...
The similar solution of link at stackoverflow.com is How do I implement full text search in Chinese on PostgreSQL? .
Although that, I would provide a solution below in detail based on my experience and a solution on Internet. I use both tools of SCWS and zhparser as the solution of Chinese full-text search in postgres.
20160131 Update:
You must check whether you have installed postgresql-server-devel-{number version} because we will use pgxs function from it for creating extension in postgresql.
Step1: install SCWS.
It's remarkable that --prefix=/usr/local/scws follows ./configure . Not just has ./configure along in below 4th line.
wget http://www.xunsearch.com/scws/down/scws-1.2.2.tar.bz2
tar xvjf scws-1.2.2.tar.bz2
cd scws-1.2.2
./configure --prefix=/usr/local/scws
make
make install
To check whether it installed successfully, please enter below command:
ls -al /usr/local/scws/lib/libscws.la
Step2: Install zhparser
git clone https://github.com/amutu/zhparser.git
cd zhparser
SCWS_HOME=/usr/local/scws/include make && make install
20160131 Update:
If you use Mac OS X Yosemite, aboved value of SCWS_HOME is same. But if you use Ubuntu 14.04 LTS, please change value of SCWS_HOME to /usr/local/scws .
Step3: Configure a new extension using zhparser in Postres
Step3.1: Login your postgres database through terminal/commandline
psql yourdatabasename
Step3.2: Create extension in Postgres. You could specify what dictionary name you want.
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION dictionarynameyouwant (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION dictionarynameyouwant ADD MAPPING FOR n,v,a,i,e,l WITH simple;
If you follow above steps, you can use the function of Postgres full-text searching in Chinese/Mandarin words.
Extra step(not necessary) in Rails for using pg_search gem: Step4. Configure the dictionary name at :dictionary attribute of :tsearch in app/models/yourmodel.rb
class YourOwnClass < ActiveRecord::Base
...
include PgSearch
pg_search_scope :functionnameyoulike, :against => [columnsyoulike1, columnsyoulike2, ...,etc], :using => { :tsearch => {:dictionary => "dictionary name you just specified in creating a extension in postgres", blah blah blah, ..., etc} }
end
Reference:
1. SCWS install tutorial
2. Zhparser#github.com
3. Francs' Post - Postgres full-text search in Chinese with zhparser and SCWS
4. Rails365.net's Post - Postgres full-text search in Chinese with pg_search gem with zhparser
5. My Post at xuite.net - Make Postgres support full text search in Mandarin/Chinese