Postgresql fulltext search for Czech language (no default language config) - postgresql

I am trying to setup fulltext search for Czech language. I am little bit confused, because I see some cs_cz.affix and cs_cz.dict files inside tsearch_data folder, but there is no Czech language configuration (it's probably not shipped with Postgres).
So should I create one? Which dics do I have to create/config? Is there some support for Czech language at all?
Should I use all possible dicts? (Synonym Dictionary, Thesaurus Dictionary, Ispell Dictionary, Snowball Dictionary)
I am able to create Czech configuration for ispell dict and it works fine, bud I am not sure if it's enough (just ispell configuration).
Thanks a lot I tried to read https://www.postgresql.org/docs/9.5/static/textsearch.html but I am little bit confused.

I have never tried it, but you should be able to create a Czech Snowball stemmer as long as you are ready to compile PostgreSQL from source.
There is an explanation in src/backend/snowball/README:
The files under src/backend/snowball/libstemmer/ and
src/include/snowball/libstemmer/ are taken directly from their libstemmer_c
distribution, with only some minor adjustments of file inclusions. Note
that most of these files are in fact derived files, not master source.
The master sources are in the Snowball language, and are available along
with the Snowball-to-C compiler from the Snowball project. We choose to
include the derived files in the PostgreSQL distribution because most
installations will not have the Snowball compiler available.
To update the PostgreSQL sources from a new Snowball libstemmer_c
distribution:
Copy the *.c files in libstemmer_c/src_c/ to src/backend/snowball/libstemmer
with replacement of "../runtime/header.h" by "header.h", for example
for f in libstemmer_c/src_c/*.c
do
sed 's|\.\./runtime/header\.h|header.h|' $f >libstemmer/`basename $f`
done
(Alternatively, if you rebuild the stemmer files from the master Snowball
sources, just omit "-r ../runtime" from the Snowball compiler switches.)
Copy the *.c files in libstemmer_c/runtime/ to
src/backend/snowball/libstemmer, and edit them to remove direct inclusions
of system headers such as <stdio.h> – they should only include "header.h".
(This removal avoids portability problems on some platforms where <stdio.h>
is sensitive to largefile compilation options.)
Copy the *.h files in libstemmer_c/src_c/ and libstemmer_c/runtime/
to src/include/snowball/libstemmer. At this writing the header files
do not require any changes.
Check whether any stemmer modules have been added or removed. If so, edit
the OBJS list in Makefile, the list of #include's in dict_snowball.c, and the
stemmer_modules[] table in dict_snowball.c.
The various stopword files in stopwords/ must be downloaded
individually from pages on the snowball.tartarus.org website.
Be careful that these files must be stored in UTF-8 encoding.
Now there is a Czech Snowball stemmer available here, it was contributed to the project. There is no stop word dictionary available, but I am sure you can either find one or create one yourself.
The real work would be to install Snowball and use the Snowball-to-C compiler to create the C and header files to add to the PostgreSQL source.
These files should then remain stable, so it shouldn't be difficult to upgrade to a new PostgreSQL version.
If you are willing to do the work, but don't want to patch PostgreSQL and build it from source every time, you could also consider submitting a patch to PostgreSQL. As long as the stemmer works fine, I don't expect that you will much resistance there (but the patch submission process is still tedious).

Related

SugarCRM Language Translation Files

Is there a way to limit SugarCRM to just one language (us_en)? Right now everything we do generates 40+ language files which we'll never use. It makes finding things in the folders very difficult.
After removing all the languages except en_us my /sugarcrm/config_override.php contains the following:
<?php
/***CONFIGURATOR***/
$sugar_config['disabled_languages'] = 'bg_BG,cs_CZ,da_DK,de_DE,el_EL,es_ES,fr_FR,he_IL,
hu_HU,hr_HR,it_it,lt_LT,ja_JP,ko_KR,lv_LV,nb_NO,nl_NL,pl_PL,pt_PT,ro_RO,ru_RU,sv_SE,
th_TH,tr_TR,zh_TW,zh_CN,pt_BR,ca_ES,en_UK,sr_RS,sk_SK,sq_AL,et_EE,es_LA,fi_FI,ar_SA,uk_UA';
/***CONFIGURATOR***/
I then created test with a new package, named Dan, which has one module named Pets. When I look in version control I still have a file for each available language in the sugarcrm/custom/modulebuilder/packages/Dan/modules/Pets/languages
It seems you can accomplish that by modifying the language array in the sugar config.
Make sure to make a backup of your config.php, so that you have the original language array if you need it back. This is important even although our change will be in another file, because Sugar might recreate config.php automagically, using the resulting array, losing the original one.
In your config_override.php add this line:
$sugar_config['languages'] = array('en_us' => 'English (US)');
Be aware that above line will make 'en_us' the only available language on that instance and Studio/etc. should now only create en_us files. If that is not the solution you're looking for - let me know please.
EDIT:
Above steps only seems to disable file creation spam for Dropdown Editor.
If you also want to make Module Builder not create any non-en_us language files, I found this - quite invasive - method of accomplishing just that:
Create a Backup of the instance, then remove all *.lang.php files from the directories of include/ and modules/, except for en_us.* files. On Linux you can do this with find include modules -name '*.lang.php' -not -name 'en_us.*' -print -delete
Delete the contents of the cache/ folder
In Sugar run Administration -> Repair -> Quick Repair and Rebuild
This made my Module Builder only create en_us language files.
Note: If anybody should ever consider doing this for any other language than en_us, make sure to not only keep your language of choice, but also keep the en_us files additional to that! Those files are expected to exist in Sugar, as they are e.g. used for fallbacks of missing strings in any other language. Deleting the en_us files may lead to unexpected side-effects!

current scctext replacement for textual representation of vfp binary files

What are people using in vfp 9 for a replacement for the built-in scctext.prg that translates binary files in vfp to a textual representation?
We’ve moving an existing project that’s in vfp 9 sp1 into tfs source control, but we need a way to make sure that the non-textual files are able to get the benefits of comparison that only non-binary text files allow. We plan to check both the textual representation and the binary file into source control (the binary is more for the “just in case” scenario)
According to the document at
http://www.ita-software.com/papers/Borup_Mercurial_Published.pdf
there are at least three options for converting .scx, .frx, .lbx, .prj and other non-prg dbf files in visual foxpro (vfp) to a textual representation. Only some of them allow for converting the textual information back to binary - not sure how often we’d really use that or not.
ALTERNATE SCCTEXT
This one seems older with latest version in 2009 - not sure if it’s still the preferred tool - and it seems to have no way to take the textual representation and convert it back to a binary file.
http://vfpx.codeplex.com/releases/view/12955
TWOFOX
This one seems similar to the foxbin2prg except it creates xml files - seems like only one dev is working on it unlike the others that are open to contributions from others so not sure how current it is and how much it’s being used by other developers - it does have two way conversion like fox2binprg has.
http://www.foxpert.com/downloads.htm
FOXBIN2PRG
This one is fairly recent - but not sure if it’s production ready enough to use for prod coding working - it does have two way conversion
http://vfpx.codeplex.com/releases/view/116407
TRIGGER INVOKE ONE OF THE ABOVE ON CHANGE OF BINARY FILES IN VFP IDE
What are people using to invoke these textual representation options?
I’ve seen this class that was created to run one of the programs listed above for all files in the project. Apparently it does it when the date time of the last generate is older that the date time on the textual version of the file. One detriment I’ve read is that it generates for foundation classes and other things that really are not items that a dev is working on (code that is referenced by but not included in your project).
http://codepaste.net/9yy1gm
Thanks for any advice from those that are using vfp 9 with source control out there!
You should check out the scX library written by Paul McNett which is published on Ed Leafe's web site. I haven't used it in a mission-critical software project yet, but I have tested it out. It seemed to catch all the potential problems I've encountered with other scctext replacements.
The reason I haven't used it in a big project for a couple of reasons.
It is a breaking change for source control history. So, comparing source code in your current SCA or VCA files with the new files generated by scX isn't going to be simple.
It isn't a drop in replacement for scctext. Instead of checking files into and out of source control directly from the IDE, you'll have an intermediary folder.
You'll check your files out of source control into one folder, convert them to FoxPro format, and then edit them in the FoxPro IDE.
Then, you'll save your changes in the FoxPro IDE, convert them to scX format, and then check them into source control.
I'm sure much of #2 can be automated; but combined with #1, making the change to scX wasn't worth it for me.
FoxBin2Prg is Production ready, and AFAIK, it's the only tool that allow Diff and Merge of the generated text (tx2) files, and can regenerate the binaries from them.
The generated files are PRG style, so developers can see them as modifying a PRG (with PROc/ENDPROC structures and such), but they aren't mean to compile. Primary use is for SCM tools, but can be used seperately.
I'm actually using on production code with a 10 member team using concurrent modifications on forms and classes.
Some documentation is available on VFPx in English and Spanish, Internal messages are vailable on both languages and from version v1.19.24 a new translation to German is available too.
More info on VFPx site,
Best regards!

iOS Localization - Updating Localizable.strings with just new strings

I have searched Google and StackOverflow and still have no clear answer on an easy and automated way of doing this but here is the scenario:
I have an app with 1000 strings localized into en, fr, de, es, it.
I build a new feature that makes 10 distinctly new NSLocalizedString() keys.
I just want those 10 new strings appended onto the ends of the files:
en.lproj/Localizable.strings
fr.lproj/Localizable.strings
es.lproj/Localizable.strings
de.lproj/Localizable.strings
it.lproj/Localizable.strings
genstrings will retrieve all 1010 distinct strings. This is a pain since I'll need to "needle in a haystack" find those 10 strings every time I do an update.
UPDATE 19-SEP-2014 -- XCode 6 - Apple has finally released support for XLIFF export and import of your .strings files
Whats new in XCode 6? Localisation
Linguan (v1.1.3) whilst it is a lovely tool most of the time, it is starting to be a tool in the other sense. It merges the changes but some strings aren't matching correctly when it merges, so everytime it does a Scan Sources it creates 100 new duplicate keys as well as the 10 strings I am after so it is making more work.
FileMerge As suggested below try doing a diff between old and new versions of the genstrings output files. The genstrings output has the strings sorted alphabetically so 10 strings scattered throughout 1000 means that there are 200 differences to review. it keeps matching the /*...*/ and the "..." = "..." and saying that the ... has been updated. It hasn't been updated, just shifted to a new location in the file. More and more it is looking like I am going to have to write a custom tool.
MacHG + FileMerge on a side note, for some strange reason doesn't like doing diffs out of the repository with the working copy of Localizable.strings. Both the left and right panes appear empty.
UPDATE: Turns out variations in some changesets being saved as UTF-16 and some as UTF-8 are screwing with it being able to do a proper diff.
Bash Script + FileMerge I have written the following script to help maintain my english reference file after each time I add new NSLocalizedString entries:
#LOCALISATION UPDATE SCRIPT
#
#This will create a temporary copy of the current 'en' reference file then generate the
#latest reference file using the 'genstrings' tool. Finally forcing FileMerge to launch
#and diff the changes.
#
#Last Updated: 2014-JAN-06
#Author(s): Josh Wilson
clear
#assuming this script is run from $SRCROOT
#Backup Existing 'en' reference
cp "en.lproj/Localizable.strings" "en.lproj/Localizable-src.strings"
#Scan source files for 'NSLocalizableString' macros
genstrings -q -u -o en.lproj Classes/*.{m,mm}
genstrings -q -u -a -o en.lproj Classes/iPad/*.{m,mm}
genstrings -q -u -a -o en.lproj Classes/iPhone/*.{m,mm}
#Force FileMerge to launch and diff the update (NOTE: piping to cat forces GUI to open)
opendiff "en.lproj/Localizable-src.strings" "en.lproj/Localizable.strings" | cat
#Cleanup up temporary file
rm "en.lproj/Localizable-src.strings"
But this only updates the EN file and I am lacking a way of having the other language files updated with the new keys. This one has been good for instances where I don't have an english word as the key and genstrings bombs my
"welcome_message" = "Welcome!" with "welcome_message" = "welcome_message"
POEditor http://poeditor.com/. This is an online tool and subscription based after 1000 strings. Seems to work well but it would be good if there was a non subscription based tool.
Traducto Pro Seems to do an alright job of integrating with XCode and extracting the strings and merging things together. But it is impossible to get anything back out of it until it is fully translated so you are coerced into using their translation services.
Surely this functionality has been implemented before. How does Apple keep their Apps localised?
Script junkies, I call upon thee! iOS development has been going on for some time now and localisation is kind of common, surely there is a mature solution to this by now?
Python Script update_strings.py: Stackoverflow finally recommended a related question and the python script in this answer Best practice using NSLocalizedString looks promising...
Tested it and in its current form (31-MAY-2013) it doesn't handle multiline comments if you have duplicate comments entries (expects single line comments).
Might just need to tweak the regex's a bit.
Checkout BartyCrouch, it perfectly solves your problem. Also it is open source, actively maintained and can be easily installed and integrated within your project.
Install BartyCrouch via Homebrew:
brew install bartycrouch
Alternatively, install it via Mint:
mint install Flinesoft/BartyCrouch
Incrementally update your Localizable.strings files:
$ bartycrouch update
This will do exactly what you were looking for.
In order to keep your Storyboards/XIBs Strings files updated over time I highly recommend adding a build script (instructions on how to add a build script here):
if which bartycrouch > /dev/null; then
bartycrouch update -x
bartycrouch lint -x
else
echo "warning: BartyCrouch not installed, download it from https://github.com/Flinesoft/BartyCrouch"
fi
In addition to incrementally updating your Storyboards/XIBs Strings files this will also make sure your Localizable.strings files stay updated with newly added keys in code using NSLocalizedString and show warnings for duplicate keys or empty values.
Make sure to checkout BartyCrouch on GitHub for additional information.
if you have the genstrings for the previous version, just a "diff" between new and old could do the tricks
EDIT: best use vimdiff to deal with utf-16 files
You can check out this Xcode Plugin I built for OneSky, it aims to improve the localization work flow for iOS/Mac OSX developers.
The string generation feature of the plugin runs genstrings and ibtool --export-strings-file to the selected source/IB files, new files will be added the project and target automatically, new strings will be merged into existing files with comments.
It will only generate/update strings for the base language, but you can make use of other features of the plugin to automate translation export and import with OneSky platform, which is free for crowdsource projects.
You may want to check out my solution here: SwiftyLocalization
With few steps to setup, you will have a very flexible localization in Google Spreadsheet (comment, custom color, highlight, font, multiple sheets, and more).
In short, steps are: Google Spreadsheet --> CSV files --> Localizable.strings
Moreover, it also generates Localizables.swift, a struct that acts like interfaces to a key retrieval & decoding for you (You have to manually specify a way to decode String from key though).
Why is this great?
You no longer need have a key as a plain string all over the places.
Wrong keys are detected at compile time.
Xcode can do autocomplete, so you can do something like this:
// It's defined as computed static var, so it's up-to-date every time you call.
// You can also have your custom retrieval method there.
button.setTitle(Localizables.login.button_title_login, forState: .Normal)
The project uses Google App Script to convert Sheets --> CSV Python script to convert CSV files --> Localizable.strings
You can have a quick look at this example sheet to know what's possible.

how to annotate files - when long filenames are not enough

I work with many files doing general data analysis.
Things I want to know about my files include:
what data is contained in the file (in long and very long descriptive, english text)?
is the file downloaded from somewhere (where? when?) or generated by a program (which one?)
why I made this file, verbal description what I want to do with it, where it belongs in my data analysis workflow (additional english text description, can get very long as well)
For this, long filenames are simply not the solution! Even long filenames are too short for the full descriptions, and when actually working with the files (perl, awk, R) the long filenames get in the way.
What I do right now is make a readme in each dir with the filename, tab-separator, and the long description. However this solution is very cumbersome as you can imagine because the descriptions are completely separated from the filesystem and everything, the readme has to be maintained and updated separatedly etc.
Is there any tool one can use for really verbose, systematic descriptions of filenames? Maybe even integrated into the filesystem?
Operating system used: Windows 7 and Cygwin, various flavours of linux/unix through SSH and importing X

Can someone break down how localization file ( .mo, .po ) generation works?

I'm trying to grok gettext.
Here's how I think it works -
First you use some sort of po editor and tell it to scan a directory for your application, create these ".po" files, the application makes a po file for each file scanned which contains a string in a programming language, then compile them to binary mo files, to which gettext parses, and you call a method using a high level API such as Zend_Translate and specify you want to use gettext, it can be setup to cache translations and it just returns those.
The part I'm really unclear about is how the editing of po files is done really, it's manual - right? Then when the compilation is done of course the application relies on the binary mo files.
And if someone could provide useful linux applications for editing .po files I'd be grateful.
The tutorial on NLS using GNU gettext should help you understand the process.
As for editing .po files, there's at least two applications (apart from vi :-): gtranslator and poedit.