I have some Unicode strings represented in Hexadecimal form and written in an ini file like the following:
Text to Convert=#$a5e#$a5a#$a5b
I would like to convert it to a wide-string in the Unicode version of innosetup.
I couldn't Find a way to do so.
Inno Setup allows you to call functions from third party DLLs so you can make a simple DLL that implements this function using your favorite language, and call it from Inno Setup.
Related
I need to make changes to an old legacy project in Delphi 7.
I need to save a TStringList to a file with Unicode encoding. All resources I have found describe a second parameter for specifying an encoding in the SaveToFile()/LoadFromFile() methods, but there is no such parameter in Delphi 7. It was probably added in later versions.
How can I save UTF-8 text to a file (.csv) in Delphi 7?
The parameter you are looking for was introduced in Delphi 2009, when Delphi's String type migrated from AnsiString to UnicodeString.
Prior to Delphi 2009, you will have to encode the TStringList entries to UTF-8 yourself. You can put UTF-8 data in an AnsiString (UTF8String in those versions was just an alias for AnsiString), and TStringList will save the data as-is.
However, you may be tempted to use the RTL's UTF8Encode() function, but know that prior to Delphi 2009 it doesn't support Unicode codepoints above U+FFFF. If you need to handle codepoints that high, you will have to use Microsoft's MultiByteToWideChar() function instead.
I know how to read a file with Haxe by using sys.io.File.read (compare Reading lines from a file in Haxe and I also know that the sys module is not available for each target). However how can I tell sys.io.File.read that my text file is encoded via a certain encoding (e.g. UTF-16, UTF-8, ISO-8859-1, ...)?
There is no way to do this at File-level, but you can encode / decode the String after reading the file. For instance, Utf8.encode() will convert a ISO-8859-1 string to a UTF-8 string:
var isoString = sys.io.File.getContent("iso_file.txt");
var utf8String = haxe.Utf8.encode(isoString);
sys.io.File.saveContent("utf8_file.txt", utf8String);
The standard library currently doesn't support UTF-16, but it's coming in Haxe 4. In the meantime, you can use libraries such as unifill for that.
Btw, if you don't need to read a file line-by-line, File.getContent() is much more convenient than the File.read()-approach you linked.
I'm using org.apache.tika.Tika.parseToString() to convert documents into plain text (i.e., unformatted text) files. My application potentially needs to convert documents that don't use a Unicode character set. For instance, some documents may be encoded in the Chinese GB2312 character set. It would be great if Tika re-coded the output into UTF-8. This would require Tika to reference a mapping between many different character sets and Unicode in order to convert the characters.
Does Tika convert the non-Unicode character set text into Unicode as the output of parseToString()? There are a lot of character sets out there so I would be impressed if Tika did this for more than a few character sets.
Update: I was able to create a couple different files with some non-Latin charsets (GB2312 (Chinese) and KOI8-R (Russian)). Tika.parseToString() couldn't even detect the charset or encoding. I opened an issue on the Tika bug tracker here: https://issues.apache.org/jira/browse/TIKA-1262
When talking about Character Sets in Apache Tika, you need to consider two kinds of files differently. One kind is that of basically just plain text, the other are more complex types (including binary ones)
With the more complex files, Tika mostly uses third party libraries, and these libraries are responsible for returning Java Strings. The exact way of doing that will depend on the file format in question - sometimes the file format will including encoding information, other times it'll be fixed in what it supports. Either way, Tika gets Java Strings, and returns to you a Java String. How you choose to encode that for output is up to you. (For Windows users especially, check the encoding of your terminal, and the font used. There've been lots of "Tika Encoding Problems" which were actually people failing to correctly set the default Java encoding on output, or failing to have a Unicode capable terminal!)
With plain text files, there's no encoding information in the file, all we have is a bunch of bytes. Here, Apache Tika uses one of a number of EncodingDetector instances to do the detection. These use hints, n-grams, language detection etc, to try to work out the most likely encoding of the file based on information given, pattern of bytes in the file etc.
The definition of EncodingDetector is held in the Tika-Core jar, but most of the implentations are held in the Tika-Parsers jar (and loaded by the service loader method, just like Detectors and Parsers). The main ones are here in SVN. If you check there, you'll see the main list of encodings that Tika can detect.
One final thing - the encoding detection is only performed on files that are text files, it isn't done on the binary type files. Depending on how you call Tika, you might need to tweak that and/or provide a hint that it's a text file, so that the EncodingDetector logic gets triggered.
This answer actually comes from a JIRA user on the Tika project. https://issues.apache.org/jira/browse/TIKA-1262
It turns out that if you tell Tika that the file extension is '.txt' it will treat the file as plain text, attempt to detect the encoding, and convert it to UTF.
An easy way to do this is to pass an empty Metadata object to TikaInputStream.get(). This will fill out the resourceName field of the Metadata object. Then pass this object to parseToString(). With the resourceName field set to a file name that ends with .txt the parser knows to treat this file as plain text and will do a encoding detection to try to discover how to decode the file. The string returned from parseToString() is a Java UTF-16 String object. When written to a file you can see that it is Unicode and uses the UCS charset.
Tika tika = new Tika();
Metadata metadata = new Metadata();
TikaInputStream reader = TikaInputStream.get(new File(filepath), metadata);
String contents = tika.parseToString(reader, metadata);
So far this has worked for text files using either GB2312/GB18030 and KOI8-R. This is the expected behavior and it's perfect! I don't know what other charsets/encoding is can handle.
I work on an app that gets distributed via a single installer containing multiple localizations. The build process includes a script that updates the .ism string table with translations for each supported language.
This works fine for languages like French and German. But when testing the installer in, i.e. Japanese, the text shows up as a series of squares. It's unlikely to be a font problem, since the InstallShield-supplied strings show up fine; only the string table entries are mangled. So the problem seems to be that the strings are in the wrong encoding.
The .ism is in XML format, with UTF-8 declared as its encoding, so I assumed the strings needed to be UTF-8 encoded as well. Do they actually need to use the encoding of the target platform? Is there any concern, then, about targets having different encodings, i.e. Chinese systems using one GB-encoding versus another? What is the right thing to do here?
Edit: Using InstallShield 2009, since there is apparently a difference between that and 2010.
In InstallShield 2009 and earlier, the encoding is a base-64 encoding of the binary string in the ANSI encoding specific to the language in question (e.g. CP932 for Japanese). In InstallShield 2010 and later, it will still accept that or use UTF-8, depending on other columns in that table.
Thanks (up-voted his answer) go to Michael Urman, for pointing us in the right direction. But this is the actual working (with InstallShield 2009) algorithm, reverse-engineered by a co-worker:
Start with a unicode (multi-byte-character) string
Write out the length as the encoded-length field in the ism-file
Encode the string as UTF-16-little-endian
Base-64 using the uuencode dictionary, except with ` (back-tick) instead of spaces.
Write the result to the ism-file, escaping XML entities
Be aware that base-64ing using the uuencode dictionary is not the same as using the uuencode algorithm. Standard uuencode produces a set of newline-separated lines, including a header, footers and one or more data lines, each of which begins with a length-character. If you're implementing this using a uuencode codec, you'll need to strip all of that off.
I'm also trying to figure this out...
I've inhereted some Installshield 12 (which is pre-2009) projects with string table entries containing characters outside the range of base64 'target' characters.
For example, one of the Japanese strings is:
4P!H&$9!O'<4!R&\=!E&,=``#$(80!C&L=0!P"00!G`&4`;#!T`)(PI##S,+DPR##\,.LP5S!^,%DP`C
After much searching I happened upon Base85 encoding, which looks much closer to being plausible, but have not yet verified this to be the solution.
Is there a way to convert Chinese simplified characters to traditional characters in Cocoa/Objective-C? On the .NET platform you can include a VB dll in your projects that gives you access to a function for an easy conversion. Is there anything I can use in Cocoa/Objective-C that will allow me to do the same? I want to go between simplified and traditional and vice-versa. Thank you!
As I know, Apple does not have public APIs to let you convert Chinese characters by simply calling a function, but you can do the conversion character by character your self.
The OpenVanilla project, an open source input method project, maintains a Chinese character conversion table. It was used in the input method software but I think it could also be used for other purposes. It is available at
http://github.com/lukhnos/openvanilla-oranje/blob/master/Modules/OVOFHanConvert/VXHCSC2TCTable.c
http://github.com/lukhnos/openvanilla-oranje/blob/master/Modules/OVOFHanConvert/VXHCTC2SCTable.c