Using gperf on UTF-16 encoded input? - unicode

When moving code that uses a gperf-generated hashing function to use UTF-16 for its strings, how would you adapt/call the hashing function? The options I can see are:
Convert UTF-16 to UTF-8 for the hashing.
This should work out-of-the-box, but involves a conversion step I hope to be able to avoid.
Use the -c option to make gperf use strncmp and encode the input file accordingly, writing \000h\000e\000l\000l\000o for hello.
I didn't actually test this and would prefer to keep the input file readable and grep-able. But I guess the transformation step could be done with a preprocessing script from the actual source file.

Related

How to detect file encoding in Octave?

I am working with many XML files and some of them are UTF-8 while most are ANSI.
In the UTF-8 files, the XML header states:
<?xml version="1.0" encoding="ISO8859-1" ?>
However that information is wrong.
The problem this generates is that I use unicode2native to generate correct XLS files, which generates bad output when the file is UTF-8 encoded.
How can I detect which is the real encoding of each file programmatically?
To manually locate them with the help of a text editor is not a feasible option, as there are hundreds of files, and my solution must work with more files which I don't have access.
There's no easy way to do this generally: because a given file might be a valid sequence in multiple encodings, detecting the character encoding requires using heuristics that are aware of natural language features, such as character frequencies, common words, and so on.
Octave doesn't have direct support for this. So you'll need to use an external program or library. Options include ICU4C, compact_enc_det, chardet, juniversalchardet, and others. chardet would probably be the easiest for you to use, since you can just install it and call it as an external command, instead of building a custom program or oct-file using a library. Or juniversalchardet, since if you have a Java-enabled Octave build, it's easy to pull in and use Java libraries from Octave code.
If it's really true that your input files are all either ANSI (Windows 1252/ISO 8859-1) or UTF-8, and no other encodings, you might be able to get away with just checking each file's contents to see if it's a valid UTF-8 string, and assume that any that are not valid UTF-8 are ANSI. Only certain byte sequences are valid UTF-8 encodings, so there's a good chance that the ANSI-encoded files are not valid UTF-8. I think you can check whether a file is valid UTF-8 in pure Octave by doing utf8_bytes = unicode2native(file_contents, 'UTF-8') on it, and seeing if the utf8_bytes output is identical to just casting file_contents directly to uint8. If that doesn't work, you can fall back to using Java's character encoding support (and that you can do with Java Standard Library stuff on any Java-enabled Octave build, without having to load an external JAR file).
And if all your input files are either UTF-8 or strictly 7-bit ASCII, then you can just treat them all as UTF-8, because 7-bit ASCII is a valid subset of UTF-8.
Palliative solution that I found for Windows 10, while I can't find a proper way to do this in pure Octave:
[~, output] = system(['file --mime-encoding "', fileAddress, '"']);
encoding = strsplit(output)(columns(strsplit(output, ' '))){1};
if strcmp('utf-8', encoding)
sheet(1, 1) = {strcat('', unicode2native(myText, 'ISO-8859-1'))};
else
sheet(1, 1) = {myText};
endif

Base58 Encoder function in PostgreSQL for TEXT

Can anyone help me to implement Base58 encoding stored procedure in PostgreSQL.
I've found answer for numbers but I'm looking for similar stored procedure that can accept TEXT or VARCHAR value.
On this very rare occasion I'm going to suggest you don't do this. It will be computationally possible but highly inadvisable.
https://en.wikipedia.org/wiki/Base58
In contrast to Base64, the digits of the encoding don't line up well
with byte boundaries of the original data. For this reason, the method
is well-suited to encode large integers, but not designed to encode
longer portions of binary data.
To put this another way, Base58 is not designed to encode strings / text. Your main alternatives are:
Base64 which if copied manually by a human, the human may make mistakes. Otherwise Base64 is safe to copy / paste
Hexadecimal which is easily copied by humans but significantly longer than Base64
If you feel you really need Base58 and not Base64 then it may be worth editing your requirements into your question. This may help someone give an answer more specific to your requiremnts:
What are these strings you need to convert (examples are preferable)?
Why do they need to be Base58 and not Base64 (what other system are you passing these to)?

(Tcl) what character encoding set should I use?

So I'm trying to open and parse some old Visual Studio compilation log files with Tcl; my only problem is the files are in a strange encoding. Upon examining them with Notepad++ it seems they are in the 'UCS-2 Little Endian' encoding. Two questions:
Is there any command in Tcl that allows me to look at the character encoding of a file? I know there is encoding system which tells me the system encoding.
Using encoding names Tcl tells me the available encoding names are the following list:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857
Given this, what would be the appropriate name to use in the fconfigure -encoding command to read these UCS-2 Little Endianencoded files and convert them to UTF-8 for use? If I understand the fconfigure command correctly, I need to specify the encoding type of the source file rather than what I want it to be; I just don't know which of the options in the above list corresponds to UCS-2 Little Endian. After reading a little bit, I see that UCS-2 is a predecessor of the UTF-16 character encoding, but that option isn't here either.
Thanks!
I'm afraid, currently there's no way to do it just by using fconfigure -encoding ?something?: the unicode encoding has rather moot meaning, and there's a feature request to create explicit support for UTF-16 variants.
What you could do about it?
Since unicode in Tcl running on Windows should mean UTF-16 with native endianness1 (little-endian on Wintel), if your solution is supposed to be a quick and dirty one, just try using -encoding unicode and see if that helps.
If you're targeting at some more bullet-proof or future-proof of cross-platform solution, I'd switch the channel to binary more, read the contents in chunks of two bytes at a time, and then use
binary scan $twoBytes s n
to scan the sequence of two bytes in $twoBytes as an 16-bit integer into a variable named "n", followed by something like
set c [format %c $n]
to produce a unicode character out of the number in $n, and assign it to a variable.
This way supposedly requires a bit more trickery to get correctly:
You might check the very first character obtained from the stream to see if it's a byte-order-mark, and drop it if it is.
If you need to process the stream in a line-wise manner, you'd have to implement a little state machine that would handle the CR&plus;LF sequences correctly.
When doing your read $channelId 2, to get the next character, you should check that it returned not just 0 or 2, but also 1 — in case the file happens to be corrupted, — and handle this.
The UCS-2 encoding differs from UTF-16 in that the latter might contain the so-called surrogate pairs, and hence it is not a fixed-length encoding. Hence handling an UTF-16 stream properly implies also detecting those surrogate pairs. On the other hand, I hardly beleive a compilation log produced by MSVS might contain them, so I'd just assume it's encoded in UCS-2LE.
1 The true story is that the only thing Tcl guarantees about textual strings it handles (that is, those obtained by maniputating text, not via binary format or encoding convertto or reading a stream in binary mode) is that they're Unicode (or, rather, the "BMP" part of it).
But technically, the interpreter might switch the internal representation of any string between the UTF-8 encoding it uses by default and some fixed-length encoding which is what is referred to by that name "unicode". The "problem" is that no part of Tcl documentation specifies that internal fixed-length encoding because you're required to explicitly convert any text you output or read to/from some specific encoding — either via configuring the stream or using encoding convertfrom and encoding convertto or using binary format and binary scan, and the interpreter will do the right thing no matter which precise encoding it's currently using for your source string value — it's all transparent. Moreover, the next release of the "standard" Tcl interpreter might decide to drop this internal feature completely, or, say, use 32-bit or 64-bit integers for that internal fixed-length encoding. Whatever "non-standard" interpreters do (like Jacl etc) are also up to them. In other words, this feature is internal and is not a part of the documented contract about the interpreter's behaviour. And by the way, the "standard" encoding for Tcl strings (UTF-8) is not specified as such either — it's just an implementation detail.
In Tcl v8.6.8 I could solve the same issue with fconfigure channelId -encoding unicode.

How do I use the StackExchange API from Matlab?

How do I access data from the StackExchange API using Matlab?
The naive
sitedata = urlread('http://api.stackoverflow.com/1.1/questions?tagged=matlab')
fails since the data is compressed. However, when I write this to file (using fprintf(fileID,'%s',sitedata)), I get a zip-file that cannot be uncompressed.
Try urlwrite() instead:
urlwrite('http://api.stackoverflow.com/1.1/questions?tagged=matlab',...
'tempfile.zip')
gunzip('tempfile.zip')
fid = fopen('tempfile');
str = textscan(fid,'%s',Delimiter','\n');
fclose(fid);
A better version of this snippet would use tempname to dynamically generate temporary filenames.
Matlab's urlread assumes you're getting text data back, not binary. The gzip binary data is getting mangled either when urlread is decoding the character data to Unicode values to stick in Matlab chars, or when the formatted-output fprintf function is writing them out, encoding them to UTF-8 or whatever default character encoding you're using for fileID and changing the byte sequence, or maybe both.
IIRC, urlread will default to using ISO-8859-1 encoding, which means the bytes will be turned in to the Unicode code points with the same numeric values - effectively just a widening. So you can get the byte data back by doing sitebytes = uint8(sitedata). (That's a regular uint8() conversion, not a typecast().) (If this isn't the case, you can probably fiddle with urlread's CharSet option.)
If you can't get the right bytes out from urlread by fiddling with the encoding and casts, then you can drop down and make calls against the Java HttpAgent like urlread does and bypass the character set decoding step, or fiddle with its options. See the urlread source for how to do it.
Once you have the right bytes in memory, you can write them out to a file using the lower-level fwrite() function, which won't mangle them by doing character set encoding. Then you'll have a valid gzip file of the site's original response. (I think it'll work if you also just use fwrite(fileID, sitedata, 'uint8') directly on the char string, but it's uglier IMHO.)
You can also unzip it in memory using Java classes and save a trip to the filesystem. Do jsitebytes = typecast(sitebytes 'int8') to get them as Java-friendly signed bytes and then stick it into a ByteArrayInputStream and read it out through a GZIPInputStream. You'll need to build a little Java helper class because Matlab doesn't play well with passing byte[] buffers by reference like java.io wants, but it may be worthwhile if you do a lot of in-memory munging like this.
When working with web services or fancier data downloads (e.g. sites that need sessions or certificates), I've often ended up dropping down and coding directly against the HttpAgent and java.io classes from within Matlab.

Command-line arguments as bytes instead of strings in python3

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.
I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.
I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.
I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).
The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.
So,
1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)
2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and
3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?
When I receive a filename argument
which is invalid, however, it is
handed to me as a unicode string with
strange characters like \udce8.
Those are surrogate characters. The low 8 bits is the original invalid byte.
See PEP 383: Non-decodable Bytes in System Character Interfaces.
Don't go against the grain: filenames are strings, not bytes.
You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.
(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)
Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.