Howto identify UTF-8 encoded strings - unicode

What's the best way to identify if a string (is or) might be UTF-8 encoded? The Win32 API IsTextUnicode isn't of much help here. Also, the string will not have an UTF-8 BOM, so that cannot be checked for. And, yes, I know that only characters above the ASCII range are encoded with more than 1 byte.

chardet character set detection developed by Mozilla used in FireFox. Source code
jchardet is a java port of the source from mozilla's automatic charset detection algorithm.
NCharDet is a .Net (C#) port of a Java port of the C++ used in the Mozilla and FireFox browsers.
Code project C# sample that uses Microsoft's MLang for character encoding detection.
UTRAC is a command line tool and library written in c++ to detect string encoding
cpdetector is a java project used for encoding detection
chsdet is a delphi project, and is a stand alone executable module for automatic charset / encoding detection of a given text or file.
Another useful post that points to a lot of libraries to help you determine character encoding http://fredeaker.blogspot.com/2007/01/character-encoding-detection.html
You could also take a look at the related question How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?, it has some useful content.

There is no really reliable way, but basically, as a random sequence of bytes (e.g. a string in an standard 8 bit encoding) is very unlikely to be a valid UTF-8 string (if the most significant bit of a byte is set, there are very specific rules as to what kind of bytes can follow it in UTF-8), you can try decoding the string as UTF-8 and consider that it is UTF-8 if there are no decoding errors.
Determining if there were decoding errors is another problem altogether, many Unicode libraries simply replace invalid characters with a question mark without indicating whether or not an error occurred. So you need an explicit way of determining if an error occurred while decoding or not.

This W3C page has a perl regular expression for validating UTF-8

You didn't specify a language, but in PHP you can use mb_check_encoding
if(mb_check_encoding($yourDtring, 'UTF-8'))
{
//the string is UTF-8
}
else
{
//string is not UTF-8
}

For Win32, you can use the mlang API, this is part of Windows and supported from Windows XP, cool thing about it is that it gives you statistics of how likely the input is to be in a particular encoding:
CComPtr<IMultiLanguage2> lang;
HRESULT hr = lang.CoCreateInstance(CLSID_CMultiLanguage, NULL, CLSCTX_INPROC_SERVER);
char* str = "abc"; // EF BB BF 61 62 63
int size = 6;
DetectEncodingInfo encodings[100];
int encodingsCount = 100;
hr = lang->DetectInputCodepage(MLDETECTCP_NONE, 0, str, &size, &encodings, &encodingsCount);

To do character detection in ruby install the 'chardet' gem
sudo gem install chardet
Here's a little ruby script to run chardet over the standard input stream.
require "rubygems"
require 'UniversalDetector' #chardet gem
infile = $stdin.read()
p UniversalDetector::chardet(infile)
Chardet outputs a guess at the character set encoding and also a confidence level (0-1) from its statistical analysis
see also this snippet

C/C++ standalone library based on Mozilla's character set detector
https://github.com/batterseapower/libcharsetdetect
Universal Character Set Detector (UCSD)
A library exposing a C interface and dependency-free interface to the Mozilla C++ UCSD library. This library provides a highly accurate set of heuristics that attempt to determine the character set used to encode some input text. This is extremely useful when your program has to handle an input file which is supplied without any encoding metadata.

On Windows, you can use MultiByteToWideChar() with the CP_UTF8 codepage and the MB_ERR_INVALID_CHARS flag. If the function fails, the string is not valid UTF-8.

As an add-on to the previous answer about the Win32 mlang DetectInputCodepage() API, here's how to call it in C:
#include <Mlang.h>
#include <objbase.h>
#pragma comment(lib, "ole32.lib")
HRESULT hr;
IMultiLanguage2 *pML;
char *pszBuffer;
int iSize;
DetectEncodingInfo lpInfo[10];
int iCount = sizeof(lpInfo) / sizeof(DetectEncodingInfo);
hr = CoInitialize(NULL);
hr = CoCreateInstance(&CLSID_CMultiLanguage, NULL, CLSCTX_INPROC_SERVER, &IID_IMultiLanguage2, (LPVOID *)&pML);
hr = pML->lpVtbl->DetectInputCodepage(pML, 0, 0, pszBuffer, &iSize, lpInfo, &iCount);
CoUninitialize();
But the test results are very disappointing:
It can't distinguish between French texts in CP 437 and CP 1252, even though the text is completely unreadable if opened in the wrong code page.
It can detect text encoded in CP 65001 (UTF-8), but not text in UTF-16, which is wrongly reported as CP 1252 with good confidence!

Related

Looking for c++ equivalent for _wfindfirst for char16_t

I have filenames using char16_t characters:
char16_t Text[2560] = u"ThisIsTheFileName.txt";
char16_t const* Filename = Text;
How can I check if the file exists already? I know that I can do so for wchar_t using _wfindfirst(). But I need char16_t here.
Is there an equivalent function to _wfindfirst() for char16_t?
Background for this is that I need to work with Unicode characters and want my code working on Linux (32-bit) as well as on other platforms (16-bit).
findfirst() is the counterpart to _wfindfirst().
However, both findfirst() and _wfindfirst() are specific to Windows. findfirst() accepts ANSI (outdated legacy stuff). _wfindfirst() accepts UTF-16 in the form of wchar_t (which is not exactly the same thing as char16_t).
ANSI and UTF-16 are generally not used on Linux. findfirst()/_wfindfirst() are not included in the gcc compiler.
Linux uses UTF-8 for its Unicode format. You can use access() to check for file permission, or use opendir()/readdir()/closedir() as the equivalent to findfirst().
If you have a UTF-16 filename from Windows, you can convert the name to UTF-8, and use the UTF-8 name in Linux. See How to convert UTF-8 std::string to UTF-16 std::wstring?
Consider using std::filesystem in C++17 or higher.
Note that a Windows or Linux executable is 32-bit or 64-bit, that doesn't have anything to do with the character set. Some very old systems are 16-bit, you probably don't come across them.

How to detect file encoding in Octave?

I am working with many XML files and some of them are UTF-8 while most are ANSI.
In the UTF-8 files, the XML header states:
<?xml version="1.0" encoding="ISO8859-1" ?>
However that information is wrong.
The problem this generates is that I use unicode2native to generate correct XLS files, which generates bad output when the file is UTF-8 encoded.
How can I detect which is the real encoding of each file programmatically?
To manually locate them with the help of a text editor is not a feasible option, as there are hundreds of files, and my solution must work with more files which I don't have access.
There's no easy way to do this generally: because a given file might be a valid sequence in multiple encodings, detecting the character encoding requires using heuristics that are aware of natural language features, such as character frequencies, common words, and so on.
Octave doesn't have direct support for this. So you'll need to use an external program or library. Options include ICU4C, compact_enc_det, chardet, juniversalchardet, and others. chardet would probably be the easiest for you to use, since you can just install it and call it as an external command, instead of building a custom program or oct-file using a library. Or juniversalchardet, since if you have a Java-enabled Octave build, it's easy to pull in and use Java libraries from Octave code.
If it's really true that your input files are all either ANSI (Windows 1252/ISO 8859-1) or UTF-8, and no other encodings, you might be able to get away with just checking each file's contents to see if it's a valid UTF-8 string, and assume that any that are not valid UTF-8 are ANSI. Only certain byte sequences are valid UTF-8 encodings, so there's a good chance that the ANSI-encoded files are not valid UTF-8. I think you can check whether a file is valid UTF-8 in pure Octave by doing utf8_bytes = unicode2native(file_contents, 'UTF-8') on it, and seeing if the utf8_bytes output is identical to just casting file_contents directly to uint8. If that doesn't work, you can fall back to using Java's character encoding support (and that you can do with Java Standard Library stuff on any Java-enabled Octave build, without having to load an external JAR file).
And if all your input files are either UTF-8 or strictly 7-bit ASCII, then you can just treat them all as UTF-8, because 7-bit ASCII is a valid subset of UTF-8.
Palliative solution that I found for Windows 10, while I can't find a proper way to do this in pure Octave:
[~, output] = system(['file --mime-encoding "', fileAddress, '"']);
encoding = strsplit(output)(columns(strsplit(output, ' '))){1};
if strcmp('utf-8', encoding)
sheet(1, 1) = {strcat('', unicode2native(myText, 'ISO-8859-1'))};
else
sheet(1, 1) = {myText};
endif

Weird Normalization on .net

I am trying to normalize a string (using .net standard 2.0) using Form D, and it works perfectly and running on a Windows machine.
[TestMethod]
public void TestChars()
{
var original = "é";
var normalized = original.Normalize(NormalizationForm.FormD);
var originalBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(original));
Assert.AreEqual("233,0", originalBytesCsv);
var normalizedBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(normalized));
Assert.AreEqual("101,0,1,3", normalizedBytesCsv);
}
When I run this on Linux, it returns "253,255" for both strings, before and after normalization. These two bytes form the word 65533 which is the Unicode Replacement char, used when something goes wrong with encoding. That's the part where I am lost.
What am I missing here? Is there someone to point me in the right direction?
It might be related to the encoding of the source file. I'm not sure which encodings .net on Linux supports, but to be on the safe side, you should use plain ASCII source files and Unicode escapes for Non-ASCII characters:
var original = "\u00e9";
There is no text but encoded text.
When communicating text to person or program, both the bytes and the character encoding are essential.
The C# compiler (like all programs that process text, except in special cases like JSON) must know which character encoding the input files use. You must inform it accurately. The default is UTF-8 and that is a fine choice, especially for C# files, which are, lexically, sequences of Unicode codepoints.
If you used your editor or IDE or file transfer without full mindfulness of these requirements, you might have used an unintended character encoding.
For example, "é" when saved as Windows-1252 (0xE9) but read as UTF-8 (leading code unit that should be followed by two continuation code units), would give � to indicate this mishandling to the readers.
To be on the safe side, use UTF-8 everywhere but do it mindfully.

What is the protocol / relationship between encodings and programming languages?

As a test I created a file called Hello.java and the contents are as follows:
public class Hello{
public static void main(String[] args){
System.out.println("Hello world!");
}
}
I saved this file with UTF-8 encoding.
Anyway, compiling and running the problem was no problem. This file was 103 bytes long.
I then saved the file with UTF-16 BE encoding. This time the file was 206 bytes long, since well UTF-16 (usually) needs more space, so no surprise here.
Tried compiling the file from my terminal and I got all these errors:
Hello.java:4: error: illegal character: '\u0000'
}
^
So does javac work only with UTF-8 encoded source files? Is that like a standard?
javac -version
javac 1.8.0_45
Also, I only know Java but lets say you are running Python code or any interpreted programming language. (Sorry if I am mistaken by thinking Python is interpreted if it is not..) Would the encoding be a problem? If not, would it have any effect on performance?
Ok so the word "true" is a reserved keyword (for a given programming language..) but in what encoding is it reserved? ASCII - UTF-8 only?
How "true" is stored in the hard drive or in memory depends on the encoding the file is saved in, so must a programming language expect always to work with a particular encoding for source files?
Regarding javac, you can set the encoding with -encoding parameter. Internally Java handles strings in UTF-16 so the compiler will convert everything to that.
The compiler must know the encoding so it can process the source codes. It doesn't matter what compiler, interpreter or language it is. Just like people can't just take random language text and assume it's German.
Keywords aren't reserves in any specific encoding. They are keywords. You can't have two ways of writing a single word no matter what encoding you use. The words are the same.
Programming language doesn't care about encoding. Compiler/interpreter does.

(Tcl) what character encoding set should I use?

So I'm trying to open and parse some old Visual Studio compilation log files with Tcl; my only problem is the files are in a strange encoding. Upon examining them with Notepad++ it seems they are in the 'UCS-2 Little Endian' encoding. Two questions:
Is there any command in Tcl that allows me to look at the character encoding of a file? I know there is encoding system which tells me the system encoding.
Using encoding names Tcl tells me the available encoding names are the following list:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857
Given this, what would be the appropriate name to use in the fconfigure -encoding command to read these UCS-2 Little Endianencoded files and convert them to UTF-8 for use? If I understand the fconfigure command correctly, I need to specify the encoding type of the source file rather than what I want it to be; I just don't know which of the options in the above list corresponds to UCS-2 Little Endian. After reading a little bit, I see that UCS-2 is a predecessor of the UTF-16 character encoding, but that option isn't here either.
Thanks!
I'm afraid, currently there's no way to do it just by using fconfigure -encoding ?something?: the unicode encoding has rather moot meaning, and there's a feature request to create explicit support for UTF-16 variants.
What you could do about it?
Since unicode in Tcl running on Windows should mean UTF-16 with native endianness1 (little-endian on Wintel), if your solution is supposed to be a quick and dirty one, just try using -encoding unicode and see if that helps.
If you're targeting at some more bullet-proof or future-proof of cross-platform solution, I'd switch the channel to binary more, read the contents in chunks of two bytes at a time, and then use
binary scan $twoBytes s n
to scan the sequence of two bytes in $twoBytes as an 16-bit integer into a variable named "n", followed by something like
set c [format %c $n]
to produce a unicode character out of the number in $n, and assign it to a variable.
This way supposedly requires a bit more trickery to get correctly:
You might check the very first character obtained from the stream to see if it's a byte-order-mark, and drop it if it is.
If you need to process the stream in a line-wise manner, you'd have to implement a little state machine that would handle the CR&plus;LF sequences correctly.
When doing your read $channelId 2, to get the next character, you should check that it returned not just 0 or 2, but also 1 — in case the file happens to be corrupted, — and handle this.
The UCS-2 encoding differs from UTF-16 in that the latter might contain the so-called surrogate pairs, and hence it is not a fixed-length encoding. Hence handling an UTF-16 stream properly implies also detecting those surrogate pairs. On the other hand, I hardly beleive a compilation log produced by MSVS might contain them, so I'd just assume it's encoded in UCS-2LE.
1 The true story is that the only thing Tcl guarantees about textual strings it handles (that is, those obtained by maniputating text, not via binary format or encoding convertto or reading a stream in binary mode) is that they're Unicode (or, rather, the "BMP" part of it).
But technically, the interpreter might switch the internal representation of any string between the UTF-8 encoding it uses by default and some fixed-length encoding which is what is referred to by that name "unicode". The "problem" is that no part of Tcl documentation specifies that internal fixed-length encoding because you're required to explicitly convert any text you output or read to/from some specific encoding — either via configuring the stream or using encoding convertfrom and encoding convertto or using binary format and binary scan, and the interpreter will do the right thing no matter which precise encoding it's currently using for your source string value — it's all transparent. Moreover, the next release of the "standard" Tcl interpreter might decide to drop this internal feature completely, or, say, use 32-bit or 64-bit integers for that internal fixed-length encoding. Whatever "non-standard" interpreters do (like Jacl etc) are also up to them. In other words, this feature is internal and is not a part of the documented contract about the interpreter's behaviour. And by the way, the "standard" encoding for Tcl strings (UTF-8) is not specified as such either — it's just an implementation detail.
In Tcl v8.6.8 I could solve the same issue with fconfigure channelId -encoding unicode.