How to detect file encoding in Octave? - encoding

I am working with many XML files and some of them are UTF-8 while most are ANSI.
In the UTF-8 files, the XML header states:
<?xml version="1.0" encoding="ISO8859-1" ?>
However that information is wrong.
The problem this generates is that I use unicode2native to generate correct XLS files, which generates bad output when the file is UTF-8 encoded.
How can I detect which is the real encoding of each file programmatically?
To manually locate them with the help of a text editor is not a feasible option, as there are hundreds of files, and my solution must work with more files which I don't have access.

There's no easy way to do this generally: because a given file might be a valid sequence in multiple encodings, detecting the character encoding requires using heuristics that are aware of natural language features, such as character frequencies, common words, and so on.
Octave doesn't have direct support for this. So you'll need to use an external program or library. Options include ICU4C, compact_enc_det, chardet, juniversalchardet, and others. chardet would probably be the easiest for you to use, since you can just install it and call it as an external command, instead of building a custom program or oct-file using a library. Or juniversalchardet, since if you have a Java-enabled Octave build, it's easy to pull in and use Java libraries from Octave code.
If it's really true that your input files are all either ANSI (Windows 1252/ISO 8859-1) or UTF-8, and no other encodings, you might be able to get away with just checking each file's contents to see if it's a valid UTF-8 string, and assume that any that are not valid UTF-8 are ANSI. Only certain byte sequences are valid UTF-8 encodings, so there's a good chance that the ANSI-encoded files are not valid UTF-8. I think you can check whether a file is valid UTF-8 in pure Octave by doing utf8_bytes = unicode2native(file_contents, 'UTF-8') on it, and seeing if the utf8_bytes output is identical to just casting file_contents directly to uint8. If that doesn't work, you can fall back to using Java's character encoding support (and that you can do with Java Standard Library stuff on any Java-enabled Octave build, without having to load an external JAR file).
And if all your input files are either UTF-8 or strictly 7-bit ASCII, then you can just treat them all as UTF-8, because 7-bit ASCII is a valid subset of UTF-8.

Palliative solution that I found for Windows 10, while I can't find a proper way to do this in pure Octave:
[~, output] = system(['file --mime-encoding "', fileAddress, '"']);
encoding = strsplit(output)(columns(strsplit(output, ' '))){1};
if strcmp('utf-8', encoding)
sheet(1, 1) = {strcat('', unicode2native(myText, 'ISO-8859-1'))};
else
sheet(1, 1) = {myText};
endif

Related

(Tcl) what character encoding set should I use?

So I'm trying to open and parse some old Visual Studio compilation log files with Tcl; my only problem is the files are in a strange encoding. Upon examining them with Notepad++ it seems they are in the 'UCS-2 Little Endian' encoding. Two questions:
Is there any command in Tcl that allows me to look at the character encoding of a file? I know there is encoding system which tells me the system encoding.
Using encoding names Tcl tells me the available encoding names are the following list:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857
Given this, what would be the appropriate name to use in the fconfigure -encoding command to read these UCS-2 Little Endianencoded files and convert them to UTF-8 for use? If I understand the fconfigure command correctly, I need to specify the encoding type of the source file rather than what I want it to be; I just don't know which of the options in the above list corresponds to UCS-2 Little Endian. After reading a little bit, I see that UCS-2 is a predecessor of the UTF-16 character encoding, but that option isn't here either.
Thanks!
I'm afraid, currently there's no way to do it just by using fconfigure -encoding ?something?: the unicode encoding has rather moot meaning, and there's a feature request to create explicit support for UTF-16 variants.
What you could do about it?
Since unicode in Tcl running on Windows should mean UTF-16 with native endianness1 (little-endian on Wintel), if your solution is supposed to be a quick and dirty one, just try using -encoding unicode and see if that helps.
If you're targeting at some more bullet-proof or future-proof of cross-platform solution, I'd switch the channel to binary more, read the contents in chunks of two bytes at a time, and then use
binary scan $twoBytes s n
to scan the sequence of two bytes in $twoBytes as an 16-bit integer into a variable named "n", followed by something like
set c [format %c $n]
to produce a unicode character out of the number in $n, and assign it to a variable.
This way supposedly requires a bit more trickery to get correctly:
You might check the very first character obtained from the stream to see if it's a byte-order-mark, and drop it if it is.
If you need to process the stream in a line-wise manner, you'd have to implement a little state machine that would handle the CR&plus;LF sequences correctly.
When doing your read $channelId 2, to get the next character, you should check that it returned not just 0 or 2, but also 1 — in case the file happens to be corrupted, — and handle this.
The UCS-2 encoding differs from UTF-16 in that the latter might contain the so-called surrogate pairs, and hence it is not a fixed-length encoding. Hence handling an UTF-16 stream properly implies also detecting those surrogate pairs. On the other hand, I hardly beleive a compilation log produced by MSVS might contain them, so I'd just assume it's encoded in UCS-2LE.
1 The true story is that the only thing Tcl guarantees about textual strings it handles (that is, those obtained by maniputating text, not via binary format or encoding convertto or reading a stream in binary mode) is that they're Unicode (or, rather, the "BMP" part of it).
But technically, the interpreter might switch the internal representation of any string between the UTF-8 encoding it uses by default and some fixed-length encoding which is what is referred to by that name "unicode". The "problem" is that no part of Tcl documentation specifies that internal fixed-length encoding because you're required to explicitly convert any text you output or read to/from some specific encoding — either via configuring the stream or using encoding convertfrom and encoding convertto or using binary format and binary scan, and the interpreter will do the right thing no matter which precise encoding it's currently using for your source string value — it's all transparent. Moreover, the next release of the "standard" Tcl interpreter might decide to drop this internal feature completely, or, say, use 32-bit or 64-bit integers for that internal fixed-length encoding. Whatever "non-standard" interpreters do (like Jacl etc) are also up to them. In other words, this feature is internal and is not a part of the documented contract about the interpreter's behaviour. And by the way, the "standard" encoding for Tcl strings (UTF-8) is not specified as such either — it's just an implementation detail.
In Tcl v8.6.8 I could solve the same issue with fconfigure channelId -encoding unicode.

UTF-8 source files are not supported in avisynth

I use avisynth to demux video from audio.
When I use
x = "m.mkv"
ffvideosource(x)
It work correctly but when I change my video filename to a UTF-8 one and my script as:
x = "م.mkv"
ffvideosource(x)
I Got the following error:
failed to open for hashing avisynth
I found a link (UTF-8 source files are not supported) who tell UTF-8 file name not work in avisynth, and to correct the problem, it said:
specify the parameter utf8=true when calling ffvideosource, save the script as UTF-8 without BOM and then see if that works.
But, I couldn't solve the problem. As I Open the script in the notepad and save it in utf-8 format, I got the following error:
UTF-8 Source files are not supported, re-save script with ANSI encoding
How can I solve the problem, How can I run my script with a UTF-8 filename?
“Withoutt BOM” is important. You need to save the file as raw UTF-8 without the Microsoft-style faux-BOM. Notepad can't do this, it always saves UTF-8 files with that generally-undesirable 0xEF 0xBB 0xBF header. Most other text editors (e.g. Notepad++) can do it properly.
AviSynth isn't really Unicode-aware so it doesn't want you using UTF-8 and will give that error message to try to stop you making mistakes. ffvideosource's workaround of hiding UTF-8 bytes in what AviSynth sees as ‘ANSI’ characters only works as long as AviSynth sees the file as ANSI. AviSynth doesn't have very sophisticated encoding-guessing, so removing the faux-BOM is enough to convince it is dealing with ANSI.
Very common problem when using UTF-8 in AviSynth.
Follow these steps:
Check the plugins folder. There should exist these three files: ffms2.dll, ffmsindex.exe, and FFMS2.avsi. If you did not have problem with ANSI, I guess that you don't have FFMS2.avsi in your plugins folder; In this situation download the latest version form here.
After that make an AVS file with Notepad++. For example I do this:
x = "C:/Users/Nemat/Desktop/StackOverFlow/نعمت.mkv"
ffmpegsource2(x,utf8=true)
Please note that here I used ffmpegsource2().
In the Encoding menu from Notepadd++ select Encode in UTF-8 without BOM.
Save your file.
Check the video file exists in the addressed directory.
Double click on your AVS file.
Enjoy it!

ASCII / UTF8 set random?

I have tried a program called UTFCast Professional. It checkes the file encoding.
When I write code I use Sublime Text.
Random encoding
What I get is that some files are UTF8 and some files are ASCII/UTF8. It appears to be set random. All of them are set to "BOM: No".
Why is some files UTF8 and some ASCII/UTF8?
Is it possible that in some cases it does not know if it's ASCII or UTF8?
Should I be worried for future encoding problems? I have not have any so far.
(I prefer UTF8)
A plain text file does not in any way save what encoding it's in anywhere. Any program that purportedly tells you what encoding a file is in is by definition only giving you its best guess based on the content of the file. Now, since a file which contains only characters which are present in ASCII and is saved as UTF-8 is indistinguishable from a pure ASCII file, either answer is valid. Even Latin-1 and a large number of other answers would be valid.
So the answer why that program randomly outputs one or the other is because its detection algorithm triggers one or the other based on some characteristics of the file content. Only the program author can tell you exactly why. The file is encoded as UTF-8 without BOM. Whatever any application tells you it thinks it is is entirely up to that application.

write utf8 with perl bug?

My problem is simple. I want to output UTF-8 with my Perl script.
This code is not working.
use utf8;
open(TROIS,">utf8.out.2.txt");
binmode(TROIS, ":utf8");
print TROIS "Hello\n";
The output file is not in UTF-8. (My file script is coded in UTF-8)
But if I insert an accentuated character in my print, then it's working and my output file is in UTF-8. Example:
print TROIS "é\n";
I use ActivePerl 5.10 under Windows. What might be the problem?
You're writing nothing but ASCII characters with Hello\n. Fortunately ASCII is still perfectly valid UTF-8. However, auto detection by editors will most likely not show UTF-8 as the encoding because they don't have anything to judge the file content's encoding by. I guess you simply don't know how file encodings work.
A file's encoding is a property that in general is not stored in a file or externally alongside a file. A lot of editors simply assume a certain encoding based on the operating system they run on or the environment settings (system language), or they include some kind of semi-intelligent auto-detection (which may still fail because file encodings cannot be auto-detected unambiguously). That's why you have to tell Perl that a file is encoded in UTF-8 when you read it with binmode or the corresponding I/O layer.
Now there is one way of marking a text file's encoding if said encoding is one of the UTF family (UTF-8, UTF-16 LE and BE, UTF-32 LE and BE) . That way is called the BOM (byte order mark). However, producing files with a BOM came from a time when UTF-8 had not been spread as widely as it is today. It usually poses more and different problems than it solves, especially due to editors and applications in general not supporting BOMs at all. Therefore BOMs should probably be avoided nowadays.
There are exceptions, of course, in which the file format contains certain instructions that tell the file's encoding. XML comes to mind with its DOCTYPE declaration. However, even for such files you will have to recognize if a file is encoded in a multi-byte encoding that always uses at least two bytes per character (UTF-16/UTF-32) or not in order to parse the DOCTYPE declaration in the first place. It's simply not simple ;)

Is there a standard encoding for NEEDED entries in ELF?

I'm trying to make some of my code a bit more friendly to non-pure-ascii systems and was wondering if there was a particular character encoding used for NEEDED entries in ELF binaries, or is it rather unstandard and based on the creating system's filesystem encoding (or even just directly the bytes that were passed to whatever created the binary) (if so is there any place in the binary that specifies the encoding? assuming the current systems encoding wouldn't work very well for my usage I think), are non-ascii names pretty much banned or something else?
ELF format specifies NEEDED fields as "null-terminated string" and does not say more about the encoding, which pretty much implies 8-bit ASCII string.
I personally don't see any point in complicating executable file format specification that does not provide any additional value for the final product or development process: the user won't see library names, so they wouldn't care about localization of thereof. You may try to use UTF-8, but actual file system encoding is not guaranteed to be UTF-8. To be sure you need to know how your target linker handles those strings.
As far as I know, the standard Unix way of dealing with non-ASCII characters is to encode them as UTF-8.