encoding conversion tool - encoding

I would need some file encoding conversion tool, to convert some of my source files. I would need to do it as a batch, so the program needs to know to find out what's the source file encoding (Unicode - Codepage 1200) and save it as proper encoding (UTF-8), because project files are saved in different encodings.
Can someone suggest me a good free tool?

iconv

Related

How to detect file encoding in Octave?

I am working with many XML files and some of them are UTF-8 while most are ANSI.
In the UTF-8 files, the XML header states:
<?xml version="1.0" encoding="ISO8859-1" ?>
However that information is wrong.
The problem this generates is that I use unicode2native to generate correct XLS files, which generates bad output when the file is UTF-8 encoded.
How can I detect which is the real encoding of each file programmatically?
To manually locate them with the help of a text editor is not a feasible option, as there are hundreds of files, and my solution must work with more files which I don't have access.
There's no easy way to do this generally: because a given file might be a valid sequence in multiple encodings, detecting the character encoding requires using heuristics that are aware of natural language features, such as character frequencies, common words, and so on.
Octave doesn't have direct support for this. So you'll need to use an external program or library. Options include ICU4C, compact_enc_det, chardet, juniversalchardet, and others. chardet would probably be the easiest for you to use, since you can just install it and call it as an external command, instead of building a custom program or oct-file using a library. Or juniversalchardet, since if you have a Java-enabled Octave build, it's easy to pull in and use Java libraries from Octave code.
If it's really true that your input files are all either ANSI (Windows 1252/ISO 8859-1) or UTF-8, and no other encodings, you might be able to get away with just checking each file's contents to see if it's a valid UTF-8 string, and assume that any that are not valid UTF-8 are ANSI. Only certain byte sequences are valid UTF-8 encodings, so there's a good chance that the ANSI-encoded files are not valid UTF-8. I think you can check whether a file is valid UTF-8 in pure Octave by doing utf8_bytes = unicode2native(file_contents, 'UTF-8') on it, and seeing if the utf8_bytes output is identical to just casting file_contents directly to uint8. If that doesn't work, you can fall back to using Java's character encoding support (and that you can do with Java Standard Library stuff on any Java-enabled Octave build, without having to load an external JAR file).
And if all your input files are either UTF-8 or strictly 7-bit ASCII, then you can just treat them all as UTF-8, because 7-bit ASCII is a valid subset of UTF-8.
Palliative solution that I found for Windows 10, while I can't find a proper way to do this in pure Octave:
[~, output] = system(['file --mime-encoding "', fileAddress, '"']);
encoding = strsplit(output)(columns(strsplit(output, ' '))){1};
if strcmp('utf-8', encoding)
sheet(1, 1) = {strcat('', unicode2native(myText, 'ISO-8859-1'))};
else
sheet(1, 1) = {myText};
endif

Specifing file encoding while reading a file with sys.io.File.read in Haxe

I know how to read a file with Haxe by using sys.io.File.read (compare Reading lines from a file in Haxe and I also know that the sys module is not available for each target). However how can I tell sys.io.File.read that my text file is encoded via a certain encoding (e.g. UTF-16, UTF-8, ISO-8859-1, ...)?
There is no way to do this at File-level, but you can encode / decode the String after reading the file. For instance, Utf8.encode() will convert a ISO-8859-1 string to a UTF-8 string:
var isoString = sys.io.File.getContent("iso_file.txt");
var utf8String = haxe.Utf8.encode(isoString);
sys.io.File.saveContent("utf8_file.txt", utf8String);
The standard library currently doesn't support UTF-16, but it's coming in Haxe 4. In the meantime, you can use libraries such as unifill for that.
Btw, if you don't need to read a file line-by-line, File.getContent() is much more convenient than the File.read()-approach you linked.

UTF-8 source files are not supported in avisynth

I use avisynth to demux video from audio.
When I use
x = "m.mkv"
ffvideosource(x)
It work correctly but when I change my video filename to a UTF-8 one and my script as:
x = "م.mkv"
ffvideosource(x)
I Got the following error:
failed to open for hashing avisynth
I found a link (UTF-8 source files are not supported) who tell UTF-8 file name not work in avisynth, and to correct the problem, it said:
specify the parameter utf8=true when calling ffvideosource, save the script as UTF-8 without BOM and then see if that works.
But, I couldn't solve the problem. As I Open the script in the notepad and save it in utf-8 format, I got the following error:
UTF-8 Source files are not supported, re-save script with ANSI encoding
How can I solve the problem, How can I run my script with a UTF-8 filename?
“Withoutt BOM” is important. You need to save the file as raw UTF-8 without the Microsoft-style faux-BOM. Notepad can't do this, it always saves UTF-8 files with that generally-undesirable 0xEF 0xBB 0xBF header. Most other text editors (e.g. Notepad++) can do it properly.
AviSynth isn't really Unicode-aware so it doesn't want you using UTF-8 and will give that error message to try to stop you making mistakes. ffvideosource's workaround of hiding UTF-8 bytes in what AviSynth sees as ‘ANSI’ characters only works as long as AviSynth sees the file as ANSI. AviSynth doesn't have very sophisticated encoding-guessing, so removing the faux-BOM is enough to convince it is dealing with ANSI.
Very common problem when using UTF-8 in AviSynth.
Follow these steps:
Check the plugins folder. There should exist these three files: ffms2.dll, ffmsindex.exe, and FFMS2.avsi. If you did not have problem with ANSI, I guess that you don't have FFMS2.avsi in your plugins folder; In this situation download the latest version form here.
After that make an AVS file with Notepad++. For example I do this:
x = "C:/Users/Nemat/Desktop/StackOverFlow/نعمت.mkv"
ffmpegsource2(x,utf8=true)
Please note that here I used ffmpegsource2().
In the Encoding menu from Notepadd++ select Encode in UTF-8 without BOM.
Save your file.
Check the video file exists in the addressed directory.
Double click on your AVS file.
Enjoy it!

ANSI view get differed from notepad and notepad++.why?

I am writing some data as a xml file with ISO-8859 encoding.If I tried to open the file in notepad++.I can able to see the 'Â' character which is already present in the file.But if I tried to open the file in notepad the character 'Â' gets removed.Though I am very new to Encoding,I don't know why.Please suggest some reason for this.
This file is also get opened in browser with the 'Â' character.
Thanks in Advance
Windows notepad is a very basic editor, and has quite a number of limitations, one of which is the support it has for different encoding formats other than ANSI, Unicode and UTF-8. When editing files in other formats, it can give unreliable/unexpected results.
If you are handling files in different encoding formats, you are better off avoiding notepad altogether and using an editor (such as Notepad++) which has better support for multiple encoding formats.
For more information on how Windows notepad "guesses" at the correct format to use (with varying levels of success) see here
Bear in mind that other editors often use similar techniques to "guess" the format of a file, so it is often a good idea to check/set the encoding for a file manually (where possible) for less common encoding formats to ensure you get the correct results every time.

How to convert windows 1252 source files to utf-8 easily (in windows)

I have an Intellij/Maven/Java project I am taking over where the previous developer wrote everything to be encoded in windows-1252 encoding. Does somebody know of a simple way to convert a set of files (ie all source, properties, xml, text) files from windows-1252 to utf-8?
maybe this is what you are searching for:
http://sourceforge.net/projects/cp-converter/