UTF-8 source files are not supported in avisynth - unicode

I use avisynth to demux video from audio.
When I use
x = "m.mkv"
ffvideosource(x)
It work correctly but when I change my video filename to a UTF-8 one and my script as:
x = "م.mkv"
ffvideosource(x)
I Got the following error:
failed to open for hashing avisynth
I found a link (UTF-8 source files are not supported) who tell UTF-8 file name not work in avisynth, and to correct the problem, it said:
specify the parameter utf8=true when calling ffvideosource, save the script as UTF-8 without BOM and then see if that works.
But, I couldn't solve the problem. As I Open the script in the notepad and save it in utf-8 format, I got the following error:
UTF-8 Source files are not supported, re-save script with ANSI encoding
How can I solve the problem, How can I run my script with a UTF-8 filename?

“Withoutt BOM” is important. You need to save the file as raw UTF-8 without the Microsoft-style faux-BOM. Notepad can't do this, it always saves UTF-8 files with that generally-undesirable 0xEF 0xBB 0xBF header. Most other text editors (e.g. Notepad++) can do it properly.
AviSynth isn't really Unicode-aware so it doesn't want you using UTF-8 and will give that error message to try to stop you making mistakes. ffvideosource's workaround of hiding UTF-8 bytes in what AviSynth sees as ‘ANSI’ characters only works as long as AviSynth sees the file as ANSI. AviSynth doesn't have very sophisticated encoding-guessing, so removing the faux-BOM is enough to convince it is dealing with ANSI.

Very common problem when using UTF-8 in AviSynth.
Follow these steps:
Check the plugins folder. There should exist these three files: ffms2.dll, ffmsindex.exe, and FFMS2.avsi. If you did not have problem with ANSI, I guess that you don't have FFMS2.avsi in your plugins folder; In this situation download the latest version form here.
After that make an AVS file with Notepad++. For example I do this:
x = "C:/Users/Nemat/Desktop/StackOverFlow/نعمت.mkv"
ffmpegsource2(x,utf8=true)
Please note that here I used ffmpegsource2().
In the Encoding menu from Notepadd++ select Encode in UTF-8 without BOM.
Save your file.
Check the video file exists in the addressed directory.
Double click on your AVS file.
Enjoy it!

Related

How to detect file encoding in Octave?

I am working with many XML files and some of them are UTF-8 while most are ANSI.
In the UTF-8 files, the XML header states:
<?xml version="1.0" encoding="ISO8859-1" ?>
However that information is wrong.
The problem this generates is that I use unicode2native to generate correct XLS files, which generates bad output when the file is UTF-8 encoded.
How can I detect which is the real encoding of each file programmatically?
To manually locate them with the help of a text editor is not a feasible option, as there are hundreds of files, and my solution must work with more files which I don't have access.
There's no easy way to do this generally: because a given file might be a valid sequence in multiple encodings, detecting the character encoding requires using heuristics that are aware of natural language features, such as character frequencies, common words, and so on.
Octave doesn't have direct support for this. So you'll need to use an external program or library. Options include ICU4C, compact_enc_det, chardet, juniversalchardet, and others. chardet would probably be the easiest for you to use, since you can just install it and call it as an external command, instead of building a custom program or oct-file using a library. Or juniversalchardet, since if you have a Java-enabled Octave build, it's easy to pull in and use Java libraries from Octave code.
If it's really true that your input files are all either ANSI (Windows 1252/ISO 8859-1) or UTF-8, and no other encodings, you might be able to get away with just checking each file's contents to see if it's a valid UTF-8 string, and assume that any that are not valid UTF-8 are ANSI. Only certain byte sequences are valid UTF-8 encodings, so there's a good chance that the ANSI-encoded files are not valid UTF-8. I think you can check whether a file is valid UTF-8 in pure Octave by doing utf8_bytes = unicode2native(file_contents, 'UTF-8') on it, and seeing if the utf8_bytes output is identical to just casting file_contents directly to uint8. If that doesn't work, you can fall back to using Java's character encoding support (and that you can do with Java Standard Library stuff on any Java-enabled Octave build, without having to load an external JAR file).
And if all your input files are either UTF-8 or strictly 7-bit ASCII, then you can just treat them all as UTF-8, because 7-bit ASCII is a valid subset of UTF-8.
Palliative solution that I found for Windows 10, while I can't find a proper way to do this in pure Octave:
[~, output] = system(['file --mime-encoding "', fileAddress, '"']);
encoding = strsplit(output)(columns(strsplit(output, ' '))){1};
if strcmp('utf-8', encoding)
sheet(1, 1) = {strcat('', unicode2native(myText, 'ISO-8859-1'))};
else
sheet(1, 1) = {myText};
endif

Windows Converting a Folder of Files From RTF to UTF-8

I am trying to analyze a corpus of 620 Korean language newspaper articles using the konlpy module in Python. The files are in rtf formatting. However konlpy only supports files encoded in UTF-8. In Windows, how can I convert a folder containing 620 rtf encoded articles to UTF-8 articles such that, upon opening the files in Notepad, the Korean characters are still in-tact?
Some things I have tried (but to no avail)
Used a freeware converter program (http://www.emreakkas.com/localization-tools/convert-rtf-to-txt) that converted the files into UNICODE and then tried to use a Cygwin iconv batch file to convert the files using the same script as this individual did:
cygwin syntax error near unexpected token `done'
When I do this all of the files are there but they are 0KB and they are blank. (let me know if you need more info about this method as i needed to do another step to get this to even loop over my files)
Used another freeware program (memory a little hazy on this one) that converted the rtf files but all the characters were just scrambled latin characters.
I'm thinking that there has to be an easy way to do this, but everything I tried is really complicated and does not work. Another funny thing is that whenever I simply manually take the original rtf file or the file converted into UNICODE and "Save As" and choose UTF-8, it works fine. I would love it if I did not have to "Save As" for 620 articles.
Thanks!

ASCII / UTF8 set random?

I have tried a program called UTFCast Professional. It checkes the file encoding.
When I write code I use Sublime Text.
Random encoding
What I get is that some files are UTF8 and some files are ASCII/UTF8. It appears to be set random. All of them are set to "BOM: No".
Why is some files UTF8 and some ASCII/UTF8?
Is it possible that in some cases it does not know if it's ASCII or UTF8?
Should I be worried for future encoding problems? I have not have any so far.
(I prefer UTF8)
A plain text file does not in any way save what encoding it's in anywhere. Any program that purportedly tells you what encoding a file is in is by definition only giving you its best guess based on the content of the file. Now, since a file which contains only characters which are present in ASCII and is saved as UTF-8 is indistinguishable from a pure ASCII file, either answer is valid. Even Latin-1 and a large number of other answers would be valid.
So the answer why that program randomly outputs one or the other is because its detection algorithm triggers one or the other based on some characteristics of the file content. Only the program author can tell you exactly why. The file is encoded as UTF-8 without BOM. Whatever any application tells you it thinks it is is entirely up to that application.

What's  sign at the beginning of my source file?

I have a PHP source file where  characters automatically got added in! I don't know from where they have come. I'm not getting any parse errors but it results in weird behavior in the execution of the file. E.g. header location functionality is not working sometimes! I'm curious how these kind of symbols are getting auto generated? I'm using UTF-8 encoding & the sign  is not showing in Notepad++ or Windows Notepad but with Netbeans IDE.
Eg. Code:
<?php
echo "no errors!";
header("Location: http://stackoverflow.com");
exit;
?>
What is this? How can I prevent it?
You propably save the files as UTF-8 with BOM. You should save them as UTF-8 without BOM.
It's called Byte Order Mark, and doesn't always have to be "". http://en.wikipedia.org/wiki/Byte_order_mark
Some Windows applications add BOM by default. In Notepad++ you can use some options in the Encoding menu like Encode in UTF without BOM or Convert to UTF without BOM.
I believe that whether you save it UTF-8 with or without BOM it still happens. I don't think it makes a difference.
Try it, see if it helps.
From a tool like vi or vim, you can modify and save the file without a BOM with the two following commands :
:setlocal nobomb
and then
:w

ANSI view get differed from notepad and notepad++.why?

I am writing some data as a xml file with ISO-8859 encoding.If I tried to open the file in notepad++.I can able to see the 'Â' character which is already present in the file.But if I tried to open the file in notepad the character 'Â' gets removed.Though I am very new to Encoding,I don't know why.Please suggest some reason for this.
This file is also get opened in browser with the 'Â' character.
Thanks in Advance
Windows notepad is a very basic editor, and has quite a number of limitations, one of which is the support it has for different encoding formats other than ANSI, Unicode and UTF-8. When editing files in other formats, it can give unreliable/unexpected results.
If you are handling files in different encoding formats, you are better off avoiding notepad altogether and using an editor (such as Notepad++) which has better support for multiple encoding formats.
For more information on how Windows notepad "guesses" at the correct format to use (with varying levels of success) see here
Bear in mind that other editors often use similar techniques to "guess" the format of a file, so it is often a good idea to check/set the encoding for a file manually (where possible) for less common encoding formats to ensure you get the correct results every time.