PyScript changing my non-ascii characters after saving - unicode

I have this line: symbols = {"oe":"ö", "SS":"ß", "ae":"ä", "ue":"ü"}
After saving/exiting and opening again, it now reads symbols = {"oe":"Г¶", "SS":"Гџ", "ae":"Г¤", "ue":"Гј"}
How do I prevent this from happening?

https://groups.google.com/forum/#!topic/pyscripter/vfvpy8vxKoM
gives the following answer:
" Edition > File format > UTF-8 (no BOM) (or UTF-8)"
However, even going to Tools > Options > IDE options and changing "encode new files in" from ASCII to UTF8, this doens't seem to have any effect:
- each time I reopened a saved file, non-ascci characters are replaced by garbage;
- Edition > File format is back to Ansi
Dom (from France)

Related

Auto-replace unicode \uxxxx characters in a text file with their utf-8 equivalents

I've been given a huge properties file created in eclipse with ISO-8859-1 encoding, and all Greek characters in it are in Unicode format (i.e.: \u03bc\u03af\u03b1\u0020\u03bc\u03ad\u03c1\u03b1). It works fine, but I want the actual file to be human readable.
I converted the file to utf-8, but the characters remained as they were. Is there a way to automatically convert the contents of the file to utf-8 either from inside eclipse, or via an external tool?
This can be done with the AnyEdit Tools plugin:
When installed, in the editor hit Ctrl+A (to select all the text containing ASCII and Unicode formatted characters), right-click and choose Convert > From Unicode Notation.

UTF-8 source files are not supported in avisynth

I use avisynth to demux video from audio.
When I use
x = "m.mkv"
ffvideosource(x)
It work correctly but when I change my video filename to a UTF-8 one and my script as:
x = "م.mkv"
ffvideosource(x)
I Got the following error:
failed to open for hashing avisynth
I found a link (UTF-8 source files are not supported) who tell UTF-8 file name not work in avisynth, and to correct the problem, it said:
specify the parameter utf8=true when calling ffvideosource, save the script as UTF-8 without BOM and then see if that works.
But, I couldn't solve the problem. As I Open the script in the notepad and save it in utf-8 format, I got the following error:
UTF-8 Source files are not supported, re-save script with ANSI encoding
How can I solve the problem, How can I run my script with a UTF-8 filename?
“Withoutt BOM” is important. You need to save the file as raw UTF-8 without the Microsoft-style faux-BOM. Notepad can't do this, it always saves UTF-8 files with that generally-undesirable 0xEF 0xBB 0xBF header. Most other text editors (e.g. Notepad++) can do it properly.
AviSynth isn't really Unicode-aware so it doesn't want you using UTF-8 and will give that error message to try to stop you making mistakes. ffvideosource's workaround of hiding UTF-8 bytes in what AviSynth sees as ‘ANSI’ characters only works as long as AviSynth sees the file as ANSI. AviSynth doesn't have very sophisticated encoding-guessing, so removing the faux-BOM is enough to convince it is dealing with ANSI.
Very common problem when using UTF-8 in AviSynth.
Follow these steps:
Check the plugins folder. There should exist these three files: ffms2.dll, ffmsindex.exe, and FFMS2.avsi. If you did not have problem with ANSI, I guess that you don't have FFMS2.avsi in your plugins folder; In this situation download the latest version form here.
After that make an AVS file with Notepad++. For example I do this:
x = "C:/Users/Nemat/Desktop/StackOverFlow/نعمت.mkv"
ffmpegsource2(x,utf8=true)
Please note that here I used ffmpegsource2().
In the Encoding menu from Notepadd++ select Encode in UTF-8 without BOM.
Save your file.
Check the video file exists in the addressed directory.
Double click on your AVS file.
Enjoy it!

How to convert escaped Unicode (e.g. \u0432\u0441\u0435) to UTF-8 chars (все) in Notepad++

I have .properties files with a bunch of unicode escaped characters. I want to convert it to the correct chars display.
E.g.:
Currently: \u0432\u0441\u0435 \u0433\u043e\u0442\u043e\u0432\u043e\u005c
Desired result: все готово
Notepad++ is already set to encode UTF8 without BOM. Opening the document and 'converting' (from the Encoding drop-down menu) doesn't do anything.
How do I achieve this with notepad++?
If not in Notepad++, is there any other way to do this for many files, perhaps by using some script?
You need a plugin named HTML Tag.
Once plugin is installed, select your text and invoke command Plugins > HTML Tag > Decode JS (Ctrl+Shift+J).
I'm not aware of how you can do it natively in Notepad++ but as requested you can script it with Python:
import codecs
# opens a file and converts input to true Unicode
with codecs.open("escaped-unicode.txt", "rb", "unicode_escape") as my_input:
contents = my_input.read()
# type(contents) = unicode
# opens a file with UTF-8 encoding
with codecs.open("utf8-out.txt", "wb", "utf8") as my_output:
my_output.write(contents)

What's  sign at the beginning of my source file?

I have a PHP source file where  characters automatically got added in! I don't know from where they have come. I'm not getting any parse errors but it results in weird behavior in the execution of the file. E.g. header location functionality is not working sometimes! I'm curious how these kind of symbols are getting auto generated? I'm using UTF-8 encoding & the sign  is not showing in Notepad++ or Windows Notepad but with Netbeans IDE.
Eg. Code:
<?php
echo "no errors!";
header("Location: http://stackoverflow.com");
exit;
?>
What is this? How can I prevent it?
You propably save the files as UTF-8 with BOM. You should save them as UTF-8 without BOM.
It's called Byte Order Mark, and doesn't always have to be "". http://en.wikipedia.org/wiki/Byte_order_mark
Some Windows applications add BOM by default. In Notepad++ you can use some options in the Encoding menu like Encode in UTF without BOM or Convert to UTF without BOM.
I believe that whether you save it UTF-8 with or without BOM it still happens. I don't think it makes a difference.
Try it, see if it helps.
From a tool like vi or vim, you can modify and save the file without a BOM with the two following commands :
:setlocal nobomb
and then
:w

Emacs displays chinese character if I open xml file

I have an xml-file. When I open it with Emacs it displays chinese characters (see attachment). This happens on my Windows 7 PC with Emacs and Notepad and also on my Windows XP (see figure A). Figure B is the hexl-mode of A.
If I use the Windows XP PC of a collegue and open the file with Notepad there are no chinese characters but a strange character character. I saved it as txt-file and sent it by email to my Windows7-PC (see figure C). The strange character was replaced with "?". (Due to restriction I could not use the PC of my collegue and reproduce the notepad file with the strange character).
My questions: it seems that there are characters in the XML-file which creates problems. I don't know how to cope with that. Does anybody has an idea how I can manage this problem? Does it have something to do with encoding? Thanks for hints.
By figure B, it looks like this file is encoded with a mixture of big-endian and little-endian UTF-16. It starts with fe ff, which is the byte order mark for big-endian UTF-16, and the XML declaration (<?xml version=...) is also big-endian, but the part starting with <report is little-endian. You can tell because the letters appear on even positions in the first part of the hexl display, but on odd positions further down.
Also, there is a null character (encoded as two bytes, 00 00) right before <report. Null characters are not allowed in XML documents.
However, since some of the XML elements appear correctly in figure A, it seems that the confusion goes on through the file. The file is corrupt, and this probably needs to be resolved manually.
If there are no non-ASCII characters in the file, I would try to open the file in Emacs as binary (M-x revert-buffer-with-coding-system and specify binary), remove all null bytes (M-% C-q C-# RET RET), save the file and hope for the best.
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.
For some reason, Emacs takes "UTF-16" in an xml file encoding attribute as big endian, while Windows takes "UTF-16" as little endian (like when exporting from Task Scheduler). Emacs will unknowingly convert LE to BE automatically if you edit and save an xml file. You can mouse over the lower left "U" to see the current encoding. encoding="UTF-16LE" or encoding="UTF-16BE" will ruin the file after saving (no BOM). I believe the latest version has this fixed.
<?xml version="1.0" encoding="UTF-16"?>
<hi />
The solution of legoscia using the possibility of Emacs to change encoding within a file solved my problem. An other possibility is:
cut the part to convert
paste in a new file and save it
open it with an editor which can convert encodings
convert the file and save it
copy the converted string and add (paste) to the original file where you cut the part to convert
In my case it worked with Atom, but not with Notepad++.
PS: The reason why I used this way is that Emacs could not open anymore this kind of corrupted files. I don't know why but this is another issue.
Edit 1: Since copy, paste and merge is cumbersome I found the solution how to open currupted files with emacs: emacs -q xmlfile.xml. Using emacs like legoscia suggested is the best way to repair such files.