Visual Studio Code - opening files with accented texts changes the contents - visual-studio-code

When I try to open and use an external .txt file that contains accented characters, it changes these characters into other random characters. But I also can type within the program the accented characters and it processes fine, the changes only occur when I open the file.
example:
original text: "São Paulo"
text in VSC: "Sгo Paulo"
I have already turned the auto-detect encoding setting on, but that didn't seem to work. Do I have any other options?

Related

Show special characters in VS Code

I have a problem with my VS Code. When trying to modify a file that contains special characters like "á", "ñ", "ó" etc., the special characters are replaced with a question mark. (See image below.)
Although, this can be solved easily from the back of Visual Studio Code, changing the language type to "Windows 1252", because at first it worked for me. But now, even if I change it to that language, the signs are still there.
the files that you opened before you made the changes to the encoding have been auto-overwritten and the original characters were replaced with the unknown-character character

retrieving unicode text from notepad which is saved as ANSI text file

Yesterday I wrote some text in a notepad file which was full of Unicode characters and saved the file as ANSI. Notepad gave me some warning, which i clicked OK without reading it fully and closed notepad.
Today when I again opened the same text in notepad, I am seeing notepad full of ??? signs. I now understand that this happened because I saved Unicode data as ANSI text. Is there a way to retrieve this text back? May be using some hex-editor or so?
No. Certain characters cannot be encoded in certain encodings. "風" cannot be encoded at all in ISO-8859 or any other single-byte encoding, for example. Each ANSI encoding also can only encode a certain subset of all possible characters. It is simply not possible to store characters not defined in a particular ANSI encoding in that encoding, they're simply not defined there.
So, they're gone. You better pull out a backup.

Extra characters in .doc file when opened with textpad

When I open a document in textpad, some extra null character is appended between every character.
Like my document is having following text
बॉम्बे testing for webmail.
When I am opening in text it is coming as
I....M....I t.e.s.t.i.n.g. f.o.r. w.e.b.m.a.i.l.
Can Anybody help me on this ?
This file is in UTF-16 or UCS-2 format. When opening it, you must specify in which encoding you want to open it. Your text editor does not recognize this encoding automatically.
If your text editor does not allow for setting encoding on opening file, try using Notepad++ or Textpad.

Emacs displays chinese character if I open xml file

I have an xml-file. When I open it with Emacs it displays chinese characters (see attachment). This happens on my Windows 7 PC with Emacs and Notepad and also on my Windows XP (see figure A). Figure B is the hexl-mode of A.
If I use the Windows XP PC of a collegue and open the file with Notepad there are no chinese characters but a strange character character. I saved it as txt-file and sent it by email to my Windows7-PC (see figure C). The strange character was replaced with "?". (Due to restriction I could not use the PC of my collegue and reproduce the notepad file with the strange character).
My questions: it seems that there are characters in the XML-file which creates problems. I don't know how to cope with that. Does anybody has an idea how I can manage this problem? Does it have something to do with encoding? Thanks for hints.
By figure B, it looks like this file is encoded with a mixture of big-endian and little-endian UTF-16. It starts with fe ff, which is the byte order mark for big-endian UTF-16, and the XML declaration (<?xml version=...) is also big-endian, but the part starting with <report is little-endian. You can tell because the letters appear on even positions in the first part of the hexl display, but on odd positions further down.
Also, there is a null character (encoded as two bytes, 00 00) right before <report. Null characters are not allowed in XML documents.
However, since some of the XML elements appear correctly in figure A, it seems that the confusion goes on through the file. The file is corrupt, and this probably needs to be resolved manually.
If there are no non-ASCII characters in the file, I would try to open the file in Emacs as binary (M-x revert-buffer-with-coding-system and specify binary), remove all null bytes (M-% C-q C-# RET RET), save the file and hope for the best.
Another possible solution is to mark each region appearing with Chinese characters and recode it with M-x recode-region, giving "Text was really in" as utf-16-le and "But was interpreted as" as utf-16-be.
For some reason, Emacs takes "UTF-16" in an xml file encoding attribute as big endian, while Windows takes "UTF-16" as little endian (like when exporting from Task Scheduler). Emacs will unknowingly convert LE to BE automatically if you edit and save an xml file. You can mouse over the lower left "U" to see the current encoding. encoding="UTF-16LE" or encoding="UTF-16BE" will ruin the file after saving (no BOM). I believe the latest version has this fixed.
<?xml version="1.0" encoding="UTF-16"?>
<hi />
The solution of legoscia using the possibility of Emacs to change encoding within a file solved my problem. An other possibility is:
cut the part to convert
paste in a new file and save it
open it with an editor which can convert encodings
convert the file and save it
copy the converted string and add (paste) to the original file where you cut the part to convert
In my case it worked with Atom, but not with Notepad++.
PS: The reason why I used this way is that Emacs could not open anymore this kind of corrupted files. I don't know why but this is another issue.
Edit 1: Since copy, paste and merge is cumbersome I found the solution how to open currupted files with emacs: emacs -q xmlfile.xml. Using emacs like legoscia suggested is the best way to repair such files.

Applescript: Save Word documents as plaintext while retaining accents

I'm trying to save Word documents as plain text docs. Currently, some times the accents turn into other symbols (usually the same ones, for example: é turns into a theta). Other times it works fine. How do I prevent this?
Currently using the line:
save as active document file name FullDocPath file format format Unicode text
When I encounter this error, I can save the document using the dialog (selecting Western Mac OS Roman encoding...that fixes the problem.
The applescript Word dictionary mentions:
[text encoding unsigned integer] : Text encoding to use when saving out as text file
I have no idea if this is the piece I'm missing or how to utilize it (is there a set integer that designates Western Mac OS Roman encoding?)
Anyone have any ideas?
Try:
set wordDoc to choose file
do shell script "textutil -convert txt " & quoted form of POSIX path of (wordDoc as text)
Check out StefanK's solution using textutil
This is in response to your comment beginning "Thanks Stefan and bibadiak"
With .txt file formats is that there is no universally used way to specify the encoding of a file inside the file, so either the application has to guess, or you have to know the encoding and the application has to let you tell it.
AFAIK if you do not specify an output encoding when you use textutil to convert from .doc or .docx format to text, you get UTF-8. But Mac Word just does not seem to recognise that when you try to open it, either programmatically or in the UI.
So I think you need to do some mix of the following:
a. save in, and work with, a format that uses 16-bit Unicode encoding. Word should recognise that, certainly if the BOM is preserved
b. save to UTF and work with UTF elsewhere, but use textutil to do the conversion back to (say) .docx before you re-open the document in Mac Word
c. if all your characters can be encoded using Mac OS Roman, use e.g.
textutil -convert txt -encoding 30
to save, ensure you work only with that character set, and re-open with Word. (30 is the value of the APple NSString value NSMacOSRomanStringEncoding). I think textutil will fail to convert documents that contain characters outside the MacOS Roman set.