Read Stata 14 unicode do file in Stata 13 - unicode

I edit and save a do-file in Stata 14 with unicode characters such as ö, Ö, ä, Ä, etc.
When someone else tries to open this do-file in Stata 13, the character is not displayed correctly (since Stata 13 does not support unicode).
After hours of searching a solution in the entire internet, this old 2015 post is the only existing post discussing this issue - but it does not offer a solution: http://www.statalist.org/forums/forum/general-stata-discussion/general/1300263-translate-stata-14-utf8-do-file-into-stata-13-or-earlier-do-file
How can I edit and save a do-file in Stata 14 with unicode characters and make this do-file readable in Stata 13?
PS:
I cannot not have Stata 13 on that computer, so copying and pasting the do-file in a Stata 13 do-file editor then saving won`t work.
If I edit with an external editor, it can't be read by Stata 14.
unicode translate only translates from ASCII --> to unicode, but not in the other direction.

Related

Save file with ANSI encoding in VS Code

I have a text file that needs to be in ANSI mode. It specifies only that: ANSI. Notepad++ has an option to convert to ANSI and that does the trick.
In VS Code I don't find this encoding option. So I read up on it and it looks like ANSI doesn't really exist and should actually be called Windows-1252.
However, there's a difference between Notepad++'s ANSI encoding and VS Code's Windows-1252. It picks different codepoints for characters such as an accented uppercase e (É), as is evident from the document.
When I let VS Code guess the encoding of the document converted to ANSI by Notepad++, however, it still guesses Windows-1252.
So the questions are:
Is there something like pure ANSI?
How can VS Code convert to it?
Check out the upcoming VSCode 1.48 (July 2020) browser support.
It also solves issue 33720, where you had to force encoding for the full project before with, for instance:
"files.encoding": "utf8",
"files.autoGuessEncoding": false
Now you can set the encoding file by file:
Text file encoding support
All of the text file encodings of the desktop version of VSCode are now supported for web as well.
As you can see, the encoding list includes ISO 8859-1, which is the closest norm of what "ANSI encoding" could represent.
The Windows codepage 1252 was created based on ISO 8859-1 but is not completely equal.
It differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range
From Wikipedia:
Historically, the phrase "ANSI Code Page" was used in Windows to refer to non-DOS encodings; the intention was that most of these would be ANSI standards such as ISO-8859-1.
Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard.
Microsoft explains, "The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community."

Read turkish characters from txt file

I am trying to read string data from txt file which has special turkish characters in it.
I want to store content in a string. I tried some methods like textscan , fileread but, instead of special turkish characters like ş,ç,ı,ö,ğ, there are some weird symbols. Are there any way to do that?
I created a file called turkish.txt with the characters you mentioned (ş,ç,ı,ö,ğ). Trying to read it gave me the following:
fid = fopen('turkish.txt','r','n','UTF-8');
str=fread(fid);
native2unicode(str')
ans =
ÿþ_, ç , 1, ö ,
As you can see, ş,ı,ğ are not rendered correctly. If you type
help slCharacterEncoding
You can see a list of most commonly supported encodings by platforms. I played with the encodings a little, some which I have checked were:
ISO-8891-1
US-ASCII
Windows-1252
Shift_JIS
The last one is related to japanese characters. They contain some of the turkish characters, which were rendered correctly such as ç and ö, but not all of them.
If you skim through the docs it says:
If you want to use a different character encoding, you need to start MATLAB with the appropriate locale settings for your operating system. Consult your operating system manual to change the locale setting.
The instructions for setting the locale on windows platforms, which I haven't tried, can be found here.
Hope it helps.

Stata 13: Encoding of German Characters in Windows 8 and Mac OS X

For a current project, I use a number of csv files that are saved in UTF8. The motivation for this encoding is that it contains information in German with special characters ä,ö,ü,ß. My team is working with Stata 13 on Mac OS X and Windows 7 (software is frequently updated).
When we import the csv file (when importing, we choose Latin-1) in Stata special characters are correctly displayed on both operating system. However, when we export the dataset to another csv file on Mac OS X - which we need to do quite often in our setup - the special characters are replaced, e.g. ä -> Š, ü -> Ÿ etc. On Windows, exporting works like a charme and special characters are not replaced.
Troubleshooting: Stata 13 cannot interpret unicode. I have tried to convert the utf8 files to windows1252 and latin 1 (iso 8859-1) encoding (since, after all, all it contains are german characters) using Sublime Text 2 prior to importing it in Stata. However the same problem remains for Mac OS X.
Yesterday, Stata 14 was announced which apparently can deal with unicode. If that is the reason, then it would probably help with my problem, however, we will not be able to upgrade soon. Apart from then, I am wondering why the problem arises on Mac but not on Windows? Can anyone help? Thank you.
[EDIT Start] When I import the exported csv file again using a "Mac Roman" Text encoding (Stata allows to specify that in the importing dialogue), then my german special characters appear again. Apparently I am not the only one encountering this problem by the looks of this thread. However, because I need to work with the exported csv files, I still need a solution to this problem. [EDIT End]
[EDIT2 Start] One example is the word "Bösdorf" that is changed to "Bšsdorf". In the original file the hex code is 42c3 b673 646f 7266, whereas the hex code in the exported file is 42c5 a173 646f 7266. [EDIT2 End]
Until the bug gets fixed, you can work around this with
iconv -f utf-8 -t cp1252 <oldfile.csv | iconv -f mac -t utf-8 >newfile.csv
This undoes an incorrect transcoding which apparently the export function in Stata performs internally.
Based on your indicators, cp1252 seems like a good guess, but it could also be cp1254. More examples could help settle the issue if you can't figure it out (common German characters to test with still would include ä and the uppercase umlauts, the German double s ligature ß, etc).
Stata 13 and below uses a deprecated locale in Mac OS X, macroman (Mac OS X is unicode). I generally used StatTransfer to convert, for example, from Excel (unicode) to Stata (Western, macroman; Options->Encoding options) in Spanish language. It was the only way to have á, é, etc. Furthermore, Stata 14 imports unicode without problem but insist to export es_ES (Spanish Spain) as the default locale, having to add the command locale UTF-8 at the end of the export command to have a readable Excel file.

translate data file with odd Hebrew encoding

I have a binary data file, in a format used by a relatively ancient program, which I am trying to convert into something sane. With the help of a Hex editor I have basically worked out the file format except that it contains Hebrew characters with an odd encoding.
All characters are 8 bits. The "standard" 27 consonants (including "final" consonants) go from hex 80 to 9A. Then there are vowels that seem to start around hex 9B or so (I'm guessing right after the standard consonants end). Then there are "dotted" consonants that seem to start at hex E0.
If I remember correctly, I think this is some sort of DOS encoding. What encoding is this and what encoding should I translate it to so that a user in Israel will be able to most easily open it in, say, Microsoft Word? Are there any tools that I could use to do the translation?
80 to 9A seem to match the codepoints in the CP862, but I could not find any match for the vowel codepoints. I think what you should do is just make a custom mapping to Unicode and produce the output in UTF-8 or UTF-16LE plain text file. If you add a BOM (Byte-Order-Mark), Notepad and/or Word should be able to read it without issues. I would probably make a small Python script, but it shouldn't be hard in any other language.

Displaying Unicode characters above U+FFFF on Windows

the application I'm developing with EVC++ 4 runs on Windows CE 5 and should support unicode (AFAIK wchar_t uses UTF-16 on windows, so I'm using that), so I want to be able to test it with "more exotic" characters. Especially with characters that use 4 Byte in UTF-16 and not just 2. Therefore I'm trying to display such characters in a texteditor (atm on my desktop PC with Windows XP, not on the embedded device).
But I haven't managed it to do so yet. As an example I've chosen this character.
Like mentioned here "MPH 2B Damase" should support this character. So I downloaded the font and put it into Windows\Fonts. I created a textfile using a hexeditor (just to be sure) with following content:
FFFE D802 DC00
When I open it with notepad (which should be unicode-capable, right?) and use the downloaded font it doesn't display 1 char, as intended, but this 2:
˘Ü
What am I doing wrong? :)
Thanks!
hrniels
Edit:
Flipping the BOM, as suggested, doesn't work. Notepad (and all other editors I tried, too) displays two squares in this case. Interesting is that if I copy the two squares here (with firefox) I see the right character:
I've also tried it with Komodo Edit with the same result.
Using UTF-8 doesn't help notepad either.
What happens if you put the byte order mark the other way around?
FEFF D802 DC00
(At the moment the byte sequence is being interpreted as the two characters U+02D8 U+00DC, so hopefully flipping the BOM will cause the bytes to be read in the intended order)
Probably you forgot to read the _wfopen() documentation. There they specify the encoding parameter. BTW, I assumed you are already using Unicode (wchars).
I would recommend you to use UTF-8 in files with or without BOM but forcing your fopen to use UTF-8 flag. It looks _wfopen("newfile.txt", "r, ccs=UTF-8"); will work with UTF-8 with or without BOM and also with UTF-16. Do not make the mistake of using the ccs=Unicode, it is a common thing to have UTF-8 files without BOM.
You should really read a little bit about Unicode before trying to work. This about this as a very good investment - it will save you time if you understand how Unicode works.
Here is a start http://blog.i18n.ro/newbie-guide-to-unicode/ and do not forget to read the links from the end of the article.
If you really need a simple text editor that allows you to play with Unicode encodings, use Notepad++ and forget about Notepad.
Your text editor might not like UTF-16. It probably assumes ANSI or UTF-8.
Try typing in the UTF-8 equivalent instead:
0xF0 0x90 0xA0 0x80
This won't help your testing, but will make sure your font isn't at fault. A text editor that does support UTF-16 is Komodo Edit.