Tool to convert code source from a codepage to UTF-8? - unicode

I'm working on an open source project. The original project contains comments in russian and is using codepage 1251. I'm using codepage 1252 and the russian comments aren't displayed correctly in Visual Studio Express 2008, not nice but anyway I can't read russian. Someone using codepage 950 (traditional chinese) tried to compile the project and was unable to do it, because of the code page! Now it is really annoying.
I think that using unicode (and more exactly UTF-8 with signature) as file format for the code source is the way to go.
Problem: how to convert the whole source code easily?
I have already though about:
Let Visual Studio save the source code as UTF-8. But: My computer is using codepage 1252 and I found no way to tell VS that the original code source is using codepage 1251 so that the conversion won't be correct.
Edit: As pointed by "LicenseQ" there is a way to open a single file in VS with another encoding: click Arrow near Open button in open dialog, chose "Open With" and then chose "Code Editor (with encoding)".
Of course I could change the codepage of my computer for the time of the conversion. But it's a global setting in Windows and you need to reboot the computer so that I'm looking for a more friendly solution.
I've found a tool called CodePageConverter which do exactly what I need, but can't a do it as batch job.
Does anyone know another tool (a command line tool would be perfect) to convert from a codepage to UTF-8?
Edit: As suggest by tkotitan seems iconv to be the solution I was looking for. There is a windows version of iconv. And now that I know the name of this tool, I was able to find over posts on stackoverflow dealing with analogous issues.

In a unix world the utility is called iconv.
Not sure if there is a windows equivalent.

You can ask VS 2008 to open file with encoding (click Arrow near Open button in open dialog)
Or you can change regional settings to add russian region as default ;)

Related

How to make a GitHub README.md render

I have a README.md here but it is not showing up as rendered Markdown, it just shows the raw text. Does anyone know what I'm doing wrong here?
https://github.com/slothdude/soundcloud-groupme-bot/blob/master/README.md
There's no way to reliably detect a file's encoding. At the end of the day, it's a guessing game.
That particular file is stored in some strange encoding. Some editors (e.g. Emacs) seem to mostly open it successfully (though there are a few strange characters that might be whitespace), but don't know what it is. When I ask Emacs what encoding it's using I get no-conversion, which isn't very helpful.
Others, like Gedit, show what looks like a mixture of kanji and rectangular symbols suggesting unknown values.
Tools like file and enca seem to have no idea what it is:
$ file README.md
README.md: data
$ enca README.md
enca: Cannot determine (or understand) your language preferences.
Please use `-L language', or `-L none' if your language is not supported
(only a few multibyte encodings can be recognized then).
Run `enca --list languages' to get a list of supported languages.
Open it in a decent text editor (ideally the one you've used to author it) and save it as UTF-8, then commit that change. I suspect that this will fix its rendering on GitHub.

How to setup Visual Studio Code to detect and set the correct encoding on file open

I recently started to use Visual Studio Code on Server Systems where I did not have Studio IDE installed. I like it very much but I'm running into a problem.
When I open a file (used Notepad++ before) the editor detects the encoding and sets it for me. I have many files on windows servers that are still with windows-1252 but vscode just uses UTF-8 by default.
I know I can reopen with encoding Western (Windows 1252) but I often forget it and I have sometimes destroyed some content while saving it.
So I did not find any parameter yet, is there a way to make vscode detect the encoding and set it automatically when I open a file?
To allow Visual Studio Code to automatically detect the encoding of a file, you can set "files.autoGuessEncoding":true (in the settings.json configuration file).
https://github.com/Microsoft/vscode/pull/21416
This obviously requires an updated verison of the application compared to when the question was originally asked.
Go to File-> Preferences -> User Settings
Add (or update) the entry "files.encoding": "windows1252" to the right editor window and save
Now VSCode opens all text files using windows-1252 when there is no proper encoding information set.
EDIT:
In 2017's June release the files.autoGuessEncoding setting was introduced. When enabled it will guess the file's encoding as good as possible. Its default value is false .
Add guide by image :
File >> Preferences >> Settings
Enter autoGuessEncoding and make sure checkbox is checked
beware, auto guessing in vscode still does not work as expected, the guessing, is VERY inaccurate, and does still open as guessed encoding, even when the guess library returns also the confidence score being low - they use jschardet (https://www.npmjs.com/package/jschardet)
if the score of guess is not close to 100%, it simply should rather open in "files.encoding" instead of guessed encoding, but that does not happen, vscode authors should make better use of the confidence score jschardet returns
i open either utf-8 which guesses ok, and second encoding i use is windows-1250, which in 99% cases detects wrong as some other windows-... encoding or even iso-8859-... and such... cry daily at the computer having to experience this
tuning up the confidence checking and fallback to default encoding would do, it needs someone skilled to check their source and offer them a fix
From Peminator's answer:
Beware: auto guessing in VSCode still does not work as expected, the guessing, is VERY inaccurate,
This should slightly improve with VSCode 1.67 (Apr. 2022) with (insider released):
Allow to set files.encoding as language specific setting for files on startup
we now detect changes to the editor language and there is almost always a transition from plaintext in the beginning until languages are resolved to a target language
if we detect that the new language has a configured files.encoding override
and that encoding is different from the current encoding
and the editor is not dirty or in save conflict resolution
:
we reopen the file with the configured encoding
Unfortunately I cannot tell apart this from happening vs. the user changing the language from the editor status bar.
So if the user changes language mode to bat from to ps1 and ps1 has a configured encoding, the encoding will change.

Issue with coding Windows-1250 in Perl

I have a text file encoded in Windows-1250. I'm using Windows 7 EN.
I would like to iterate through this file line by line in Perl code with
print. In console I cannot see the diacritic signs.
Could you give me any solution?
It depends on what you are going to do with the text, but for many cases
it's possible to code independently on encoding. Anyway, if you redirect
output to a file and the result is OK (read: can be displayed opened by
text editor in Windows 1250 mode using proper font), your code is not the
problem.
The other thing is that you want to see CE characters in your console.
For that to work you need to do:
set your console window to use font capable of displaying them (you
may need to install such font, I don't remember The Right Way in Win 7)
set your console to Windows-1250 mode using command chcp 1250
Note that this is basically the same you would need to do with your viewer
or editor to see the characters. Except that while many editors are able
to detect encoding themselves (sometimes even correctly) and pick the right
font, consoles typically need help from you.
Your problem might be similar to what has been solved here. I also
recommend reading the other post I'm referencing there.

Eclipse turns Japanese into garbage during refactoring

I have several Java files that have Japanese strings in them, and are encoded in UTF-8. I use Eclipse. However, whenever Eclipse touches them in any automated way, it turns the Japanese into garbage. A good example of this is JAWJAW, the Java Japanese WordNet interface. You can see the code on the website with Japanese characters in it. If you load the project into Eclipse, though, everything will fail because the characters are garbled (bakemoji).
Does anyone know how to fix this?
What is the default encoding for your project?
Future version of Eclipse (like e4) could be set by default to UTF-8, which would avoid any automatic conversion into "garbage".
See bug 108668 for more on that reflexion:
No solution will be perfect. However in the long term I think the current platform specific approach is clearly inferior to a platform-independent UTF-8 default.
+1 UTF-8 should be the obvious default character set for all text files, I had
a problem with eclipse when I was using an English Windows XP system and trying
to open a file in eclipse with Chinese characters, as you can imagine the
display is completely messed up and eclipse doesn't tell me what I need to do.
I had to spend time google for answers. I had to put -Dfile.encoding=UTF-8 in
eclipse.ini so that it behaves correctly.
Making UTF-8 the default is not the right solution for the problem you were
having.
+1 for embedding encoding in the character stream wherever we can (like XML, HTTP, some kinds of file systems).
Encoding is meta-info for the data and belongs to the data, not to a separate user-changeable setup.
The primary reason for this cause is - the unicode supported font is missing from the system fonts. So do the following things to get it done.
Download Arial Unicode MS font and put it inside windows->fonts
directory in windows.
Change the default text encoding in eclipse to UTF-8 by navigating to
Window->Preferences->General->Workspace->Text File encoding
->Other->UTF-8
set Arial Unicode MS font to the Text Font attribute by navigating to
Window->Preferences->General->General->Appearance->colors and
Fonts->Basic->Text Font (select it)->Edit

R character encodings across windows, mac and linux

I use OS X and I am currently cooperating with a windows user and deploying the scripts on a linux server. We use git for version control, and I keep getting R scripts from his end where the character encoding used has mixed latin1 and utf8 encodings. So I have a couple of questions.
Is there a simple to use editor for windows that handle UTF8 with more grace than Winedt that my coauthor currently uses? I use emacs, but I am having a hard time selling getting him to switch.
How to set up R in Windows so that it defaults to reading and writing UTF8?
This is driving me crazy. Has anyone found a solution for it (be it in the workflow or in the software used) who cares to share?
Take a look at ?Encoding to set the encoding for specific objects.
You might have luck with options(encoding = ) see ?options, (disclaimer, I don't have a windows machine)
As for editors, I haven't heard complaints about encoding issues with Crimson editor which lists utf-8 support as a feature.
TextPad is a well featured editor supporting R syntax that allows you to specify the target platform for files (Win/UNIX/Mac/keep current encoding) when you save them. The only problem with it is that some of the keyboard shortcuts are nonstandard (e.g. 'Find' is F5, not F3).