Opening a file containing unicode characters using notepad++ appears corrupted - unicode

I'm using the latest version of Notepad ++ 6.3.1 and using a windows os. While trying to open a file containing unicode character appears corrupted despite changing the encoding to UTP8. It is displayed like "[][][][]". I'm I missing something in the settings? Kindly help.
Thanks

This is a font issue. You need a font containing the Japanese characters as installed in your computer, and you also need to have Notepad++ set to use such a font, for the kind of text being viewed. But it seems that Notepad++ is capable of using fallback fonts when needed (e.g., when the font selected does not contain all characters appearing in the text), so the problem is probably that no font in your system contains the characters. See e.g. the list East Asian Unicode fonts for Windows computers.

Not a font issue, unfortunately. Try with these characters, with UTF-16 encoding:
🔊, 🎥, 📕 (> U+FFFF)
Conclusion: Notepad++ doesn't have full Unicode support (unlike Windows Notepad or AkelPad)

Notepad++ is also inconsistent. With a document in UTF-8, using Lucida Console font, create a line
⇐⇑⇒⇓⇔⇕⇖⇗⇘⇙
and enter a newline in the middle - second line becomes 5 blocks, and then delete newline - all 10 characters display properly.
With font MS Gothic, this test always displays proper characters
Notepad++ v7.5.1 (64-bit)
Build time : Aug 29 2017 - 02:38:44
Path : C:\Program Files\Notepad++\notepad++.exe
Admin mode : OFF
Local Conf mode : OFF
OS : Windows 10 (64-bit)
Plugins : mimeTools.dll NppConverter.dll

Related

How to fix PowerShell 7 fonts not showing correctly | oh-my-posh

I've already installed Windows Terminal, set it up with "oh my posh" and everything working as intended.
Though whenever I launch PowerShell 7 (without the terminal), the font is messy as you can see at the image below
I have already tried to change the font, to the same one I used in terminal's .json but there are still some parts that are not rendering correctly and I cannot use it that way with VSCode
The problem is because the Windows Console doesn't fully support UTF-8:
Windows Console was created way back in the early days of Windows,
back before Unicode itself existed! Back then, a decision was made to
represent each text character as a fixed-length 16-bit value (UCS-2).
Thus, the Console’s text buffer contains 2-byte wchar_t values per
grid cell, x columns by y rows in size.
...
One problem, for example, is that because UCS-2 is a fixed-width
16-bit encoding, it is unable to represent all Unicode codepoints.
This means you have "partial" support for Unicode characters in the Windows Console (i.e. as long as the character can be represented in UCS-2), but won't support all potential (32-bit) Unicode regions.
When you see boxes, that means that the character that is being used is using a region outside of the UCS-2 range. You also tell this because you get 2 boxes (i.e. 2 x 16 bit values). That is why you can't have happy faces 😀 in your Windows Console (which makes me sad ☹️).
In order for it to work in all locations, you will have to modify your oh-my-posh themes to use a different character that can be represented with a UCS-2 character.
For Version 2 of Oh My Posh, to make the font changes you have to edit the $ThemeSettings variable. Follow the instructions on the GitHub on configuring Theme Settings. e.g.:
$ThemeSettings.GitSymbols.BranchSymbol = [char]::ConvertFromUtf32(0x2514)
For Version 3+ of Oh My Posh, you have to edit the JSON configuration file to make the changes, e.g.:
...
{
"type": "git",
"style": "powerline",
"powerline_symbol": "\u2514",
....

Where are the unicode characters on the disk and what's the mapping process?

There are several unicode relevant questions has been confusing me for some time.
For these reasons as follow I think the unicode characters are existed on disk.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
There's a concept of UCD (unicode character database), and We can download it's latest version. UCD latest
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
So if the unicode characters does existed on the disk , then :
Where is it ?
How can I upgrade it ?
What's the process of mapping the unicode code point to a glyph ?
If I use a specific font, then what's the process of mapping the unicode code point to a glyph ?
If not, then what's the process of mapping the unicode code point to a glyph ?
It will very appreciated if someone could shed light on these problems.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
That's echo -e in bash.
› echo "\u6211"
\u6211
› echo -e "\u6211"
我
Where is it ?
In the font file.
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
How can I upgrade it ?
Installing/upgrading a suitable font with the emojis should be enough. I don't have macOS, so I cannot verify this.
I use "Noto Color Emoji" version 2.011/20180424, it works fine.
What's the process of mapping the unicode code point to a glyph ?
The application (e.g. text editor) provides the font rendering subsystem (Quartz? on macOS) with Unicode text and a font name. The font renderer analyses the codepoints of the text and decides whether this is simple text (e.g. Latin, Chinese, stand-alone emojis) or complex text (e.g. Latin with many marks, Thai, Arabic, emojis with zero-width joiners). The renderer finds the corresponding outlines in the font file. If the file does not have the required glyph, the renderer may use a similar font, or use a configured fallback font for a poor substitute (white box, black question mark etc.). Then the outlines undergo shaping to compose a complex glyph and line-breaking. Finally, the font renderer hands off the result to the display system.
Apart from the shaping, very little of this has to do with Unicode or encoding. Font rendering already used to work that way before Unicode existed, of course font files and rendering was much simpler 30 years ago. Encoding only matters when someone wants to load or save text from an application.
Summary: investigate
Truetype/Opentype font editing software so you can see what's contained in the files
font renderers, on Linux look at the libraries pango and freetype.
Generally speaking, operating system components that use text use the Unicode character set. In particular, font files use the Unicode character set. But, not all font files support all the Unicode codepoints.
When a codepoint is not supported by one font, the system might fallback to another that does. This is particularly true of web browsers. But ultimately if the codepoint is not supported, an unfilled rectangle is rendered. (There is no character for that because it's not a character. In fact, if you were able to copy and paste it as text, it should be the original character that couldn't be rendered.)
In web development, the web page can either supply or give the location of fonts that should work for the codepoints it uses.
Other programs typically use the operating system's rendering facilities and therefore the fonts available through it. How to install a font in an operating system is not a programming question (unless you are including a font in an installer for your program). For more information on that, you could see if the question fits with the Ask Different (Apple) Stack Exchange site.

Recursive directory listing of unicoded file names

If i use dir /s /b>list.txt all unicode characters in file names, like äöüß, are broken or missed - instead of ä i get '', ü just disappears and so on...
Yes, i know, unicode characters aren't a good way to name files - they aren't named by me.
Is there a method to get file names healthy listed?
The default console code page usually only supports a small subset of Unicode. US Windows defaults to code page 437 and supports only 256 characters.
If you open a Unicode command prompt (cmd /u), when you redirect to a file the file will be encoded in UTF-16LE, which supports all Unicode characters. Notepad should display the content as long as its font supports the glyphs used.
Changing to an encoding such as UTF-8 (chcp 65001) that supports the full Unicode code point set and redirecting to a file will use that encoding and work as well.

Stata 13: Encoding of German Characters in Windows 8 and Mac OS X

For a current project, I use a number of csv files that are saved in UTF8. The motivation for this encoding is that it contains information in German with special characters ä,ö,ü,ß. My team is working with Stata 13 on Mac OS X and Windows 7 (software is frequently updated).
When we import the csv file (when importing, we choose Latin-1) in Stata special characters are correctly displayed on both operating system. However, when we export the dataset to another csv file on Mac OS X - which we need to do quite often in our setup - the special characters are replaced, e.g. ä -> Š, ü -> Ÿ etc. On Windows, exporting works like a charme and special characters are not replaced.
Troubleshooting: Stata 13 cannot interpret unicode. I have tried to convert the utf8 files to windows1252 and latin 1 (iso 8859-1) encoding (since, after all, all it contains are german characters) using Sublime Text 2 prior to importing it in Stata. However the same problem remains for Mac OS X.
Yesterday, Stata 14 was announced which apparently can deal with unicode. If that is the reason, then it would probably help with my problem, however, we will not be able to upgrade soon. Apart from then, I am wondering why the problem arises on Mac but not on Windows? Can anyone help? Thank you.
[EDIT Start] When I import the exported csv file again using a "Mac Roman" Text encoding (Stata allows to specify that in the importing dialogue), then my german special characters appear again. Apparently I am not the only one encountering this problem by the looks of this thread. However, because I need to work with the exported csv files, I still need a solution to this problem. [EDIT End]
[EDIT2 Start] One example is the word "Bösdorf" that is changed to "Bšsdorf". In the original file the hex code is 42c3 b673 646f 7266, whereas the hex code in the exported file is 42c5 a173 646f 7266. [EDIT2 End]
Until the bug gets fixed, you can work around this with
iconv -f utf-8 -t cp1252 <oldfile.csv | iconv -f mac -t utf-8 >newfile.csv
This undoes an incorrect transcoding which apparently the export function in Stata performs internally.
Based on your indicators, cp1252 seems like a good guess, but it could also be cp1254. More examples could help settle the issue if you can't figure it out (common German characters to test with still would include ä and the uppercase umlauts, the German double s ligature ß, etc).
Stata 13 and below uses a deprecated locale in Mac OS X, macroman (Mac OS X is unicode). I generally used StatTransfer to convert, for example, from Excel (unicode) to Stata (Western, macroman; Options->Encoding options) in Spanish language. It was the only way to have á, é, etc. Furthermore, Stata 14 imports unicode without problem but insist to export es_ES (Spanish Spain) as the default locale, having to add the command locale UTF-8 at the end of the export command to have a readable Excel file.

How can I get Mocha's Unicode output to display properly in a Windows console?

When I run Mocha, it tries to show a check mark or an X for a passing or a failing test run, respectively. I've seen great-looking screenshots of Mocha's output. But those screenshots were all taken on Macs or Linux. In a console window on Windows, these characters both show up as a nondescript empty-box character, the classic "huh?" glyph:
If I highlight the text in the console window and copy it to the clipboard, I do see actual Unicode characters; I can paste the fancy characters into a textbox in a Web browser and they render just fine (✔, ✖). So the Unicode output is getting to the console window just fine; the problem is that the console window isn't displaying those characters properly.
How can I fix this so that all of Mocha's output (including the ✔ and ✖) displays properly in a Windows console?
By pasting the characters into LinqPad, I was able to figure out that they were 'HEAVY CHECK MARK' (U+2714) and 'HEAVY MULTIPLICATION X' (U+2716). It looks like neither character is supported in any of the console fonts (Consolas, Lucida Console, or Raster Fonts) that are available in Windows 7. In fact, out of all the fonts that ship with Windows 7, only a handful support these characters (Meiryo, Meiryo UI, MS Gothic, MS Mincho, MS PGothic, MS PMincho, MS UI Gothic, and Segoe UI Symbol). The ones starting with "MS" are all fixed-width (monospace) fonts, but they all look awful at the font sizes typical of a console. And the others are out, since the console requires fixed-width fonts.
So you'll need to download a font. I like DejaVu Sans Mono -- it's free, it looks good at console sizes, it's easy to tell the 0 from the O and the 1 from the I from the l, and it's got all kinds of fancy Unicode symbols, including the check and X that Mocha uses.
Unfortunately, it's a bit of a pain to install a new console font, but it's doable. (Steps adapted from this post by Scott Hanselman, but extended to include the non-obvious subtleties of 000.)
Steps:
Download the DejaVu fonts. Unzip the files. Go into the "ttf" directory you just unzipped, select all the files, right-click and "Install".
Run Regedit, and go to HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Console\TrueTypeFont.
Add a new string value. Give it a name that's a string of zeroes one longer than the longest string of zeroes that's already there. For example, on my Windows 7 install, there's already a value named 0 and one named 00, so I had to name the new one 000.
Double-click on your new value, and set its value to DejaVu Sans Mono.
Reboot. (Yes, this step is necessary, at least on OSes up to and including Windows 7.)
Now you can open a console window, open the window menu, go to Defaults > Font tab, and "DejaVu Sans Mono" should be available in the Font list box. Select it and OK.
Now Mocha's output will display in all its glory.
Update: this issue has now been fixed. Starting from Mocha 1.7.0, fallbacks are used for symbols that don't exist in default console fonts (√ instead of ✔, × instead of ✖, etc.). It's not as pretty as it could be, but it surely beats empty-box placeholder symbols.
For details, see the related pull request: https://github.com/visionmedia/mocha/pull/641