Sample text for common scripts - unicode

The Windows font chooser dialog displays different text according to the selected 'Script' (corresponding to legacy Windows code pages, I think). I want to preview fonts that support scripts not listed there, however, and I was wondering if there is a resource of short (~8-12 characters) strings useful for this purpose.
Here's what I've got so far, based on the preview text from the Windows font chooser:
Latin: AaBbYyZz
Greek: AaBbΑαΒβ
Cyrillic: AaBbБбФф
Hebrew: AaBbנסשת
Arabic: AaBbابجدهوز
Thai: AaBbอักษรไทย
Korean: 가나다AaBbYyZz
Japanese: Aaあぁアァ亜宇
Chinese: 中文字型範例
Devanagari: माता
Gurmukhi: ਮਾਤਾ
Gujarati: માતા
Tamil: அம்மா
Telugu: అమ్మ
Kannada: ಅಮ್ಮ
The Chinese text is from the Traditional Chinese preview text in the font chooser dialog - I'd like text that's good for simplified and traditional fonts, so I'm not sure about this one.
The six South Asian scripts at the end of the list all use the word 'mother', which seems a bit strange, but I'll use the same pattern for Malayalam:
Malayalam: അമ്മ
In Bengali, however, 'mother' is apparently only two characters long (মা), so I'd prefer to find something else for that language.
I'm missing sample text for the following languages/scripts:
Armenian: ?
Bengali: ?
Oriya: ?
Lao: ?
Tibetan: ?
Georgian: ?

Related

Where are the unicode characters on the disk and what's the mapping process?

There are several unicode relevant questions has been confusing me for some time.
For these reasons as follow I think the unicode characters are existed on disk.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
There's a concept of UCD (unicode character database), and We can download it's latest version. UCD latest
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
So if the unicode characters does existed on the disk , then :
Where is it ?
How can I upgrade it ?
What's the process of mapping the unicode code point to a glyph ?
If I use a specific font, then what's the process of mapping the unicode code point to a glyph ?
If not, then what's the process of mapping the unicode code point to a glyph ?
It will very appreciated if someone could shed light on these problems.
Execute echo "\u6211" in terminal, it will print the glyph corresponding to the unicode code point U+6211.
That's echo -e in bash.
› echo "\u6211"
\u6211
› echo -e "\u6211"
我
Where is it ?
In the font file.
Some new version unicode characters like latest emojis can not display on my mac until I upgrade macOS version.
How can I upgrade it ?
Installing/upgrading a suitable font with the emojis should be enough. I don't have macOS, so I cannot verify this.
I use "Noto Color Emoji" version 2.011/20180424, it works fine.
What's the process of mapping the unicode code point to a glyph ?
The application (e.g. text editor) provides the font rendering subsystem (Quartz? on macOS) with Unicode text and a font name. The font renderer analyses the codepoints of the text and decides whether this is simple text (e.g. Latin, Chinese, stand-alone emojis) or complex text (e.g. Latin with many marks, Thai, Arabic, emojis with zero-width joiners). The renderer finds the corresponding outlines in the font file. If the file does not have the required glyph, the renderer may use a similar font, or use a configured fallback font for a poor substitute (white box, black question mark etc.). Then the outlines undergo shaping to compose a complex glyph and line-breaking. Finally, the font renderer hands off the result to the display system.
Apart from the shaping, very little of this has to do with Unicode or encoding. Font rendering already used to work that way before Unicode existed, of course font files and rendering was much simpler 30 years ago. Encoding only matters when someone wants to load or save text from an application.
Summary: investigate
Truetype/Opentype font editing software so you can see what's contained in the files
font renderers, on Linux look at the libraries pango and freetype.
Generally speaking, operating system components that use text use the Unicode character set. In particular, font files use the Unicode character set. But, not all font files support all the Unicode codepoints.
When a codepoint is not supported by one font, the system might fallback to another that does. This is particularly true of web browsers. But ultimately if the codepoint is not supported, an unfilled rectangle is rendered. (There is no character for that because it's not a character. In fact, if you were able to copy and paste it as text, it should be the original character that couldn't be rendered.)
In web development, the web page can either supply or give the location of fonts that should work for the codepoints it uses.
Other programs typically use the operating system's rendering facilities and therefore the fonts available through it. How to install a font in an operating system is not a programming question (unless you are including a font in an installer for your program). For more information on that, you could see if the question fits with the Ask Different (Apple) Stack Exchange site.

How to convert unicode font to ansi

I am trying to use bengali writing product AVRO but since it uses UNICODE by default so working with ADOBE products like photoshop, pagemaker, etc. are not working at all. When i change the mode to ANSI then AVRO works but i also have to change the font to ANSI coded font(unicoded fonts dont work). Now it will be very kind if someone find any one of them -
1) Collection of Bengali ANSI coded fonts.
or
2) Method to convert these UNICODE fonts to ANSI fonts(if its possible, I dont know actually)
or
3) A workaround to use them on Adobe products and still using the unicoded fonts.
There are numerous software like Indica, Ramdhenu, Easy DTP etc. to type in Pagemaker, Photoshop, coreldraw etc. I use and prefer Indica.

Using unicode / utf-8 in programmers editors

There are a lot of programmers editors that claim to support unicode / utf-8. I've tried a number of them (UltraEdit, jedit, emedit) but none of them tell you how to actually enter unicode characters into a file. Some of them tell you how to change the default file encoding to utf-8 or how to select a font that has good support for utf-8, but not how to enter utf-8 into a file using their editor.
The Go language (and some others) support utf-8 and I like the idea of using the actual utf-8 symbols for variables instead of variables with names like omega. I haven't found a programmers editor yet that actually allows you to do this, though.
The only editor / word processor that I've found that lets you how to enter unicode is Microsoft Word. Type the unicode and Alt+X and Word converts it. To get the Greek letter omega type "03c9" followed by Alt+X. UltraEdit will let you copy utf-8 from a web page into it, but their docs don't say how to actually enter utf-8 in a file, and their tech. support people don't know either.
This should be simple, but seems to be completely undocumented. Is there some key combination convention the lets you enter unicode into these editors that supposedly support unicode the way that Ctrl-F is widely used for search?
Thanks.
The standard programmer’s editor vim(1) supports limited Unicode input even if your operating system should be too broken to do so (are there any such, still?).
Just enter ^VuXXXX, where XXXX represents exactly four hex digits.
That will allow you to enter the ~6% of Unicode allocated to the Basic Multilingual Plane. The rest are forbidden to you.
This may be fixed in a newer release.
Otherwise, just use your mouse.
A few techniques I use if an editor is lacking:
Use the Windows charmap.exe utility to select characters and paste into a document.
Install an input method editor (IME) to write in a particular language.
Windows ALT keycodes.
Better to set your keyboard to generate Unicode characters across all Windows applications than to rely on a single application's custom input feature IMO.
Use the EnableHexNumpad feature and you can type any character in the Basic Multilingual Plane using Alt+numbad-plus,hexcode. (May not be of much use on a laptop without a numpad though.)
Or if there are particular characters you want to type a lot, find a keyboard layout that allows you to type them directly. For example eurokb might cover it, or you can make your own with MSKLC.
Old question, but you can type a lot of unicode in GNU Emacs or Vim
GNU Emacs: M-x set-input-method RET tex (or C-x RET C-\ tex) will let you type \omega to generate ω
Vim: Vim digraphs can generate unicode; C-k w * in insert mode gives you ω.
deceze hit the nail on the head. (S)he just didn't elaborate. bobince gave a bit more.
And I'm hazarding a guess that you're a developer or tester working on L14N or I18N. I'm also guessing you need to do more than just a few characters here or there, or you'd be satisfied with pasting from another app. So, I'll share some advice. (note: here, "you" refers to the next person to look here. I'm sure the original poster doesn't care anymore by now. :-))
If you're on Windows 10, install an appropriate keyboard driver that lets you input the characters you want into any application. I'm sure Linux has support for the same sort of thing.
E.g. I'm teaching myself Hindi (हिंदी), so I installed Windows' Hindi (Devanangari) support. I typed "Hindi", in Hindi using that support, then I switched back to US English to do the rest of this post. If all you need are accented characters from Western European languages, you can install the INTL English support and type directly in español or français or whatever.
Don't look at entering Unicode characters as entering some sort of special data amidst your English text. It's just someone else's language. Use their keyboard. Type their language.
I'm writing a flashcard app to help my learning. I'm using the Hindi keyboard support to type characters into Word, WordPad, Excel, and the Visual Studio editor. And that Hindi keyboard support works exactly the same way in all of those apps, as I'd expect it to work in just about any text editor that supports Unicode. And as you saw above, it also works in a simple text edit control in Chrome. No copy and paste. No remembering special codes. It's as ubiquitous as ctrl-F.
It looks like the unicode support in programmers editors (except for some Microsoft products) is mostly read-only. They can open a file with unicode and display the characters, but typing unicode into a file is a different story. If you want to enter unicode in a programmers editor you can copy it from somewhere else (a web page or Microsoft Word or Notepad) and paste it into the editor, but the editors make typing unicode difficult or impossible.
UltraEdit tech support referred me to this web page which explains a lot. Unfortunately none of the solutions worked with UltraEdit.
Microsoft Word and Notepad support unicode entry. Type the unicode value followed by Alt+X and it converts the hexadecimal and displays it. You can then copy and paste it into UltraEdit or one of the other programmers editors. As others have mentioned unicode support depends on support within the operating system as well as the editor.
What got me interested in using unicode in source code files is Mark Summerfield's book Programming in Go. He includes an example .go file that uses unicode. It would be great to use unicode Greek characters for variable names instead of variables named "omega" or "theta".
Using unicode in source code is a bad idea, however. Support for unicode in programmers editors is lousy, and developers would have to save or convert their source code files to utf-8 instead of ASCII. Developer's tools are just not ready to write code in unicode no matter how neat the idea sounds.

Convert non english characters into Unicode (UTF-8)

I copied large amount of text from another system to my PC. When I viewed the text in my PC, it looked weird. So I copied all the fonts from the other PC and installed them in mine too. Now the text looks okay, but actually it seems that is not in Unicode. For example, if I copy the text and paste in another UTF-8 supported editor such as Notepad++, I get English characters ("bgah;") only like shown below.
How to convert this whole text into unicode text, like the one below. So I can copy the text and paste anywhere else.
பெயர்
The above text was manually obtained using http://www.google.com/transliterate/indic/Tamil
I need this conversion to be done, so I can copy them into database tables.
'Ja-01' is a font with a custom 'visual encoding'.
That is to say, the sequence of characters really is "bgah;" and it only looks like Tamil to you because the font's shapes for the Latin characters bg look like பெ.
This is always to be avoided, because by storing the content as "bgah;" you lose the ability to search and process it as real Tamil, but this approach was common in the pre-Unicode days especially for less-widespread scripts without mature encoding standards. This application probably predates widespread use of TSCII.
Because it is a custom encoding not shared by any other font, it is very unlikely you will be able to find a tool to convert content in this encoding to proper Unicode characters. It does not appear to be any standard character ordering, so you will have to look at the font (eg in charmap.exe) and note down every character, find the matching character in Unicode and map between them.
For example here's a trivial Python script to replace characters in a file:
mapping= {
u'a': u'\u0BAF', # Tamil letter Ya
u'b': u'\u0BAA', # Tamil letter Pa
u'g': u'\u0BC6', # Tamil vowel sign E (combining)
u'h': u'\u0BB0', # Tamil letter Ra
u';': u'\u0BCD', # Tamil sign virama (combining)
# fill in the rest of the mapping information here!
}
with open('ja01data.txt', 'rb') as fp:
data= fp.read().decode('utf-8')
for char in mapping:
data= data.replace(char, mapping[char])
with open('utf8data.txt', 'wb') as fp:
fp.write(data.encode('utf-8'))
The font you found is getting you into trouble. The actual cell text is "bgah;", it gets rendered to பெயர் because you found a font that can work with 8-bit non-Unicode characters. So reading it or pasting it into Notepad++ is going to produce "bgah;" since that's the real text. It can only ever be rendered properly again by forcing the program that displays the string to use that same font.
Ditch the font and enter Unicode so it looks like this:
"bgah" looks like a Baamini based system, which is pre-unicode. It was popular in Canada (and the SL Tamil diaspora in general) in the 90s.
As the others mentioned, it looks like a custom visual encoding that mimics the performance of a foreign script while maintaining ASCII encoding.
Google "Baamini to unicode convertor". The University of Colombo seems to have put one up: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_b-u.xml
Let me know if this works. If not, I can ask around and get something for you.
You could first check whether the encoding is TSCII, as this sounds most probable. It is an 8-bit encoding, and the fonts you copied are probably based on that encoding. Check out whether the TSCII to UTF-8 converter at SourceForge is suitable. The project there is called “Any Tamil Encoding to Unicode” but they say that only TSCII is supported for now.

Fast, Unicode-capable, cross-platform programmer's text editor that shows invisibles like ZWSP?

Our publishing workflow includes Windows and Linux machines (there are some Macs too, but not in the critical-path workflow). Many texts include both English and Khmer and are marked-up in XML.
XML Copy Editor is the best cross-platform open-source XML editor I've discovered. It utilizes the Scintilla editing component, which is generally good with Unicode but which does not enable non-printing or invisible characters like U+200B (zero-width space) and U+200C (zero-width non-joiner) to be displayed. Khmer does not separate words with a space character as Western languages do, so ZWSP is used in electronic texts to enable applications to break lines easily.
Ideally I'd edit the markup and the content in a single editor, but XML awareness is less important at times than being able to display invisibles. (OpenOffice.org Writer and Microsoft Word are the only two apps I know that will display ZWSP. They are not suitable for the markup and text manipulations that need to be done to prepare manuscripts for publication, unfortunately, although I guess they're fine for authoring.)
I tried out a promising editor last week, but a search-and-replace regex operation that took under a second in TextPad 4.7.3 lasted over twenty seconds. So I want to mention that speed and the ability to handle large (up to 150mb) files is also a concern.
Is there a good, fast, free or not too expensive text editor, with versions on Windows and Linux and maybe mac too, Unicode-aware and capable of displaying invisibles like ZWSP? That has syntax highlighting, can handle large files and is customizable enough that I won't tear my hair out in frustration?
I don't know about ZWSP in particular, but EditPadPro is good, fast, not expensive, has a very good regex engine and is Unicode-aware (and well-suited to editing XML, too). The developer (Jan Goyvaerts) lives in Thailand and knows about requirements for Eastern scripts and languages, so chances are good that it will be able to handle these texts.
EditPad Pro does not (yet) have the ability to visualize non-printable characters other than the ASCII space and tab. Version 6 does recognize ZWSP as a word boundary when doing word wrapping and selecting words by double-clicking or Ctrl+Shift+Left/Right.
What you can do is to search for the regular expression \u200B. Though this doesn't make the zero-width space visible, it will select it and put the cursor after it. You could use the regex \u200B\X and turn on the Highlight button on the search panel to highlight each grapheme after U+200B. You could even use the syntax coloring scheme editor to edit the provided XML scheme to use that regex always highlight each grapheme after U+200B.
EditPad Pro easily handles 150 MB files and has a powerful regex engine (same as used in RegexBuddy and PowerGREP). Maximum file size is 2 GB. Windows only.
I'm using CKEditor , it's cross platform and completly support unicode.
Take a look at it