i18n / Markdown - Does Markdown support internationalization? - unicode

I'm building a CMS which needs to manage content in english, chinese, and spanish at a minimum. Do most markdown implementations handle UTF-8 encoded text? Is the Markdown language designed to be used with non-english languages? I'm currently using Markdown Extra by Michel Fortin.

Markdown is languages neutral so you can use it easily as it does not care about encoding
as long as ASCII is a subset of it.
However I would not recommend you using smarty-pants as they designed for English so you may get incorrect quotes in your language.

As Michel Fortin (a french speaking guy), I use his library (the extra one) with special characters like éà without any problem

Related

How can i create iTextSharp pdf in Hindi font?

I am trying to build desktop application for Hindi PDFs in c#. But the Unicode encoding is not well supported.Any idea to fix this.
string ARIALUNI_TTF = path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Fonts), "ARIALUNI.TTF");
bf = iTextSharp.text.pdf.BaseFont.CreateFont(ARIALUNI_TTF, BaseFont.IDENTITY_H,BaseFont.EMBEDDED);
iTextSharp.text.Font font = new iTextSharp.text.Font(bf, 8, iTextSharp.text.Font.NORMAL);
Can Identity_H will give support for Hindi Encoding?
Hindi is not supported yet. A font like mangal.ttf, that supports the Devanagari script, will show you in iTextSharp the glyphs but not the ligatures. Work is being done on the Indic front not only for Hindi support but also for Telegu, Gujarati and others.
You basically require support for Asian Characters. A similar thread can be found here(stackoverflow). The implementation revolve around usage of BaseFont (use createFont method), which indicates using font and appropriate encoding. You can find the example on the official site of iText here. Note that the example is in Java, but the same implementation is available in .Net as well.

How will search engines react to different unicode?

I am developing a website in the Georgian language. The Georgian alphabet has its own Unicode range, but there are also special fonts which have Georgian glyphs in place of English characters, a bit like the "Symbol" and "Dingbats" fonts.
For example the string "saqarTvelo" will be rendered as "საქართველო" with these fonts. So now I have two options and don't know what to do:
Using Georgian Unicode for my website, but the problem is that all fonts are created for English Unicode, and don't work with Georgian Unicode.
Using Georgian fonts with English Unicode. But I don't know how search engine will react.
Please tell me what to do, I am stacked!
The short answer is that using the approach you mean in option 1, search engines will see the word “საქართველო” in your text as “saqarTvelo”, so normal searches will fail.
The question seems to refer to two different ways of using Georgian letters on web pages:
Using Unicode encoding, so that characters will be rendered using an Unicode-encoded font (which is what most fonts are, but most fonts don’t contain Georgian letters).
Using a nonstandard, “private” encoding, usually one that maps 256 different code positions (8-bit combinations) to whatever characters are needed for some purposes. This presumes that the text is rendered using a font encoded the same way.
Method 2 can be characterized as a wrong approach, but it has been used on the web since the early days (even when CSS was not available and one had to resort to <font face=...> for setting font), and especially in the early days. It really does not work unless the user’s computer has the specific, “privately” encoded font (or some font encoded exactly the same way). Since search engines are font-agnostic, they only see the 8-bit codes and try to interpret them in the encoding declared or implied for the page, not in the “private” encoding (which cannot be declared since it has no published definition and no standard name, or any name for that matter).
Method 1 has the problem that for it to work, the user’s computer needs to have some (Unicode-encoded font) that supports the characters used. Nowadays, this can be reasonably well solved using a downloadable font (web font) via #font-face. Fonts that support Georgian letters include some useful free fonts like DejaVu fonts, GNU Freefont fonts, and Quivira. For more info on this approach, see my Guide to using special characters in HTML.
Using method 1, search engines will see the Georgian letters correctly, provided that the document’s encoding (normally UTF-8) has been properly declared or can be inferred by the search engine.

PDFTable Unicode support

I'm using PDFTable from http://www.vanxuan.net/tool/pdftable/ which is based on FPDF class. I managed to export HTML table to pdf using PDFTable. However, I'm facing one issue. The non-English characters are all displayed in gibberish. It doesn't seem that it supports unicode. The language I'm trying to display is Arabic and Russian.
I could, theoretically, create a similar class to PDFTable, which is inherited from FPDF, and develop it from scratch to add unicode support. But it's a lot of work. Has anyone done something like that and perhaps could share? Thank you!
For unicode support, the best way is to use tFPDF from http://www.fpdf.org/en/script/script92.php. It's a fork of FPDF with specifically to support unicode. The class is based on the latest FPDF version 1.7.

Any standard for Unicode font support expected of all browsers?

Is there a standard governing Unicode font support expected of all browsers?
The latest version of Unicode contains a repertoire of more than 110,000 characters covering 100 scripts. I don't expect the browsers to support all of them, but there should be minimum support for some characters such as letters from the Latin script, common punctuation, and symbols of type math, currency, and other.
I am currently having problem displaying the U+060B AFGANI SIGN (؋) and U+202F NARROW NON-BREAK SPACE on the Android browser. I wonder if there is a list of universally recognized Unicode characters so that developers can use them confidently without having to worry about browser display issues.
There is no standard on Unicode support in browsers. Besides, the ability to display a character mostly depends on fonts, though browsers differ in their abilities in scanning through fonts. Normally what you can do is to specify a suitable font-family list of fonts that each support all the characters you need. For generalities on this, see my Guide to using special characters in HTML.
On Android, the problem is that there is a very limited set of fonts. If you need any characters beyond what is supported by them, you need to use a downloadable font, via #font-face.
The currency symbol “؋” U+060B AFGHANI SIGN is present in about a dozen fonts, but the only free font among them (if we don’t count the bitmap font GNU Unifont) appears to be Scheherazade.
For U+202F NARROW NO-BREAK SPACE, font support is wider. But in general, it is often better to use other methods than such characters. Many fonts contain this character as almost as wide as a normal space, and its description in the Unicode standard as regards to its width is vague: “a narrow form of a no-break space, typically the width of a thin space or a mid space”. “Thin space” is described as “a fifth of an em (or sometimes a sixth)” in the Unicode standard, and in reality its width varies. And “mid space” is really an undefined concept.
For example, if the text is in a language that uses spaces as thousands separators, you could in principle write a number like 100 000 as 100 000, but it’s better to write, say,
<span class="gr">100 000</span>
with CSS code like .gr { word-spacing: -0.15em }.
AFAIK, all browsers support #font-face for loading webfonts and can support any character within those fonts. As such, you should be able to display any character in any browser if you make sure you provide access to a webfont with support for those characters.
To avoid using giant fonts just to support a few special characters, you can create your own fonts with tools like the Icomoon App.
I used the Icomoon App to create the Emoji emoticon font as well as for creating custom icon fonts on a per project basis.
For more info on the use or creation of icon fonts (or other webfonts), see Create webfont with Unicode Supplementary Multilingual Plane symbols

How to display emoji char in HTML

I saved the face "savouring delicious food emoji" to database, and read it in php json_encode which show "uD83D\uDE0B"。 but usually we use one <img /> label to replace it .
however,usually I just find this format '\uE056' not "uD83D\uDE0B",to replace with pic E056.png .
I don't know how to get the pic accroding to 'uD83D\uDE0B'.someone know ?
What the relation between 'uD83D\uDE0B' and '\uE056', they both represent emoji "savouring delicious food"?
The Unicode character U+1F60B FACE SAVOURING DELICIOUS FOOD is a so-called Plane 1 character, which means that its UTF-16 encoded form consists of two 16-bit code units, namely 0xD83D 0xDE0B. Generally, Plane 1 characters cause considerable problems because many programs are not prepared to deal with them, and few fonts contain them.
According to http://www.fileformat.info/info/unicode/char/1f60b/fontsupport.htm this particular character only exists in DejaVu fonts and in Symbola, but the versions of DejaVu I’m using don’t contain it.
Instead of dealing with the problems of encodings (which are not that difficult, but require extra information), you can use the character reference 😈 in HTML. But this does not solve the font problem. I don’t know about iPhone fonts, but in general in web browsing, the odds of a computer having any font capable of rendering the character are probably less than 1%. So you may need to use downloadable fonts. Using an image is obviously much simpler and mostly more reliable.
U+E056 is a Private Use codepoint, which means that anybody can make an agreement about its meaning with his brother or with himself, without asking anyone else’s mind. A font designer may assign any glyph to it.
IMPORTANT: As of this posting, the only browser that doesn't automatically support emojis is chrome.
FOR CHROME:
Depending on what server side language you are using, you should be able to find a library that converts emojis for you. I recently needed to solve this issue with php and used this library:
https://github.com/iamcal/php-emoji
The creator essentially created a sprite and adjusts the css according to the unicode of the emoji. It isnt pretty, but luckily he/she did all the grunt work for you. If you're using a different language you should be able to find something similar.
how do I put those little boxes into a php file?
Same way as any other Unicode character. Just paste them and make sure you're saving the PHP file and serving the PHP page as UTF-8.
When I put it into a php file, it turns into question marks and what not
Then you have an encoding problem. Work it out with Unicode characters you can actually see properly first, for example ąαд™日本, before worrying about the emoji.
Your PHP file should be saved as UTF-8; the page it produces should be served as Content-Type: text/html;charset:UTF-8 (or with similar meta tag); the MySQL database should be using a UTF-8 collation to store data and PHP should be talking to MySQL using UTF-8.
However. Even handling everything correctly like this, PCs will still not show the emoji. That's because:
they don't have fonts that include shapes for those characters, and
emoji are still completely unstandardised. Those characters you posted are in the Unicode Private Use Area, which means they don't have any official meaning at all.
Each network in Japan uses different character codes for their emoji, mapped to different areas in the PUA. So even on another mobile phone, it probably won't display the correct character, unless you spend ages manually converting emoji codes for different networks. I'm guessing the ones you posted above are from SoftBank (iPhone?).
There is an ongoing proposal led by Google and Apple to collate the different networks' emoji and give them a proper standardised place in Unicode. Until then, getting emoji to display consistently across networks is an exercise in unhappiness. See the character overview from the standardisation work to see how much converting you would have to do.
God, I hate emoji. All that pain for such a load of useless twee rubbish.