How are encodings and charset tables organized in CFF font file? - unicode

The CFF specification Chapter 11 to Chapter 13 gives a rough description of what encodings and charset data are organized in a file. CFF Specification. Here are some questions.
Considering the possible existence of multi-font file, and that charstrings are accessed in a per-font manner, the corresponding index should also be meaningful only for each font. However, is there only at most one encoding and one charset table for the file? If so, how are the glyph indices correspond to the ones for the charstrings? If not, do they appear multiple times in the TopDict from where they are accessed?
(Resolved. See answer below.)
It seems that charsets give names to each glyph. How about encoding? What is that Card8 data stored in each array element? Given its 256 limit, wouldn't the encoding be very restrained? And why in the supplement format the data come via SID? What is the designed method to access glyphs through encodings (In a hybrid string/code way)? And why again are these data strings when it comes to predefined encodings?
Thanks

Here is an answer to questions 1:
It is a mistake thinking that there is only one TopDict in a single font file. TopDict is an index structure, which contains possibly multiple top table for each font in the FontSet. Therefore the definition of encoding and charset is naturally per-font. It is a little bit confusing that in the specification's Data Layout, that Name and TopDict are not marked with "per-font". See Section 8.
This contains the top-level DICTs of all the fonts in the FontSet
stored in an INDEX structure. Objects contained within this INDEX
correspond to those in the Name INDEX in both order and number. Each
object is a DICT structure that corresponds to the top-level
dictionary of a PostScript font. A font is identified by an entry in
the Name INDEX and its data is accessed via the corresponding Top
DICT.

I can respond from a practical point-of-view:
About #1, the question appears to be about multiple CFF FontSets, which is a non-starter in the context of OpenType/CFF fonts: only one CFF FontSet is allowed/recognized.
About #2, the encoding issue is also somewhat of a non-starter in the context of an OpenType/CFF font in that the encoding is imposed by the 'cmap' table.
In summary, stand-alone CFFs are effectively worthless, which nullifies any actual or perceived benefits of multiple CFF FontSets and built-in CFF encodings.

Related

How to save the text on a web page in a particular encoding?

I read following sentence from the link
Content authors need to find out how to declare the character encoding
used for the document format they are working with.
Note that just declaring a different encoding in your page won't
change the bytes; you need to save the text in that encoding too.
As per my knowledge, the characters from the text are stored in the computer as one or more bytes irrespective of the 'character encoding' specified in the web page.
I understood the above quoted text also, except the last sentence in bold font
you need to save the text in that encoding too
What does this sentence mean?
Is it saying that the content author/developer has to manually save the same text(which is already stored in the computer as one or more bytes) in the encoding specified by him/her? If yes, how to do it and why it is needed to do? If no, then what this sentence actually mean?
When you make a web page publicly available in the most basic sense you make a text file (that is located on a piece of hardware you own) public in the sense that when a certain adress is requested you return this file. That file can be saved on your local hardware or may not be saved there (dynamic content). Whatever the case, the user that is accessing your web page is provided a file. Once the user gains posession of the file he should be able to read it, that is where the encoding comes in play. If you have a raw binary file you can only guess what it contains and what encoding it is in, so most web pages provide the encoding that they return the file in alongside the file. This is where the bold text you ask about can be related to my answer - if you provide one encoding alongside the file (for example utf 8) but deliver the file in another encoding (ASCII) the user may see parts of the text or may not see it at all. And if you provide a static file it should be saved in the correct encoding (that is the one you told your file will be in).
As for the question how to store it - that is highly specific to the way you provide the file. Most text editors provide means to save a file in specific encoding. And most tools to bring up a page content provide convenient ways to give the file in a form that would be easy for the user to decode.
It is just a note, probably because of confusion by some users.
The text tell us that one should specify in some form the encoding of the file. This is straightforward. Webserver usually cannot know the encoding of a file. Note if pages are delivered by e.g. a database, the encoding could be implicit, but web consider file as first class citizen, so we still need to specify encoding.
The note makes just clears that by changing the encoding, the page is not transcoded by webrowser. The page will remain byte per byte the same, just clients (browsers) will misinterpret the content. So if you want to change the encoding, you should specify the new encoding, but also save the file (or save and convert) to the expected encoding. No magic will be done (usually) by web-servers.
There is no text but encoded text.
The fundamental rule of character encodings is that the reader must use the same encoding as the writer. That requires communication, conventions, specifications or standards to establish an agreement.
"Is it saying that the content author/developer has to manually save the same text(which is already stored in the computer as one or more bytes) in the encoding specified by him/her? If yes, how to do it and why it is needed to do?"
Yes, it always the case for every text file that a character encoding is chosen. Obviously, if the file already exists it is probably best not to change the encoding. You do it by some editor option (try the Save As… dialog or equivalent) or by some library property or configuration.
"save the text in that encoding too"
Actually, it's usually the other way around. You decide on the encoding you want or need to use and the HTML editor or library updates the contents with a matching declaration and any newly necessary character entity references (e.g., does 🚲 need to be written as 🚲? Does ¡ need to be written as ¡?) as it writes or streams the document. (If your editor doesn't do that then get a real HTML editor.)

How do I create a character set like ASCII?

I'm curious about the way that in the past it was implemented and I want to get information about how can I implement a character set of my own.
ASCII (American Standard Code for Information Interchange) was the "original" characterset, and remains the basis for most text data. ASCII is actually a 7-bit code (the numeric values range from 0 to 127) with the most significant bit of a byte indicating if the rest of the byte refers to ASCII (if zero) or the current Codepage.
Extra (non-ascii) characters were then added to these codepages, and the user's computer would load a specific codepage to use. Unfortunately this meant that you needed to load the correct codepage before viewing a file or the wrong characters would appear.
We have now moved on, and most systems use Unicode which is a variable character length (rather than the single-byte characters used previously) which can contain thousands upon thousands of characters, allowing for a single encoding to cater for what would have been multiple codepages using the ASCII+Codepage method of old.
That's the brief history; As to how to create your own characterset, I'm not sure what you are trying to achieve - You can create your own fonts, but if you're talking about an actual characterset (i.e. characters that do not already exist) then you'll have to get your characterset added to a standard such as Unicode so that other computers can make use of your new characters, which would be a considerable amount of work (and I have no idea how you'd even go about it) -- It's worth considering, however, that almost every character in existence already exists in Unicode so you may want to review what's already been done before you try and take on a mammoth undertaking such as creating an entirely new characterset.

Character Sets, Locales, Fonts and Code Pages?

I can't figure out the relation between those terms. I actually need a brief explanation for each and eventually a relation between them.
Moreover, where do all these stuff reside? Where are they implemented? Is it the job of the operating system to manage the aforementioned terms? If not, then who is responsible for this job?
Character Sets describe the relationship between character codes and characters.
As an example, all (extended) ASCII character sets assign 41hex == 65dec to A.
Common character sets are ASCII, Unicode (UTF-8, UTF-16), Latin-1 and Windows-1252.
Code Pages are a a representation of character sets / a mechanism for selecting which character set is in use: There are the old DOS/computer manufacturer codepages and the legacy support for them in Windows, ANSI-Codepage (ANSI is not to blame here) and OEM-Codepage.
If you have a choice, avoid them like the plague and go for unicode, preferably UTF-8, though UTF-16 is an acceptable choice for the OS-facing part in Windows.
Locales are collections of all the information needed to conform to local conventions for display of information. On systems which neither use code pages nor have a system-defined universal character set, they also determine the character set used (e.g. Unixoids).
Fonts are the graphics and ancillary information neccessary for displaying text of a known encoding. Examples are "Times New Roman", "Verdana", "Arial" and "WingDings".
Not all Fonts have symbols for all characters present in any specific character set.

Encoding special chars in XSLT output

I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)
<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)
Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.

How to discover what codepage to use when converting RTF hex literals to Unicode

I'm parsing RTF 1.5+ files generated by Word 2003+ that may have content from other languages. This content is usually encoded as hex literals (\'xx). I would like to convert these literals to unicode values.
I know my document's code page by looking for ansicpg (\ansi\ansicpg1252).
When I use the ansicpg codepage to decode to Unicode, many languages (like French) seem to convert to the Unicode char values that I expect.
However when I see Russian text (like below), codepage 1252 decodes the content to jibberish.
\f277\lang1049\langfe1033\langnp1049\insrsid5989826\charrsid6817286
\'d1\'f2\'f0\'e0\'ed\'e8\'f6\'fb \'e1\'e5\'e7 \'ed\'e0\'e7\'e2\'e0\'ed\'e8\'ff. \'dd\'f2
\'e0 \'f1\'f2\'f0\'e0\'ed\'e8\'f6\'e0 \'ed\'e5 \'e4\'ee\'eb\'e6\'ed\'e0
\'ee\'f2\'ee\'e1\'f0\'e0\'e6\'e0\'f2\'fc\'f1\'ff \'e2 \'f2\'e0\'e1\'eb\'e8\'f6\'e5
\'e2 \'f1\'ee\'e4\'e5\'f0\'e6\'e0\'ed\'e8\'e8.
I assume that lang1049, langfe1033, langnp1049 should provide me clues so I can programmatically choose a different (non-default) code page for the text that they reference? If so, where can I find information that explains how to map a lang* code to a codepage? Or should I be looking for some other RTF command/directive to provide me with the information I'm looking for? (Or must I use \f277 as a font reference and see if it has an associated codepage?)
\lang really only marks up particular stretches of the text as being in a particular language, and shouldn't impact what code page is to be used for the old non-Unicode \' escapes.
Putting an \ansicpg token in the header should perhaps do it, but seems to be ignored by Word (for both raw bytes and \' escapes.
Or must I use \f277 as a font reference and see if it has an associated codepage?
It looks that way. Changing the \fcharset of the font assigned to a particular stretch of text is the only way I can get Word to change how it treats the bytes, anyway. The codes in this token (see eg here for list) are, aggravatingly, different again from either the language ID or the code page number.
It is not so clear but you can use the RichEdit control in order to convert the RTF to UTF-8 format according to the MSDN:
http://msdn.microsoft.com/en-us/library/windows/desktop/bb774304(v=vs.85).aspx
Take a look to the SF_USECODEPAGE for the EM_STREAMOUT message.