Is there anyway to use unicode strings (most probably in UTF-8, but could be any encoding) in PostScript?
So far, i've been using this function to transforms fonts to Latin1 encoding:
/latinize {
findfont
dup length dict begin
{ 1 index /FID ne {def}{pop pop} ifelse }forall
/Encoding ISOLatin1Encoding def
currentdict
end
definefont pop
}bind def
/HelveLat /Helvetica latinize
/HelveLatbold /Helvetica-Bold latinize
but i really don't like it.
Not really or in any simple "out of the box" way. See this FAQ entry for details.
This may or may not fit your bill, but the interpreter that I wrote (xpost) uses Cairo for all its graphics and font functions, including show. So whatever support Cairo has to offer, xpost doesn't get in the way. But before you get too excited, it's a one-man project, and doesn't quite offer full Level-1 Postscript yet.
Edit: The newest version does not support this. Here is the last version that did (listing).
Here's my C code for the show operator itself.
OPFN_ void show(state *st, object s) {
char str[s.u.c.n+1];
memcpy(str, STR(s), s.u.c.n); str[s.u.c.n] = '\0';
//printf("showing (%s)\n", str);
if (st->cr) {
cairo_show_text(st->cr, str);
cairo_surface_flush(st->surface);
XFlush(st->dis);
}
}
And from the Cairo docs:
cairo_show_text ()
void cairo_show_text (cairo_t *cr,
const char *utf8);
A drawing operator that generates the shape from a string of UTF-8 characters, rendered according to the current font_face, font_size (font_matrix), and font_options.
This function first computes a set of glyphs for the string of text. The first glyph is placed so that its origin is at the current point. The origin of each subsequent glyph is offset from that of the previous glyph by the advance values of the previous glyph.
After this call the current point is moved to the origin of where the next glyph would be placed in this same progression. That is, the current point will be at the origin of the final glyph offset by its advance values. This allows for easy display of a single logical string with multiple calls to cairo_show_text().
Note: The cairo_show_text() function call is part of what the cairo designers call the "toy" text API. It is convenient for short demos and simple programs, but it is not expected to be adequate for serious text-using applications. See cairo_show_glyphs() for the "real" text display API in cairo.
http://www.cairographics.org/manual/cairo-text.html#cairo-show-text
So it's UTF-8 in Postscript, near as I can figure! :)
Related
If I look in UAX#15 Section 9, there is sample code to check for normalization. That code uses the NFC_QC property and checks CCC ordering, as expected. It looks great except this line puzzles me if (Character.isSupplementaryCodePoint(ch)) ++i;. It seems to be saying that if a character is supplemental (i.e. >= 0x10000), then I can just assume the next character passes the quick check without bothering to check the NFC_QC property or CCC ordering on it.
Theoretically, I can have, say, a starting code point, followed by a supplemental code point with CCC>0, followed by a third code point with CCC>0 and lower than that of the second code point or NFC_QC==NO, and it will STILL pass the NFC quick check, even tho that would seem to not be in NFC form. There are a bunch of supplemental code points with CCC=7,9,216,220,230, so it seems like there are a lot of possibilities to hit this case. I guess this can work if we can assume that it will always be the case throughout future versions of Unicode that all supplemental characters with CCC>0 will also have NFC_QC==No.
Is this sample code correct? If so, why is this supplemental check valid? Are there cases that would produce incorrect results if that supplemental check were removed?
Here is the code snippet copied directly from that link.
public int quickCheck(String source) {
short lastCanonicalClass = 0;
int result = YES;
for (int i = 0; i < source.length(); ++i) {
int ch = source.codepointAt(i);
if (Character.isSupplementaryCodePoint(ch)) ++i;
short canonicalClass = getCanonicalClass(ch);
if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
return NO; }
int check = isAllowed(ch);
if (check == NO) return NO;
if (check == MAYBE) result = MAYBE;
lastCanonicalClass = canonicalClass;
}
return result;
}
The sample code is correct,1 but the part that concerns you has little to do with Unicode normalization. No characters in the string are actually skipped, it’s just that Java makes iterating over a string’s characters somewhat awkward.
The extra increment is a workaround for a historical wart in Java (which it happens to share with JavaScript and Windows, two other early adopters of Unicode): Java Strings are arrays of Java chars, but Java chars are not Unicode (abstract) characters or (concrete numeric) code points, they are 16-bit UTF-16 code units. This means that every character with code point C < 1 0000h takes up one position in a Java String, containing C, but every character with code point C ≥ 1 0000h takes two, as specified by UTF-16: the high or leading surrogate D800h + (C − 1 0000h) div 400h and the low or trailing surrogate DC00h + (C − 1 0000h) mod 400h (no Unicode characters are or ever will be assigned code points in the range [D800h, DFFFh], so the two cases are unambiguously distinguishable).
Because Unicode normalization operates in terms of a sequence of Unicode characters and cares little for the particulars of UTF-16, the sample code calls String.codePointAt(i) to decode the code point that occupies either position i or the two positions i and i+1 in the provided string, processes it, and uses Character.isSupplementaryCodePoint to figure out whether it should advance one or two positions. The way the loop is written treats the “supplementary” two-unit case like an unwanted stepchild, but that’s the accepted Java way of treating them.
1 Well, correct up to a small spelling error: codepointAt should be codePointAt.
Hi All,
This is a question related to itextsharp version 5.5.13.1. I am using a custom LocationTextExtractionStrategy implementation to extract sensible words from a PDF document. I am calling the method GetSingleSpaceWidth of TextRenderInfo to determine when to
join two adjacent blocks of characters into a single word as per the SFO link
itext java pdf to text creation
This approach has generally worked well. However, if you look at the attached document, the words "Credit" and "Extended" is giving me some problems.
Why are all the characters shown encircled in the screen capture returning a zero value for GetSingleSpaceWidth? This causes a problem . Instead of two separate words, my logic returns me one word "CreditExtended".
I understand that itextsharp5 is not supported any more. Any suggestions would be highly appreciated?
Sample document
https://drive.google.com/open?id=1pPyNRXvnUyIA2CeRrv05-H9q0sTUN97d
As already conjectured in a comment, the cause is that the font in question does not contain a regular space glyph, or even more exactly, does not map any of its glyphs to the Unicode value U+0020 in its ToUnicode map.
If a font has a ToUnicode map, iText uses only the information from that map. Thus, iText does not identify a space glyph in that font, so it cannot provide the actual SingleSpaceWidth value and returns 0 instead.
The font in question is named F5 and has this ToUnicode map:
/CIDInit /ProcSet findresource begin
14 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
4 beginbfchar
<0004> <0041>
<0012> <0043>
<001C> <0045>
<002F> <0049>
endbfchar
1 beginbfrange
<0044> <0045> <004D>
endbfrange
13 beginbfchar
<0102> <0061>
<0110> <0063>
<011A> <0064>
<011E> <0065>
<0150> <0067>
<015D> <0069>
<016F> <006C>
<0176> <006E>
<017D> <006F>
<0189> <0070>
<018C> <0072>
<0190> <0073>
<019A> <0074>
endbfchar
5 beginbfrange
<01C0> <01C1> <0076>
<01C6> <01C7> <0078>
<0359> <0359> [<2026>]
<035A> <035B> <2018>
<035E> <035F> <201C>
endbfrange
1 beginbfchar
<0374> <2013>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
As you can see, there is no mapping to <0020>.
The use of fonts in this PDF page is quite funny, by the way:
Its body is (mostly) drawn using Calibri, but it uses two distinct PDF font objects for this, F4 which uses WinAnsiEncoding from character 32 through 122, i.e. including the space glyph, and F5 which uses Identity-H and provides the above quoted ToUnicode map without a space glyph. Each maximal sequence of glyphs without gap is drawn separately; if that whole sequence can be drawn using F4, that font is used, otherwise F5 is used.
Thus, CMI, (Credit, and sub-indexes are drawn using F4 while I’ve, “Credit, and Extended” are drawn using F5.
In your problem string “Credit Extended”, therefore, we see two consecutive sequences drawn using F5. Thus, you'll get a 0 SingleSpaceWidth both for the “Credit t and the Extended” E.
At first glance these are the only two consecutive sequences using F5, so you have that issue only there.
As a consequence you should develop a fallback strategy for the case of two consecutive characters both coming with a 0 SingleSpaceWidth, e.g. using something like a third of the font size.
I'm making a virtual computer with a custom font and programming environment (Mini Micro), all Unicode based. I have need for a few custom glyphs in my environment. I know about the Private Use Areas, but I'm wondering about the "control" code points at U+0080 through U+009F. I can't find any documentation on what these points are for beyond "control".
Would it be a gross abuse of Unicode to tuck a few of my custom glyphs in there? What would be a proper use of them?
Wikipedia lists their meaning. You get 2 of them for your use, U+0091 and U+0092.
The 0x80 - 0x9F range you referto to is generally called the C1 control characters. Like other control codes, the C1s are for code extension, and by their very nature, some are generally left open for further expansion and thus have only vague standardization.
The original and most comprehensive reference is probably ECMA-48 - up to the Fifth Edition in June 1991. (The link takes you to a free download in PDF format.)
For additional glyphs, C1 codes would not be appropriate. In effect, the whole idea of control codes is that they are the special case of non-graphical codes.
UNICODE has continued to evolve, with an emoji block that has a lot of "characters" you might not expect. Let's try one: 💎 it is officially called the GemStone Emoji. I used this copy/paste website to insert it, you might look to see if something you can use has been standardized in the Emoji code block.
One of the interesting things about the emoji characters is that they are double-wide, even in a fixed-width font.
Microsoft uses them for smart quotes the Euro and a few other symbols in its latin-1 extension cp1252. As this character encoding is frequently reported as latin-1 using these code points for other uses can cause problems, especially as latin-1 is supposed to be code point equivalent to Unicode. This Wikipedia page gives some history and the meanings of these control characters.
Note: Non-BMP characters can be displayed in IDLE as of Python 3.8 (so, it's possible Tkinter might display them now, too, since they both use TCL), which was released some time after I posted this question. I plan to edit this after I try out Python 3.9 (after I install an updated version of Xubuntu). I also read the editing these characters in IDLE might not be as straightforward as other characters; see the last comment here.
So, today I was making shortcuts for entering certain Unicode characters. All was going well. Then, when I decided to do these characters (in my Tkinter program; they wouldn't even try to go in IDLE), 𝄫 and 𝄪, I got a strange unexpected error and my program started deleting just about everything I had written in the text box. That's not acceptable.
Here's the error:
_tkinter.TclError: character U+1d12b is above the range (U+0000-U+FFFF) allowed by Tcl
I realize most of the Unicode characters I had been using only had four characters in the code. For some reason, it doesn't like five.
So, is there any way to print these characters in a ScrolledText widget (let alone without messing everything else up)?
UTF-8 is my encoding. I'm using Python 3.4 (so UTF-8 is the default).
I can print these characters just fine with the print statement.
Entering the character without just using ScrolledText.insert (e.g. Ctrl-shift-u, or by doing this in the code: b'\xf0\x9d\x84\xab') does actually enter it, without that error, but it still starts deleting stuff crazily, or adding extra spaces (including itself, although it reappears randomly at times).
There is currently no way to display those characters as they are supposed to look in Tkinter in Python 3.4 (although someone mentioned how using surrogate pairs may work [in Python 2.x]). However, you can implement methods to convert the characters into displayable codes and back, and just call them whenever necessary. You have to call them when you print to Text widgets, copy/paste, in file dialogs*, in the tab bar, in the status bar, and other stuff.
*The default Tkinter file dialogs do not allow for much internal engineering of the dialogs. I made my own file dialogs, partly to help with this issue. Let me know if you're interested. Hopefully I'll post the code for them here in the future.
These methods convert out-of-range characters into codes and vice versa. The codes are formatted with ordinal numbers, like this: {119083ū}. The brackets and the ū are just to distinguish this as a code. {119083ū} represents 𝄫. As you can see, I haven’t yet bothered with a way to escape codes, although I did purposefully try to make the codes very unlikely to occur. The same is true for the ᗍ119083ūᗍ used while converting. Anyway, I'm meaning to add escape sequences eventually. These methods are taken from my class (hence the self). (And yes, I know you don’t have to use semi-colons in Python. I just like them and consider that they make the code more readable in some situations.)
import re;
def convert65536(self, s):
#Converts a string with out-of-range characters in it into a string with codes in it.
l=list(s);
i=0;
while i<len(l):
o=ord(l[i]);
if o>65535:
l[i]="{"+str(o)+"ū}";
i+=1;
return "".join(l);
def parse65536(self, match):
#This is a regular expression method used for substitutions in convert65536back()
text=int(match.group()[1:-2]);
if text>65535:
return chr(text);
else:
return "ᗍ"+str(text)+"ūᗍ";
def convert65536back(self, s):
#Converts a string with codes in it into a string with out-of-range characters in it
while re.search(r"{\d\d\d\d\d+ū}", s)!=None:
s=re.sub(r"{\d\d\d\d\d+ū}", self.parse65536, s);
s=re.sub(r"ᗍ(\d\d\d\d\d+)ūᗍ", r"{\1ū}", s);
return s;
My answer is based on #Shule answer but provide more pythnoic and easy to read code. It also provide a real case.
This is the methode populating items to a tkinter.Listbox. There is no back conversion. This solution only take care of displaying strings with Tcl-unallowed characters.
class MyListbox (Listbox):
# ...
def populate(self):
"""
"""
def _convert65536(to_convert):
"""Converts a string with out-of-range characters in it into a
string with codes in it.
Based on <https://stackoverflow.com/a/28076205/4865723>.
This is a workaround because Tkinter (Tcl) doesn't allow unicode
characters outside of a specific range. This could be emoticons
for example.
"""
for character in to_convert[:]:
if ord(character) > 65535:
convert_with = '{' + str(ord(character)) + 'ū}'
to_convert = to_convert.replace(character, convert_with)
return to_convert
# delete all listbox items
self.delete(0, END)
# add items to listbox
for item in mydata_list:
try:
self.insert(END, item)
except TclError as err:
_log.warning('{} It will be converted.'.format(err))
self.insert(END, _convert65536(item))
I am working on an RTF file made by someone else on an unknown platform, and everything is interpreted correctly, except some characters, whatever character set I open them from in openoffice. Here is the plain text, after interpretation:
"Même taille que la Terre, même masse, même âgec Vénus a souvent été qualifiée de sœur de la Terre. "
and here is the original ANSI paragraph:
"M\u234\'3fme taille que la Terre, m\u234\'3fme masse, m\u234\'3fme \u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus a souvent \u233\'3ft\u233\'3f qualifi\u233\'3fe de s\u339\'3fur de la Terre."
To zoom in:
"âgec Vénus" becomes "\u226\'3fge\uc2 \u61825\'ff\'81\uc1 c V\u233\'3fnus"
and finally, what we come up with:
"\uc2 \u61825\'ff\'81\uc1 c"
here \uc2 and \uc1 are to say we are going back and forth between 4-bytes and 2-bytes Unicode encoding.
\u61825 is an unknown Unicode character. Indeed, according to the RTF specification, any UTF character greater than 2^15 should be written in a negative form; negative form with ANSI characters should make the "-" (minus) sign visible to the notepad, am I right? So here already I have something I don't understand, how the RTF writer used by the person who made the rtf file in the first place could have done it. Maybe I missed something in the specification, specific versions, character sets, I don't know. If taken as is, 61825 would correspond to F181 which is in a private area of the Unicode table.
And then, the \'ff\'81 would be some use of the ANSI equivalent field of the whole "specific character" group (whose structure is usually \uN\'XX), to code something that would be 4-byte long. And here again, I could not find:
what is the code page (Windows-1252, ISO-8859-1, other?) being refered to (as in all the other places in the file where a \uN\'XX sequence apears, XX are always 3F, the Windows-1252 code for "?", so it did not give me much information)
what does the \'FF (which looks like some control character inside an escape sequence!) stand for, and then why \'81... Actually, the translation of \u61825 to hex is F181, not FF81...I am lost here!
Finally, what the translated text (in French) would make us expect is the ":" (semicolon): "Same size as Earth, same mass, same age: Venus has often been qualified as Earth's sister". It would make sense. But what rtf writer could imagine such a complicated code for the semicolon?
So again, after 1 hour of search, I open the question to you fellows: does someone recognize this, and could tell me what control word encoding is used, is there a big endian/little endian/2's complement mess here with the 61825, and same with the \'ff\'81, which would assemble as FF81 instead of F181, which itself doesn't mean anything as is...here my question is only to know if there would be a way to find the complete original text back from the bizarre RTF encoding!
what the translated text (in french) would make us expect is the ":" (semicolon
Nearly: it should be the ellipsis. You can see the source text eg here.
The ellipsis should normally be written simply as three periods, but there has traditionally been a separate character representing ellipsis in order better to control their spacing, back before complex text layout algorithms existed that could do automatic glyph replacement. Consequently there exists a Unicode compatibility character U+2026 HORIZONTAL ELLIPSIS to allow round-tripping to legacy encodings such as Windows code page 1252, where it is byte 133.
That, however, is not what has been encoded in your RTF document. That would be too easy.
61825 is an unknown Unicode character.
It's a Private Use Area character, which means it could represent absolutely anything. Word has exported certain common symbol fonts as PUA characters - see this post for the background.
So someone at some point may have used a symbol font where code unit 129 (the 0x81 in U+F181, 61825) maps to something that looks like an ellipsis. Quite what that font is, I have no idea! It doesn't seem to be one of the usual suspects (Symbol, Wingdings, Webdings). You might just have to manually replace U+F181 with U+2026 for now unless you can find out more about the source.