Truncate bidirectional text - unicode

for a project I am currently working on my own text rendering engine. Currently I am stuck on how to truncate bidirectional text properly in case it does not fit within the available text box.
I imagine the following behavior should apply:
(Text is always "Hello Matthew" in different languages)
All Left-to-right text:
Hello Matthew
→
Hello Mat...
All Right-to-left text (I just copied this from Google translate so please excuse any misspelling):
مرحبا ماثيو
→
مرحبا م...‏
Mixed RTL and LTR text:
Matthew مرحبا‎‏
→
Mat... مرحبا‎‏
Is there any standard that describes how the truncation should be done and how to do it? (Currently I image to always truncate the end of the string in memory order and append the ellipsis to the end in memory order. But I think there is also a Left-to-right / Right-to-left mark after the ellipsis necessary sometimes based on ...?)

Related

Why does the TextBox character ordering change with FlowDirection RightToLeft

In my UWP app I have a textbox.
I want the user to be able to type Farsi / Persian text (right to left) into the textbox so I set the FlowDirection property to RighToLeft.
The text can be entered and is displayed correctly:
When I save the text, and inspect the property during debugging, i see the same character order as on screen:
The same character order applies for the stored value when viewed with mssql management studio:
When I add a '.' or a '!' at the end of the text, the WPF textbox still displays what I expect,
but the text I get back from the text property puts the exclamation mark at the right side of the string.
It is also stored this way in the sql database:
When loading the database value (with the exclamation point on the right) into the textbox it shows the exclamation point correctly on the left side. There must be some magic happening here that I am not aware of, or maybe the problem is that the debug preview / mssql preview does not support displaying RTL values.
My problem is that this magic does not work in other situations.
When I load the database value and put it in a microsoft word document, it seems to do no conversion and place the text in the document exactly as it is in the database, resulting in the exclamation point to be shown on the 'wrong' side.
I would like to understand the 'magic' that takes place in displaying / storing these strings, so I can output it correctly in MS Word. And Yes, I have set the paragraph where I output the values in word to RTL.
In Unicode, all characters have directional properties that get used in the Unicode Bidirectional Algorithm for determining how characters are ordered visually. Most characters have a "strong" directional property, but not all. In particular, most punctuation characters are considered directionally neutral.
The visual ordering of neutral characters is determined by the characters that surround them. For example, the exclamation mark ! is neutral; if it occurs between two left-to-right characters, it will be treated as though it also is a left-to-right character. But if it occurs between two right-to-left characters, it will be treated as though it is a right-to-left character.
In your example, though, the exclamation mark occurs at the end of the string. So, it has a strong-direction character on one side, but nothing on the other side. In this case, another factor comes into play, which is that the paragraph as a whole has a base direction.
The Unicode Bidi Algorithm allows two ways that apps can handle the paragraph base direction:
the app can set the base direction explicitly, regardless of the string content in the paragraph; or
the app can let the base direction be derived implicitly from the string: the base direction is determined by the first strong-directional character in the string.
In your UWP app, when you set the flow direction to RTL, then the paragraph base direction (for purposes of the Bidi Algorithm) is RTL. With an Arabic-script string that ends with the exclamation mark, the directionality of the exclamation is set to RTL because of the paragraph base direction, and so it appears at the left end of the string. But when you view the control property value in an IDE, the IDE is presenting that property string in a control that has LTR base direction. That is causing the exclamation at the logical end of the string to appear visually at the right end.
Note that apps will often conflate base direction and alignment, though these are really distinct things. In Word, you can set the paragraph base direction in the Paragraph settings dialog, and when you do it will set the alignment to match by default:
But you can override the paragraph alignment to have a RTL base direction with left alignment:
Note that the visual order of the exclamation mark is affected by the paragraph base direction but not by the alignment. The Unicode Bidi Algorithm doesn't pay attention to the alignment.
This article gives a good overview of how the Bidi Algorithm works: https://www.w3.org/International/articles/inline-bidi-markup/uba-basics.
If you want to explore how the Bidi Algorithm works in more detail, you can read the spec, Unicode Standard Annex #9, Unicode Bidirectional Algorithm; and check out this Unicode utility that explains how the rules of the algorithm apply to sample strings you can provide.

VSCode custom font ligatures, display two symbols as one?

I would like to display list.emptyqm() as list.empty?() in function names for specific language. So, two symbols qm if they are at the end of the function name should be displayed as ? (possibly some unicode symbol looking similar to question mark).
Is that possible in VSCode?
The VSCode already knows that piece of text is string, or function-name/keyword/variable-name (as it highlights it properly), so the ligature should be displayed only if qm are the last
characters of function-name/keyword/variable-name. It shouldn't be displayed in the middle of the function name, like aqma() shouldn't be displayed as a?a().
You seem to misunderstand what a ligature is. A ligature describes how two individual letters can be combined to form a visual pleasing appearance. A ligature never changes the syntax of a text. Hence, converting qm to ? is a completely different thing.
Replacing text in vscode is of course possible, for instance as part of the format command. You can register your own formatter and determine the text edit actions that you want to be applied, including the transformation of these character sequences.

Unicode for Contextual forms of ټ,ګ,ځ,څ,ڼ,ښ,ډ,ۍ,ړ,ې in Pashto language

I am developing a program that give the correct format of text for example if I write سلام so it give FEB3, FEE0, FE8E and FEE2 witch are Unicode of سـ, ـلـ,ﺎ,ـم, then if I write ټول there is Unicode for character ټ which is 067C, but there is not Unicode for character ټـ which is Initial Contextual form.
So I found Unicode for isolated of ټ,ګ,ځ,څ,ڼ,ښ,ډ,ۍ,ړ,ې in the Wikipedia, but I can't find Unicode of Contextual forms.
For example Unicode of ټـ ,ـټـ,ـټ.
I am waiting for response if any one knows the solution of this problem.
thanks...
A Unicode character is intended to be abstract in the sense that it doesn't have a particular presentation form. The preferred way to display cursive scripts like Arabic is to store the standard, non-contextual forms, and convert them to their cursive forms at display time - that is, as one of the final stages of a text display system in an operating system or word processor.
The cursive forms are usually provided as glyphs in the font, and are chosen using information in tables in the font file embodying the contextual rules.
Unicode stores quite a large number of Arabic contextual forms, but only for compatibility with older encodings, and with traditional metal type, for which only a finite number of physical glyphs can be supplied. Unfortunately for your purposes, these contextual forms don't cover all the extended characters used in languages other than Arabic, such as the example you give, which is U+067C ARABIC LETTER TEH WITH RING, used in Pashto.
It's very unlikely that further contextual Arabic forms will be added, in my opinion. Therefore your proposed program cannot be made to work, at least according to its current design.
Earlier Unicode versions included separate codes for the different forms of Arabic letters for all letters except some. Arabic letters are used to write Pashto, Farsi, Urdu, and few other languages. The letters that were used in Arabic, Farsi, and may be a couple more languages were assigned different codes for each form of the their letters. However, the letters used only by less taught languages like Pashto, which you are asking about, were assigned codes for only the isolated forms. In the later versions of the Unicode, it was decided to only assign a single code to each letter, leaving Pashto only letters to have codes for only the isolated forms.
Actually there was no need to have a separate code for each form which was a bad decision made by the earlier Unicode versions. A rendering engine (editors, and other programs that deal with plain text) should account for the different forms of each letter and display the correct form according to its position.

What does it mean for a CTLine to have "string access"?

I'm trying to solve a hairy problem with UILabel, and I've gotten most of it figured out, except for one thing: I'm having a challenge understanding what it means for a CTLine to have "string access".
The method that I'd like to use is CTLineGetOffsetForStringIndex. Here's a link to the documentation for the method.
Here's the part of the documentation that I don't understand (emphasis is mine):
The primary offset along the baseline for charIndex, or 0.0 if the
line does not support string access.
When I'm running this method, I'm getting 0.0 back, so I guess that means the line doesn't support string access - but what does that mean, exactly?
The statement "the line does not support string access" may be inferred as meaning that the line of text may not be treated as a sequence of characters that may be accessed by the index of each character.
This may open up a large discussion about visual characters versus non-visual characters, and glyphs versus characters. But to simplify the discussion, assume that a line of text may have one of the following states:
more than zero characters (characters which translate to either glyphs or whitespace within the same line) are present in the line of text in question
there are no characters in the line of text which occupy any "space"
Now to provide some rationale for this inference.
Apple's documentation provides a description of Text Kit, upon which UILabel is built:
The UIKit framework includes several classes whose purpose is to display text in an app’s user interface: UITextView, UITextField, and UILabel, as described in Displaying Text Content in iOS. Text views, created from the UITextView class, are meant to display large amounts of text. Underlying UITextView is a powerful layout engine called Text Kit. If you need to customize the layout process or you need to intervene in that behavior, you can use Text Kit. For smaller amounts of text and special needs requiring custom solutions, you can use alternative, lower-level technologies, as described in Lower Level Text-Handling Technologies.
Text Kit is a set of classes and protocols in the UIKit framework providing high-quality typographical services that enable apps to store, lay out, and display text with all the characteristics of fine typesetting, such as kerning, ligatures, line breaking, and justification. Text Kit is built on top of Core Text, so it provides the same speed and power. UITextView is fully integrated with Text Kit; it provides editing and display capabilities that enable users to input text, specify formatting attributes, and view the results. The other Text Kit classes provide text storage and layout capabilities. Figure 8-1 shows the position of Text Kit among other iOS text and graphics frameworks.
Figure 8-1 Text Kit Framework Position
Text Kit gives you complete control over text rendering in user interface elements. In addition to UITextView, UITextField and UILabel are built on top of Text Kit, and it seamlessly integrates with animations, UICollectionView and UITableView. Text Kit is designed with a fully extensible object-oriented architecture that supports subclassing, delegation, and a thorough set of notifications enabling deep customization.
The answer to the related question mentions several classes such as NSTextStorage, NSLayoutManager, and NSTextContainer.
Consider that the UILabel uses all the above classes to provide the end result of displaying text in the parent UIView, which the end user sees on the screen. A layout manager (an instance of NSLayoutManager) coordinates data flow between the text view, the text container, and the text storage, resulting in the display of characters in the view. The layout manager maps the characters to glyphs, and figures out which lines to use to lay out the glyphs. The layout manager also figures out how to display things like underline and strikethrough, which are not part of the glyphs.
Important to this discussion is the fact that the Layout Manager lays out lines of text. If that line of text is selectable, the user may select visible characters in the line. In this particular case, there is "string access" for the line.
A similar concept is the method posted in the solution to related question:
func boundingRect(forGlyphRange glyphRange: NSRange, in container: NSTextContainer) -> CGRect
Returns a single bounding rectangle (in container coordinates) enclosing all glyphs and other marks drawn in the given text container for the given glyph range, including glyphs that draw outside their line fragment rectangles and text attributes such as underlining.
Finally, the reference discussion for the function CTLineGetOffsetForStringIndex speaks about graphical offsets which are suitable for drawing custom carets. The carets may be used to show insertion points or text selection. The primary and secondary offsets may be thought of as beginning and end indices for a string -- a sequence of characters. If there is no sequence of characters for a given line, there can be no selected characters, no carets, no range of glyphs. Therefore no "string access".

Fast, Unicode-capable, cross-platform programmer's text editor that shows invisibles like ZWSP?

Our publishing workflow includes Windows and Linux machines (there are some Macs too, but not in the critical-path workflow). Many texts include both English and Khmer and are marked-up in XML.
XML Copy Editor is the best cross-platform open-source XML editor I've discovered. It utilizes the Scintilla editing component, which is generally good with Unicode but which does not enable non-printing or invisible characters like U+200B (zero-width space) and U+200C (zero-width non-joiner) to be displayed. Khmer does not separate words with a space character as Western languages do, so ZWSP is used in electronic texts to enable applications to break lines easily.
Ideally I'd edit the markup and the content in a single editor, but XML awareness is less important at times than being able to display invisibles. (OpenOffice.org Writer and Microsoft Word are the only two apps I know that will display ZWSP. They are not suitable for the markup and text manipulations that need to be done to prepare manuscripts for publication, unfortunately, although I guess they're fine for authoring.)
I tried out a promising editor last week, but a search-and-replace regex operation that took under a second in TextPad 4.7.3 lasted over twenty seconds. So I want to mention that speed and the ability to handle large (up to 150mb) files is also a concern.
Is there a good, fast, free or not too expensive text editor, with versions on Windows and Linux and maybe mac too, Unicode-aware and capable of displaying invisibles like ZWSP? That has syntax highlighting, can handle large files and is customizable enough that I won't tear my hair out in frustration?
I don't know about ZWSP in particular, but EditPadPro is good, fast, not expensive, has a very good regex engine and is Unicode-aware (and well-suited to editing XML, too). The developer (Jan Goyvaerts) lives in Thailand and knows about requirements for Eastern scripts and languages, so chances are good that it will be able to handle these texts.
EditPad Pro does not (yet) have the ability to visualize non-printable characters other than the ASCII space and tab. Version 6 does recognize ZWSP as a word boundary when doing word wrapping and selecting words by double-clicking or Ctrl+Shift+Left/Right.
What you can do is to search for the regular expression \u200B. Though this doesn't make the zero-width space visible, it will select it and put the cursor after it. You could use the regex \u200B\X and turn on the Highlight button on the search panel to highlight each grapheme after U+200B. You could even use the syntax coloring scheme editor to edit the provided XML scheme to use that regex always highlight each grapheme after U+200B.
EditPad Pro easily handles 150 MB files and has a powerful regex engine (same as used in RegexBuddy and PowerGREP). Maximum file size is 2 GB. Windows only.
I'm using CKEditor , it's cross platform and completly support unicode.
Take a look at it