How should I handle digits from different sets of UNICODE digits in the same string? - perl

I am writing a function that transliterates UNICODE digits into ASCII digits, and I am a bit stumped on what to do if the string contains digits from different sets of UNICODE digits. So for example, if I have the string "\x{2463}\x{24F6}" ("④⓶"). Should my function
return 42?
croak that the string contains mixed sets?
carp that the string contains mixed sets and return 42?
give the user an additional argument to specify one of the three above behaviours?
do something else?

Your current function appears to do #1.
I suggest that you should also write another function to do #4, but only when the requirement appears, and not before .
I'm sure Joel wrote about "premature implementation" in a blog article sometime recently, but I can't find it.

I'm not sure I see a problem.
You support numeric conversion from a range of scripts, which is to say, you are aware of the Unicode codepoints for their numeric characters.
If you find an unknown codepoint in your input data, it is an error.
It is up to you what you do in the event of an error; you may insert a space or underscore, or you may abort conversion. What you would do will depend on the environment in which your function executes; it is not something we can tell you.

My initial thought was #4; strictly based on the fact that I like options. However, I changed my mind, when I viewed your function.
The purpose of the function seems to be, simply, to get the resulting digits 0..9. Users may find it useful to send in mixed sets (a feature :) . I'll use it.

If you ever have to handle input in bases greater than 10, you may end up having to treat many variants on the first 6 letters of the Latin alphabet ('ABCDEF') as digits in all their forms.

Related

crystal reports attempting to link two tables by matching string with no luck

As stated in the title, I have two tables I'm attempting to link. Both Strings appear to be a match, however Crystal Reports is not picking it up. The only thing I can think is that that length of the field is different, even though the strings are the same. could that cause a discrepancy? If so how can I correct for it? Thank you
Length of the string will prevent a match. If you are using the Trim(string) function, that only removes spaces found at the beginning or end of your string, so the two strings could still be of different lengths after using this function. You will need to use another function to capture a substring of the original string. To do this you can use the Left(string, length) function to ensure both strings are the same length.
If they still do not match then you may have non-printable characters in one or both of your strings. Carriage Return and Line Feed tend to be the most commonly found non-printable characters. A Carriage Return is represented as Chr(10), while a Line Feed is represented as Chr(13). These are Built In Constants similar to those found in VBA and Visual Basic.
You can use a find and replace to remove them with the following formula. Its not a bad idea to also include the trim and left functions in this as well to ensure you get the best match possible.
Replace(Replace(Left(Trim({YourStringField}), 10),Chr(10), ""),Chr(13), "")
There are a few additional Built In Constants you may need to check for if this doesn't work. A Tab is represented as Chr(9) for example. Its very rare for strings to contain the other Built In Constants though. In most cases Carriage Return and Line Feed are the only ones that are typically found in Plain Text. Tabs and the other constants should only be found in Rich Text and are very rare in string data.

Ogg metadata - Vorbis Comment end

I want to implement a class to read vorbis comments. I know that a field will start with a field name, followed by an equal sign and the value. But how does it end? Documentation makes me think that a semicolon will end the field but I checked an ogg file with a hex editor and I cannot see any.
This is how I think it should look like in a file :
TITLE=MY SUPER TITLE;
The field name is title, followed by the equals sign and then the value is MY SUPER TITLE. And finally the semicolon to end the field.
But instead inside my file, the fields look like this :
TITLE=MY SUPER TITLE....
It's almost as above but there is no semicolon. The .'s are characters that cannot be displayed. I thought okay, it seems like the dots represent a value that will say "this is the end of the field!!" but they are almost always different. I noticed that there are always exactly 4 dots. The first dot has always a different value. The other free have usually a value of 0. But not always...
My question now, how does a field end? How do I read this comment?
Also, yeah I know that there are libraries and that I should use them instead of reinventing the wheel over and over again. I will use libraries later but first I want to know how to do it myself. Educational purpose only.
Each field is preceded by a little-endian 32-bit integer that indicates the number of bytes to read. You then convert the bytes to a string via UTF8.
See NVorbis' implementation (LoadComments(...)) for details.

How do I use length() for unicode characters?

When working in the Moovweb SDK, length("çãêá") is expected to return 4, but instead returns 8. How can I ensure that the length function works correctly when using Unicode characters?
This is a common issue with Unicode characters and the length() function using the wrong character set. To fix it you need to set the charset_determined variable to make sure the correct character set is being used before making the call to length(), like so in your tritium code:
$charset_determined = "utf-8"
# your call to length() here
In Unicode, there is no such thing as a length of a string or "number of characters". All this comes from ASCII thinking.
You can choose from one of the following, depending what you exactly need:
For cursor movement, text selection and alike, grapheme clusters shall be used.
For limiting the length of a string in input fields, file formats, protocols, or databases, the length is measured in code units of some predetermined encoding. The reason is that any length limit is derived from the fixed amount of memory allocated for the string at a lower level, be it in memory, disk or in a particular data structure.
The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.
There is more info in http://utf8everywhere.org

Displaying Unicode Characters

I already searched for answers to this sort of question here, and have found plenty of them -- but I still have this nagging doubt about the apparent triviality of the matter.
I have read this very interesting an helpful article on the subject: http://www.joelonsoftware.com/articles/Unicode.html, but it left me wondering about how one would go about identifying individual glyphs given a buffer of Unicode data.
My questions are:
How would I go about parsing a Unicode string, say UTF-8?
Assuming I know the byte order, what happens when I encounter the beginning of a glyph that is supposed to be represented by 6 bytes?
That is, if I interpreted the method of storage correctly.
This is all related to a text display system I am designing to work with OpenGL.
I am storing glyph data in display lists and I need to translate the contents of a string to a sequence of glyph indexes, which are then mapped to display list indices (since, obviously, storing the entire glyph set in graphics memory is not always practical).
To have to represent every string as an array of shorts would require a significant amount of storage considering everything I have need to display.
Additionally, it seems to me that 2 bytes per character simply isn't enough to represent every possible Unicode element.
How would I go about parsing a Unicode string, say UTF-8?
I'm assuming that by "parsing", you mean converting to code points.
Often, you don't have to do that. For example, you can search for a UTF-8 string within another UTF-8 string without needing to care about what characters those bytes represent.
If you do need to convert to code points (UTF-32), then:
Check the first byte to see how many bytes are in the character.
Look at the trailing bytes of the character to ensure that they're in the range 80-BF. If not, report an error.
Use bit masking and shifting to convert the bytes to the code point.
Report an error if the byte sequence you got was longer than the minimum needed to represent the character.
Increment your pointer by the sequence length and repeat for the next character.
Additionally, it seems to me that 2
bytes per character simply isn't
enough to represent every possible
Unicode element.
It's not. Unicode was originally intended to be a fixed-with 16-bit encoding. It was later decided that 65,536 characters wasn't enough, so UTF-16 was created, and Unicode was redefined to use code points between 0 and 1,114,111.
If you want a fixed-width encoding, you need 21 bits. But they aren't many languages that have a 21-bit integer type, so in practice you need 32 bits.
Well, I think this answers it:
http://en.wikipedia.org/wiki/UTF-8
Why it didn't show up the first time I went searching, I have no idea.

Japanese COBOL Code: rules for G literals and identifiers?

We are processing IBMEnterprise Japanese COBOL source code.
The rules that describe exactly what is allowed in G type literals,
and what are allowed for identifiers are unclear.
The IBM manual indicates that a G'....' literal
must have a SHIFT-OUT as the first character inside the quotes,
and a SHIFT-IN as the last character before the closing quote.
Our COBOL lexer "knows" this, but objects to G literals
found in real code. Conclusion: the IBM manual is wrong,
or we are misreading it. The customer won't let us see the code,
so it is pretty difficult to diagnose the problem.
EDIT: Revised/extended below text for clarity:
Does anyone know the exact rules of G literal formation,
and how they (don't) match what the IBM reference manuals say?
The ideal answer would a be regular expression for the G literal.
This is what we are using now (coded by another author, sigh):
#token non_numeric_literal_quote_g [STRING]
"<G><squote><ShiftOut> (
(<NotLineOrParagraphSeparatorNorShiftInNorShiftOut>|<squote><squote>|<ShiftOut>)
(<NotLineOrParagraphSeparator>|<squote><squote>)
| <ShiftIn> ( <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut>|
<ShiftIn>|<ShiftOut>)
| <squote><squote>
)* <ShiftIn><squote>"
where <name> is a macro that is another regular expression. Presumably they
are named well enough so you can guess what they contain.
Here is the IBM Enterprise COBOL Reference.
Chapter 3 "Character Strings", subheading "DBCS literals" page 32 is relevant reading.
I'm hoping that by providing the exact reference, an experienced IBMer can tell us how we misread it :-{ I'm particularly unclear on what the phrase "DBCS-characters" means
when it says "one or more characters in the range X'00...X'FF for either byte"
How can DBCS-characters be anything but pairs of 8-bit character codes?
The existing RE matches 3 types of pairs of characters if you examine it.
One answer below suggests that the <squote><squote> pairing is wrong.
OK, I might believe that, but that means the RE would only reject
literal strings containing single <squote>s. I don't believe that's
the problem we are having as we seem to trip over every instance of a G literal.
Similarly, COBOL identifiers can apparantly be composed
with DBCS characters. What is allowed for an identifier, exactly?
Again a regular expression would be ideal.
EDIT2: I'm beginning to think the problem might not be the RE.
We are reading Shift-JIS encoded text. Our reader converts that
text to Unicode as it goes. But DBCS characters are really
not Shift-JIS; rather, they are binary-coded data. Likely
what is happening is the that DBCS data is getting translated
as if it were Shift-JIS, and that would muck up the ability
to recognize "two bytes" as a DBCS element. For instance,
if a DBCS character pair were :81 :1F, a ShiftJIS reader
would convert this pair into a single Unicode character,
and its two-byte nature is then lost. If you can't count pairs,
you can't find the end quote. If you can't find the end quote,
you can't recognize the literal. So the problem would appear
to be that we need to switch input-encoding modes in the middle
of the lexing process. Yuk.
Try to add a single quote in your rule to see if it passes by making this change,
<squote><squote> => <squote>{1,2}
If I remember it correctly, one difference between N and G literals is that G allows single quote. Your regular expression doesn't allow that.
EDIT: I thought you got all other DBCS literals working and just having issues with G-string so I just pointed out the difference between N and G. Now I took a closer look at your RE. It has problems. In the Cobol I used, you can mix ASCII with Japanese, for example,
G"ABC<ヲァィ>" <> are Shift-out/shift-in
You RE assumes the DBCS only. I would loose this restriction and try again.
I don't think it's possible to handle G literals entirely in regular expression. There is no way to keep track of matching quotes and SO/SI with a finite state machine alone. Your RE is so complicated because it's trying to do the impossible. I would just simplify it and take care of mismatching tokens manually.
You could also face encoding issues. The code could be in EBCDIC (Katakana) or UTF-16, treating it as ASCII will not work. SO/SI sometimes are converted to 0x1E/0x1F on Windows.
I am just trying to help you shoot in the dark without seeing the actual code :)
Does <NotLineOrParagraphSeparatorNorApostropheNorShiftInNorShiftOut> also include single and double quotation marks, or just apostrophes? That would be a problem, as it would consume the literal closing character sequence >' ...
I would check the definition of all other macros to make sure. The only obvious problem that I can see is the <squote><squote> that you already seem to be aware of.