How character expansions does work under the hood?

How character expansions does work under the hood? - unicode

I am reading the CLR VIA C# by Jeffrey Richter. And while explaining string comparison, he notes that:
When the Compare method is not performing an ordinal comparison, it
performs character expansions. A character expansion is when a
character is expanded to multiple characters regardless of culture.
String s1 = "Strasse";
String s2 = "Straße";
Boolean eq;
CultureInfo ci = new CultureInfo("de-DE");
eq = String.Compare(s1, s2, true, ci) == 0; // returns true
For the above case, he notes:
...the German Eszet character ‘ß’ is always expanded to
‘ss. So in the code example, the call to Compare will always
return 0 regardless of which culture I actually pass in to it.
I want to know from which source, the runtime takes that ß is equal to ss or how it calculates it?

Related

Strange results when deleting all special characters from a string in Progress / OpenEdge

I have the code snippet below (as suggested in this previous Stack Overflow answer ... Deleting all special characters from a string in progress 4GL) which is attempting to remove all extended characters from a string so that I may transmit it to a customer's system which will not accept any extended characters.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
It is working perfectly with one exception (which makes me fear there may be others I have not caught). When it gets to 255, it will replace all 'y's in the string.
If I do the following ...
display chr(255) = chr(121). /* 121 is asc code of y */
I get true as the result.
And therefore, if I do the following ...
display replace("This is really strange",chr(255),"").
I get the following result:
This is reall strange
I have verified that 'y' is the only character affected by running the following:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz".
def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
display v-string.
Which results in the following:
abcdefghijklmnopqrstuvwxz
I know I can fix this by removing 255 from the range but I would like to understand why this is happening.
Is this a character collation set issue or am I missing something simpler?
Thanks for any help!

This is a bug. Here's a Progress Knowledge Base article about it:
http://knowledgebase.progress.com/articles/Article/000046181
The workaround is to specify the codepage in the CHR() statement, like this:
CHR(255, "UTF-8", "1252")
Here it is in your example:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz". def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string, chr(v-int, "UTF-8", "1252"), "").
end.
display v-string.
You should now see the 'y' in the output.

This seems to be a bug!
The REPLACE() function returns an unexpected result when replacing character CHR(255) (ÿ) in a String.
The REPLACE() function modifies the value of the target character, but additionally it changes any occurrence of characters 'Y' and 'y' present in the String.
This behavior seems to affect only the character ÿ. Other characters are correctly changed by REPLACE().
Using default codepage ISO-8859-1
Link to knowledgebase

Matching Unicode punctuation using LPeg

I am trying to create an LPeg pattern that would match any Unicode punctuation inside UTF-8 encoded input. I came up with the following marriage of Selene Unicode and LPeg:
local unicode = require("unicode")
local lpeg = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
local match = unicode.utf8.match(a, "^%p")
if match == nil
return false
else
return i+#match
end
end)
This appears to work, but it will miss punctuation characters that are a combination of several Unicode codepoints (if such characters exist), as I am reading only 4 bytes ahead, it probably kills the performance of the parser, and it is undefined what the library match function will do, when I feed it a string that contains a runt UTF-8 character (although it appears to work now).
I would like to know whether this is a correct approach or if there is a better way to achieve what I am trying to achieve.

The correct way to match UTF-8 characters is shown in an example in the LPeg homepage. The first byte of a UTF-8 character determines how many more bytes are a part of it:
local cont = lpeg.R("\128\191") -- continuation byte
local utf8 = lpeg.R("\0\127")
+ lpeg.R("\194\223") * cont
+ lpeg.R("\224\239") * cont * cont
+ lpeg.R("\240\244") * cont * cont * cont
Building on this utf8 pattern we can use lpeg.Cmt and the Selene Unicode match function kind of like you proposed:
local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
if unicode.utf8.match(c, "%p") then
return i
end
end)
Note that we return i, this is in accordance with what Cmt expects:
The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.
This means we should return the same number the function receives, that is the position immediately after the UTF-8 character.

String to Integer (atoi) [Leetcode] gave wrong answer?

String to Integer (atoi)
This problem is implement atoi to convert a string to an integer.
When test input = " +0 123"
My code return = 123
But why expected answer = 0?
======================
And if test input = " +0123"
My code return = 123
Now expected answer = 123
So is that answer wrong?

I think this is expected result as it said
Requirements for atoi:
The function first discards as many whitespace characters as necessary until the first non-whitespace character is found. Then, starting from this character, takes an optional initial plus or minus sign followed by as many numerical digits as possible, and interprets them as a numerical value.
Your first test case has a space in between two different digit groups, and atoi only consider the first group which is '0' and convert into integer

How can we validate a string consist any single alphabets or not in scala?

My application takes in a string like this:
k0qVsfpz7_cG9n75OjZCCA
P700058213111115432196
1700058213111115432196
I need to validate in a Scala script that the string consists of any single alphabet or not.

Consider exists method over a given string, which maps each character onto a predicate provided. For instance, Char.isLetter proves true only if a given character is an alphabetical value (a letter). Hence
"P700058213111115432196".exists(_.isLetter)
Boolean = true
and
"700058213111115432196".exists(_.isLetter)
Boolean = false
Similarly with forall we can verify that each and every character in a string holds a predicate, for instance
"P700058213111115432196".forall(_.isDigit)
Boolean = false
and
"700058213111115432196".forall(_.isDigit)
Boolean = true
To remark that both exists and forall iterate over a collection. Here we iterate over a Scala string which is treated as a sequence of Char.

Need code for removing all unicode characters in vb6

I need code for removing all unicode characters in a vb6 string.

If this is UTF-16 text (as normal VB6 String values all are) and you can ignore the issue of surrogate pairs, then this is fairly quick and reasonably concise:
Private Sub DeleteNonAscii(ByRef Text As String)
Dim I As Long
Dim J As Long
Dim Char As String
I = 1
For J = 1 To Len(Text)
Char = Mid$(Text, J, 1)
If (AscW(Char) And &HFFFF&) <= &H7F& Then
Mid$(Text, I, 1) = Char
I = I + 1
End If
Next
Text = Left$(Text, I - 1)
End Sub
This has the workaround for the unfortunate choice VB6 had to make in returning a signed 16-bit integer from the AscW() function. It should have been a Long for symmatry with ChrW$() but it is what it is.
It should beat the pants off any regular expression library in clarity, maintainability, and performance. If better performance is required for truly massive amounts of text then SAFEARRAY or CopyMemory stunts could be used.

Public Shared Function StripUnicodeCharactersFromString(ByVal inputValue As String) As String
Return Regex.Replace(inputValue, "[^\u0000-\u007F]", String.Empty)
End Function
Vb6 - not sure will
sRTF = "\u" & CStr(AscW(char))
work? - You could do this for all char values above 127

StrConv is the command for converting strings.
StrConv Function
Returns a Variant (String) converted as specified.
Syntax
StrConv(string, conversion, LCID)
The StrConv function syntax has these named arguments:
Part Description
string Required. String expression to be converted.
conversion Required. Integer. The sum of values specifying the type of conversion to perform. `128` is Unicode to local code page (or whatever the optional LCID is)
LCID Optional. The LocaleID, if different than the system LocaleID. (The system LocaleID is the default.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How character expansions does work under the hood? - unicode

Related

Strange results when deleting all special characters from a string in Progress / OpenEdge

Matching Unicode punctuation using LPeg

String to Integer (atoi) [Leetcode] gave wrong answer?

How can we validate a string consist any single alphabets or not in scala?

Need code for removing all unicode characters in vb6

Categories

Resources