Strange results when deleting all special characters from a string in Progress / OpenEdge - progress-4gl

I have the code snippet below (as suggested in this previous Stack Overflow answer ... Deleting all special characters from a string in progress 4GL) which is attempting to remove all extended characters from a string so that I may transmit it to a customer's system which will not accept any extended characters.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
It is working perfectly with one exception (which makes me fear there may be others I have not caught). When it gets to 255, it will replace all 'y's in the string.
If I do the following ...
display chr(255) = chr(121). /* 121 is asc code of y */
I get true as the result.
And therefore, if I do the following ...
display replace("This is really strange",chr(255),"").
I get the following result:
This is reall strange
I have verified that 'y' is the only character affected by running the following:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz".
def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
display v-string.
Which results in the following:
abcdefghijklmnopqrstuvwxz
I know I can fix this by removing 255 from the range but I would like to understand why this is happening.
Is this a character collation set issue or am I missing something simpler?
Thanks for any help!

This is a bug. Here's a Progress Knowledge Base article about it:
http://knowledgebase.progress.com/articles/Article/000046181
The workaround is to specify the codepage in the CHR() statement, like this:
CHR(255, "UTF-8", "1252")
Here it is in your example:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz". def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string, chr(v-int, "UTF-8", "1252"), "").
end.
display v-string.
You should now see the 'y' in the output.

This seems to be a bug!
The REPLACE() function returns an unexpected result when replacing character CHR(255) (ÿ) in a String.
The REPLACE() function modifies the value of the target character, but additionally it changes any occurrence of characters 'Y' and 'y' present in the String.
This behavior seems to affect only the character ÿ. Other characters are correctly changed by REPLACE().
Using default codepage ISO-8859-1
Link to knowledgebase

Related

How do I parse out a number from this returned XML string in python?

I have the following string:
{\"Id\":\"135\",\"Type\":0}
The number in the Id field will vary, but will always be an integer with no comma separator. I'm not sure how to get just that value from that string given that it's string data type and not real "XML". I was toying with the replace() function, but the special characters are making it more complex than it seems it needs to be.
is there a way to convert that to XML or something that I can reference the Id value directly?
Maybe use a regular expression, e.g.
import re
txt = "{\"Id\":\"135\",\"Type\":0}"
x = re.search('"Id":"([0-9]+)"', txt)
if x:
print(x.group(1))
gives
135
It is assumed here that the ids are numeric and consist of at least one digit.
Non-regex answer as you asked
\" is an escape sequence in python.
So if {\"Id\":\"135\",\"Type\":0} is a raw string and if you put it into a python variable like
a = '{\"Id\":\"135\",\"Type\":0}'
gives
>>> a
'{"Id":"135","Type":0}'
OR
If the above string is python string which has \" which is already escaped, then do a.replace("\\","") which will give you the string without \.
Now just load this string into a dict and access element Id like below.
import json
d = json.loads(a)
d['Id']
Output :
135

Matching Unicode punctuation using LPeg

I am trying to create an LPeg pattern that would match any Unicode punctuation inside UTF-8 encoded input. I came up with the following marriage of Selene Unicode and LPeg:
local unicode = require("unicode")
local lpeg = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
local match = unicode.utf8.match(a, "^%p")
if match == nil
return false
else
return i+#match
end
end)
This appears to work, but it will miss punctuation characters that are a combination of several Unicode codepoints (if such characters exist), as I am reading only 4 bytes ahead, it probably kills the performance of the parser, and it is undefined what the library match function will do, when I feed it a string that contains a runt UTF-8 character (although it appears to work now).
I would like to know whether this is a correct approach or if there is a better way to achieve what I am trying to achieve.
The correct way to match UTF-8 characters is shown in an example in the LPeg homepage. The first byte of a UTF-8 character determines how many more bytes are a part of it:
local cont = lpeg.R("\128\191") -- continuation byte
local utf8 = lpeg.R("\0\127")
+ lpeg.R("\194\223") * cont
+ lpeg.R("\224\239") * cont * cont
+ lpeg.R("\240\244") * cont * cont * cont
Building on this utf8 pattern we can use lpeg.Cmt and the Selene Unicode match function kind of like you proposed:
local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
if unicode.utf8.match(c, "%p") then
return i
end
end)
Note that we return i, this is in accordance with what Cmt expects:
The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.
This means we should return the same number the function receives, that is the position immediately after the UTF-8 character.

Need code for removing all unicode characters in vb6

I need code for removing all unicode characters in a vb6 string.
If this is UTF-16 text (as normal VB6 String values all are) and you can ignore the issue of surrogate pairs, then this is fairly quick and reasonably concise:
Private Sub DeleteNonAscii(ByRef Text As String)
Dim I As Long
Dim J As Long
Dim Char As String
I = 1
For J = 1 To Len(Text)
Char = Mid$(Text, J, 1)
If (AscW(Char) And &HFFFF&) <= &H7F& Then
Mid$(Text, I, 1) = Char
I = I + 1
End If
Next
Text = Left$(Text, I - 1)
End Sub
This has the workaround for the unfortunate choice VB6 had to make in returning a signed 16-bit integer from the AscW() function. It should have been a Long for symmatry with ChrW$() but it is what it is.
It should beat the pants off any regular expression library in clarity, maintainability, and performance. If better performance is required for truly massive amounts of text then SAFEARRAY or CopyMemory stunts could be used.
Public Shared Function StripUnicodeCharactersFromString(ByVal inputValue As String) As String
Return Regex.Replace(inputValue, "[^\u0000-\u007F]", String.Empty)
End Function
Vb6 - not sure will
sRTF = "\u" & CStr(AscW(char))
work? - You could do this for all char values above 127
StrConv is the command for converting strings.
StrConv Function
Returns a Variant (String) converted as specified.
Syntax
StrConv(string, conversion, LCID)
The StrConv function syntax has these named arguments:
Part Description
string Required. String expression to be converted.
conversion Required. Integer. The sum of values specifying the type of conversion to perform. `128` is Unicode to local code page (or whatever the optional LCID is)
LCID Optional. The LocaleID, if different than the system LocaleID. (The system LocaleID is the default.)

String conversion in matlab doesn't work with int values

I'm parsing longstrings in matlab and whenever I use str2num with an int it doesn't work, it outputs a weird Chinese or Greek symbol instead.
satrec.satnum = str2num(longstr1(3:7));
I checked by outputting it as a string, it works properly but I won't be able to use it in my calculations later on if I don't manage to change it to an int. The characters 3 to 7 of my string are ints (ex : 8188). As it appears to work if my strings are doubles, I tried this :
satrec.satnum = longstr1(3:7);
satrec.satnum = strcat(satrec.satnum,'.0');
satrec.satnum = str2num(satrec.satnum);
fprintf('satellite number : %s\n',satrec.satnum);
But it outputs the same weird symbol. Does anyone know what I can do ?
This looks like NORAD 2-line element data. In that case the file encoding is US-ASCII or effectively UTF-8 since no non-ASCII characters should be present.
Your problem appears to be in this line:
fprintf('satellite number : %s\n',satrec.satnum);
satrec.satnum is an integer, but you are printing it with a %s character in the format string, so Matlab is interpreting it as a string. Replace this with
fprintf('satellite number : %d\n',satrec.satnum);
and you get the correct result.
Edited to add
Matlab has in fact converted the string to an int correctly!
I tried running the the code you provided along with your example, and am unable to reproduce the problem you described:
longstr1='1 28895U 05043F 14195.24580016 .00000503 00000-0 10925-3 0 8188';
satrec.satnum = str2num(longstr1(3:7))
satrec =
satnum: 28895
In any case, I'd suggest using something like textscan or dlmread:
Data = textscan(longstr1,'%u8 %u16 %c %u16 %c %f %f %u16-%u8 %u16-%u8 %u8 %u16', 'delimiter', '')
Data =
Columns 1 through 9
[1] [28895] 'U' [5043] 'F' [1.4195e+04] [5.0300e-06] [0] [0]
Columns 10 through 13
[10925] [3] [0] [8188]
In the above example I guessed some of the data-types so you should update them for your use.
As you can see, this code works on a string. If however you provide it with fileID it will read all the lines in the file (see documentation for textscan) using this template.
On a side note: I noticed that char(28895) outputs a Chinese character.

How to convert Unicode characters to escape codes

So, I have a bunch of strings like this: {\b\cf12 よろてそ } . I'm thinking I could iterate over each character and replace any unicode (Edit: Anything where AscW(char) > 127 or < 0) with a unicode escape code (\u###). However, I'm not sure how to programmatically do so. Any suggestions?
Clarification:
I have a string like {\b\cf12 よろてそ } and I want a string like {\b\cf12 [STUFF]}, where [STUFF] will display as よろてそ when I view the rtf text.
You can simply use the AscW() function to get the correct value:-
sRTF = "\u" & CStr(AscW(char))
Note unlike other escapes for unicode, RTF uses the decimal signed short int (2 bytes) representation for a unicode character. Which makes the conversion in VB6 really quite easy.
Edit
As MarkJ points out in a comment you would only do this for characters outside of 0-127 but then you would also need to give some other characters inside the 0-127 range special handling as well.
Another more roundabout way, would be to add the MSScript.OCX to the project and interface with VBScript's Escape function. For example
Sub main()
Dim s As String
s = ChrW$(&H3088) & ChrW$(&H308D) & ChrW$(&H3066) & ChrW$(&H305D)
Debug.Print MyEscape(s)
End Sub
Function MyEscape(s As String) As String
Dim scr As Object
Set scr = CreateObject("MSScriptControl.ScriptControl")
scr.Language = "VBScript"
scr.Reset
MyEscape = scr.eval("escape(" & dq(s) & ")")
End Function
Function dq(s)
dq = Chr$(34) & s & Chr$(34)
End Function
The Main routine passes in the original Japanese characters and the debug output says:
%u3088%u308D%u3066%u305D
HTH