How do I convert strings to title case in OpenEdge ABL / Progress 4GL? - progress-4gl

How do I convert a string to title case in OpenEdge ABL (aka Progress 4GL)?
I know I can get upper case with CAPS(), and lower case with LC(), but I can't find the title case (sometimes called proper case) function.
Examples:
Input Output
------------ ------------
hello world! Hello World!
HELLO WORLD! Hello World!

function titleWord returns character ( input inString as character ):
return caps( substring( inString, 1, 1 )) + lc( substring( inString, 2 )).
end.
function titleCase returns character ( input inString as character ):
define variable i as integer no-undo.
define variable n as integer no-undo.
define variable outString as character no-undo.
n = num-entries( inString, " " ).
do i = 1 to n:
outString =
outString +
( if i > 1 and i <= n then " " else "" ) +
titleWord( entry( i, inString, " " ))
.
end.
return outString.
end.
display
titleCase( "the quick brown fox JUMPED over the lazy dog!" ) format "x(60)"
.

I think the order of one of those statements above is incorrect -
You'll be adding an extra " " at the beginning of the string! Also need to change the <= to < or you'll be tacking an extra " " into your return string.
It should be:
n = num-entries( inString, " " ).
do i = 1 to n:
outString =
outString +
titleWord( entry( i, inString, " " )) +
( if i < n then " " else "" ) +
.
end.
At least that's what I -think- it should be...
-Me

I was playing around with this a while back, and besides a solution similar to Tom's, I came up with two variations.
One of the problems I had was that not all words are separated by space, such as Run-Time and Read/Write, so I wrote this version to use any non-alphabetic characters as separators.
I also wanted to count diacritics and accented characters as alphabetic, so it became a little complicated. To solve the problem I create two versions of the title, one upper and one lower case. Where the two strings are the same, it's a non-alphabetic character, where they are different, it's alphabetical. Titles are usually very short, so this method is not as inefficient as might seem at first.
FUNCTION TitleCase2 RETURNS CHARACTER
( pcText AS CHARACTER ) :
/*------------------------------------------------------------------------------
Purpose: Converts a string to Title Case.
Notes: This version takes all non-alphabetic characters as word seperators
at the expense of a little speed. This affects things like
D'Arby vs D'arby or Week-End vs Week-end.
------------------------------------------------------------------------------*/
DEFINE VARIABLE cUText AS CHARACTER NO-UNDO CASE-SENSITIVE.
DEFINE VARIABLE cLText AS CHARACTER NO-UNDO CASE-SENSITIVE.
DEFINE VARIABLE i AS INTEGER NO-UNDO.
DEFINE VARIABLE lFound AS LOGICAL NO-UNDO INITIAL TRUE.
cUText = CAPS(pcText).
cLText = LC(pcText).
DO i = 1 TO LENGTH(pcText):
IF (SUBSTRING(cUText, i, 1)) <> (SUBSTRING(cLText, i, 1)) THEN
DO:
IF lFound THEN
DO:
SUBSTRING(cLText, i, 1) = (SUBSTRING(cUText, i, 1)).
lFound = FALSE.
END.
END.
ELSE lFound = TRUE.
END.
RETURN cLText.
END FUNCTION.
Another issue is that title case is supposed to be language specific, i.e. verbs and nouns are treated differently to prepositions and conjunctions. These are some possible rules for title case:
First and last word always get capitalized
Capitalize all nouns, verbs (including "is" and other forms of "to
be"), adverbs (including "than" and "when"), adjectives (including
"this" and "that"), and pronouns (including "its").
Capitalize prepositions that are part of a verb phrase.
Lowercase articles (a, an, the).
Lowercase coordinate conjunctions (and, but, for, nor, or).
Lowercase prepositions of four or fewer letters.
Lowercase "to" in an infinitive phrase.
Capitalize the second word in compound words if it is a noun or
proper adjective or the words have equal weight (Cross-Reference,
Pre-Microsoft Software, Read/Write Access, Run-Time). Lowercase the
second word if it is another part of speech or a participle
modifying the first word (How-to, Take-off).
I could of course not code all this without teaching the computer English, so I created this version as a simple if crude compromise; it works in most cases, but there are exceptions.
FUNCTION TitleCaseE RETURNS CHARACTER
( pcText AS CHARACTER ) :
/*------------------------------------------------------------------------------
Purpose: Converts an English string to Title Case.
Notes:
------------------------------------------------------------------------------*/
DEFINE VARIABLE i AS INTEGER NO-UNDO.
DEFINE VARIABLE cWord AS CHARACTER NO-UNDO.
DEFINE VARIABLE lFound AS LOGICAL NO-UNDO INITIAL TRUE.
DEFINE VARIABLE iLast AS INTEGER NO-UNDO.
DEFINE VARIABLE cSmallWords AS CHARACTER NO-UNDO
INITIAL "and,but,or,for,nor,the,a,an,to,amid,anti,as,at,but,by,down,from,in" +
",into,like,near,of,off,on,onto,over,per,than,to,up,upon,via,with".
pcText = REPLACE(REPLACE(LC(pcText),"-"," - "),"/"," / ").
iLast = NUM-ENTRIES(pcText, " ").
DO i = 1 TO iLast:
cWord = ENTRY(i, pcText, " ").
IF LENGTH(cWord) > 0 THEN
IF i = 1 OR i = iLast OR LOOKUP(cWord, cSmallWords) = 0 THEN
ENTRY(i, pcText, " ") = CAPS(SUBSTRING(cWord, 1, 1)) + LC(SUBSTRING(cWord, 2)).
END.
RETURN REPLACE(REPLACE(pcText," - ","-")," / ","/").
END FUNCTION.
I have to mention that Tom's solution is very much faster than both of mine. Depending on what you need, you may find that the speed is not that important, since you're unlikely to use this in large data crunching processes or with long strings, but I wouldn't ignore it. Make sure that your needs justify the performance loss.

Related

Syntax for Returning One Character of String by Index

I am attempting to compare one character of a string to see if it is my delimiter character. However, when I execute the following code the value that gets placed in the variable valstring is a number that represents the byte that was converted to a string and not a character itself. For Example the value may be the string '58'.
Through my testing in CoDeSys using the debugging features I know that the string sReadLine contains a valid string of characters. I'm just not sure of the syntax to single only one of them out; the sReadLine[valPos + i] part is what I don't understand.
sReadLine : STRING;
valstring : STRING;
i : INT;
valPos : INT;
FOR i := 0 TO 20 DO
IF BYTE_TO_STRING(sReadLine[valPos + i]) = '"' THEN
EXIT;
END_IF
valstring := CONCAT(STR1 := valstring, STR2 := BYTE_TO_STRING(sReadLine[valPos + i]));
END_FOR
I think you have multiple choises.
1) Use built-in string functions instead. You can use MID function get get part of a string. So in your case something like "get one character from valPos + 1 from sReadLine.
FOR i := 0 TO 20 DO
IF MID(sReadLine, 1, valPos + i) = '"' THEN
EXIT;
END_IF
valstring := CONCAT(STR1 := valstring, STR2 := MID(sReadLine, 1, valPos + i));
END_FOR
2) Convert the ASCII byte to string. In TwinCAT systems, there is a function F_ToCHR. It takes a ASCII byte in and returns the character as string. I can't find something like that for Codesys, but i'm sure there would be a solution in some library. So please note that this won't work in Codesys without modifications:
FOR i := 0 TO 20 DO
IF F_ToCHR(sReadLine[valPos + i]) = '"' THEN
EXIT;
END_IF
valstring := CONCAT(STR1 := valstring, STR2 := F_ToCHR(sReadLine[valPos + i]));
END_FOR
3) The OSCAT library seems to have a CHR_TO_STRING function. You could use this instead of F_ToCHR in step 2.
4) You can use pointers to copy the ASCII byte to a string array (MemCpy) and add a string end character. This needs some knowledge of pointers etc. See Codesys forum for some example.
5) You can write a helper function similar to step 2 youself. Check the example from Codesys forums. That example doesn't include all characters so it needs to be updated. It's not quite elegant.
When you convert a byte to a string, what is beeing converted is the digital representation of the byte.
This means you are interpreting that byte as an ascii character (The ascii decimal value of : is 58).
So if you want to Concat chars instead of their ascii decimal representation, you need another function:
valstring := CONCAT(STR1 := valstring, STR2 := F_ToCHR(sReadLine[valPos + i]));
EDIT:
As Quirzo, I couldn't find a similar F_ToCHR function for Codesys, but you could easily build one yourself.
For example:
Declaration Part:
FUNCTION F_ASCII_TO_STRING : STRING
VAR_INPUT
input : BYTE;
END_VAR
VAR
ascii : ARRAY[0..255] OF STRING(1):=
[
33(' '),'!','"','#',
'$$' ,'%' ,'&' ,'´',
'(' ,')' ,'*' ,'+' ,
',' ,'-' ,'.' ,'/' ,
'0' ,'1' ,'2' ,'3' ,
'4' ,'5' ,'6' ,'7' ,
'8' ,'9' ,':' ,';' ,
'<' ,'=' ,'>' ,'?' ,
'#' ,'A' ,'B' ,'C' ,
'D' ,'E' ,'F' ,'G' ,
'H' ,'I' ,'J' ,'K' ,
'L' ,'M' ,'N' ,'O' ,
'P' ,'Q' ,'R' ,'S' ,
'T' ,'U' ,'V' ,'W' ,
'X' ,'Y' ,'Z' ,'[' ,
'\' ,']' ,'^' ,'_' ,
'`' ,'a' ,'b' ,'c' ,
'd' ,'e' ,'f' ,'g' ,
'h' ,'i' ,'j' ,'k' ,
'l' ,'m' ,'n' ,'o' ,
'p' ,'q' ,'r' ,'s' ,
't' ,'u' ,'v' ,'w' ,
'x' ,'y' ,'z' ,'{' ,
'|' ,'}' ,'~'
];
END_VAR
Implementation part:
F_ASCII_TO_STRING := ascii[input];
As Sergey said, this might not be an optimal solution to your problem. It seems like you want to extract the longest substring not containing any character " from initial input sReadLine to valstring, starting from position valPos.
In your implementation, for each valid input character, CONCAT() needs to search for the end of valstring, before appending only 1 character to it.
You should rather decompose your problem and use two standard functions to be optimal:
FIND() --> to get the position of the next character " (or to know if there is none),
MID() --> to create a string from initial position up to before the first character " (or the end of the input string).
That way, there remains only 2 loops; each one is hidden in these functions.

Matching Unicode punctuation using LPeg

I am trying to create an LPeg pattern that would match any Unicode punctuation inside UTF-8 encoded input. I came up with the following marriage of Selene Unicode and LPeg:
local unicode = require("unicode")
local lpeg = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
local match = unicode.utf8.match(a, "^%p")
if match == nil
return false
else
return i+#match
end
end)
This appears to work, but it will miss punctuation characters that are a combination of several Unicode codepoints (if such characters exist), as I am reading only 4 bytes ahead, it probably kills the performance of the parser, and it is undefined what the library match function will do, when I feed it a string that contains a runt UTF-8 character (although it appears to work now).
I would like to know whether this is a correct approach or if there is a better way to achieve what I am trying to achieve.
The correct way to match UTF-8 characters is shown in an example in the LPeg homepage. The first byte of a UTF-8 character determines how many more bytes are a part of it:
local cont = lpeg.R("\128\191") -- continuation byte
local utf8 = lpeg.R("\0\127")
+ lpeg.R("\194\223") * cont
+ lpeg.R("\224\239") * cont * cont
+ lpeg.R("\240\244") * cont * cont * cont
Building on this utf8 pattern we can use lpeg.Cmt and the Selene Unicode match function kind of like you proposed:
local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
if unicode.utf8.match(c, "%p") then
return i
end
end)
Note that we return i, this is in accordance with what Cmt expects:
The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.
This means we should return the same number the function receives, that is the position immediately after the UTF-8 character.

Deleting all special characters from a string in progress 4GL

How can I delete all special characters from a string in Progress 4GL?
I guess this depends on your definition of special characters.
You can remove ANY character with REPLACE. Simply set the to-string part of replace to blank ("").
Syntax:
REPLACE ( source-string , from-string , to-string )
Example:
DEFINE VARIABLE cOldString AS CHARACTER NO-UNDO.
DEFINE VARIABLE cNewString AS CHARACTER NO-UNDO.
cOldString = "ABC123AACCC".
cNewString = REPLACE(cOldString, "A", "").
DISPLAY cNewString FORMAT "x(10)".
You can use REPLACE to remove a complete matching string. For example:
REPLACE("This is a text with HTML entity &", "&", "").
Handling "special characters" can be done in a number of ways. If you mean special "ASCII" characters like linefeed, bell and so on you can use REPLACE together with the CHR function.
Basic syntax (you could add some information about code pages as well but that's rarely needed) :
CHR( expression )
expression: An expression that yields an integer value that you want to convert to a character value. (ASCII numberic value).
So if you want to remove all Swedish letter Ö:s (ASCII 214) from a text you could do:
REPLACE("ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ", "Ö", "").
or
REPLACE("ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ", CHR(214), "").
Putting this together you could build an array of unwanted characters and remove all those in the string. For example:
FUNCTION cleanString RETURNS CHARACTER (INPUT pcString AS CHARACTER):
DEFINE VARIABLE iUnwanted AS INTEGER NO-UNDO EXTENT 3.
DEFINE VARIABLE i AS INTEGER NO-UNDO.
/* Remove all capital Swedish letters ÅÄÖ */
iUnwanted[1] = 197.
iUnwanted[2] = 196.
iUnwanted[3] = 214.
DO i = 1 TO EXTENT(iUnwanted):
IF iUnwanted[i] <> 0 THEN DO:
pcString = REPLACE(pcString, CHR(iUnwanted[i]), "").
END.
END.
RETURN pcString.
END.
DEFINE VARIABLE cString AS CHARACTER NO-UNDO INIT "AANÅÅÖÖBBCVCÄÄ".
DISPLAY cleanString(cString) FORMAT "x(10)".
Other functions that could be useful to look into:
SUBSTRING: Returns a part of a string. Can be used to modify it as well.
ASC: Like CHR but the other way around - displays ASCII value from a character).
INDEX: Returns the position of a character in a string.
R-INDEX: Like INDEX but searches right to left.
STRING: Converts a value of any data type into a character value.
This function will replace chars according to the current collation.
function Dia2Plain returns character (input icTxt as character):
define variable ocTxt as character no-undo.
define variable i as integer no-undo.
define variable iAsc as integer no-undo.
define variable cDia as character no-undo.
define variable cPlain as character no-undo.
assign ocTxt = icTxt.
repeat i = 1 to length(ocTxt):
assign cDia = substring(ocTxt,i,1)
cPlain = "".
if asc(cDia) > 127
then do:
repeat iAsc = 65 to 90: /* A..Z */
if compare(cDia, "eq" , chr(iAsc), "case-sensitive")
then assign cPlain = chr(iAsc).
end.
repeat iAsc = 97 to 122: /* a..z */
if compare(cDia, "eq" , chr(iAsc), "case-sensitive")
then assign cPlain = chr(iAsc).
end.
if cPlain <> ""
then assign substring(ocTxt,i,1) = cPlain.
end.
end.
return ocTxt.
end.
/* testing */
def var c as char init "ÄëÉÖìÇ".
disp c Dia2Plain(c).
def var i as int.
def var d as char.
repeat i = 128 to 256:
assign c = chr(i) d = Dia2Plain(chr(i)).
if asc(c) <> asc(d) then disp i c d.
end.
This function will remove anything that is not a letter or number (adapt it as you wish).
/* remove any characters that are not numbers or letters */
FUNCTION alphanumeric RETURN CHARACTER
(lch_string AS CHARACTER).
DEFINE VARIABLE lch_newstring AS CHARACTER NO-UNDO.
DEFINE VARIABLE i AS INTEGER NO-UNDO.
DO i = 1 TO LENGTH(lch_string):
/* check to see if this is a number or letter */
IF (ASC(SUBSTRING(lch_string,i,1)) GE ASC("1")
AND ASC(SUBSTRING(lch_string,i,1)) LE ASC("9"))
OR (ASC(SUBSTRING(lch_string,i,1)) GE ASC("A")
AND ASC(SUBSTRING(lch_string,i,1)) LE ASC("Z"))
OR (ASC(SUBSTRING(lch_string,i,1)) GE ASC("a")
AND ASC(SUBSTRING(lch_string,i,1)) LE ASC("z"))
THEN
/* only keep it if it is a number or letter */
lch_newstring = lch_newstring + SUBSTRING(lch_string,i,1).
END.
RETURN lch_newstring.
END FUNCTION.
Or you can simply use regex
System.Text.RegularExpressions.Regex:Replace("Say,Hi!", "[^a-zA-Z0-9]","")

Unicode character transformation in SPSS

I have a string variable. I need to convert all non-digit characters to spaces (" "). I have a problem with unicode characters. Unicode characters (the characters outside the basic charset) are converted to some invalid characters. See the code for example.
Is there any other way how to achieve the same result with procedure which would not choke on special unicode characters?
new file.
set unicode = yes.
show unicode.
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
end data.
string Z (a10).
comp Z = T.
loop #k = 1 to char.len(Z).
if ~range(char.sub(Z, #k, 1), "0", "9") sub(Z, #k, 1) = " ".
end loop.
comp Z = normalize(Z).
comp len = char.len(Z).
list var = all.
exe.
The result:
T Z len
1234 1234 4
5678 5678 4
absd 0
12as 12 2
12(a 12 2
12(vi 12 2
12(vī 12 � 6
>Warning # 649
>The first argument to the CHAR.SUBSTR function contains invalid characters.
>Command line: 1939 Current case: 8 Current splitfile group: 1
12āčž 12 �ž 7
Number of cases read: 8 Number of cases listed: 8
The substr function should not be used on the left hand side of an expression in Unicode mode, because the replacement character may not be the same number of bytes as the character(s) being replaced. Instead, use the replace function on the right hand side.
The corrupted characters you are seeing are due to this size mismatch.
How about instead of replacing non-numeric characters, you cycle though and pull out the numeric characters and rebuild Z? (Note my version here is pre CHAR. string functions.)
data list free
/T (a10).
begin data
1234
5678
absd
12as
12(a
12(vi
12(vī
12āčž
12as23
end data.
STRING Z (a10).
STRING #temp (A1).
COMPUTE #len = LENGTH(RTRIM(T)).
LOOP #i = 1 to #len.
COMPUTE #temp = SUBSTR(T,#i,1).
DO IF INDEX('0123456789',#temp) > 0.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1),#temp).
ELSE.
COMPUTE Z = CONCAT(SUBSTR(Z,1,#i-1)," ").
END IF.
END LOOP.
EXECUTE.

Output sanitization within Progress ABL / 4GL

Is there an analagous procedure to php's http://php.net/manual/en/function.mysql-real-escape-string.php for Progress 4GL / ABL or a best practice within the Progress community that is followed for writing sanitized text to external and untrusted entities (web sites, mysql servers and APIs)?
The QUOTE or QUERY-PREPARE functions will not work as they sanitize text for dynamic queries for Progress and not for external entities.
The closest analogue to your cited example would be to write a function that does this:
DEFINE VARIABLE ch-escape-chars AS CHARACTER NO-UNDO.
DEFINE VARIABLE ch-string AS CHARACTER NO-UNDO.
DEFINE VARIABLE i-cnt AS INTEGER NO-UNDO.
DO i-cnt = 1 TO LENGTH(ch-escape-char):
ch-string = REPLACE(ch-string,
SUBSTRING(ch-escape-char, i-cnt, 1),
"~~" + SUBSTRING(ch-escape-char, i-cnt, 1)).
END.
where
ch-escape-chars are the characters you want escape'd.
ch-string is the incoming string.
"~~" is the esacap'd escape character.
It sounds like roll your own would be the only way. For my purposes I emulated the mysql_real_escape_string function
/* TODO progress auto changes all ASC(0) characters to space or ASC(20) in a non db string. */
/* the backslash needs to go first */
/* there is no concept of static vars in progress (non class) so global variables */
DEFINE VARIABLE cEscape AS CHARACTER EXTENT INITIAL [
"~\",
/*"~000",*/
"~n",
"~r",
"'",
"~""
]
.
DEFINE VARIABLE cReplace AS CHARACTER EXTENT INITIAL [
"\\",
/*"\0",*/
"\n",
"\r",
"\'",
'\"'
]
.
FUNCTION mysql_real_escape_string RETURNS CHARACTER (INPUT pcString AS CHAR):
DEF VAR ii AS INTEGER NO-UNDO.
MESSAGE pcString '->'.
DO ii = 1 TO EXTENT(cEscape):
ASSIGN pcString = REPLACE (pcString, cEscape[ii], cReplace[ii]).
END.
MESSAGE pcString.
RETURN pcString.
END.