Is there a function to decode encoded unicode utf-8 string like from a form? - forms

I want to store some data with a html form and Rebol cgi. My form looks like this:
<form action="test.cgi" method="post" >
Input:
<input type="text" name="field"/>
<input type="submit" value="Submit" />
</form>
But for unicode characters like Chinese, I get the encoded form of the data with percent signs, for instance %E4%BA%BA.
(This is for the Chinese character "人" ... its UTF-8 form as a Rebol binary literal is #{E4BABA})
Is there a function in the system, or an existing library that can decode this directly? dehex does not appear to currently cover this case. I'm currently decoding this manually by removing the percent signs and constructing the corresponding binary, like this:
data: to-string read system/ports/input
print data
;-- this prints "field=%E4%BA%BA"
k-v: parse data "="
print k-v
;-- this prints ["field" "%E4%BA%BA"]
v: append insert replace/all k-v/2 "%" "" "#{" "}"
print v
;-- This prints "#{E4BABA}" ... a string!, not binary!
;-- LOAD will help construct the corresponding binary
;-- then TO-STRING will decode that binary from UTF-8 to character codepoints
write %test.txt to-string load v

I have a library called AltWebForm that en/decodes percent-encoded web form data:
do http://reb4.me/r3/altwebform
load-webform "field=%E4%BA%BA"
The library is described here: Rebol and Web Forms.

Looks to be related to ticket #1986, where it is discussed whether this is a "bug" or the Internet changing out from under its own spec:
Have DEHEX convert UTF-8 sequences from browsers as Unicode.
If you have specific experience on what has become standard in Chinese, and want to weigh in, that would be valuable.
Just as an aside, the specific case above could have been handled in PARSE alternately as:
key-value: {field=%E4%BA%BA}
utf8-bytes: copy #{}
either parse key-value [
copy field-name to {=}
skip
some [
and {%}
copy enhexed-byte 3 skip (
append utf8-bytes dehex enhexed-byte
)
]
] [
print [field-name {is} to string! utf8-bytes]
] [
print {Malformed input.}
]
That will output:
field is 人
With some comments included:
key-value: {field=%E4%BA%BA}
;-- Generate empty binary value by copying an empty binary literal
utf8-bytes: copy #{}
either parse key-value [
;-- grab field-name as the chars right up to the equals sign
copy field-name to {=}
;-- skip the equal sign as we went up to it, without moving "past" it
skip
;-- apply the enclosed rule SOME (non-zero) number of times
some [
;-- match a percent sign as the immediate next symbol, without
;-- advancing the parse position
and {%}
;-- grab the next three chars, starting with %, into enhexed-byte
copy enhexed-byte 3 skip (
;-- If we get to this point in the match rule, this parenthesized
;-- expression lets us evaluate non-dialected Rebol code to
;-- append the dehexed byte to our utf8 binary
append utf8-bytes dehex enhexed-byte
)
]
] [
print [field-name {is} to string! utf8-bytes]
] [
print {Malformed input.}
]
(Note also that "simple parse" is getting the axe in favor of enhancements to SPLIT. So writing code like parse data "=" can now be expressed instead as split data "=", or other cool variants if you check them out...samples are in the ticket.)

Related

Get UTF-16 code unit at a given index in ABAP

I want to get the UTF-16 code unit at a given index in ABAP.
Same can be done in JavaScript with charCodeAt().
For example "d".charCodeAt(); will give back 100.
Is there a similar functionality in ABAP?
This can be done with class CL_ABAP_CONV_OUT_CE
DATA(lo_converter) = cl_abap_conv_out_ce=>create( encoding = '4103' ). "Litte Endian
TRY.
CALL METHOD lo_converter->convert
EXPORTING
data = 'a'
n = 1
IMPORTING
buffer = DATA(lv_buffer). "lv_buffer will 0061
CATCH ...
ENDTRY.
Codepage 4102 is for UTF-16 Big endian.
It is possible to encode not just a single character, but a string as well:
EXPORTING
data = 'abc'
n = 3
"n" always stands for the length of the string you want to be encoded. It could be less, than the actual length of the string.
When you say you "want to get the UTF-16 code unit",
either you mean the Unicode code point, e.g. the character d is always U+0064 (official "name" of Unicode character, the two bytes 0x0064 being the hexadecimal representation of decimal 100),
or you mean you want to encode d to UTF-16 little endian (SAP code page 4103) or big endian (SAP code page 4102) which gives respectively 2 bytes 0x4400 or 2 bytes 0x0044.
For the second case, see József answer.
For the first case, you may get it using the method UCCP (UniCode Code Point) or UCCPI (UniCode Code Point Integer) of class CL_ABAP_CONV_OUT_CE:
DATA: l_unicode_point_hex TYPE x LENGTH 2,
l_unicode_point_int TYPE i.
l_unicode_point_hex = cl_abap_conv_out_ce=>UCCP( 'd' ).
ASSERT l_unicode_point_hex = '0064'.
l_unicode_point_int = cl_abap_conv_out_ce=>UCCPI( 'd' ).
ASSERT l_unicode_point_int = 100.
EDIT: Note that the two methods return always the same values whatever the SAP system code page is (4102, 4103 or whatever).

Swift string indexing combines "\r\n" as one char instead of two

I am dealing with strings containing \r\n with Swift 4.2. I ran into kind of strange behavior of Swift index, it appears \r\n will be treated as one character instead of two by Swift indexing methods. I wrote a piece of code to present this behavior:
var text = "ABC\r\n\r\nDEF"
func printChar(_ lower: Int, _ upper: Int) {
let start = text.index(text.startIndex, offsetBy: lower)
let end = text.index(text.startIndex, offsetBy: upper)
print("\"" + text[start..<end] + "\"")
}
printChar(0, 1) // "A"
printChar(1, 2) // "B"
printChar(2, 3) // "C"
printChar(3, 4) // new line
printChar(4, 5) // new line (okay, what's going on here?)
printChar(5, 6) // "D"
printChar(6, 7) // "E"
printChar(7, 8) // "F"
The print result will be
"A"
"B"
"C"
"
"
"
"
"D"
"E"
"F"
Any idea why it's like this?
TLDR: \r\n is a grapheme cluster and is treated as a single Character in Swift because Unicode.
Swift treats \r\n as one Character.
Objective-C NSString treats it as two characters (in terms of the result from length).
On the swift-users forum someone wrote:
– "\r\n" is a single Character. Is this the correct behaviour?
– Yes, a Character corresponds to a Unicode grapheme cluster, and "\r\n" is considered a single grapheme cluster.
And the subsequent response posted a link to Unicode documentation, check out this table which officially states CRLF is a grapheme cluster.
Take a look at the Apple documentation on Characters and Grapheme Clusters.
It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.
The Swift documentation on Strings and Characters is also worth reading.
This overview from objc.io is interesting as well.
NSString represents UTF-16-encoded text. Length, indices, and ranges are all based on UTF-16 code units.
Another example of this is an emoji like 👍🏻. This single character is actually %uD83D%uDC4D%uD83C%uDFFB, four different unicode scalars. But if you called count on a string with just that emoji you'd (correctly) get 1.
If you wanted to see the scalars you could iterate them as follows:
for scalar in text.unicodeScalars {
print("\(scalar.value) ", terminator: "")
}
Which for "\r\n" would give you 13 10
In the Swift documentation you'll find why NSString is different:
The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.
Thus this isn't really "strange" behaviour of Swift string indexing, but rather a result of how Unicode treats these characters and how String in Swift is designed. Swift string indexing goes by Character and \r\n is a single Character.

Removing Unwanted commas from a csv

I'm writing a program in Progress, OpenEdge, ABL, and whatever else it's known as.
I have a CSV file that is delimited by commas. However, there is a "gift message" field, and users enter messages with "commas", so now my program will see additional entries because of those bad commas.
The CSV fields are not in double qoutes so I CAN NOT just use my main method with is
/** this next block of code will remove all unwanted commas from the data. **/
if v-line-cnt > 1 then /** we won't run this against the headers. Otherwise thhey will get deleted **/
assign
v-data = replace(v-data,'","',"\t") /** Here is a special technique to replace the comma delim wiht a tab **/
v-data = replace(v-data,','," ") /** now that we removed the comma delim above, we can remove all nuisance commas **/
v-data = replace(v-data,"\t",'","'). /** all nuisance commas are gone, we turn the tabs back to commas. **/
Any advice?
edit:
From Progress, I cal call Linux commands. So I should be able to execute C++/PHP/Shell etc all from my Progress Program. I look forward to advice, until then I shall look into using external scripts.
You are not providing quite enough data for a perfect answer but given what you say I think the IMPORT statement should handle this automatically.
In my example here commaimport.csv is a comma-separated csv-file with quotes around text fields. Integers, logical variables etc have no quotes. The last field contains a comma in one line:
commaimport.csv
=======================
"Id1", 123, NO, "This is a message"
"Id2", 124, YES, "This is a another message, with a comma"
"Id3", 323, NO, "This is a another message without a comma"
To import this file I define a temp-table matching the file layout and use the IMPORT statement with comma as delimiter:
DEFINE TEMP-TABLE ttImport NO-UNDO
FIELD field1 AS CHARACTER FORMAT "xxx"
FIELD field2 AS INTEGER FORMAT "zz9"
FIELD field3 AS LOGICAL
FIELD field4 AS CHARACTER FORMAT "x(50)".
INPUT FROM VALUE("c:\temp\commaimport.csv").
REPEAT :
CREATE ttImport.
IMPORT DELIMITER "," ttImport.
END.
INPUT CLOSE.
FOR EACH ttImport:
DISPLAY ttImport.
END.
You don't have to import into a temp-table. You could import into variables instead.
DEFINE VARIABLE c AS CHARACTER NO-UNDO FORMAT "xxx".
DEFINE VARIABLE i AS INTEGER NO-UNDO FORMAT "zz9".
DEFINE VARIABLE l AS LOGICAL NO-UNDO.
DEFINE VARIABLE d AS CHARACTER NO-UNDO FORMAT "x(50)".
INPUT FROM VALUE("c:\temp\commaimport.csv").
REPEAT :
IMPORT DELIMITER "," c i l d.
DISP c i l d.
END.
INPUT CLOSE.
This will render basically the same output:
You don't show what your data file looks like. But if the problematic field is the last one, and there are no quotes, then your best bet is probably to read it using INPUT UNFORMATTED to get it a line at a time, and then split the line into fields using ENTRY(). That way you can treat everything after the nth comma as a single field no matter how many commas the line has.
For example, say your input file has three columns like this:
boris,14.23,12 the avenue
mark,32.10,flat 1, the grange
percy,1.00,Bleak house, Dartmouth
... so that column three is an address which might contain a comma and is not enclosed in quotes so that IMPORT DELIMITER can't help you.
Something like this would work in that case:
/* ...skipping a lot of definitions here ... */
input from "datafile.csv".
repeat:
import unformatted v-line.
create tt-thing.
assign tt-thing.name = entry(1, v-line, ',')
tt-thing.price = entry(2, v-line, ',')
tt-thing.address = entry(3, v-line, ',').
do v=i = 4 to num-entries(v-line, ','):
tt-thing.address = tt-thing.address
+ ','
+ entry(v-i, v-line, ',').
end.
end.
input close.

How to use Unicode codepoints above U+FFFF in Rebol 3 strings like in Rebol 2?

I know you can't use caret style escaping in strings for codepoints bigger than ^(FF) in Rebol 2, because it doesn't know anything about Unicode. So this doesn't generate anything good, it looks messed up:
print {Q: What does a Zen master's {Cow} Say? A: "^(03BC)"!}
Yet the code works in Rebol 3 and prints out:
Q: What does a Zen master's {Cow} Say? A: "μ"!
That's great, but R3 maxes out its ability to hold a character in a string at all at U+FFFF apparently:
>> type? "^(FFFF)"
== string!
>> type? "^(010000)"
** Syntax error: invalid "string" -- {"^^(010000)"}
** Near: (line 1) type? "^(010000)"
The situation is a lot better than the random behavior of Rebol 2 when it met codepoints it didn't know about. However, there used to be a workaround in Rebol for storing strings if you knew how to do your own UTF-8 encoding (or got your strings by way of loading source code off disk). You could just assemble them from individual characters.
So the UTF-8 encoding of U+010000 is #F0908080, and you could before say:
workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]
And you'd get a string with that single codepoint encoded using UTF-8, that you could save to disk in code blocks and read back in again. Is there any similar trick in R3?
There is a workaround using the string! datatype as well. You cannot use UTF-8 in that case, but you can use UTF-16 workaround as follows:
utf-16: "^(d800)^(dc00)"
, which encodes the ^(10000) code point using UTF-16 surrogate pair. In general, the following function can do the encoding:
utf-16: func [
code [integer!]
/local low high
] [
case [
code < 0 [do make error! "invalid code"]
code < 65536 [append copy "" to char! code]
code < 1114112 [
code: code - 65536
low: code and 1023
high: code - low / 1024
append append copy "" to char! high + 55296 to char! low + 56320
]
'else [do make error! "invalid code"]
]
]
Yes, there is a trick...which is the trick you should have been using in R2 as well. Don't use a string! Use a binary! if you have to do this sort of thing:
good-workaround: #{F0908080}
It would've worked in Rebol2, and it works in Rebol3. You can save it and load it without any funny business.
In fact, if care about Unicode at all, ever...stop doing string processing that is using codepoints higher than ^(7F) if you are stuck in Rebol 2 and not 3. We'll see why by looking at that terrible workaround:
terrible-workaround: rejoin [#"^(F0)" #"^(90)" #"^(80)" #"^(80)"]
..."And you'd get a string with that single UTF-8 codepoint"...
The only thing you should get is a string with four individual character codepoints, and with 4 = length? terrible-workaround. Rebol2 is broken because string! is basically no different from binary! under the hood. In fact, in Rebol2 you could alias the two types back and forth without making a copy, look up AS-BINARY and AS-STRING. (This is impossible in Rebol3 because they really are fundamentally different, so don't get attached to the feature!)
It's somewhat deceptive to see these strings reporting a length of 4, and there's a false comfort of each character producing the same value if you convert them to integer!. Because if you ever write them out to a file or port somewhere, and they need to be encoded, you'll get bitten. Note this in Rebol2:
>> to integer! #"^(80)"
== 128
>> to binary! #"^(80)"
== #{80}
But in R3, you have a UTF-8 encoding when binary conversion is needed:
>> to integer! #"^(80)"
== 128
>> to binary! #"^(80)"
== #{C280}
So you will be in for a surprise when your seemingly-working code does something different at a later time, and winds up serializing differently. In fact, if you want to know how "messed up" R2 is in this regard, look at why you got a weird symbol for your "mu". In R2:
>> to binary! #"^(03BC)"
== #{BC}
It just threw the "03" away. :-/
So if you need for some reason to work with a Unicode strings and can't switch to R3, try something like this for the cow example:
mu-utf8: #{03BC}
utf8: rejoin [#{} {Q: What does a Zen master's {Cow} Say? A: "} mu-utf8 {"!}]
That gets you a binary. Only convert it to string for debug output, and be ready to see gibberish. But it is the right thing to do if you're stuck in Rebol2.
And to reiterate the answer: it's also what to do if for some odd reason stuck needing to use those higher codepoints in Rebol3:
utf8: rejoin [#{} {Q: What did the Mycenaean's {Cow} Say? A: "} #{010000} {"!}]
I'm sure that would be a very funny joke if I knew what LINEAR B SYLLABLE B008 A was. Which leads me to say that most likely, if you're doing something this esoteric you probably only have a few codepoints being cited as examples. You can hold most of your data as string up until you need to slot them in conveniently, and hold the result in a binary series.
UPDATE: If one hits this problem, here is a utility function that can be useful for working around it temporarily:
safe-r2-char: charset [#"^(00)" - #"^(7F)"]
unsafe-r2-char: charset [#"^(80)" - #"^(FF)"]
hex-digit: charset [#"0" - #"9" #"A" - #"F" #"a" - #"f"]
r2-string-to-binary: func [
str [string!] /string /unescape /unsafe
/local result s e escape-rule unsafe-rule safe-rule rule
] [
result: copy either string [{}] [#{}]
escape-rule: [
"^^(" s: 2 hex-digit e: ")" (
append result debase/base copy/part s e 16
)
]
unsafe-rule: [
s: unsafe-r2-char (
append result to integer! first s
)
]
safe-rule: [
s: safe-r2-char (append result first s)
]
rule: compose/deep [
any [
(either unescape [[escape-rule |]] [])
safe-rule
(either unsafe [[| unsafe-rule]] [])
]
]
unless parse/all str rule [
print "Unsafe codepoints found in string! by r2-string-to-binary"
print "See http://stackoverflow.com/questions/15077974/"
print mold str
throw "Bad codepoint found by r2-string-to-binary"
]
result
]
If you use this instead of a to binary! conversion, you will get the consistent behavior in both Rebol2 and Rebol3. (It effectively implements a solution for terrible-workaround style strings.)

Clean string from html tags and special characters

I want to clean my text from html tags, html spacial characters and characters like < > [ ] / \ * ,
I used $str = preg_replace("/&#?[a-zA-Z0-9]+;/i", "", $str);
it works well with html special characters but some characters doesn't remove like :
( /*/*]]>*/ )
how can I remove these characters?
If you are really using php as it looks like, you can just use:
$str = htmlspecialchars($str);
All HTML chars will be escaped (which could be better than just stripping them). If you really want just to filter these characters, what you need to do is escape those characters on the chars list:
$str = preg_replace("/[\&#\?\]\[\/\\\<\>\*\:\(\);]*/i","",$str);
Notice there's just one "/[]*/i", I removed the a-zA-Z0-9 as you should want these chars in. You can also classify only the desired chars to enter your string (will give you trouble with accentuations like á é ü if you use them, you have to specify every accepted char):
$str = preg_replace("/[^a-zA-Z0-9áÁéÉíÍãÃüÜõÕñÑ\.\+\-\_\%\$\#\!\=;]*/","",$str);
Notice also there's never too much to escape characters, unless for example for the intervals (\a-\z would do fine, \a-\z would match a, or -, or z).
I hope it helps. :)
Regular expression for html tags is:
/\<(.*)?\>/
so use something like this:
// The regular expression to remove HTML tags
$htmltagsregex = '/\<(.*)?\>/';
// what shit will substitute it
$nothing = '';
// the string I want to apply it to
$string = 'this is a string with <b>HTML tags</b> that I want to <strong>remove</strong>';
// DO IT
$result = preg_replace ($htmltagsregex,nothing,$string);
and it will return
this is a string with HTML tags that I want to remove
That's all