How to detect if a Unicode char is supported by EBCDIC in .NET 4.0? - unicode

We have a web site and WinForms application written in .NET 4.0 that allows users to enter any Unicode char (pretty standard).
The problem is that a small amount of our data gets submitted to an old mainframe application. While we were testing a user entered a name with characters that ending up crashing the mainframe program. The name was BOËNS. The E is not supported.
What is the best way to detect if a unicode char is supported by EBCDIC?
I tried using the following regular expression but that restricted some standard special chars (/, _, :) which are fine for the mainframe.
I would prefer to use one method to validate each char or have a method that you just passed in a string and it returned true or false if chars not supported by EBCDIC were contained in the strig.

First, you would have to get the proper Encoding instance for EBCDIC, calling the static GetEncoding method which will takes the code page id as a parameter.
Once you have that, you can set the DecoderFallback property to the value in the static ExceptionFallback property on the DecoderFallback class.
Then, in your code, you would loop through each character in your string and call the GetBytes method to encode the character to the byte sequence. If it cannot be encoded, then a DecoderFallbackException is thrown; you would just have to wrap each call to GetBytes in a try/catch block to determine which character is in error.
Note, the above is required if you want to know the position of the character that failed. If you don't care about the position of the character, just if the string will not encode as a whole, then you can just call the GetBytes method which takes a string parameter and it will throw the same DecoderFallbackException if a character that cannot be encoded is encountered.

You can escape characters in Regex using the \ . So if you want to match a dot, you can do #"\." . To match /._,:[]- for example: #"[/._,:\-\[\]] . Now, EBDIC is 8 bits, but many characters are control characters. Do you have a list of "valid" characters?
I have made this pattern:
string pattern = #"[^a-zA-Z0-9 ¢.<(+&!$*);¬/|,%_>?`:##'=~{}\-\\" + '"' + "]";
It should find "illegal" characters. If IsMatch then there is a problem.
I have used this: http://nemesis.lonestar.org/reference/telecom/codes/ebcdic.html
Note the special handling of the ". I'm using the # at the beginning of the string to disable \ escape expansion, so I can't escape the closing quote, and so I add it to the pattern in the end.
To test it:
Regex rx = new Regex(pattern);
bool m1 = rx.IsMatch(#"a-zA-Z0-9 ¢.<(+&!$*);¬/|,%_>?`:##'=~{}\-\\" + '"');
bool m2 = rx.IsMatch(#"€a-zA-Z0-9 ¢.<(+&!$*);¬/|,%_>?`:##'=~{}\-\\" + '"');
m1 is false (it's the list of all the "good" characters), m2 is true (to the other list I've added the € symbol)

Related

db2 remove all non-alphanumeric, including non-printable, and special characters

This may sound like a duplicate, but existing solutions does not work.
I need to remove all non-alphanumerics from a varchar field. I'm using the following but it doesn't work in all cases (it works with diamond questionmark characters):
select TRANSLATE(FIELDNAME, '?',
TRANSLATE(FIELDNAME , '', 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'))
from TABLENAME
What it's doing is the inner translate parse all non-alphanumeric characters, then the outer translate replace them all with a '?'. This seems to work for replacement character�. However, it throws The second, third or fourth argument of the TRANSLATE scalar function is incorrect. which is expected according to IBM:
The TRANSLATE scalar function does not allow replacement of a character by another character which is encoded using a different number of bytes. The second and third arguments of the TRANSLATE scalar function must end with correctly formed characters.
Is there anyway to get around this?
Edit: #Paul Vernon's solution seems to be working:
· 6005308 ??6005308
–6009908 ?6009908
–6011177 ?6011177
��6011183�� ??6011183??
Try regexp_replace(c,'[^\w\d]','') or regexp_replace(c,'[^a-zA-Z\d]','')
E.g.
select regexp_replace(c,'[^a-zA-Z\d]','') from table(values('AB_- C$£abc�$123£')) t(c)
which returns
1
---------
ABCabc123
BTW Note that the allowed regular expression patterns are listed on this page Regular expression control characters
Outside of a set, the following must be preceded with a backslash to be treated as a literal
* ? + [ ( ) { } ^ $ | \ . /
Inside a set, the follow must be preceded with a backslash to be treated as a literal
Characters that must be quoted to be treated as literals are [ ] \
Characters that might need to be quoted, depending on the context are - &

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

I'm currently working on something that requires me to pass a Base64 string to a PowerShell script. But while decoding the string back to the original I'm getting some unexpected results as I need to use UTF-7 during decoding and I don't understand why. Would someone know why?
The Mozilla documentation would suggest that it's insufficient to use Base64 if you have Unicode characters in your string. Thus you need to use a workaround that consists of using encodeURIComponent and a replace. I don't really get why the replace is needed and shortened it to btoa(escape('✓ à la mode')) to encode the string. The result of that operation would be JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl.
Using PowerShell to decode the string back to the original, I need to first undo the Base64 encoding. In order to do System.Convert can be used (which results in a byte array) and its output can be converted to a UTF-8 string using System.Text.Encoding. Together this would look like the following:
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
What's left to do is URL decode the whole thing. As it is a UTF-8 string I'd expect only to need to run the URL decode without any further parameters. But if you do that you end up with a accented a that looks like � in a file or ? on the console. To get the actual original string it's necessary to tell the URL decode to use UTF-7 as the character set. It's nice that this works but I don't really get why it's necessary since the string should be UTF-8 and UTF-8 certainly supports an accented a. See the last two lines of the entire script for what I mean. With those two lines you will end up with one line that has the garbled text and one which has the original text in the same file encoded as UTF-8
Entire PowerShell script:
Add-Type -AssemblyName System.Web
$inputstring = "JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl"
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
[System.Web.HttpUtility]::UrlDecode($utf8string) | Out-File -Encoding utf8 C:\temp\output.txt
[System.Web.HttpUtility]::UrlDecode($utf8string, [System.Text.UnicodeEncoding]::UTF7) | Out-File -Append -Encoding utf8 C:\temp\output.txt
Clarification:
The problem isn't the conversion of the Base64 to UTF-8. The problem is some inconsistent behavior of the UrlDecode of C#. If you run escape('✓ à la mode') in your browser you will end up with the following string %u2713%20%E0%20la%20mode. So we have a Unicode representation of the check mark and a HTML entity for the á. If we use this directly in UrlDecode we end up with the same error. My current assumption would be that it's an issue with the encoding of the PowerShell window and pasting characters into it.
Turns out it actually isn't all that strange. It's just for what I want to do it's advantages to use a newer function. I'm still not sure why it works if you use the UTF-7 encoding. But anyways, as an explanation:
... The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
As TesselatingHecksler pointed out What is the proper way to URL encode Unicode characters? would indicate that the %u format wasn't formerly standardized. A newer version to escape characters exists though, which is encodeURIComponent.
The encodeURIComponent() function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
The output of this function actually works with the C# implementation of UrlDecode without supplying an additional encoding of UTF-7.
The original linked Mozilla article about a Base64 encode for an UTF-8 strings modifies the whole process in a way to allows you to just call the Base64 decode function in order to get the whole string. This is realized by converting the URL encode version of the string to bytes.

how to remove # character from national data type in cobol

i am facing issue while converting unicode data into national characters.
When i convert the Unicode data into national using national-of function, some junk character like # is appended after the string.
E.g
Ws-unicode pic X(200)
Ws-national pic N(600)
--let the value in Ws-Unicode is これらの変更は. getting from java end.
move function national-of ( Ws-unicode ,1208 ) to Ws-national.
--after converting value is like これらの変更は #.
i do not want the extra # character added after conversion.
please help me to find out the possible solution, i have tried to replace N'#' with space using inspect clause.
it worked well but failed in some specific scenario like if we have # in input from user end. in that case genuine # also converted to space.
Below is a snippet of code I used to convert EBCDIC to UTF. Before I was capturing string lengths, I was also getting # symbols:
STRING
FUNCTION DISPLAY-OF (
FUNCTION NATIONAL-OF (
WS-EBCDIC-STRING(1:WS-XML-EBCDIC-LENGTH)
WS-EBCDIC-CCSID
)
WS-UTF8-CCSID
)
DELIMITED BY SIZE
INTO WS-UTF8-STRING
WITH POINTER WS-XML-UTF8-LENGTH
END-STRING
SUBTRACT 1 FROM WS-XML-UTF8-LENGTH
What this code does is string the UTF8 representation of the EBCIDIC string into another variable. The WITH POINTER clause will capture the new length of the string + 1 (+ 1 because the pointer is positioned to the next position after the string ended).
Using this method, you should be able to know exactly how long second string is and use that string with the exact length.
That should remove the unwanted #s.
EDIT:
One thing I forgot to mention, in my case, the # signs were actually EBCDIC low values when viewing the actual hex on the mainframe
Use inspect with reverse and stop after first occurence of #

How do I replace spaces with %20 in PowerShell?

I'm creating a PowerShell script that will assemble an HTTP path from user input. The output has to convert any spaces in the user input to the product specific codes, "%2F".
Here's a sample of the source and the output:
The site URL can be a constant, though a variable would be a better approach for reuse, as used in the program is: /http:%2F%2SPServer/Projects/"
$Company="Company"
$Product="Product"
$Project="The new project"
$SitePath="$SiteUrl/$Company/$Product/$Project"
As output I need:
'/http:%2F%2FSPServer%2FProjects%2FCompany%2FProductF2FThe%2Fnew%2Fproject'
To replace " " with %20 and / with %2F and so on, do the following:
[uri]::EscapeDataString($SitePath)
The solution of #manojlds converts all odd characters in the supplied string.
If you want to do escaping for URLs only, use
[uri]::EscapeUriString($SitePath)
This will leave, e.g., slashes (/) or equal signs (=) as they are.
Example:
# Returns http%3A%2F%2Ftest.com%3Ftest%3Dmy%20value
[uri]::EscapeDataString("http://test.com?test=my value")
# Returns http://test.com?test=my%20value
[uri]::EscapeUriString("http://test.com?test=my value")
For newer operating systems, the command is changed. I had problems with this in Server 2012 R2 and Windows 10.
[System.Net.WebUtility] is what you should use if you get errors that [System.Web.HttpUtility] is not there.
$Escaped = [System.Net.WebUtility]::UrlEncode($SitePath)
The output transformation you need (spaces to %20, forward slashes to %2F) is called URL encoding. It replaces (escapes) characters that have a special meaning when part of a URL with their hex equivalent preceded by a % sign.
You can use .NET framework classes from within Powershell.
[System.Web.HttpUtility]::UrlEncode($SitePath)
Encodes a URL string. These method overloads can be used to encode the entire URL, including query-string values.
http://msdn.microsoft.com/en-us/library/system.web.httputility.urlencode.aspx

Decoding URL containing unicode characters

I have the following code in a Mako template:
<a href="#" onclick='getCompanyHTML("${fund.investments[inv_name].name | u}"); return false;'>${inv_name}</a>
This applies url escaping to the name string of an object representing a company. The resulting escaped string is then used in a url. The Mako documentation states that url encoding is provided using urllib.quote_plus(string.encode('utf-8')).
On the server I receive the company name part into the argument investment_name:
def Investment(client, fund_name, investment_name, **kwargs):
client = urllib.unquote_plus(client)
fund_name = urllib.unquote_plus(fund_name)
investment_name = urllib.unquote_plus(investment_name)
I then use investment_name as a key back in to the same dictionary from which it was extracted in the template.
This works fine for all the standard cases, such as spaces, slashes, and single quotes in the company name. However, it fails if the company name contains unicode characters outside of the ascii character set.
For instance, the url for company name "Eptisa Servicios de Ingeniería S.L." is rendered as "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L." When this value arrives back at the server, I'm reversing the url escaping but clearly failing to decode the unicode properly because my attempt to use the result as a dictionary key generates a key error.
I've tried adding unicode decoding in these two forms, without luck:
investment_name = urllib.unquote_plus(investment_name.decode('utf-8'))
investment_name = urllib.unquote_plus(investment_name.encode('raw_unicode_escape').decode('utf-8'))
Can anyone suggest what I must do to "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L." to turn it back into "Eptisa Servicios de Ingeniería S.L."?
Do it in the reverse order: first unquote then .decode('utf-8')
Do not mix bytes and Unicode strings.
Example
import urllib
q = "Eptisa+Servicios+de+Ingenier%C3%ADa+S.L."
b = urllib.unquote_plus(q)
u = b.decode("utf-8")
print u
Note: print u might produce UnicodeEncodeError. To fix it:
print u.encode(character_encoding_your_console_understands)
Or set PYTHONIOENCODING environment variable.
On Unix you could try locale.getpreferredencoding() as character encoding, on Windows see output of chcp