How to convert Unicode characters to escape codes - unicode

So, I have a bunch of strings like this: {\b\cf12 よろてそ } . I'm thinking I could iterate over each character and replace any unicode (Edit: Anything where AscW(char) > 127 or < 0) with a unicode escape code (\u###). However, I'm not sure how to programmatically do so. Any suggestions?
Clarification:
I have a string like {\b\cf12 よろてそ } and I want a string like {\b\cf12 [STUFF]}, where [STUFF] will display as よろてそ when I view the rtf text.

You can simply use the AscW() function to get the correct value:-
sRTF = "\u" & CStr(AscW(char))
Note unlike other escapes for unicode, RTF uses the decimal signed short int (2 bytes) representation for a unicode character. Which makes the conversion in VB6 really quite easy.
Edit
As MarkJ points out in a comment you would only do this for characters outside of 0-127 but then you would also need to give some other characters inside the 0-127 range special handling as well.

Another more roundabout way, would be to add the MSScript.OCX to the project and interface with VBScript's Escape function. For example
Sub main()
Dim s As String
s = ChrW$(&H3088) & ChrW$(&H308D) & ChrW$(&H3066) & ChrW$(&H305D)
Debug.Print MyEscape(s)
End Sub
Function MyEscape(s As String) As String
Dim scr As Object
Set scr = CreateObject("MSScriptControl.ScriptControl")
scr.Language = "VBScript"
scr.Reset
MyEscape = scr.eval("escape(" & dq(s) & ")")
End Function
Function dq(s)
dq = Chr$(34) & s & Chr$(34)
End Function
The Main routine passes in the original Japanese characters and the debug output says:
%u3088%u308D%u3066%u305D
HTH

Related

Replace emdash with double dash

I want to replace ― back into --
I tried with the utf8 encodings but that doesn't work
string = "blablabla -- blablabla ―"
I want to replace the long dash (if there is one) with double hyphens. I tried it the simple way but that didn't work:
string= string.replace ("―", "--")
I also tried to encode it with utf8 and use the codes of the special characters
stringutf8= string.encode("utf-8")
emdash= u"\u2014"
hyphen= u"\u002D"
if emdash in stringutf8:
stringutf8.replace(emdash, 2*hyphen)
Any suggestions?
I am working with text files in which sometimes apparently the two hyphens are replaced automatically with a long dash...
thanks a lot!
You are dealing with strings here. Strings are lists of characters. Replace the character, leave the encoding out of the equation.
string = 'blablabla -- blablabla \u2014'
emdash = '\u2014'
hyphen = '\u002D'
string2 = string.replace(emdash, 2*hyphen)

Strange results when deleting all special characters from a string in Progress / OpenEdge

I have the code snippet below (as suggested in this previous Stack Overflow answer ... Deleting all special characters from a string in progress 4GL) which is attempting to remove all extended characters from a string so that I may transmit it to a customer's system which will not accept any extended characters.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
It is working perfectly with one exception (which makes me fear there may be others I have not caught). When it gets to 255, it will replace all 'y's in the string.
If I do the following ...
display chr(255) = chr(121). /* 121 is asc code of y */
I get true as the result.
And therefore, if I do the following ...
display replace("This is really strange",chr(255),"").
I get the following result:
This is reall strange
I have verified that 'y' is the only character affected by running the following:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz".
def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
display v-string.
Which results in the following:
abcdefghijklmnopqrstuvwxz
I know I can fix this by removing 255 from the range but I would like to understand why this is happening.
Is this a character collation set issue or am I missing something simpler?
Thanks for any help!
This is a bug. Here's a Progress Knowledge Base article about it:
http://knowledgebase.progress.com/articles/Article/000046181
The workaround is to specify the codepage in the CHR() statement, like this:
CHR(255, "UTF-8", "1252")
Here it is in your example:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz". def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string, chr(v-int, "UTF-8", "1252"), "").
end.
display v-string.
You should now see the 'y' in the output.
This seems to be a bug!
The REPLACE() function returns an unexpected result when replacing character CHR(255) (ÿ) in a String.
The REPLACE() function modifies the value of the target character, but additionally it changes any occurrence of characters 'Y' and 'y' present in the String.
This behavior seems to affect only the character ÿ. Other characters are correctly changed by REPLACE().
Using default codepage ISO-8859-1
Link to knowledgebase

Need code for removing all unicode characters in vb6

I need code for removing all unicode characters in a vb6 string.
If this is UTF-16 text (as normal VB6 String values all are) and you can ignore the issue of surrogate pairs, then this is fairly quick and reasonably concise:
Private Sub DeleteNonAscii(ByRef Text As String)
Dim I As Long
Dim J As Long
Dim Char As String
I = 1
For J = 1 To Len(Text)
Char = Mid$(Text, J, 1)
If (AscW(Char) And &HFFFF&) <= &H7F& Then
Mid$(Text, I, 1) = Char
I = I + 1
End If
Next
Text = Left$(Text, I - 1)
End Sub
This has the workaround for the unfortunate choice VB6 had to make in returning a signed 16-bit integer from the AscW() function. It should have been a Long for symmatry with ChrW$() but it is what it is.
It should beat the pants off any regular expression library in clarity, maintainability, and performance. If better performance is required for truly massive amounts of text then SAFEARRAY or CopyMemory stunts could be used.
Public Shared Function StripUnicodeCharactersFromString(ByVal inputValue As String) As String
Return Regex.Replace(inputValue, "[^\u0000-\u007F]", String.Empty)
End Function
Vb6 - not sure will
sRTF = "\u" & CStr(AscW(char))
work? - You could do this for all char values above 127
StrConv is the command for converting strings.
StrConv Function
Returns a Variant (String) converted as specified.
Syntax
StrConv(string, conversion, LCID)
The StrConv function syntax has these named arguments:
Part Description
string Required. String expression to be converted.
conversion Required. Integer. The sum of values specifying the type of conversion to perform. `128` is Unicode to local code page (or whatever the optional LCID is)
LCID Optional. The LocaleID, if different than the system LocaleID. (The system LocaleID is the default.)

How to add \ before all special characters in MATLAB?

I am trying to add '\' before all special characters in a string in MATLAB, could anyone please help me out. Here is the example:
tStr = 'Hi, I'm a Big (Not So Big) MATLAB addict; Since my school days!';
I want this string to be changed to:
'Hi\, I\'m a Big \(Not so Big \) MATLAB addict\; Since my school days\!'
The escape character in Matlab is the single quote ('), not the backslash (\), like in C language. Thus, your string must be like this:
tStr = 'Hi\, I\''m a Big (Not so Big ) MATLAB addict\; Since my school days!'
I took the list of special charecters defined on the Mathworks webpage to do this:
special = '[]{}()=''.().....,;:%%{%}!#';
tStr = 'Hi, I''m a Big (Not So Big) MATLAB addict; Since my school days!';
outStr = '';
for l = tStr
if (length(find(special == l)) > 0)
outStr = [outStr, '\', l];
else
outStr = [outStr, l];
end
end
which will automatically add those \s. You do need to use two single quotes ('') in place of the apostrophe in your input string. If tStr is obtained with the function input(), or something similar, this will procedure will still work.
Edited:
Or using regular expressions:
regexprep(tStr,'([[\]{}()=''.(),;:%%{%}!#])','\\$1')

EDIFACT macro (readable message structure)

I´m working within the EDI area and would like some help with a EDIFACT macro to make the EDIFACT files more readable.
The message looks like this:
data'data'data'data'
I would like to have the macro converting the structure to:
data'
data'
data'
data'
Pls let me know how to do this.
Thanks in advance!
BR
Jonas
If you merely want to view the files in a more readable format, try downloading the Softshare EDI Notepad. It's a fairly good tool just for that purpose, it supports X12, EDIFACT and TRADACOMS standards, and it's free.
Replacing in VIM (assuming that the standard EDIFACT separators/escape characters for UNOA character set are in use):
:s/\([^?]'\)\(.\)/\1\r\2/g
Breaking down the regex:
\([^?]'\) - search for ' which occurs after any character except ? (the standard escape character) and capture these two characters as the first atom. These are the last two characters of each segment.
\(.\) - Capture any single character following the segment terminator (ie. don't match if the segment terminator is already on the end of a line)
Then replace all matches on this line with a new line between the segment terminator and the beginning of the next segment.
Otherwise you could end up with this:
...
FTX+AAR+++FORWARDING?: Freight under Vendor?'
s care.'
NAD+BY+9312345123452'
CTA+PD+0001:Terence Trent D?'
Arby'
...
instead of this:
...
FTX+AAR+++FORWARDING?: Freight under Vendor?'s care .'
NAD+BY+9312345123452'
CTA+PD+0001:Terence Trent D?'Arby'
...
Is this what you are looking for?
Option Explicit
Dim stmOutput: Set stmOutput = CreateObject("ADODB.Stream")
stmOutput.Open
stmOutput.Type = 2 'adTypeText
stmOutput.Charset = "us-ascii"
Dim stm: Set stm = CreateObject("ADODB.Stream")
stm.Type = 1 'adTypeBinary
stm.Open
stm.LoadFromFile "EDIFACT.txt"
stm.Position = 0
stm.Type = 2 'adTypeText
stm.Charset = "us-ascii"
Dim c: c = ""
Do Until stm.EOS
c = stm.ReadText(1)
Select Case c
Case Chr(39)
stmOutput.WriteText c & vbCrLf
Case Else
stmOutput.WriteText c
End Select
Loop
stm.Close
Set stm = Nothing
stmOutput.SaveToFile "EDIFACT.with-CRLF.txt"
stmOutput.Close
Set stmOutput = Nothing
WScript.Echo "Done."