Java changing unicode character Θ to? - unicode

I'm trying to use Unicode character Θ (\u0398) in my java program, But when I print that character, I'm getting ?.
System.getProperty("file.encoding") is showing value Cp1252 on my machine. I have tried to change this property to UTF8, But I'm still getting ?.
The sample code that I use in my test program is given below.
char[] CHAR_TABLE = { '#', '\u0398', '1', };
for(int i=0; i< CHAR_TABLE.length; i++) {
System.out.println(CHAR_TABLE[i]);
}
The output of this code is:
#
?
1
I was facing the same issue in another Java application (on another machine), but after service restart everything was fine.

Related

How to print unicode to console in Eiffel?

Evidently in python:
print u'\u0420\u043e\u0441\u0441\u0438\u044f'
outputs:
Россия
How do I do this in Eiffel?
An example at the first page of eiffel.org (at the time of writintg) suggests the following code:
io.put_string_32 ("[
Hello!
¡Hola!
Bonjour!
こんにちは!
Здравствуйте!
Γειά σου!
]")
It is supported since EiffelStudio 20.05. (I tested the example with EiffelStudio 22.05.)
In this particular case, using print instead of io.put_string_32 works as well. However, in some boundary cases, when all character codes in the string are below 256, you may need to specify the string type explicitly:
print ({STRING_32} "Zürich") -- All character code points are below 256.
Of course, you can write the character code points explicitly:
io.put_string_32 ("%/0x041f/%/0x0440/%/0x0438/%/0x0432/%/0x0435/%/0x0442/%/0x0021/")
The output code page defaults to UTF-8 for text files and the current console code page for CONSOLE (the standard object type of io). If needed, the default encoding can be changed to any other encoding:
io.standard_default.set_encoding ({SYSTEM_ENCODINGS}.console_encoding)
On my linux OS, this code print exactly what you want:
make
-- Initialisation of `Current'
local
l_utf_converter:UTF_CONVERTER
l_string:STRING_32
do
create l_string.make_empty
l_string.append_code (0x420)
l_string.append_code (0x43e)
l_string.append_code (0x441)
l_string.append_code (0x441)
l_string.append_code (0x438)
l_string.append_code (0x44f)
print(l_utf_converter.utf_32_string_to_utf_8_string_8 (l_string))
end
In a nutshell, STRING_32 use UTF-32 and the linux console use UTF-8. The UTF_CONVERTER class can be use to convert UTF-32 to UTF-8.
I do not know if it is possible on a Windows console.

Display 3-byte unicode character in Windows PowerShell

I want to support Unicode and as most characters as possible in my PowerShell script. As encoding I want to use UTF-8. So for testing purposes I simply type this line and press enter:
[char]0x02A7
And it successfully shows the character ʧ.
But when I try to display a Unicode character (> 0xFFFF):
[char]0x01F600
It throws an error telling that the value 128512 cannot be converted to System.Char.
Instead it should show the smiley 😀.
What is wrong here?
Edit:
As Jeroen Mostert stated in the comments, I have to use another command for unicode characters with code point > 0xFFFF. So I wrote this script:
$s = [Char]::ConvertFromUtf32(0x01F600)
Write-Host $s
In the PowerShell IDE I get a beautiful smiley 😀. But when I run the script standalone (in an own window) I don't get the smiley.
Instead it shows two strange characters.
What is wrong here?
Aside from [Char]::ConvertFromUtf32(), here's a way to calculate the surrogate pair by hand for code points over 2 bytes or 16 bits long (http://www.russellcottrell.com/greek/utilities/surrogatepaircalculator.htm):
$S = 0x1F600
[int]$H = [Math]::Truncate(($S - 0x10000) / 0x400) + 0xD800
[int]$L = ($S - 0x10000) % 0x400 + 0xDC00
[char]$H + [char]$L
😀

Autohotkey String-comparison

For some reason, I can not get an autohotkey string comparison to work in the script I need it in, but it is working in a test script.
Tester
password = asdf
^!=::
InputBox,input,Enter Phrase,Enter Phrase,,,,,,,30,
if ( input == password ){
MsgBox, How original your left home row fingers are
Return
} else {
MsgBox, You entered "%input%"
Return
}
Main
password = password
!^=::
InputBox,input,Enter Password,Enter Password,HIDE,,,,,,30,
if ( input == password ){
MsgBox,"That is correct sir"
;Run,C:\Copy\Registry\disable.bat
return
}else{
MsgBox,That is not correct sir you said %input%
Return
}
Main keeps giving me the invalid. Any ideas?
Your "main" script works just fine.
The == comparitor is case sensitive, you know.
I found that strings in the clipboard were not comparing properly to strings in my source files when the strings contained in the source file contained non-ascii characters. After converting the file to UTF-8 with BOM, it would correctly compare.
The documentation doesn't say directly that it will affect string comparisons but it does say that it has an affect. In the FAQ section it states:
Why are the non-ASCII characters in my script displaying or sending
incorrectly?
Short answer: Save the script as UTF-8 with BOM.
Although AutoHotkey supports Unicode text, it is optimized for
backward-compatibility, which means defaulting to the ANSI encoding
rather than the more internationally recommended UTF-8. AutoHotkey
will not automatically recognize a UTF-8 file unless it begins with a
byte order mark.
Source: https://web.archive.org/web/20230203020016/https://www.autohotkey.com/docs/v1/FAQ.htm#nonascii
So perhaps it does more than just display and send incorrectly, but also store values incorrectly causing invalid comparisons.

Why does the filename requested from the server start with Unicode characters?

I use FTP to list the file attributes on the server. I request the name of file and put them into an array. I print the array directly like this:
NSLog(#"%#", array);
What I got is like this:
\U6587\U4ef6\U540d\Uff1afilename.txt
\U6587\U4ef6\U540d\Uff1afilename1.txt
......
When I want to print the Unicode "\U6587\U4ef6\U540d\Uff1a" to see what it is, I got the compiling error: "incomplete universal character name".
However, If I print the name instead of the whole array, I can get the name correctly without the Unicode. But I need to do something with the name in the array. I want to know why the Unicode is there, and is it proper to just remove the Unicode then to do something with the real file name?
In C99, and therefore presumably Objective C too, there are two Unicode escapes:
\uXXXX
\UXXXXXXXX
The lower-case u is followed by 4 hex digits; the upper-case U is followed by 8 hex digits (of which, the first two should be zeroes to be valid Unicode (and the third should be 0 or 1; the maximum Unicode code point is U+10FFFF).
I believe that if you replace the upper-case U's with lower-case u's, you should get the code to compile.
On my Mac OS 10.7.4 system, compiling with GCC 4.7.0 (home built), I compiled this code:
#include <stdio.h>
int main(void)
{
char array[] = "\u6587\u4ef6\u540d\uff1a";
puts(array);
return 0;
}
and got this output:
文件名:
I can't answer why the characters are there, but the colon-like character at the end suggests that the site might be preceding the actual file name with a tag of some sort (analogous to 'file:').
Adding to what Jonathan said, you might have to use stringWithUTF8String:, but I agree that the error is with the capital U rather than u.

How to detect if a Unicode char is supported by EBCDIC in .NET 4.0?

We have a web site and WinForms application written in .NET 4.0 that allows users to enter any Unicode char (pretty standard).
The problem is that a small amount of our data gets submitted to an old mainframe application. While we were testing a user entered a name with characters that ending up crashing the mainframe program. The name was BOËNS. The E is not supported.
What is the best way to detect if a unicode char is supported by EBCDIC?
I tried using the following regular expression but that restricted some standard special chars (/, _, :) which are fine for the mainframe.
I would prefer to use one method to validate each char or have a method that you just passed in a string and it returned true or false if chars not supported by EBCDIC were contained in the strig.
First, you would have to get the proper Encoding instance for EBCDIC, calling the static GetEncoding method which will takes the code page id as a parameter.
Once you have that, you can set the DecoderFallback property to the value in the static ExceptionFallback property on the DecoderFallback class.
Then, in your code, you would loop through each character in your string and call the GetBytes method to encode the character to the byte sequence. If it cannot be encoded, then a DecoderFallbackException is thrown; you would just have to wrap each call to GetBytes in a try/catch block to determine which character is in error.
Note, the above is required if you want to know the position of the character that failed. If you don't care about the position of the character, just if the string will not encode as a whole, then you can just call the GetBytes method which takes a string parameter and it will throw the same DecoderFallbackException if a character that cannot be encoded is encountered.
You can escape characters in Regex using the \ . So if you want to match a dot, you can do #"\." . To match /._,:[]- for example: #"[/._,:\-\[\]] . Now, EBDIC is 8 bits, but many characters are control characters. Do you have a list of "valid" characters?
I have made this pattern:
string pattern = #"[^a-zA-Z0-9 ¢.<(+&!$*);¬/|,%_>?`:##'=~{}\-\\" + '"' + "]";
It should find "illegal" characters. If IsMatch then there is a problem.
I have used this: http://nemesis.lonestar.org/reference/telecom/codes/ebcdic.html
Note the special handling of the ". I'm using the # at the beginning of the string to disable \ escape expansion, so I can't escape the closing quote, and so I add it to the pattern in the end.
To test it:
Regex rx = new Regex(pattern);
bool m1 = rx.IsMatch(#"a-zA-Z0-9 ¢.<(+&!$*);¬/|,%_>?`:##'=~{}\-\\" + '"');
bool m2 = rx.IsMatch(#"€a-zA-Z0-9 ¢.<(+&!$*);¬/|,%_>?`:##'=~{}\-\\" + '"');
m1 is false (it's the list of all the "good" characters), m2 is true (to the other list I've added the € symbol)