Can I read Cyrillic (Russian) characters from the CLI in Groovy? - encoding

I have a Groovy script that takes user input from the CLI. The CLI supports Cyrillic characters and both the encoding and charset are in UTF-8. Yet, when Groovy reads input with Cyrillic characters, all it sees are "???????". Additionally, Groovy cannot create a directory or file with the given parameter. Does anyone have any ideas on forcing Groovy to accept the Cyrillic characters? Thanks.

Ensure the reader you're using is using the same encoding as your CLI. If they are, it could be a problem displaying the characters instead. You can verify the Unicode codepoints that groovy is getting like this:
// test.groovy
def input = System.in.withReader('UTF-8') { it.readLine() }
input.eachWithIndex { ch, index ->
println "$ch, ${Character.codePointAt(input, index)}"
}
Run this from the CLI:
$ echo $LANG
en_US.UTF-8
$ echo Здра́вствуйте | groovy test.groovy
З, 1047
д, 1076
р, 1088
а, 1072
́, 769
в, 1074
с, 1089
т, 1090
в, 1074
у, 1091
й, 1081
т, 1090
е, 1077

Related

Convert from memcached output binary protocol to readable UTF8 in command line

Right now a lot of symbols are not readable via memcat/memccat or telnet (get) commands.
Is there anyway to convert the telnet/memcat/memccat output to UTF8 readable (except native clients)?
aim-server[~/www/next/src]$ telnet localhost 11211
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
get 20_dev_cache:c9bf76a9fa1e19ad92ab7195c64e47f8
VALUE 20_dev_cache:c9bf76a9fa1e19ad92ab7195c64e47f8 84 7914
�a:9:{s:14:"Licenses";a:2`�8:"QSL/TestModule"; 49:"JJJC-UNKK-MXZL-DAGW��� 42#20:"NNHJ2�6312-1 ";} R2��Tok �#�1#�#jC/TaxJar#E1:"�1
coreVersion#03#05:"majo`,3:"5.4#6#in� 8 B`build#
~35�� q#Wa ceKeyValue#$ �#!ad MURL#3 ehttps://xlocal/next/src/`'.php#(!jwav IN |1 L public_key#73 ssh-rsa A
ADAQAB CAQDWq4/EAVRilQslmKeA9A6y8f5i+oJSg0dfwaXUnbP6f4YwuJq8TOzr/q05HoQGS/biMExeu/YF2nu/Vo2RoBCV9rW5j+wPeIicgUQHarO+zoLTFM2+xdR7aG2MMEW4NO+4fdXgmRiqm9z6LJW4wpISZWiBqbJxsjaxeCvCVSsAtxlmP8Fg19lDB9OiMsll+GkMVAprH4xxhUiPz2hs2c6f9kf5kG2lf2lNS2bovobNFT2etds9so7HdbJ8GPVJf2wC0xUaRR7QNm0+HZ2SzfybYR11WQqmBrWVhZBVxhN343Knjh3jjW3jx/eOWWVeHezG8apD7YC58/ZVUhW3KdFbB3huUXNYmY8FWtYUkC+QEIjLkste57FvIMGb2Opwm4+inVf4hsVBs/a14bJgJGIn+7mLwvgsoAePunpRyv3tk0Wz+yM2KFy7xjHxt4wIIRUVRrp9D4ZHJbB9H7f0YcJRfCv/guIXywOJ1e1TDsjGfImbk77M+gV/gSHdnrlR2QRQYar7PcNWicVRKbDmbz9FERxhoSyd0Bhq+Jgp31oc2qxQ89#"6# 6:"auth�� KhgMQ+CBK9tAHrv00P4VIMj6tmBIENviEVAcMLbN+JbtnqXDrRpm RIbwxMurQeTqUAqgfva/nHoucBFAZETCX+LnsrIG2KvoVjP3XcZsrGQ== aim#example.comB�1#� installedAdC�224C�D)
�2#��G#�$A�� w��Y��0#)��i � �Ec��7:"enab �";b:0�;�/$C$�$�4#~utomatedShippingRefunds71LBS/SeentyOnePo#�
CDev/�`S3Imag�"� � �#+#Z

Does the postgres COPY function support utf 16 encoded files?

I am trying to use the postgreSQL COPY function to insert a UTF 16 encoded csv into a table. However, when running the below query:
COPY temp
FROM 'C:\folder\some_file.csv'
WITH (
DELIMITER E'\t',
FORMAT csv,
HEADER);
I get the error below:
ERROR: invalid byte sequence for encoding "UTF8": 0xff
CONTEXT: COPY temp, line 1
SQL state: 22021
and when I run the same query, but adding the encoding settings Encoding 'UTF-16' or Encoding 'UTF 16' to the with block, I get the error below:
ERROR: argument to option "encoding" must be a valid encoding name
LINE 13: ENCODING 'UTF 16' );
^
SQL state: 22023
Character: 377
I've looked through the postgres documentation to try to find the correct encoding, but haven't managed to find anything. Is this because the copy function does support UTF 16 encoded files? I would have thought that this would almost certainly have been possible!
I'm running postgres 12, on windows 10 pro
Any help would be hugely appreciated!
No, you cannot do that.
UTF-16 is not in the list of supported encodings.
PostgreSQL will never support an encoding that is not an extension of ASCII.
You will have to convert the file to UTF-8.

Mule ESB execute script from Groovy with unicode params

I have this little chain of components in my Mule ESB project:
<set-payload value="Получена заявка ##[sessionVars['ticketID']]" doc:name="Set SMS Text"/>
<scripting:transformer doc:name="Send SMS" ignoreBadInput="true">
<scripting:script engine="Groovy"><![CDATA[
def command = ["/tmp/call.sh", message.payload]
def proc = command.execute()
proc.waitFor()
]]></scripting:script>
</scripting:transformer>
And /tmp/call.sh listing:
#!/bin/bash
echo $# > /tmp/call.out
When message passes Mule chain in /tmp/call.out I can see "Џолучена заЯвка #4041" instead of expected "Получена заявка #4041" ("Получена заявка" - Russian words), i.e. there is a problem with unicode chars output and there are no problems with ASCII chars.
When I check /tmp/groovy.out with HEX editor I see that all Russain chars has 1-byte lenght (in unicode that must be 2-bytes length), i.e. output of my Groovy component is not unicode.
There is no problem with unicode output to Mule log when I user Echo and Logger components. Also in SMTP component everything is perfect: I successfully receive letters in unicode from Mule.
Can you help me with unicode arguments in Mule ESB with Groovy command call?
Solved by selecting UTF-8 ecoding in Run configuration options (menu Run -> Run configurations...). By default it was MacCyrilic encoding.

Encoding of file names in Java

I am running a small Java application on an embedded Linux platform. After replacing the Java VM JamVM with OpenJDK, file names with special characters are not stored correctly. Special characters like umlauts are replaced by question marks.
Here is my test code:
import java.io.File;
import java.io.IOException;
public class FilenameEncoding
{
public static void main (String[] args) {
String name = "umlaute-äöü";
System.out.println("\nname = " + name);
System.out.print("name in Bytes: ");
for (byte b : name.getBytes()) {
System.out.print(Integer.toHexString(b & 255) + " ");
}
System.out.println();
try {
File f = new File(name);
f.createNewFile();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Running it gives the following output:
name = umlaute-???
name in Bytes: 75 6d 6c 61 75 74 65 2d 3f 3f 3f
and file called umlaute-??? is created.
Setting the properties file.encoding and sun.jnu.encoding to UTF-8 gives the correct strings in the terminal, but the created file is still umlaute-???
Running the VM with strace, I can see the system call
open("umlaute-???", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, 0666) = 4
This shows, that the problem is not a file system issue, but one of the VM.
How can the encoding of the file name be set?
If you are using Eclipse, then you can go to Window->Preferences->General->Workspace and select the "Text file encoding" option you want from the pull down menu. By changing mine around, I was able to recreate your problem (and also change back to the fix).
If you are not, then you can add an environmental variable to windows (System properties->Environment Variables and under system variables you want to select New...) The name should be (without quotes) JAVA_TOOL_OPTIONS and the value should be set to -Dfile.encoding=UTF8 (or whatever encoding will get yours to work.
I found the answer through this post, btw:
Setting the default Java character encoding?
Linux Solutions
-(Permanent) Using env | grep LANG in the terminal will give you one or two responses back on what encoding linux is currently setup with. You can then set LANG to UTF8 (yours might be set to ASCII) in the /etc/sysconfig i18n file (I tested this on 2.6.40 fedora). Bascially, I switched from UTF8 (where I had odd characters) to ASCII (where I had question marks) and back.
-(on running the JVM, but may not fix the problem) You can start the JVM with the encoding you want using java -Dfile.encoding=**** FilenameEncoding
Here is the output from the two ways:
[youssef#JoeLaptop bin]$ java -Dfile.encoding=UTF8 FilenameEncoding
name = umlaute-הצ�
name in Bytes: 75 6d 6c 61 75 74 65 2d d7 94 d7 a6 ef bf bd
UTF-8
UTF8
[youssef#JoeLaptop bin]$ java FilenameEncoding
name = umlaute-???????
name in Bytes: 75 6d 6c 61 75 74 65 2d 3f 3f 3f 3f 3f 3f 3f
US-ASCII
ASCII
Here is some references for the linux stuff
http://www.cyberciti.biz/faq/set-environment-variable-linux/
and here is one about the -Dfile.encoding
Setting the default Java character encoding?
I know it's an old question but I had the same problem.
All of the mentioned solutions did not work for me, but the following solved it:
Source encoding to UTF8 (project.build.sourceEncoding to UTF-8 in maven properties)
Program arguments: -Dfile.encoding=utf8 and -Dsun.jnu.encoding=utf8
Using java.nio.file.Path instead of java.io.File
Your problem is that javac is expecting a different encoding for your .java-file than you have saved it as. Didn't javac warn you when you compiled?
Maybe you have saved it with encoding ISO-8859-1 or windows-1252, and javac is expecting UTF-8.
Provide the correct encoding to javac with the -encoding flag, or the equivalent for your build tool.

Python 3 doesn't read unicode file on a new server

My webpages are served by a script that dynamically imports a bunch of files with
try:
with open (filename, 'r') as f:
exec(f.read())
except IOError: pass
(actually, can you suggest a better method of importing a file? I'm sure there is one.)
Sometimes the files have strings in different languages, like
# contents of language.ru
title = "Название"
Those were all saved as UTF-8 files. Python has no problem running the script in command line or serving a page from my MacBook:
OK: [server command line] python3.0 page.py /index.ru
OK: http://whitebox.local/index.ru
but it throws an error when trying to serve a page from a server we just moved to:
157 try:
158 with open (filename, 'r') as f:
159 exec(f.read())
160 except IOError: pass
161
/usr/local/lib/python3.0/io.py in read(self=, n=-1)
...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 627: ordinal not in range(128)
All the files were copied from my laptop where they were perfectly served by Apache. What is the reason?
Update: I found out the default encoding for open() is platform-dependent so it was utf8 on my laptop and ascii on server. I wonder if there is a per-program function to set it in Python 3 (sys.setdefaultencoding is used in site module and then deleted from the namespace).
Use open(filename, 'r', encoding='utf8').
See Python 3 docs for open.
Use codecs library, I'm using python 2.6.6 and I do not use the usual open with encoding argument:
import codecs
codecs.open('filename','r',encoding='UTF-8')
You can use something like
with open(fname, 'r', encoding="ascii", errors="surrogateescape") as f:
data = f.read()
# make changes to the string 'data'
with open(fname + '.new', 'w',
encoding="ascii", errors="surrogateescape") as f:
f.write(data)
more information is on python unicode documents