Different encoding in Jython's Java and Python level

Different encoding in Jython's Java and Python level - encoding

I'm using Sikuli (see sikuli.org) which uses jython2.5.2.
Here is a summary of the class Region on the Java level:
public class Region {
// < other class methods >
public int type(String text) {
System.out.println("javadebug: "+text); // debug output
// do actual typing
}
}
On the Pythonlevel there is a Wrapperclass:
import Region as JRegion // import java class
class Region(JRegion):
# < other class methods >
def type(self, text):
print "pythondebug: "+text // debug output
JRegion.type(self, text)
This works as intended for ascii chars, but when I use ö, ä or ü as text, this happens:
// python input:
# -*- encoding: utf-8 -*-
someregion = Region()
someregion.type("ä")
// output:
pythondebug: ä
javadebug: Ã¤
The character seems to be converted wrongly when passed to the Java object.
I would like to know what exactly is going wrong here and how to fix this, so that the characters entered in the pythonmethod are the same in the javamethod.
Thanks for your help

Looking from the Jython code you have to tell Java, that the string is UTF-8 encoded:
def type(self, text):
jtext = java.lang.String(text, "utf-8")
print "pythondebug: " + text // debug output
JRegion.type(self, jtext)

Related

How to make my own encoding for a file in VSCode Editor

Is it possible to have an own encoding in VSCode editor, inheritd from an exising?
class myEcoding implements utf-8
{
// changes for some codes
}
I have some files, which contains german characters like "ä ö ü" that are encoded as unicode numbers in this file.
So for example, the file conatins the following line
Pr\u00FCfsignal
While I want to edit this file with the correct german characters, it should exist on the harddisk in the form above.
This is how I want to see it in the editor
Prüfsignal
I already have a function, that can transform a string in both directions:
function translate(content: string, direction: boolean): string {
if (direction) {
content = content
.replace(/\\u00E4/g, "ä")
.replace(/\\u00F6/g, "ö")
.replace(/\\u00FC/g, "ü")
.replace(/\\u00C4/g, "Ä")
.replace(/\\u00D6/g, "Ö")
.replace(/\\u00DC/g, "Ü")
.replace(/\\u00DF/g, "ß")
.replace(/\\u00B0/g, "°")
.replace(/\\u00B1/g, "±")
.replace(/\\u00B5/g, "µ");
}
else {
content = content
.replace(/ä/g, "\\u00E4")
.replace(/ö/g, "\\u00F6")
.replace(/ü/g, "\\u00FC")
.replace(/Ä/g, "\\u00C4")
.replace(/Ö/g, "\\u00D6")
.replace(/Ü/g, "\\u00DC")
.replace(/ß/g, "\\u00DF")
.replace(/°/g, "\\u00B0")
.replace(/±/g, "\\u00B1")
.replace(/µ/g, "\\u00B5");
}
return content;
}
Can this be solved with a custom encoding, and if yes, any hints?
Is there possibly a better solution?

There has been an open feature request for some years to Provide encoding-related APIs for editor extensions:
https://github.com/microsoft/vscode/issues/824
For now you could just wrap that function in a loop that encodes all files in the working directory.

D-language: How to print Unicode characters to the console?

I have the following simple program to generate a random Unicode string from the union of 3 unicode character-sets.
#!/usr/bin/env rdmd
import std.uni;
import std.random : randomSample;
import std.stdio;
import std.conv;
/**
* Random salt generator
*/
dstring get_salt(uint s)
{
auto unicodechars = unicode("Cyrillic") | unicode("Armenian") | unicode("Telugu");
dstring unichars = to!dstring(unicodechars);
return to!dstring(randomSample(unichars, s));
}
void main()
{
writeln("Random salt:");
writeln(get_salt(32));
}
However, the output of the writeln is:
$ ./teste.d
Random salt:
rw13 13437 78580112 104 3914645
What are these numbers? Unicode code-points? How do I print the actual characters? I am on Ubuntu Linux with Locale set to UTF-8

This line is the problem you have:
dstring unichars = to!dstring(unicodechars);
It converts the CodepointSet object unicode returns to string, not the characters it covers. The set has a name and boundaries of characters but not characters itself. It took this:
InversionList!(GcPolicy)(CowArray!(GcPolicy)([1024, 1157, 1159, 1320, 1329, 1367, 1369, 1376, 1377, 1416, 1418, 1419, 1423, 1424, 3073, 3076, 3077, 3085, 3086, 3089, 3090, 3113, 3114, 3124, 3125, 3130, 3133, 3141, 3142, 3145, 3146, 3150, 3157, 3159, 3160, 3162, 3168, 3172, 3174, 3184, 3192, 3200, 7467, 7468, 7544, 7545, 11744, 11776, 42560, 42648, 42655, 42656, 64275, 64280, 5]))
And pulled random chars out of that string! Instead, you want:
dstring unichars = to!dstring(unicodechars.byCodepoint);
Calling the byCodepoint method on that object will yield the actual characters (well, code points, unicode is messy) inside the range, then you get a string out of that and randomize it.

ANTLR4: Using non-ASCII characters in token rules

On page 74 of the ANTRL4 book it says that any Unicode character can be used in a grammar simply by specifying its codepoint in this manner:
'\uxxxx'
where xxxx is the hexadecimal value for the Unicode codepoint.
So I used that technique in a token rule for an ID token:
grammar ID;
id : ID EOF ;
ID : ('a' .. 'z' | 'A' .. 'Z' | '\u0100' .. '\u017E')+ ;
WS : [ \t\r\n]+ -> skip ;
When I tried to parse this input:
Gŭnter
ANTLR throws an error, saying that it does not recognize ŭ. (The ŭ character is hex 016D, so it is within the range specified)
What am I doing wrong please?

ANTLR is ready to accept 16-bit characters but, by default, many locales will read in characters as bytes (8 bits). You need to specify the appropriate encoding when you read from the file using the Java libraries. If you are using the TestRig, perhaps through alias/script grun, then use argument -encoding utf-8 or whatever. If you look at the source code of that class, you will see the following mechanism:
InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...

Grammar:
NAME:
[A-Za-z][0-9A-Za-z\u0080-\uFFFF_]+
;
Java:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.TokenStream;
import com.thalesgroup.dms.stimulus.StimulusParser.SystemContext;
final class RequirementParser {
static SystemContext parse( String requirement ) {
requirement = requirement.replaceAll( "\t", " " );
final CharStream charStream = CharStreams.fromString( requirement );
final StimulusLexer lexer = new StimulusLexer( charStream );
final TokenStream tokens = new CommonTokenStream( lexer );
final StimulusParser parser = new StimulusParser( tokens );
final SystemContext system = parser.system();
if( parser.getNumberOfSyntaxErrors() > 0 ) {
Debug.format( requirement );
}
return system;
}
private RequirementParser() {/**/}
}
Source:
Lexers and Unicode text

For those having the same problem using antlr4 in java code, ANTLRInputStream beeing deprecated, here is a working way to pass multi-char unicode data from a String to a the MyLexer lexer :
String myString = "\u2013";
CharBuffer charBuffer = CharBuffer.wrap(myString.toCharArray());
CodePointBuffer codePointBuffer = CodePointBuffer.withChars(charBuffer);
CodePointCharStream cpcs = CodePointCharStream.fromBuffer(codePointBuffer);
OneLexer lexer = new MyLexer(cpcs);
CommonTokenStream tokens = new CommonTokenStream(lexer);

You can specify the encoding of the file when actually reading the file.
For Kotlin/Java that could look like this, no need to specify the encoding in the grammar!
val inputStream: CharStream = CharStreams.fromFileName(fileName, Charset.forName("UTF-16LE"))
val lexer = BlastFeatureGrammarLexer(inputStream)
Supported Charsets by Java/Kotlin

How do I generate binary RFC822-style headers in Python 3.2?

How do I convince email.generator.Generator to use binary in Python 3.2? This seems like precisely the use case for the policy framework that was introduced in Python 3.3, but I would like my code to run in 3.2.
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: \N{SNOWMAN}\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Fails with UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128).

Your data is not a valid RFC2822 header, which I suspect misleads you. It's a Unicode string, but RFC2822 is always only ASCII. To have non-ASCII characters you need to encode them with a character set and either base64 or quoted-printable encoding.
Hence, valid code would be this:
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: =?utf8?b?4piD?=\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Which of course avoids the error completely.
The question is how to generate such headers as =?utf8?b?4piD?= and the answer lies in the email.header module.
I made this example with:
>>> from email import header
>>> header.Header('\N{SNOWMAN}', 'utf8').encode()
'=?utf8?b?4piD?='
To handle files that have a Key: Value format the email module is the wrong solution. Handling such files are easy enough without the email module, and you will not have to work around the restrictions of RF2822. For example:
# -*- coding: UTF-8 -*-
import io
import sys
if sys.version_info > (3,):
def u(s): return s
else:
def u(s): return s.decode('unicode-escape')
def parse(infile):
res = {}
payload = ''
for line in infile:
key, value = line.strip().split(': ',1)
if key in res:
raise ValueError(u("Key {0} appears twice").format(key))
res[key] = value
return res
def generate(outfile, data):
for key in data:
outfile.write(u("{0}: {1}\n").format(key, data[key]))
if __name__ == "__main__":
# Ensure roundtripping:
data = {u('Key'): u('Value'), u('Foo'): u('Bar'), u('Frötz'): u('Öpöpöp')}
with io.open('/tmp/outfile.conf', 'wt', encoding='UTF8') as outfile:
generate(outfile, data)
with io.open('/tmp/outfile.conf', 'rt', encoding='UTF8') as infile:
res = parse(infile)
assert data == res
That code took 15 minutes to write, and works in both Python 2 and Python 3. If you want line continuations etc that's easy to add as well.
Here is a more complete one that supports comments etc.

A useful solution comes from http://mail.python.org/pipermail/python-dev/2010-October/104409.html :
from email.parser import Parser
from email.generator import BytesGenerator
# How do I get surrogateescape from a BytesIO/StringIO?
data = "Key: \N{SNOWMAN}\r\n\r\n" # write this to headers.txt
headers = open("headers.txt", "r", encoding="ascii", errors="surrogateescape")
message = Parser().parse(headers)
with open("/tmp/rfc882test", "wb") as out:
BytesGenerator(out, maxheaderlen=0).flatten(message)
This is for a program that wants to read and write a binary Key: value file without caring about the encoding. To consume the headers as decoded text without being able to write them back out with Generator(), Parser().parse(open("headers.txt", "r", encoding="utf-8")) should be sufficient.

How to a recover a text from a wrong encoding?

I have got some files created from some asian OS (chinese and japanese XPs)
the file name is garbled, for example:
ÐÂ¸è+¾«Ñ¡Õä²ØºÏ¼
how i can recover the original text?
I tried with this in c#
Encoding unicode = Encoding.Unicode;
Encoding cinese = Encoding.GetEncoding(936);
byte[] chineseBytes = chinese.GetBytes(garbledString);
byte[] unicodeBytes = Encoding.Convert(unicode, chinese, chineseBytes);
//(Then convert byte in string)
and tried to change unicode to windows-1252 but no luck

It's a double-encoded text. The original is in Windows-936, then some application assumed the text is in ISO-8869-1 and encoded the result to UTF-8. Here is an example how to decode it in Python:
>>> print 'ÐÂ¸è+¾«Ñ¡Õä²ØºÏ¼'.decode('utf8').encode('latin1').decode('cp936')
新歌+精选珍藏合辑
I'm sure you can do something similar in C#.

Encoding unicode = Encoding.Unicode;
That's not what you want. “Unicode” is Microsoft's totally misleading name for what is really the UTF-16LE encoding. UTF-16LE plays no part here, what you have is a simple case where a 936 string has been misdecoded as 1252.
Windows codepage 1252 is similar but not the same as ISO-8859-1. There is no way to tell which is in the example string as it does not contain any of the bytes 0x80-0x9F which are different in the two encodings, but I'm assuming 1252 because that's the standard codepage on a western Windows install.
Encoding latin= Encoding.getEncoding(1252);
Encoding chinese= Encoding.getEncoding(936);
chinese.getChars(latin.getBytes(s));

The first argument to Encoding.Convert is the source encoding, Shouldn't that be chinese in your case? So
Encoding.Convert(chinese, unicode, chineseBytes);
might actually work. Because, after all, you want to convert CP-936 to Unicode and not vice-versa. And I'd suggest you don't even try bothering with CP-1252 since your text there is very likely not Latin.

This is an old question, but I just ran into the same situation while trying to migrate WordPress upload files off of an old Windows Server 2008 R2 server. bobince's answer set me on the right track, but I had to search for the right encoding/decoding pair.
With the following C#, I found the relevant encoding/deciding pair:
using System;
using System.Text;
public class Program
{
public static void Main()
{
// garbled
string s = "2020ç«¹æ…¶æœ¬æ¨‚ä»æ³¢åˆ‡äºžæ´²æ³•çµ-Intro-2-1024x643.jpg";
// expected
string t = "2020竹慶本樂仁波切亞洲法筵-Intro-2-1024x643.jpg";
foreach( EncodingInfo ei in Encoding.GetEncodings() ) {
Encoding e = ei.GetEncoding();
foreach( EncodingInfo ei2 in Encoding.GetEncodings() ) {
Encoding e2 = ei2.GetEncoding();
var s2 = e2.GetString(e.GetBytes(s));
if (s2 == t) {
var x = ei.CodePage;
Console.WriteLine($"e1={ei.DisplayName} (CP {ei.CodePage}), e2={ei2.DisplayName} (CP {ei2.CodePage})");
Console.WriteLine(t);
Console.WriteLine(s2);
}
}
}
Console.WriteLine("-----------");
Console.WriteLine(t);
Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));
}
}
It turned out that the correct encoding/deciding in my case was:
e1=Western European (Windows) (CP 1252), e2=Unicode (UTF-8) (CP 65001)
So the last line of code is a one-liner for the correct conversion Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Different encoding in Jython's Java and Python level - encoding

Looking from the Jython code you have to tell Java, that the string is UTF-8 encoded: def type(self, text): jtext = java.lang.String(text, "utf-8") print "pythondebug: " + text // debug output JRegion.type(self, jtext)

Related

How to make my own encoding for a file in VSCode Editor

D-language: How to print Unicode characters to the console?

ANTLR4: Using non-ASCII characters in token rules

How do I generate binary RFC822-style headers in Python 3.2?

How to a recover a text from a wrong encoding?

Categories

Resources