D-language: How to print Unicode characters to the console? - unicode

I have the following simple program to generate a random Unicode string from the union of 3 unicode character-sets.
#!/usr/bin/env rdmd
import std.uni;
import std.random : randomSample;
import std.stdio;
import std.conv;
/**
* Random salt generator
*/
dstring get_salt(uint s)
{
auto unicodechars = unicode("Cyrillic") | unicode("Armenian") | unicode("Telugu");
dstring unichars = to!dstring(unicodechars);
return to!dstring(randomSample(unichars, s));
}
void main()
{
writeln("Random salt:");
writeln(get_salt(32));
}
However, the output of the writeln is:
$ ./teste.d
Random salt:
rw13 13437 78580112 104 3914645
What are these numbers? Unicode code-points? How do I print the actual characters? I am on Ubuntu Linux with Locale set to UTF-8

This line is the problem you have:
dstring unichars = to!dstring(unicodechars);
It converts the CodepointSet object unicode returns to string, not the characters it covers. The set has a name and boundaries of characters but not characters itself. It took this:
InversionList!(GcPolicy)(CowArray!(GcPolicy)([1024, 1157, 1159, 1320, 1329, 1367, 1369, 1376, 1377, 1416, 1418, 1419, 1423, 1424, 3073, 3076, 3077, 3085, 3086, 3089, 3090, 3113, 3114, 3124, 3125, 3130, 3133, 3141, 3142, 3145, 3146, 3150, 3157, 3159, 3160, 3162, 3168, 3172, 3174, 3184, 3192, 3200, 7467, 7468, 7544, 7545, 11744, 11776, 42560, 42648, 42655, 42656, 64275, 64280, 5]))
And pulled random chars out of that string! Instead, you want:
dstring unichars = to!dstring(unicodechars.byCodepoint);
Calling the byCodepoint method on that object will yield the actual characters (well, code points, unicode is messy) inside the range, then you get a string out of that and randomize it.

Related

Dart is not printing hex string

I got the string \x01\x01 from a tcp/ip socket, when I try to print it to console, no output is coming
void main() {
var out = "\x01\x01";
print("printing out as --> $out <--");
final runes = out.runes.toList();
print(runes);
}
It gives the output as
printing out as --> <--
[1, 1]
dart pad link: https://dartpad.dev/?id=854e4479bfec03d7e8fd40621c845567
I tried to use hex package and it gives Non-hex character detected error.
Questions.
How do I print these types of strings to the console?
If some conversion is needed, how do I know data belongs to these type ?
my socket client is like the following
socket.listen(
// handle data from the server
(Uint8List data) async {
var serverResponse = String.fromCharCodes(data);
print('Server: $serverResponse');
final runes = serverResponse.runes.toList();
print(runes);
},
EDIT
The socket server is the x0vnc server, on reading the input with wire shark I can see the server sent 01 01
To display a hexa, you have to escape the characters like this:
var out = '\\x01\\x01';
this will work.
I suspect you have misunderstood what the server is sending.
Given you've not stated the server language I'm going to guess that ` ab = b'\x01\x01' generates a array with two bytes both with the value 1.
If you treat this as an ASCII value then 1 is a non printable character.
As such you need to iterate over the array and convert each byte into a suitable visual format.
This might mean that when you see a 1 you print x01.
Edit:
actually dart will convert an int to a string for you:
void main() {
final bytes = <int>[1, 2, 3];
for (final byte in bytes) {
print(byte.toString());
}
}

How do I print a UTF-16 string in Zig?

I've been trying to code a UTF-16 string structure, and although the standard library provides a unicode module, it doesn't seem to provide a way to print out a slice of u16.
I've tried this:
const std = #import("std");
const unicode = std.unicode;
const stdout = std.io.getStdOut().outStream();
pub fn main() !void {
const unicode_str = unicode.utf8ToUtf16LeStringLiteral("😎 hello! 😎");
try stdout.print("{}\n", .{unicode_str});
}
This outputs:
[12:0]u16#202e9c
Is there a way to print a unicode string ([]u16) without converting it back into a non-unicode string ([]u8)?
Both []const u8 and []const u16 store encoded unicode codepoints. Unicode codepoints fit within the range 0..1,114,112 so an actual unicode string with one array index per codepoint would have to be []const u21. utf-8 and utf-16 both require encoding for codepoints that don't fit. Unless there is a compatability reason for utf-16 (like some windows functions), you should probably be using []const u8 unicode strings.
To print utf-16 to a utf-8 stream, you have to decode utf-16 and re-encode it into utf-8. There is currently no formatting specifier to do this automatically.
You can either convert the entire string at once, requiring allocation:
const utf8string = try std.unicode.utf16leToUtf8Alloc(alloc, utf16le);
Or, without allocation:
var writer = std.io.getStdOut().writer();
var it = std.unicode.Utf16LeIterator.init(utf16le);
while (try it.nextCodepoint()) |codepoint| {
var buf: [4]u8 = [_]u8{undefined} ** 4;
const len = try std.unicode.utf8Encode(codepoint, &buf);
try writer.writeAll(buf[0..len]);
}
Note that this will be very slow without using a buffered writer if you are writing somewhere that requires a syscall to write.

ANTLR4: Using non-ASCII characters in token rules

On page 74 of the ANTRL4 book it says that any Unicode character can be used in a grammar simply by specifying its codepoint in this manner:
'\uxxxx'
where xxxx is the hexadecimal value for the Unicode codepoint.
So I used that technique in a token rule for an ID token:
grammar ID;
id : ID EOF ;
ID : ('a' .. 'z' | 'A' .. 'Z' | '\u0100' .. '\u017E')+ ;
WS : [ \t\r\n]+ -> skip ;
When I tried to parse this input:
GÅ­nter
ANTLR throws an error, saying that it does not recognize Å­. (The Å­ character is hex 016D, so it is within the range specified)
What am I doing wrong please?
ANTLR is ready to accept 16-bit characters but, by default, many locales will read in characters as bytes (8 bits). You need to specify the appropriate encoding when you read from the file using the Java libraries. If you are using the TestRig, perhaps through alias/script grun, then use argument -encoding utf-8 or whatever. If you look at the source code of that class, you will see the following mechanism:
InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...
Grammar:
NAME:
[A-Za-z][0-9A-Za-z\u0080-\uFFFF_]+
;
Java:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.TokenStream;
import com.thalesgroup.dms.stimulus.StimulusParser.SystemContext;
final class RequirementParser {
static SystemContext parse( String requirement ) {
requirement = requirement.replaceAll( "\t", " " );
final CharStream charStream = CharStreams.fromString( requirement );
final StimulusLexer lexer = new StimulusLexer( charStream );
final TokenStream tokens = new CommonTokenStream( lexer );
final StimulusParser parser = new StimulusParser( tokens );
final SystemContext system = parser.system();
if( parser.getNumberOfSyntaxErrors() > 0 ) {
Debug.format( requirement );
}
return system;
}
private RequirementParser() {/**/}
}
Source:
Lexers and Unicode text
For those having the same problem using antlr4 in java code, ANTLRInputStream beeing deprecated, here is a working way to pass multi-char unicode data from a String to a the MyLexer lexer :
String myString = "\u2013";
CharBuffer charBuffer = CharBuffer.wrap(myString.toCharArray());
CodePointBuffer codePointBuffer = CodePointBuffer.withChars(charBuffer);
CodePointCharStream cpcs = CodePointCharStream.fromBuffer(codePointBuffer);
OneLexer lexer = new MyLexer(cpcs);
CommonTokenStream tokens = new CommonTokenStream(lexer);
You can specify the encoding of the file when actually reading the file.
For Kotlin/Java that could look like this, no need to specify the encoding in the grammar!
val inputStream: CharStream = CharStreams.fromFileName(fileName, Charset.forName("UTF-16LE"))
val lexer = BlastFeatureGrammarLexer(inputStream)
Supported Charsets by Java/Kotlin

Different encoding in Jython's Java and Python level

I'm using Sikuli (see sikuli.org) which uses jython2.5.2.
Here is a summary of the class Region on the Java level:
public class Region {
// < other class methods >
public int type(String text) {
System.out.println("javadebug: "+text); // debug output
// do actual typing
}
}
On the Pythonlevel there is a Wrapperclass:
import Region as JRegion // import java class
class Region(JRegion):
# < other class methods >
def type(self, text):
print "pythondebug: "+text // debug output
JRegion.type(self, text)
This works as intended for ascii chars, but when I use ö, ä or ü as text, this happens:
// python input:
# -*- encoding: utf-8 -*-
someregion = Region()
someregion.type("ä")
// output:
pythondebug: ä
javadebug: ä
The character seems to be converted wrongly when passed to the Java object.
I would like to know what exactly is going wrong here and how to fix this, so that the characters entered in the pythonmethod are the same in the javamethod.
Thanks for your help
Looking from the Jython code you have to tell Java, that the string is UTF-8 encoded:
def type(self, text):
jtext = java.lang.String(text, "utf-8")
print "pythondebug: " + text // debug output
JRegion.type(self, jtext)

How do I generate binary RFC822-style headers in Python 3.2?

How do I convince email.generator.Generator to use binary in Python 3.2? This seems like precisely the use case for the policy framework that was introduced in Python 3.3, but I would like my code to run in 3.2.
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: \N{SNOWMAN}\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Fails with UnicodeEncodeError: 'ascii' codec can't encode character '\u2603' in position 0: ordinal not in range(128).
Your data is not a valid RFC2822 header, which I suspect misleads you. It's a Unicode string, but RFC2822 is always only ASCII. To have non-ASCII characters you need to encode them with a character set and either base64 or quoted-printable encoding.
Hence, valid code would be this:
from email.parser import Parser
from email.generator import Generator
from io import BytesIO, StringIO
data = "Key: =?utf8?b?4piD?=\r\n\r\n"
message = Parser().parse(StringIO(data))
with open("/tmp/rfc882test", "w") as out:
Generator(out, maxheaderlen=0).flatten(message)
Which of course avoids the error completely.
The question is how to generate such headers as =?utf8?b?4piD?= and the answer lies in the email.header module.
I made this example with:
>>> from email import header
>>> header.Header('\N{SNOWMAN}', 'utf8').encode()
'=?utf8?b?4piD?='
To handle files that have a Key: Value format the email module is the wrong solution. Handling such files are easy enough without the email module, and you will not have to work around the restrictions of RF2822. For example:
# -*- coding: UTF-8 -*-
import io
import sys
if sys.version_info > (3,):
def u(s): return s
else:
def u(s): return s.decode('unicode-escape')
def parse(infile):
res = {}
payload = ''
for line in infile:
key, value = line.strip().split(': ',1)
if key in res:
raise ValueError(u("Key {0} appears twice").format(key))
res[key] = value
return res
def generate(outfile, data):
for key in data:
outfile.write(u("{0}: {1}\n").format(key, data[key]))
if __name__ == "__main__":
# Ensure roundtripping:
data = {u('Key'): u('Value'), u('Foo'): u('Bar'), u('Frötz'): u('Öpöpöp')}
with io.open('/tmp/outfile.conf', 'wt', encoding='UTF8') as outfile:
generate(outfile, data)
with io.open('/tmp/outfile.conf', 'rt', encoding='UTF8') as infile:
res = parse(infile)
assert data == res
That code took 15 minutes to write, and works in both Python 2 and Python 3. If you want line continuations etc that's easy to add as well.
Here is a more complete one that supports comments etc.
A useful solution comes from http://mail.python.org/pipermail/python-dev/2010-October/104409.html :
from email.parser import Parser
from email.generator import BytesGenerator
# How do I get surrogateescape from a BytesIO/StringIO?
data = "Key: \N{SNOWMAN}\r\n\r\n" # write this to headers.txt
headers = open("headers.txt", "r", encoding="ascii", errors="surrogateescape")
message = Parser().parse(headers)
with open("/tmp/rfc882test", "wb") as out:
BytesGenerator(out, maxheaderlen=0).flatten(message)
This is for a program that wants to read and write a binary Key: value file without caring about the encoding. To consume the headers as decoded text without being able to write them back out with Generator(), Parser().parse(open("headers.txt", "r", encoding="utf-8")) should be sufficient.