ANTLR4: Using non-ASCII characters in token rules - unicode

On page 74 of the ANTRL4 book it says that any Unicode character can be used in a grammar simply by specifying its codepoint in this manner:
'\uxxxx'
where xxxx is the hexadecimal value for the Unicode codepoint.
So I used that technique in a token rule for an ID token:
grammar ID;
id : ID EOF ;
ID : ('a' .. 'z' | 'A' .. 'Z' | '\u0100' .. '\u017E')+ ;
WS : [ \t\r\n]+ -> skip ;
When I tried to parse this input:
Gŭnter
ANTLR throws an error, saying that it does not recognize ŭ. (The ŭ character is hex 016D, so it is within the range specified)
What am I doing wrong please?

ANTLR is ready to accept 16-bit characters but, by default, many locales will read in characters as bytes (8 bits). You need to specify the appropriate encoding when you read from the file using the Java libraries. If you are using the TestRig, perhaps through alias/script grun, then use argument -encoding utf-8 or whatever. If you look at the source code of that class, you will see the following mechanism:
InputStream is = new FileInputStream(inputFile);
Reader r = new InputStreamReader(is, encoding); // e.g., euc-jp or utf-8
ANTLRInputStream input = new ANTLRInputStream(r);
XLexer lexer = new XLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
...

Grammar:
NAME:
[A-Za-z][0-9A-Za-z\u0080-\uFFFF_]+
;
Java:
import org.antlr.v4.runtime.CharStream;
import org.antlr.v4.runtime.CharStreams;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.TokenStream;
import com.thalesgroup.dms.stimulus.StimulusParser.SystemContext;
final class RequirementParser {
static SystemContext parse( String requirement ) {
requirement = requirement.replaceAll( "\t", " " );
final CharStream charStream = CharStreams.fromString( requirement );
final StimulusLexer lexer = new StimulusLexer( charStream );
final TokenStream tokens = new CommonTokenStream( lexer );
final StimulusParser parser = new StimulusParser( tokens );
final SystemContext system = parser.system();
if( parser.getNumberOfSyntaxErrors() > 0 ) {
Debug.format( requirement );
}
return system;
}
private RequirementParser() {/**/}
}
Source:
Lexers and Unicode text

For those having the same problem using antlr4 in java code, ANTLRInputStream beeing deprecated, here is a working way to pass multi-char unicode data from a String to a the MyLexer lexer :
String myString = "\u2013";
CharBuffer charBuffer = CharBuffer.wrap(myString.toCharArray());
CodePointBuffer codePointBuffer = CodePointBuffer.withChars(charBuffer);
CodePointCharStream cpcs = CodePointCharStream.fromBuffer(codePointBuffer);
OneLexer lexer = new MyLexer(cpcs);
CommonTokenStream tokens = new CommonTokenStream(lexer);

You can specify the encoding of the file when actually reading the file.
For Kotlin/Java that could look like this, no need to specify the encoding in the grammar!
val inputStream: CharStream = CharStreams.fromFileName(fileName, Charset.forName("UTF-16LE"))
val lexer = BlastFeatureGrammarLexer(inputStream)
Supported Charsets by Java/Kotlin

Related

How do I print a UTF-16 string in Zig?

I've been trying to code a UTF-16 string structure, and although the standard library provides a unicode module, it doesn't seem to provide a way to print out a slice of u16.
I've tried this:
const std = #import("std");
const unicode = std.unicode;
const stdout = std.io.getStdOut().outStream();
pub fn main() !void {
const unicode_str = unicode.utf8ToUtf16LeStringLiteral("😎 hello! 😎");
try stdout.print("{}\n", .{unicode_str});
}
This outputs:
[12:0]u16#202e9c
Is there a way to print a unicode string ([]u16) without converting it back into a non-unicode string ([]u8)?
Both []const u8 and []const u16 store encoded unicode codepoints. Unicode codepoints fit within the range 0..1,114,112 so an actual unicode string with one array index per codepoint would have to be []const u21. utf-8 and utf-16 both require encoding for codepoints that don't fit. Unless there is a compatability reason for utf-16 (like some windows functions), you should probably be using []const u8 unicode strings.
To print utf-16 to a utf-8 stream, you have to decode utf-16 and re-encode it into utf-8. There is currently no formatting specifier to do this automatically.
You can either convert the entire string at once, requiring allocation:
const utf8string = try std.unicode.utf16leToUtf8Alloc(alloc, utf16le);
Or, without allocation:
var writer = std.io.getStdOut().writer();
var it = std.unicode.Utf16LeIterator.init(utf16le);
while (try it.nextCodepoint()) |codepoint| {
var buf: [4]u8 = [_]u8{undefined} ** 4;
const len = try std.unicode.utf8Encode(codepoint, &buf);
try writer.writeAll(buf[0..len]);
}
Note that this will be very slow without using a buffered writer if you are writing somewhere that requires a syscall to write.

Convert persian unicode to Ascii

I need to get the ASCII code of a Persian string to use it in a program. But the method below give the ? marks: "??? ????"
public string PerisanAscii()
{
//persian string
string unicodeString = "صبح بخیر";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte array.
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
return asciiString;
}
Can you help me?
Best regards,
Mohsen
You can convert Persian UTF8 data to Windows-1256 (Arabic Windows):
var enc1256 = Encoding.GetEncoding("windows-1256");
var data = enc1256.GetBytes(unicodeString);
System.IO.File.WriteAllBytes(path, data);
ASCII does not support Persian. You may need old school Iran System encoding standard. This is determined by your Autocad application. I don't know if there is a direct Encoding in windows for it or not. But you can convert characters manually too. It's a simple mapping.

Storing Special Characters in Windows Azure Blob Metadata

I have an app that is storing images in a Windows Azure Block Blob. I'm adding meta data to each blob that gets uploaded. The metadata may include some special characters. For instance, the registered trademark symbol (®). How do I add this value to meta data in Windows Azure?
Currently, when I try, I get a 400 (Bad Request) error anytime I try to upload a file that uses a special character like this.
Thank you!
You might use HttpUtility to encode/decode the string:
blob.Metadata["Description"] = HttpUtility.HtmlEncode(model.Description);
Description = HttpUtility.HtmlDecode(blob.Metadata["Description"]);
http://lvbernal.blogspot.com/2013/02/metadatos-de-azure-vs-caracteres.html
The supported characters in the blob metadata must be ASCII characters. To work around this you can either escape the string ( percent encode), base64 encode etc.
joe
HttpUtility.HtmlEncode may not work; if Unicode characters are in your string (i.e. &#8217), it will fail. So far, I have found Uri.EscapeDataString does handle this edge case and others. However, there are a number of characters that get encoded unnecessarily, such as space (' '=chr(32)=%20).
I mapped the illegal ascii characters metadata will not accept and built this to restore the characters:
static List<string> illegals = new List<string> { "%1", "%2", "%3", "%4", "%5", "%6", "%7", "%8", "%A", "%B", "%C", "%D", "%E", "%F", "%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17", "%18", "%19", "%1A", "%1B", "%1C", "%1D", "%1E", "%1F", "%7F", "%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87", "%88", "%89", "%8A", "%8B", "%8C", "%8D", "%8E", "%8F", "%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97", "%98", "%99", "%9A", "%9B", "%9C", "%9D", "%9E", "%9F", "%A0", "%A1", "%A2", "%A3", "%A4", "%A5", "%A6", "%A7", "%A8", "%A9", "%AA", "%AB", "%AC", "%AD", "%AE", "%AF", "%B0", "%B1", "%B2", "%B3", "%B4", "%B5", "%B6", "%B7", "%B8", "%B9", "%BA", "%BB", "%BC", "%BD", "%BE", "%BF", "%C0", "%C1", "%C2", "%C3", "%C4", "%C5", "%C6", "%C7", "%C8", "%C9", "%CA", "%CB", "%CC", "%CD", "%CE", "%CF", "%D0", "%D1", "%D2", "%D3", "%D4", "%D5", "%D6", "%D7", "%D8", "%D9", "%DA", "%DB", "%DC", "%DD", "%DE", "%DF", "%E0", "%E1", "%E2", "%E3", "%E4", "%E5", "%E6", "%E7", "%E8", "%E9", "%EA", "%EB", "%EC", "%ED", "%EE", "%EF", "%F0", "%F1", "%F2", "%F3", "%F4", "%F5", "%F6", "%F7", "%F8", "%F9", "%FA", "%FB", "%FC", "%FD", "%FE" };
private static string MetaDataEscape(string value)
{
//CDC%20Guideline%20for%20Prescribing%20Opioids%20Module%206%3A%20%0Ahttps%3A%2F%2Fwww.cdc.gov%2Fdrugoverdose%2Ftraining%2Fdosing%2F
var x = HttpUtility.HtmlEncode(value);
var sz = value.Trim();
sz = Uri.EscapeDataString(sz);
for (int i = 1; i < 255; i++)
{
var hex = "%" + i.ToString("X");
if (!illegals.Contains(hex))
{
sz = sz.Replace(hex, Uri.UnescapeDataString(hex));
}
}
return sz;
}
The result is:
Before ==> "1080x1080 Facebook Images"
Uri.EscapeDataString =>
"1080x1080%20Facebook%20Images"
After => "1080x1080 Facebook
Images"
I am sure there is a more efficient way, but the hit seems negligible for my needs.

Different encoding in Jython's Java and Python level

I'm using Sikuli (see sikuli.org) which uses jython2.5.2.
Here is a summary of the class Region on the Java level:
public class Region {
// < other class methods >
public int type(String text) {
System.out.println("javadebug: "+text); // debug output
// do actual typing
}
}
On the Pythonlevel there is a Wrapperclass:
import Region as JRegion // import java class
class Region(JRegion):
# < other class methods >
def type(self, text):
print "pythondebug: "+text // debug output
JRegion.type(self, text)
This works as intended for ascii chars, but when I use ö, ä or ü as text, this happens:
// python input:
# -*- encoding: utf-8 -*-
someregion = Region()
someregion.type("ä")
// output:
pythondebug: ä
javadebug: ä
The character seems to be converted wrongly when passed to the Java object.
I would like to know what exactly is going wrong here and how to fix this, so that the characters entered in the pythonmethod are the same in the javamethod.
Thanks for your help
Looking from the Jython code you have to tell Java, that the string is UTF-8 encoded:
def type(self, text):
jtext = java.lang.String(text, "utf-8")
print "pythondebug: " + text // debug output
JRegion.type(self, jtext)

WP7's WebBrowser.NavigateToString() and text encoding

does anyone know how to load a UTF8-encoded string using WebBrowser.NavigateToString() method? For now I end up with a bunch of mis-displayed characters.
Here's the simple string that won't display correctly:
webBrowser.NavigateToString("ąęłóńżźćś");
The code file is saved with UTF-8 encoding (with signature).
Thanks.
Using ConvertExtendedASCII as suggested works, but is very slow. Using a StringBuilder instead was (in my case) about 800 times faster:
public string FixHtml(string HTML)
{
StringBuilder sb = new StringBuilder();
char[] s = HTML.ToCharArray();
foreach (char c in s)
{
if (Convert.ToInt32(c) > 127)
sb.Append("&#" + Convert.ToInt32(c) + ";");
else
sb.Append(c);
}
return sb.ToString();
}
First up, NavigateToString() is expecting a full html document.
Secondly, as you're passing HTML, it's best to pass HTML entities, rather than relying on encodings. Unfortunately, not that many entity codes are actually supported by the browser so you should look at using the numeric unicode values where necessary.
Much like this:
webBrowser1.NavigateToString("<html><body><p>ó Õ</p></body></html>");
Try this article. It should help. Shortly speaking, it proposes to use following snippet to convert your string into appropriate format:
private static string ConvertExtendedASCII(string HTML)
{
string retVal = "";
char[] s = HTML.ToCharArray();
foreach (char c in s)
{
if (Convert.ToInt32(c) > 127)
retVal += "&#" + Convert.ToInt32(c) + ";";
else
retVal += c;
}
return retVal;
}
If you have the UTF8 in memory in a byte array then you could try NavigateToStream with a MemoryStream rather than using NavigateToString. You should try to ensure their is a BOM on the UTF8 buffer if you can.
Note that the string in the question is not a UTF8 string. It is a UTF16 string with some garbage in it. By placing zeros between the bytes and storing it in a System.String you corrupted it.