I've been trying to code a UTF-16 string structure, and although the standard library provides a unicode module, it doesn't seem to provide a way to print out a slice of u16.
I've tried this:
const std = #import("std");
const unicode = std.unicode;
const stdout = std.io.getStdOut().outStream();
pub fn main() !void {
const unicode_str = unicode.utf8ToUtf16LeStringLiteral("😎 hello! 😎");
try stdout.print("{}\n", .{unicode_str});
}
This outputs:
[12:0]u16#202e9c
Is there a way to print a unicode string ([]u16) without converting it back into a non-unicode string ([]u8)?
Both []const u8 and []const u16 store encoded unicode codepoints. Unicode codepoints fit within the range 0..1,114,112 so an actual unicode string with one array index per codepoint would have to be []const u21. utf-8 and utf-16 both require encoding for codepoints that don't fit. Unless there is a compatability reason for utf-16 (like some windows functions), you should probably be using []const u8 unicode strings.
To print utf-16 to a utf-8 stream, you have to decode utf-16 and re-encode it into utf-8. There is currently no formatting specifier to do this automatically.
You can either convert the entire string at once, requiring allocation:
const utf8string = try std.unicode.utf16leToUtf8Alloc(alloc, utf16le);
Or, without allocation:
var writer = std.io.getStdOut().writer();
var it = std.unicode.Utf16LeIterator.init(utf16le);
while (try it.nextCodepoint()) |codepoint| {
var buf: [4]u8 = [_]u8{undefined} ** 4;
const len = try std.unicode.utf8Encode(codepoint, &buf);
try writer.writeAll(buf[0..len]);
}
Note that this will be very slow without using a buffered writer if you are writing somewhere that requires a syscall to write.
I need to get the ASCII code of a Persian string to use it in a program. But the method below give the ? marks: "??? ????"
public string PerisanAscii()
{
//persian string
string unicodeString = "صبح بخیر";
// Create two different encodings.
Encoding ascii = Encoding.ASCII;
Encoding unicode = Encoding.Unicode;
// Convert the string into a byte array.
byte[] unicodeBytes = unicode.GetBytes(unicodeString);
// Perform the conversion from one encoding to the other.
byte[] asciiBytes = Encoding.Convert(unicode, ascii, unicodeBytes);
// Convert the new byte[] into a char[] and then into a string.
char[] asciiChars = new char[ascii.GetCharCount(asciiBytes, 0, asciiBytes.Length)];
ascii.GetChars(asciiBytes, 0, asciiBytes.Length, asciiChars, 0);
string asciiString = new string(asciiChars);
return asciiString;
}
Can you help me?
Best regards,
Mohsen
You can convert Persian UTF8 data to Windows-1256 (Arabic Windows):
var enc1256 = Encoding.GetEncoding("windows-1256");
var data = enc1256.GetBytes(unicodeString);
System.IO.File.WriteAllBytes(path, data);
ASCII does not support Persian. You may need old school Iran System encoding standard. This is determined by your Autocad application. I don't know if there is a direct Encoding in windows for it or not. But you can convert characters manually too. It's a simple mapping.
I have an app that is storing images in a Windows Azure Block Blob. I'm adding meta data to each blob that gets uploaded. The metadata may include some special characters. For instance, the registered trademark symbol (®). How do I add this value to meta data in Windows Azure?
Currently, when I try, I get a 400 (Bad Request) error anytime I try to upload a file that uses a special character like this.
Thank you!
You might use HttpUtility to encode/decode the string:
blob.Metadata["Description"] = HttpUtility.HtmlEncode(model.Description);
Description = HttpUtility.HtmlDecode(blob.Metadata["Description"]);
http://lvbernal.blogspot.com/2013/02/metadatos-de-azure-vs-caracteres.html
The supported characters in the blob metadata must be ASCII characters. To work around this you can either escape the string ( percent encode), base64 encode etc.
joe
HttpUtility.HtmlEncode may not work; if Unicode characters are in your string (i.e. ’), it will fail. So far, I have found Uri.EscapeDataString does handle this edge case and others. However, there are a number of characters that get encoded unnecessarily, such as space (' '=chr(32)=%20).
I mapped the illegal ascii characters metadata will not accept and built this to restore the characters:
static List<string> illegals = new List<string> { "%1", "%2", "%3", "%4", "%5", "%6", "%7", "%8", "%A", "%B", "%C", "%D", "%E", "%F", "%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17", "%18", "%19", "%1A", "%1B", "%1C", "%1D", "%1E", "%1F", "%7F", "%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87", "%88", "%89", "%8A", "%8B", "%8C", "%8D", "%8E", "%8F", "%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97", "%98", "%99", "%9A", "%9B", "%9C", "%9D", "%9E", "%9F", "%A0", "%A1", "%A2", "%A3", "%A4", "%A5", "%A6", "%A7", "%A8", "%A9", "%AA", "%AB", "%AC", "%AD", "%AE", "%AF", "%B0", "%B1", "%B2", "%B3", "%B4", "%B5", "%B6", "%B7", "%B8", "%B9", "%BA", "%BB", "%BC", "%BD", "%BE", "%BF", "%C0", "%C1", "%C2", "%C3", "%C4", "%C5", "%C6", "%C7", "%C8", "%C9", "%CA", "%CB", "%CC", "%CD", "%CE", "%CF", "%D0", "%D1", "%D2", "%D3", "%D4", "%D5", "%D6", "%D7", "%D8", "%D9", "%DA", "%DB", "%DC", "%DD", "%DE", "%DF", "%E0", "%E1", "%E2", "%E3", "%E4", "%E5", "%E6", "%E7", "%E8", "%E9", "%EA", "%EB", "%EC", "%ED", "%EE", "%EF", "%F0", "%F1", "%F2", "%F3", "%F4", "%F5", "%F6", "%F7", "%F8", "%F9", "%FA", "%FB", "%FC", "%FD", "%FE" };
private static string MetaDataEscape(string value)
{
//CDC%20Guideline%20for%20Prescribing%20Opioids%20Module%206%3A%20%0Ahttps%3A%2F%2Fwww.cdc.gov%2Fdrugoverdose%2Ftraining%2Fdosing%2F
var x = HttpUtility.HtmlEncode(value);
var sz = value.Trim();
sz = Uri.EscapeDataString(sz);
for (int i = 1; i < 255; i++)
{
var hex = "%" + i.ToString("X");
if (!illegals.Contains(hex))
{
sz = sz.Replace(hex, Uri.UnescapeDataString(hex));
}
}
return sz;
}
The result is:
Before ==> "1080x1080 Facebook Images"
Uri.EscapeDataString =>
"1080x1080%20Facebook%20Images"
After => "1080x1080 Facebook
Images"
I am sure there is a more efficient way, but the hit seems negligible for my needs.
I'm using Sikuli (see sikuli.org) which uses jython2.5.2.
Here is a summary of the class Region on the Java level:
public class Region {
// < other class methods >
public int type(String text) {
System.out.println("javadebug: "+text); // debug output
// do actual typing
}
}
On the Pythonlevel there is a Wrapperclass:
import Region as JRegion // import java class
class Region(JRegion):
# < other class methods >
def type(self, text):
print "pythondebug: "+text // debug output
JRegion.type(self, text)
This works as intended for ascii chars, but when I use ö, ä or ü as text, this happens:
// python input:
# -*- encoding: utf-8 -*-
someregion = Region()
someregion.type("ä")
// output:
pythondebug: ä
javadebug: ä
The character seems to be converted wrongly when passed to the Java object.
I would like to know what exactly is going wrong here and how to fix this, so that the characters entered in the pythonmethod are the same in the javamethod.
Thanks for your help
Looking from the Jython code you have to tell Java, that the string is UTF-8 encoded:
def type(self, text):
jtext = java.lang.String(text, "utf-8")
print "pythondebug: " + text // debug output
JRegion.type(self, jtext)
does anyone know how to load a UTF8-encoded string using WebBrowser.NavigateToString() method? For now I end up with a bunch of mis-displayed characters.
Here's the simple string that won't display correctly:
webBrowser.NavigateToString("ąęłóńżźćś");
The code file is saved with UTF-8 encoding (with signature).
Thanks.
Using ConvertExtendedASCII as suggested works, but is very slow. Using a StringBuilder instead was (in my case) about 800 times faster:
public string FixHtml(string HTML)
{
StringBuilder sb = new StringBuilder();
char[] s = HTML.ToCharArray();
foreach (char c in s)
{
if (Convert.ToInt32(c) > 127)
sb.Append("&#" + Convert.ToInt32(c) + ";");
else
sb.Append(c);
}
return sb.ToString();
}
First up, NavigateToString() is expecting a full html document.
Secondly, as you're passing HTML, it's best to pass HTML entities, rather than relying on encodings. Unfortunately, not that many entity codes are actually supported by the browser so you should look at using the numeric unicode values where necessary.
Much like this:
webBrowser1.NavigateToString("<html><body><p>ó Õ</p></body></html>");
Try this article. It should help. Shortly speaking, it proposes to use following snippet to convert your string into appropriate format:
private static string ConvertExtendedASCII(string HTML)
{
string retVal = "";
char[] s = HTML.ToCharArray();
foreach (char c in s)
{
if (Convert.ToInt32(c) > 127)
retVal += "&#" + Convert.ToInt32(c) + ";";
else
retVal += c;
}
return retVal;
}
If you have the UTF8 in memory in a byte array then you could try NavigateToStream with a MemoryStream rather than using NavigateToString. You should try to ensure their is a BOM on the UTF8 buffer if you can.
Note that the string in the question is not a UTF8 string. It is a UTF16 string with some garbage in it. By placing zeros between the bytes and storing it in a System.String you corrupted it.