To read a file contents of .txt file I am using
List<String> linesList = await file.readAsLines(encoding: latin1);
return linesList;
Files with Encodng UTF-8 are working perfectly with this above code.
But for Encoding UTF-16LE its returning a list with length double of the lines in the file but are all empty except first line. This first index contains ÿþ#
As package:utf is now abandoned (and therefore will never support null-safety), another way to read a UTF-16LE file as a String is to take advantage of Dart Strings using UTF-16 code units internally. You therefore can read the file, interpret the data as 16-bit (unsigned) integers, and then create a String using those as UTF-16 code units:
Basically:
var f = File("utf-16le.txt");
var bytes = f.readAsBytesSync();
// Note that this assumes that the system's native endianness is the same as
// the file's.
var utf16CodeUnits = bytes.buffer.asUint16List();
var s = String.fromCharCodes(utf16CodeUnits);
I also leave it as an exercise for the reader to deal with potential BOMs at the beginning of the file.
Also see https://github.com/dart-lang/convert/issues/30, which requests that the Dart SDK provide UTF-16 conversion functions.
So the first credit goes to #Richard_Heap , who has commented on the above question. He mentioned a dart package that encodes and decodes UTF formats. I have been able to decode the txt files as expected in my Flutter app using this package.
First I am identifying the utf format type by these functions available from the package mentioned by #Richard_Heap
List<int> bytes = await file.readAsBytes();
hasUtf16beBom(bytes)
hasUtf16Bom(bytes)}
hasUtf16leBom(bytes)
hasUtf32beBom(bytes)
hasUtf32Bom(bytes)
hasUtf32leBom(bytes)
There are different decoder & encoder functions in this package that can be used once the utf format is known using these above functions. Like I used
String decodedString = decodeUtf16le(bytes);
Check out the charset package. It is null-safe and supports both UTF-16BE and UTF-16LE according to documentation.
Usage:
import 'package:charset/charset.dart';
main() {
// default
print(utf16.decode([254, 255, 78, 10, 85, 132, 130, 229, 108, 52]));
print(utf16.encode("上善若水"));
// detect
print(hasUtf16Bom([0xFE, 0xFF, 0x6C, 0x34]));
// advance
Utf16Encoder encoder = utf16.encoder as Utf16Encoder;
print(encoder.encodeUtf16Be("上善若水", false));
print(encoder.encodeUtf16Le("上善若水", true));
}
Related
I am trying to encode a Unicode character in dart, but this results in an invalid byte array.
The character: 🔥
The bytes: [FF, FE, 3D, D8, 25, DD]
The string is encoded with BOM. After decoding this string I can see that the string is parsed correctly, resulting to see the emoji inside my IDE.
Then I try to encode the String again but that gives me a byte array, I don't understand:
[FF, FE, FD, FF, FD, FF]
I am using the package utf_convert to encode the string:
import 'package:utf_convert/utf_convert.dart' as utf;
List<int> convert(String input) {
return utf.encodeUtf16le(input, true).cast<int>();
}
Is this a bug inside this package, or am I overseeing something here?
Edit1
I wrote some simple tests to capture the problem:
void main() {
var emojiString = '🔥';
var emojiBytes = <int>[0xFF, 0xFE, 0x3D, 0xD8, 0x25, 0xDD];
test('Decode Emoji', () {
var emoji = utf.decodeUtf16le(emojiBytes);
expect(emoji, emojiString);
});
test('Encode Emoji', () {
var bytes = utf.encodeUtf16le(emojiString, true).cast<int>();
expect(bytes, emojiBytes);
});
}
The function "Decode Emoji" succeeds, but the second one, "Encode Emoji" fails with the assertion:
Expected: [255, 254, 61, 216, 37, 221] Actual: [255, 254, 253, 255, 253, 255]
So after doing a lot of researching, I think this is a bug within this library. The code found there is a fork of a discontinued package found here.
The solution I did now, was using some other piece of code, still existing inside the dart library. I found a hint inside this SO post.
Then I implemented a new library on my own, which others facing the same issue can use too. I hosted it on GitHub and pub.dev under MIT license.
I've been trying to code a UTF-16 string structure, and although the standard library provides a unicode module, it doesn't seem to provide a way to print out a slice of u16.
I've tried this:
const std = #import("std");
const unicode = std.unicode;
const stdout = std.io.getStdOut().outStream();
pub fn main() !void {
const unicode_str = unicode.utf8ToUtf16LeStringLiteral("😎 hello! 😎");
try stdout.print("{}\n", .{unicode_str});
}
This outputs:
[12:0]u16#202e9c
Is there a way to print a unicode string ([]u16) without converting it back into a non-unicode string ([]u8)?
Both []const u8 and []const u16 store encoded unicode codepoints. Unicode codepoints fit within the range 0..1,114,112 so an actual unicode string with one array index per codepoint would have to be []const u21. utf-8 and utf-16 both require encoding for codepoints that don't fit. Unless there is a compatability reason for utf-16 (like some windows functions), you should probably be using []const u8 unicode strings.
To print utf-16 to a utf-8 stream, you have to decode utf-16 and re-encode it into utf-8. There is currently no formatting specifier to do this automatically.
You can either convert the entire string at once, requiring allocation:
const utf8string = try std.unicode.utf16leToUtf8Alloc(alloc, utf16le);
Or, without allocation:
var writer = std.io.getStdOut().writer();
var it = std.unicode.Utf16LeIterator.init(utf16le);
while (try it.nextCodepoint()) |codepoint| {
var buf: [4]u8 = [_]u8{undefined} ** 4;
const len = try std.unicode.utf8Encode(codepoint, &buf);
try writer.writeAll(buf[0..len]);
}
Note that this will be very slow without using a buffered writer if you are writing somewhere that requires a syscall to write.
I'm trying to draw QR barcodes in a PDF file using iTextSharp. If I'm using English text the barcodes are fine, they are decoded properly, but if I'm using Chinese text, the barcode is decoded as question marks. For example this character '测' (\u6D4B) is decoded as '?'. I tried all supported character sets, but none of them helped.
What combination of parameters should I use for the QR barcode in iTextSharp in order to encode correctly Chinese text?
iText and iTextSharp apparently don't natively support this but you can write some code to handle this on your own. The trick is to get the QR code parser to work with just an arbitrary byte array instead of a string. What's really nice is that the iTextSharp code is almost ready for this but doesn't expose the functionality. Unfortunately many of the required classes are sealed so you can't just subclass them, you'll have to recreate them. You can either download the entire source and add these changes or just create separate classes with the same names. (Please check over the license to make sure you are allowed to do this.) My changes below don't have any error correction so make sure you do that, too.
The first class that you'll need to recreate is iTextSharp.text.pdf.qrcode.BlockPair and the only change you'll need to make is to make the constructor public instead of internal. (You only need to do this if you are creating your own code and not modifying the existing code.)
The second class is iTextSharp.text.pdf.qrcode.Encoder. This is where we'll make the most changes. Add an overload to Append8BitBytes that looks like this:
static void Append8BitBytes(byte[] bytes, BitVector bits) {
for (int i = 0; i < bytes.Length; ++i) {
bits.AppendBits(bytes[i], 8);
}
}
The string version of this method converts text to a byte array and then uses the above so we're just cutting out the middle man. Next, add a new overload to the constructor that takes in a byte array instead of a string. We'll then just cut out the string detection part and force the system to byte-mode, otherwise the code below is pretty much the same.
public static void Encode(byte[] bytes, ErrorCorrectionLevel ecLevel, IDictionary<EncodeHintType, Object> hints, QRCode qrCode) {
String encoding = DEFAULT_BYTE_MODE_ENCODING;
// Step 1: Choose the mode (encoding).
Mode mode = Mode.BYTE;
// Step 2: Append "bytes" into "dataBits" in appropriate encoding.
BitVector dataBits = new BitVector();
Append8BitBytes(bytes, dataBits);
// Step 3: Initialize QR code that can contain "dataBits".
int numInputBytes = dataBits.SizeInBytes();
InitQRCode(numInputBytes, ecLevel, mode, qrCode);
// Step 4: Build another bit vector that contains header and data.
BitVector headerAndDataBits = new BitVector();
// Step 4.5: Append ECI message if applicable
if (mode == Mode.BYTE && !DEFAULT_BYTE_MODE_ENCODING.Equals(encoding)) {
CharacterSetECI eci = CharacterSetECI.GetCharacterSetECIByName(encoding);
if (eci != null) {
AppendECI(eci, headerAndDataBits);
}
}
AppendModeInfo(mode, headerAndDataBits);
int numLetters = dataBits.SizeInBytes();
AppendLengthInfo(numLetters, qrCode.GetVersion(), mode, headerAndDataBits);
headerAndDataBits.AppendBitVector(dataBits);
// Step 5: Terminate the bits properly.
TerminateBits(qrCode.GetNumDataBytes(), headerAndDataBits);
// Step 6: Interleave data bits with error correction code.
BitVector finalBits = new BitVector();
InterleaveWithECBytes(headerAndDataBits, qrCode.GetNumTotalBytes(), qrCode.GetNumDataBytes(),
qrCode.GetNumRSBlocks(), finalBits);
// Step 7: Choose the mask pattern and set to "qrCode".
ByteMatrix matrix = new ByteMatrix(qrCode.GetMatrixWidth(), qrCode.GetMatrixWidth());
qrCode.SetMaskPattern(ChooseMaskPattern(finalBits, qrCode.GetECLevel(), qrCode.GetVersion(),
matrix));
// Step 8. Build the matrix and set it to "qrCode".
MatrixUtil.BuildMatrix(finalBits, qrCode.GetECLevel(), qrCode.GetVersion(),
qrCode.GetMaskPattern(), matrix);
qrCode.SetMatrix(matrix);
// Step 9. Make sure we have a valid QR Code.
if (!qrCode.IsValid()) {
throw new WriterException("Invalid QR code: " + qrCode.ToString());
}
}
The third class is iTextSharp.text.pdf.qrcode.QRCodeWriter and once again we just need to add an overloaded Encode method supports a byte array and that calls are new constructor created above:
public ByteMatrix Encode(byte[] bytes, int width, int height, IDictionary<EncodeHintType, Object> hints) {
ErrorCorrectionLevel errorCorrectionLevel = ErrorCorrectionLevel.L;
if (hints != null && hints.ContainsKey(EncodeHintType.ERROR_CORRECTION))
errorCorrectionLevel = (ErrorCorrectionLevel)hints[EncodeHintType.ERROR_CORRECTION];
QRCode code = new QRCode();
Encoder.Encode(bytes, errorCorrectionLevel, hints, code);
return RenderResult(code, width, height);
}
The last class is iTextSharp.text.pdf.BarcodeQRCode which we once again add our new constructor overload:
public BarcodeQRCode(byte[] bytes, int width, int height, IDictionary<EncodeHintType, Object> hints) {
newCode.QRCodeWriter qc = new newCode.QRCodeWriter();
bm = qc.Encode(bytes, width, height, hints);
}
The last trick is to make sure when calling this that you include the byte order mark (BOM) so that decoders know to decode this properly, in this case UTF-8.
//Create an encoder that supports outputting a BOM
System.Text.Encoding enc = new System.Text.UTF8Encoding(true, true);
//Get the BOM
byte[] bom = enc.GetPreamble();
//Get the raw bytes for the string
byte[] bytes = enc.GetBytes("测");
//Combine the byte arrays
byte[] final = new byte[bom.Length + bytes.Length];
System.Buffer.BlockCopy(bom, 0, final, 0, bom.Length);
System.Buffer.BlockCopy(bytes, 0, final, bom.Length, bytes.Length);
//Create are barcode using our new constructor
var q = new BarcodeQRCode(final, 100, 100, null);
//Add it to the document
doc.Add(q.GetImage());
Looks like you may be out of luck. I tried too and got the same results as you did. Then looked at the Java API:
"*CHARACTER_SET the values are strings and can be Cp437, Shift_JIS and
ISO-8859-1 to ISO-8859-16. The default value is ISO-8859-1.*"
Lastly, looked at the iTextSharp BarcodeQRCode class source code to confirm that only those characters sets are supported. I'm by no means an authority on Unicode or encoding, but according to ISO/IEC 8859, the character sets above won't work for Chinese.
Essentially the same trick that Chris has done in his answer could be implemented by specifying UTF-8 charset in barcode hints.
var hints = new Dictionary<EncodeHintType, Object>() {{EncodeHintType.CHARACTER_SET, "UTF-8"}};
var q = new BarcodeQRCode("\u6D4B", 100, 100, hints);
If you want to be more safe, you can start your string with BOM character '\uFEFF', like Chris suggested, so it would be "\uFEFF\u6D4B".
UTF-8 is unfortunately not supported by QR codes specification, and there are a lot of discussions on this subject, but the fact is that most QR code readers will correctly read the code created by this method.
I have several characters that aren't recognized properly.
Characters like:
º
á
ó
(etc..)
This means that the characters encoding is not utf-8 right?
So, can you tell me what character encoding could it be please.
We don't have nearly enough information to really answer this, but the gist of it is: you shouldn't just guess. You need to work out where the data is coming from, and find out what the encoding is. You haven't told us anything about the data source, so we're completely in the dark. You might want to try Encoding.Default if these are files saved with something like Notepad.
If you know what the characters are meant to be and how they're represented in binary, that should suggest an encoding... but again, we'd need to know more information.
read this first http://www.joelonsoftware.com/articles/Unicode.html
There are two encodings: the one that was used to encode string and one that is used to decode string. They must be the same to get expected result. If they are different then some characters will be displayed incorrectly. we can try to guess if you post actual and expected results.
I wrote a couple of methods to narrow down the possibilities a while back for situations just like this.
static void Main(string[] args)
{
Encoding[] matches = FindEncodingTable('Ÿ');
Encoding[] enc2 = FindEncodingTable(159, 'Ÿ');
}
// Locates all Encodings with the specified Character and position
// "CharacterPosition": Decimal position of the character on the unknown encoding table. E.G. 159 on the extended ASCII table
//"character": The character to locate in the encoding table. E.G. 'Ÿ' on the extended ASCII table
static Encoding[] FindEncodingTable(int CharacterPosition, char character)
{
List matches = new List();
byte myByte = (byte)CharacterPosition;
byte[] bytes = { myByte };
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = thisEnc.GetChars(bytes);
if (chars[0] == character)
{
matches.Add(thisEnc);
break;
}
}
return matches.ToArray();
}
// Locates all Encodings that contain the specified character
static Encoding[] FindEncodingTable(char character)
{
List matches = new List();
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = { character };
byte[] temp = thisEnc.GetBytes(chars);
if (temp != null)
matches.Add(thisEnc);
}
return matches.ToArray();
}
Encoding is the form of modifying some existing content; thus allowing it to be parsed by the required destination protocols.
An example of encoding can be seen when browsing the internet:
The URL you visit: www.example.com, may have the search facility to run custom searches via the URL address:
www.example.com?search=...
The following variables on the URL require URL encoding. If you was to write:
www.example.com?search=cat food cheap
The browser wouldn't understand your request as you have used an invalid character of ' ' (a white space)
To correct this encoding error you should exchange the ' ' with '%20' to form this URL:
www.example.com?search=cat%20food%20cheap
Different systems use different forms of encoding, in this example I have used standard Hex encoding for a URL. In other applications and instances you may find the need to use other types of encoding.
Good Luck!
I have got some files created from some asian OS (chinese and japanese XPs)
the file name is garbled, for example:
иè+¾«Ñ¡Õä²ØºÏ¼
how i can recover the original text?
I tried with this in c#
Encoding unicode = Encoding.Unicode;
Encoding cinese = Encoding.GetEncoding(936);
byte[] chineseBytes = chinese.GetBytes(garbledString);
byte[] unicodeBytes = Encoding.Convert(unicode, chinese, chineseBytes);
//(Then convert byte in string)
and tried to change unicode to windows-1252 but no luck
It's a double-encoded text. The original is in Windows-936, then some application assumed the text is in ISO-8869-1 and encoded the result to UTF-8. Here is an example how to decode it in Python:
>>> print 'иè+¾«Ñ¡Õä²ØºÏ¼'.decode('utf8').encode('latin1').decode('cp936')
新歌+精选珍藏合辑
I'm sure you can do something similar in C#.
Encoding unicode = Encoding.Unicode;
That's not what you want. “Unicode” is Microsoft's totally misleading name for what is really the UTF-16LE encoding. UTF-16LE plays no part here, what you have is a simple case where a 936 string has been misdecoded as 1252.
Windows codepage 1252 is similar but not the same as ISO-8859-1. There is no way to tell which is in the example string as it does not contain any of the bytes 0x80-0x9F which are different in the two encodings, but I'm assuming 1252 because that's the standard codepage on a western Windows install.
Encoding latin= Encoding.getEncoding(1252);
Encoding chinese= Encoding.getEncoding(936);
chinese.getChars(latin.getBytes(s));
The first argument to Encoding.Convert is the source encoding, Shouldn't that be chinese in your case? So
Encoding.Convert(chinese, unicode, chineseBytes);
might actually work. Because, after all, you want to convert CP-936 to Unicode and not vice-versa. And I'd suggest you don't even try bothering with CP-1252 since your text there is very likely not Latin.
This is an old question, but I just ran into the same situation while trying to migrate WordPress upload files off of an old Windows Server 2008 R2 server. bobince's answer set me on the right track, but I had to search for the right encoding/decoding pair.
With the following C#, I found the relevant encoding/deciding pair:
using System;
using System.Text;
public class Program
{
public static void Main()
{
// garbled
string s = "2020竹慶本樂ä»æ³¢åˆ‡äºžæ´²æ³•çµ-Intro-2-1024x643.jpg";
// expected
string t = "2020竹慶本樂仁波切亞洲法筵-Intro-2-1024x643.jpg";
foreach( EncodingInfo ei in Encoding.GetEncodings() ) {
Encoding e = ei.GetEncoding();
foreach( EncodingInfo ei2 in Encoding.GetEncodings() ) {
Encoding e2 = ei2.GetEncoding();
var s2 = e2.GetString(e.GetBytes(s));
if (s2 == t) {
var x = ei.CodePage;
Console.WriteLine($"e1={ei.DisplayName} (CP {ei.CodePage}), e2={ei2.DisplayName} (CP {ei2.CodePage})");
Console.WriteLine(t);
Console.WriteLine(s2);
}
}
}
Console.WriteLine("-----------");
Console.WriteLine(t);
Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));
}
}
It turned out that the correct encoding/deciding in my case was:
e1=Western European (Windows) (CP 1252), e2=Unicode (UTF-8) (CP 65001)
So the last line of code is a one-liner for the correct conversion Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));.