Converting a byte array to a string given encoding - encoding

I read from a file to a byte array:
auto text = cast(immutable(ubyte)[]) read("test.txt");
I can get the type of character encoding using the following function:
enum EncodingType {ANSI, UTF8, UTF16LE, UTF16BE, UTF32LE, UTF32BE}
EncodingType DetectEncoding(immutable(ubyte)[] data){
switch (data[0]){
case 0xEF:
if (data[1] == 0xBB && data[2] == 0xBF){
return EncodingType.UTF8;
} break;
case 0xFE:
if (data[1] == 0xFF){
return EncodingType.UTF16BE;
} break;
case 0xFF:
if (data[1] == 0xFE){
if (data[2] == 0x00 && data[3] == 0x00){
return EncodingType.UTF32LE;
}else{
return EncodingType.UTF16LE;
}
}
case 0x00:
if (data[1] == 0x00 && data[2] == 0xFE && data[3] == 0xFF){
return EncodingType.UTF32BE;
}
default:
break;
}
return EncodingType.ANSI;
}
I need a function that takes a byte array and returns the text string (utf-8).
If the text is encoded in UTF-8, then the transformation is trivial. Similarly, if the encoding is UTF-16 or UTF-32 native byte order for the system.
string TextDataToString(immutable(ubyte)[] data){
import std.utf;
final switch (DetectEncoding(data[0..4])){
case EncodingType.ANSI:
return null;/*???*/
case EncodingType.UTF8:
return cast(string) data[3..$];
case EncodingType.UTF16LE:
wstring result;
version(LittleEndian) { result = cast(wstring) data[2..$]; }
version(BigEndian) { result = "";/*???*/ }
return toUTF8(result);
case EncodingType.UTF16BE:
return null;/*???*/
case EncodingType.UTF32LE:
dstring result;
version(LittleEndian) { result = cast(dstring) data[4..$]; }
version(BigEndian) { result = "";/*???*/ }
return toUTF8(result);
case EncodingType.UTF32BE:
return null;/*???*/
}
}
But I could not figure out how to convert byte array with ANSI encoded text (for example, windows-1251) or UTF-16/32 with NOT native byte order.
I ticked the appropriate places in the code with /*???*/.
As a result, the following code should work, with any encoding of a text file:
string s = TextDataToString(text);
writeln(s);
Please help!

BOMs are optional. You cannot use them to reliably detect the encoding. Even if there is a BOM, using it to distinguish UTF from code page encodings is problematic, because the byte sequences are usually valid (if nonsensical) in those, too. E.g. 0xFE 0xFF is "юя" in Windows-1251.
Even if you could tell UTF from code page encodings, you couldn't tell the different code pages from another. You could analyze the whole text and make guesses, but that's super error prone and not very practical.
So, I'd advise you to not try to detect the encoding. Instead, require a specific encoding, or add a mechanism to specify it.
As for trandscoding from a different byte order, example for UTF16BE:
import std.algorithm: map;
import std.bitmanip: bigEndianToNative;
import std.conv: to;
import std.exception: enforce;
import std.range: chunks;
alias C = wchar;
enforce(data.length % C.sizeof == 0);
auto result = data
.chunks(C.sizeof)
.map!(x => bigEndianToNative!C(x[0 .. C.sizeof]))
.to!string;

Related

Problem receiving data from bluetooth at very high speeds

I'm using the flutter_bluetooth_serial 0.4.0 package, it has a listen function that receives a function that returns the reading of the sending of a string (List Uint8), but for my case I need to carry out the communication at a very high speed, and when that it happens it does not understand where the end of a string is and it joins it until it overflows with 230 bytes and shows it as if it were a single string, I tried to solve this in several ways, but I can only receive the complete string (18 bytes) when I reduce the transmission speed, I've tried using some characters like '\n', to see if it understands where a string ends and another begins, but I wasn't successful either. If I could read character by character for me it would also solve it, because the messages have a sending pattern.
Do you have an idea how I could be solving this? Some package that works better than this one for this purpose or some way to determine where the end of the string is. I thank!
Here is the code snippet I get the data:
_dataSubscription = connection.input!.listen(_onDataReceived);
void _onDataReceived(Uint8List data) {
print('Data incoming: ${ascii.decode(data)}');
// Allocate buffer for parsed data
var backspacesCounter = 0;
for (var byte in data) {
if (byte == 8 || byte == 127) backspacesCounter++;
}
var buffer = Uint8List(data.length - backspacesCounter);
var bufferIndex = buffer.length;
// Apply backspace control character
backspacesCounter = 0;
for (int i = data.length - 1; i >= 0; i--) {
if (data[i] == 8 || data[i] == 127) {
backspacesCounter++;
} else {
if (backspacesCounter > 0) {
backspacesCounter--;
} else {
buffer[--bufferIndex] = data[i];
}
}
// print(backspacesCounter);
// print(buffer);
// print(bufferIndex);
}
I've tried using some characters like '\n', to see if it understands where a string ends and another begins, read character per character, but doesn't have function to do this.

Is there a way to sort string lists by numbers inside of the strings?

Is there a way to sort something like:
List<String> hi = ['1hi', '2hi','5hi', '3hi', '4hi'];
to this?
['1hi', '2hi','3hi', '4hi', '5hi']
Just calling List<String>.sort() by itself will do a lexicographic sort. That is, your strings will be sorted in character code order, and '10' will be sorted before '2'. That usually isn't expected.
A lexicographic sort will work if your numbers have leading 0s to ensure that all numbers have the same number of digits. However, if the number of digits is variable, you will need to parse the values of the numbers for sorting. A more general approach is to provide a callback to .sort() to tell it how to determine the relative ordering of two items.
Luckily, package:collection has a compareNatural function that can do this for you:
import 'package:collection/collection.dart';
List<String> hi = ['1hi', '2hi','5hi', '3hi', '4hi'];
hi.sort(compareNatural);
If your situation is a bit more complicated and compareNatural doesn't do what you want, a more general approach is to make the .sort() callback do parsing itself, such as via a regular expression:
/// Returns the integer prefix from a string.
///
/// Returns null if no integer prefix is found.
int parseIntPrefix(String s) {
var re = RegExp(r'(-?[0-9]+).*');
var match = re.firstMatch(s);
if (match == null) {
return null;
}
return int.parse(match.group(1));
}
int compareIntPrefixes(String a, String b) {
var aValue = parseIntPrefix(a);
var bValue = parseIntPrefix(b);
if (aValue != null && bValue != null) {
return aValue - bValue;
}
if (aValue == null && bValue == null) {
// If neither string has an integer prefix, sort the strings lexically.
return a.compareTo(b);
}
// Sort strings with integer prefixes before strings without.
if (aValue == null) {
return 1;
} else {
return -1;
}
}
void main() {
List<String> hi = ['1hi', '2hi','5hi', '3hi', '4hi'];
hi.sort(compareIntPrefixes);
}
You can sort the list like this:
hi.sort();
(because numbers sort before letters in its implementation)

How to decode UTF-8 knowing character count but not byte count?

I need to decode a UTF-8-encoded string I don’t know the byte count for. I do know the character count.
With the byte count, I would do this:
NSString(bytes: UnsafePointer<Byte>(bytes),
length: byteCount,
encoding: String.Encoding.utf8.rawValue)
How can I use the character count instead?
A possible solution is to use the UTF-8 UnicodeCodec to decode
bytes until the wanted number of characters is reached
(or an error occurs):
func decodeUTF8<S: Sequence>(bytes: S, numCharacters: Int) -> String
where S.Iterator.Element == UInt8 {
var iterator = bytes.makeIterator()
var utf8codec = UTF8()
var string = ""
while string.characters.count < numCharacters {
switch (utf8codec.decode(&iterator)) {
case let .scalarValue(val):
string.unicodeScalars.append(val)
default:
// Error or out of bytes:
return string
}
}
return string
}
(You could also return nil or throw an error in the error case.)
Example:
let bytes = "H€llo".utf8
let dec = decodeUTF8(bytes: bytes, numCharacters: 3)
print(dec) // H€l

How can I check if a string contains Chinese in Swift?

I want to know that how can I check if a string contains Chinese in Swift?
For example, I want to check if there's Chinese inside:
var myString = "Hi! 大家好!It's contains Chinese!"
Thanks!
This answer
to How to determine if a character is a Chinese character can also easily be translated from
Ruby to Swift (now updated for Swift 3):
extension String {
var containsChineseCharacters: Bool {
return self.range(of: "\\p{Han}", options: .regularExpression) != nil
}
}
if myString.containsChineseCharacters {
print("Contains Chinese")
}
In a regular expression, "\p{Han}" matches all characters with the
"Han" Unicode property, which – as I understand it – are the characters
from the CJK languages.
Looking at questions on how to do this in other languages (such as this accepted answer for Ruby) it looks like the common technique is to determine if each character in the string falls in the CJK range. The ruby answer could be adapted to Swift strings as extension with the following code:
extension String {
var containsChineseCharacters: Bool {
return self.unicodeScalars.contains { scalar in
let cjkRanges: [ClosedInterval<UInt32>] = [
0x4E00...0x9FFF, // main block
0x3400...0x4DBF, // extended block A
0x20000...0x2A6DF, // extended block B
0x2A700...0x2B73F, // extended block C
]
return cjkRanges.contains { $0.contains(scalar.value) }
}
}
}
// true:
"Hi! 大家好!It's contains Chinese!".containsChineseCharacters
// false:
"Hello, world!".containsChineseCharacters
The ranges may already exist in Foundation somewhere rather than manually hardcoding them.
The above is for Swift 2.0, for earlier, you will have to use the free contains function rather than the protocol extension (twice):
extension String {
var containsChineseCharacters: Bool {
return contains(self.unicodeScalars) {
// older version of compiler seems to need extra help with type inference
(scalar: UnicodeScalar)->Bool in
let cjkRanges: [ClosedInterval<UInt32>] = [
0x4E00...0x9FFF, // main block
0x3400...0x4DBF, // extended block A
0x20000...0x2A6DF, // extended block B
0x2A700...0x2B73F, // extended block C
]
return contains(cjkRanges) { $0.contains(scalar.value) }
}
}
}
The accepted answer only find if string contains Chinese character, i created one suit for my own case:
enum ChineseRange {
case notFound, contain, all
}
extension String {
var findChineseCharacters: ChineseRange {
guard let a = self.range(of: "\\p{Han}*\\p{Han}", options: .regularExpression) else {
return .notFound
}
var result: ChineseRange
switch a {
case nil:
result = .notFound
case self.startIndex..<self.endIndex:
result = .all
default:
result = .contain
}
return result
}
}
if "你好".findChineseCharacters == .all {
print("All Chinese")
}
if "Chinese".findChineseCharacters == .notFound {
print("Not found Chinese")
}
if "Chinese你好".findChineseCharacters == .contain {
print("Contains Chinese")
}
gist here: https://gist.github.com/williamhqs/6899691b5a26272550578601bee17f1a
Try this in Swift 2:
var myString = "Hi! 大家好!It's contains Chinese!"
var a = false
for c in myString.characters {
let cs = String(c)
a = a || (cs != cs.stringByApplyingTransform(NSStringTransformMandarinToLatin, reverse: false))
}
print("\(myString) contains Chinese characters = \(a)")
I have created a Swift 3 String extension for checking how much Chinese characters a String contains. Similar to the code by Airspeed Velocity but more comprehensive. Checking various Unicode ranges to see whether a character is Chinese. See Chinese character ranges listed in the tables under section 18.1 in the Unicode standard specification: http://www.unicode.org/versions/Unicode9.0.0/ch18.pdf
The String extension can be found on GitHub: https://github.com/niklasberglund/String-chinese.swift
Usage example:
let myString = "Hi! 大家好!It contains Chinese!"
let chinesePercentage = myString.chinesePercentage()
let chineseCharacterCount = myString.chineseCharactersCount()
print("String contains \(chinesePercentage) percent Chinese. That's \(chineseCharacterCount) characters.")

how to Insert multiple columns using PQputCopyData

I am trying to insert two columns using PQputCopyData with following code. But once it checks for the final result it shows error invalid byte sequence for encoding UTF8 and data is not getting inserted into the database.
Both columns type is character varying. What am I doing wrong here ?
const char *buffer = "john,doe";
PGresult *res;
res=PQexec(conn,"COPY john FROM STDIN DELIMITER ',';");
cout<<buffer;
if(PQresultStatus(res) != PGRES_COPY_IN)
{
cout<<"copy in not ok";
}
else
{
if(PQputCopyData(conn,buffer,400) == 1)
{
if(PQputCopyEnd(conn,NULL) == 1)
{
PGresult *res = PQgetResult(conn);
if(PQresultStatus(res) == PGRES_COMMAND_OK)
{
cout<<"done";
}
else
{
cout<<PQerrorMessage(conn); Here I get the error invalid byte sequence for encoding "UTF8"
}
}
else
{
cout<<PQerrorMessage(conn);
}
}
}
if(PQputCopyData(conn,buffer,400) == 1)
What's wrong is passing 400 instead of the actual size of the contents in buffer, making it send unallocated garbage after the real data. Use strlen(buffer) instead.
Also you want each line to finish with a newline, so buffer should be :
const char *buffer = "john,doe\n";