Read unicode characters with bufio scanner in Go - unicode

I'm trying to read a plain text file that contains names like this: "CASTAÑEDA"
The code is basically like this:
file, err := os.Open("C:/Files/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
Then, when "CASTAÑEDA" is read it prints "CASTA�EDA"
There's any way to handle that characters when reading with bufio?
Thanks.

Your file is, most propably, non UTF-8. Because of that (go expects all strings to be UTF-8) your console output looks mangled. I would advise usage of the packages golang.org/x/text/encoding/charmap and golang.org/x/text/transform in your case, to convert the file's data to UTF-8. As I might presume, looking at your file path, you are on Windows. So your character encoding might be Windows1252 (if you have edited it e.g. with notepad.exe).
Try something like this:
package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/charmap"
"golang.org/x/text/transform"
)
func main() {
file, err := os.Open("C:/temp/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) <- insert your enconding here
scanner := bufio.NewScanner(dec)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}
You can find more encodings in the package golang.org/x/text/encoding/charmap, that you can insert into my example to your liking.

The issue you're encountering is that your input is likely not UTF-8 (which is what bufio and most of the Go language/stdlib expect). Instead, your input probably uses some extended-ASCII codepage, which is why the unaccented characters are passing through cleanly (UTF-8 is also a superset of 7-bit ASCII), but that the 'Ñ' is not passed through intact.
In this situation, the bit-representation of the accented character is not valid UTF-8, so the unicode replacement character (U+FFFD) is being produced. You've got a few options:
Convert your input files to UTF-8 before passing them to Go. There are many utilities that can do this, and editors often have this feature.
Try using golang.org/x/text/encoding/charmap together with NewReader from golang.org/x/text/transform to transform your input to UTF-8. Pass the resulting Reader to bufio.NewScanner
Change the line in the loop to os.Stdout.Write(scanner.Bytes()); fmt.Println(); This might avoid the bytes being interpreted as UTF-8 beyond newline splitting. Writing the bytes directly to os.Stdout will further avoid any (mis)interpretation of the contents.

Related

easy way to escape special characters in a text field

I am trying to use copy in golang to load hundreds of thousands of lines of texts files in a postgres database. Sometimes it fails because lines have special characters (non ascii). If I replace non ascii chars it works fine.
Is there a simple/easy way to save not allowed characters in a text field or another kind of field? Or a postgres function that valid a text to avoid a transaction with wrong characters?
I would recommend using db.Exec().
col1Val := `string 1 with special characters`
col2Val := `string 2 with special characters`
sqlStr := `INSERT INTO table (col1, col2) VALUES ($1, $2)`
_, err = db.Exec(sqlStr, col1Val, col2Val)
if err != nil {
panic(err)
}
db.Exec escapes special characters by itself and removes SQL injections.

Any chance for Indy 10 to output Unicode with Delphi 6?

I gave a try for Indy 10 on Delphi 6.
The problem is - with old Indy I was able to output Unicode through UTF-8 as AnsiString by setting proper encoding in ResponseInfo.ContentType. Now I lost the Unicode output. Here is an example how did I output an unicode string with old Indy:
var
MyUnicodeBodyString: WideString;
function MyUTF8Encode(const s: WideString): UTF8String;
var
Len: Integer;
begin
Len := WideCharToMultiByte(CP_UTF8, 0, PWideChar(s), Length(s), nil, 0, nil, nil);
SetLength(Result, Len);
if Len > 0 then
WideCharToMultiByte(CP_UTF8, 0, PWideChar(s), Length(s), PAnsiChar(Result), Len, nil, nil);
end;
begin
// ...
AResponseInfo.ContentText := MyUTF8Encode(MyUnicodeBodyString);
end;
When I do the same with Indy 10, the output is like
Товар
(the UTF-8 string where each byte is encoded as Unicode then).
When I change the output to just
AResponseInfo.ContentText := MyUnicodeBodyString;
I see the normal output of ASCII and of symbols for "language for non-Unicode programs" (in Windows control panel). Other languages are garbled.
Indy 10 is programmed with "string" and probably assumes that "string" is WideString, but in Delphi 6 string is an alias for AnsiString.
Can I influence the output of Indy 10 HTTP Server without replacing every string in Indy 10 source code with WideString ?
Indy 10 is programmed with "string" and probably assumes that "string" is WideString
That is incorrect. Indy's existence predates Delphi's switch to Unicode in Delphi 2009, so Indy has a lot of backwards compatibility for handling AnsiString in Delphi 2007 and earlier. In those versions, Indy does not use or assume WideString anywhere in its public API (well, except for in the IIdTextEncoding interface), everything is based on AnsiString instead.
in Delphi 6 string is an alias for AnsiString.
Yes, exactly. Which is why the preferred way to send non-ASCII content in an older ANSI version of Delphi is to use ANSI-encoded strings, eg:
var
MyAnsiBodyString: AnsiString;
...
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentText := MyAnsiBodyString;
...
If the AnsiString is encoded in the default OS ANSI codepage (as it typically should be), then Indy will simply convert the AnsiString to Unicode using that codepage by default, and then encode that Unicode result as UTF-8 for transmission.
Can I influence the output of Indy 10 HTTP Server without replacing every string in Indy 10 source code with WideString ?
Yes. In pre-Unicode versions of Delphi, most of Indy's components/classes have additional properties/parameters to specify an ANSI byte encoding, allowing Indy to properly convert an AnsiString to Unicode before charset-converting the Unicode to bytes for transmission (and vice versa on reception).
So, if you want to send an AnsiString that is already pre-encoded as UTF-8, one approach is to manually set the AResponseInfo.ContentLength property, as well as the IOHandler.DefAnsiEncoding property, eg:
var
MyUtf8Str: UTF8String;
...
MyUtf8Str := MyUTF8Encode(MyUnicodeBodyString);
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentText := myUtf8Str;
AResponseInfo.ContentLength := Length(myUtf8Str);
AContext.Connection.IOHandler.DefAnsiEncoding := IndyTextEncoding_UTF8;
...
If you don't set the ContentLength manually, TIdHTTPResponseInfo.WriteHeader() will calculate that value for you, by converting the ContentText to WideString using the RTL's default ANSI->Unicode conversion, and then encoding that WideString to UTF-8 to get the byte count. However, the initial ANSI->Unicode conversion will not know your AnsiString is encoded in UTF-8 and thus will not process it correctly.
If you don't set the DefAnsiEncoding manually, TIdIOHandler.Write() will use the default DefAnsiEncoding setting of IndyTextEncoding_OSDefault to convert the ContentText to Unicode using the OS's default ANSI codepage, which is likely not UTF-8 and so will not convert the text to Unicode properly before then encoding the Unicode result to UTF-8 bytes.
Another approach is to use AResponseInfo.ContentStream instead of AResponseInfo.ContentText. That way, you can simply store your UTF-8 bytes in a TMemoryStream or TStringStream and then TIdHTTPResponseInfo.WriteContent() can send those bytes as-is, eg:
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentStream := TStringStream.Create(MyUTF8Encode(MyUnicodeBodyString));
Or:
var
MyUtf8Str: UTF8String;
...
MyUtf8Str := MyUTF8Encode(MyUnicodeBodyString);
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentStream := TMemoryStream.Create;
AResponseInfo.ContentStream.WriteBuffer(PAnsiChar(MyUtf8Str)^, Length(MyUtf8Str));
AResponseInfo.ContentStream.Position := 0;
Or:
AResponseInfo.CharSet := 'utf-8';
AResponseInfo.ContentStream := TMemoryStream.Create;
WriteStringToStream(AResponseInfo.ContentStream, MyUTF8Encode(MyUnicodeBodyString), IndyTextEncoding_UTF8, IndyTextEncoding_UTF8);
AResponseInfo.ContentStream.Position := 0;

Inno script function "LoadStringFromFile" NOT reading Unicode correctly [duplicate]

I have a function called GetServerName. I need to pass the file name (say for example 'test.txt') as well as a needed section string (say for example 'server')
The test.txt file is contains something like this
data1 | abcd
data2 | efgh
server| 'serverName1'
data3 | ijkl
I need to extract server name so in my function I will pass something like GetServerName('test.txt', 'server') and it should return serverName1.
My problem is that the test.txt was an ANSI-encoded file earlier. Now it can be an ANSI-encoded file or Unicode-encoded file. Below function worked correctly for ANSI-encoded file, but giving problem, if file is encoded in UNICODE. I suspect something with LoadStringsFromFile function. Because when, I debug I could see it returns Unicode characters instead of human readable characters. How to solve my issue simply? (or how to find the type of encoding of my file and how to convert UNICODE string to ANSI for comparison, then I can do it myself)
function GetServerName(const FileName, Section: string): string;
//Get Smartlink server name
var
DirLine: Integer;
LineCount: Integer;
SectionLine: Integer;
Lines: TArrayOfString;
//Lines: String;
AHA: TArrayOfString;
begin
Result := '';
if LoadStringsFromFile(FileName, Lines) then
begin
LineCount := GetArrayLength(Lines);
for SectionLine := 0 to LineCount - 1 do
begin
AHA := StrSplit(Trim(Lines[SectionLine]), '|')
if AHA[0] = Section then
begin
Result := AHA[1];
Exit;
end
end
end;
end;
First, note that the Unicode is not an encoding. The Unicode is a character set. Encoding is UTF-8, UTF-16, UTF-32 etc. So we do not know which encoding you actually use.
In the Unicode version of Inno Setup, the LoadStringsFromFile function (plural – do not confuse with singular LoadStringFromFile) uses the current Windows Ansi encoding by default.
But, if the file has the UTF-8 BOM, it will treat the contents accordingly. The BOM is a common way to autodetect the UTF-8 (and other UTF-*) encoding. You can create a file in the UTF-8 encoding with BOM using Windows Notepad.
UTF-16 or other encodings are not supported natively.
For implementation of reading UTF-16 file, see Reading UTF-16 file in Inno Setup Pascal script.
For working with files in any encoding, including UTF-8 without BOM, see Inno Setup - Convert array of string to Unicode and back to ANSI or Inno Setup replace a string in a UTF-8 file without BOM.

Reading file in Ansi and Unicode encoding in Inno Setup

I have a function called GetServerName. I need to pass the file name (say for example 'test.txt') as well as a needed section string (say for example 'server')
The test.txt file is contains something like this
data1 | abcd
data2 | efgh
server| 'serverName1'
data3 | ijkl
I need to extract server name so in my function I will pass something like GetServerName('test.txt', 'server') and it should return serverName1.
My problem is that the test.txt was an ANSI-encoded file earlier. Now it can be an ANSI-encoded file or Unicode-encoded file. Below function worked correctly for ANSI-encoded file, but giving problem, if file is encoded in UNICODE. I suspect something with LoadStringsFromFile function. Because when, I debug I could see it returns Unicode characters instead of human readable characters. How to solve my issue simply? (or how to find the type of encoding of my file and how to convert UNICODE string to ANSI for comparison, then I can do it myself)
function GetServerName(const FileName, Section: string): string;
//Get Smartlink server name
var
DirLine: Integer;
LineCount: Integer;
SectionLine: Integer;
Lines: TArrayOfString;
//Lines: String;
AHA: TArrayOfString;
begin
Result := '';
if LoadStringsFromFile(FileName, Lines) then
begin
LineCount := GetArrayLength(Lines);
for SectionLine := 0 to LineCount - 1 do
begin
AHA := StrSplit(Trim(Lines[SectionLine]), '|')
if AHA[0] = Section then
begin
Result := AHA[1];
Exit;
end
end
end;
end;
First, note that the Unicode is not an encoding. The Unicode is a character set. Encoding is UTF-8, UTF-16, UTF-32 etc. So we do not know which encoding you actually use.
In the Unicode version of Inno Setup, the LoadStringsFromFile function (plural – do not confuse with singular LoadStringFromFile) uses the current Windows Ansi encoding by default.
But, if the file has the UTF-8 BOM, it will treat the contents accordingly. The BOM is a common way to autodetect the UTF-8 (and other UTF-*) encoding. You can create a file in the UTF-8 encoding with BOM using Windows Notepad.
UTF-16 or other encodings are not supported natively.
For implementation of reading UTF-16 file, see Reading UTF-16 file in Inno Setup Pascal script.
For working with files in any encoding, including UTF-8 without BOM, see Inno Setup - Convert array of string to Unicode and back to ANSI or Inno Setup replace a string in a UTF-8 file without BOM.

Autohotkey String-comparison

For some reason, I can not get an autohotkey string comparison to work in the script I need it in, but it is working in a test script.
Tester
password = asdf
^!=::
InputBox,input,Enter Phrase,Enter Phrase,,,,,,,30,
if ( input == password ){
MsgBox, How original your left home row fingers are
Return
} else {
MsgBox, You entered "%input%"
Return
}
Main
password = password
!^=::
InputBox,input,Enter Password,Enter Password,HIDE,,,,,,30,
if ( input == password ){
MsgBox,"That is correct sir"
;Run,C:\Copy\Registry\disable.bat
return
}else{
MsgBox,That is not correct sir you said %input%
Return
}
Main keeps giving me the invalid. Any ideas?
Your "main" script works just fine.
The == comparitor is case sensitive, you know.
I found that strings in the clipboard were not comparing properly to strings in my source files when the strings contained in the source file contained non-ascii characters. After converting the file to UTF-8 with BOM, it would correctly compare.
The documentation doesn't say directly that it will affect string comparisons but it does say that it has an affect. In the FAQ section it states:
Why are the non-ASCII characters in my script displaying or sending
incorrectly?
Short answer: Save the script as UTF-8 with BOM.
Although AutoHotkey supports Unicode text, it is optimized for
backward-compatibility, which means defaulting to the ANSI encoding
rather than the more internationally recommended UTF-8. AutoHotkey
will not automatically recognize a UTF-8 file unless it begins with a
byte order mark.
Source: https://web.archive.org/web/20230203020016/https://www.autohotkey.com/docs/v1/FAQ.htm#nonascii
So perhaps it does more than just display and send incorrectly, but also store values incorrectly causing invalid comparisons.