flex(lexer) and unicode file input - unicode

I know flex can only use utf-8.
Is there a way to take a unicode file(utf-16 LE) as input and convert it to utf-8 internally?
It receives a file name as input and opens and uses the Unicode file.
...
yyin = fopen("fname", "r");
yyparse();
fclose(yyin);
...

Related

IW8ISO8859P8 to utf-8 conversion

My perl script reads values extracted from Oracle DB based on IW8ISO8859P8 codepage into a string. I have also an input file (saved as UTF-8) from which I read also a string.
I am trying to compare both string. printing first string gives me gibberish e.g òøáä -îåñáåú whereas the other string gives me the hebrew letter e.g הסבות. How can I encode the first string to give me the right Hebrew string ?
Thanks
Shimon
I tried using the format $string1 = Encode (uft-8, $string1) but it did not help

Inno script function "LoadStringFromFile" NOT reading Unicode correctly [duplicate]

I have a function called GetServerName. I need to pass the file name (say for example 'test.txt') as well as a needed section string (say for example 'server')
The test.txt file is contains something like this
data1 | abcd
data2 | efgh
server| 'serverName1'
data3 | ijkl
I need to extract server name so in my function I will pass something like GetServerName('test.txt', 'server') and it should return serverName1.
My problem is that the test.txt was an ANSI-encoded file earlier. Now it can be an ANSI-encoded file or Unicode-encoded file. Below function worked correctly for ANSI-encoded file, but giving problem, if file is encoded in UNICODE. I suspect something with LoadStringsFromFile function. Because when, I debug I could see it returns Unicode characters instead of human readable characters. How to solve my issue simply? (or how to find the type of encoding of my file and how to convert UNICODE string to ANSI for comparison, then I can do it myself)
function GetServerName(const FileName, Section: string): string;
//Get Smartlink server name
var
DirLine: Integer;
LineCount: Integer;
SectionLine: Integer;
Lines: TArrayOfString;
//Lines: String;
AHA: TArrayOfString;
begin
Result := '';
if LoadStringsFromFile(FileName, Lines) then
begin
LineCount := GetArrayLength(Lines);
for SectionLine := 0 to LineCount - 1 do
begin
AHA := StrSplit(Trim(Lines[SectionLine]), '|')
if AHA[0] = Section then
begin
Result := AHA[1];
Exit;
end
end
end;
end;
First, note that the Unicode is not an encoding. The Unicode is a character set. Encoding is UTF-8, UTF-16, UTF-32 etc. So we do not know which encoding you actually use.
In the Unicode version of Inno Setup, the LoadStringsFromFile function (plural – do not confuse with singular LoadStringFromFile) uses the current Windows Ansi encoding by default.
But, if the file has the UTF-8 BOM, it will treat the contents accordingly. The BOM is a common way to autodetect the UTF-8 (and other UTF-*) encoding. You can create a file in the UTF-8 encoding with BOM using Windows Notepad.
UTF-16 or other encodings are not supported natively.
For implementation of reading UTF-16 file, see Reading UTF-16 file in Inno Setup Pascal script.
For working with files in any encoding, including UTF-8 without BOM, see Inno Setup - Convert array of string to Unicode and back to ANSI or Inno Setup replace a string in a UTF-8 file without BOM.

Reading file in Ansi and Unicode encoding in Inno Setup

I have a function called GetServerName. I need to pass the file name (say for example 'test.txt') as well as a needed section string (say for example 'server')
The test.txt file is contains something like this
data1 | abcd
data2 | efgh
server| 'serverName1'
data3 | ijkl
I need to extract server name so in my function I will pass something like GetServerName('test.txt', 'server') and it should return serverName1.
My problem is that the test.txt was an ANSI-encoded file earlier. Now it can be an ANSI-encoded file or Unicode-encoded file. Below function worked correctly for ANSI-encoded file, but giving problem, if file is encoded in UNICODE. I suspect something with LoadStringsFromFile function. Because when, I debug I could see it returns Unicode characters instead of human readable characters. How to solve my issue simply? (or how to find the type of encoding of my file and how to convert UNICODE string to ANSI for comparison, then I can do it myself)
function GetServerName(const FileName, Section: string): string;
//Get Smartlink server name
var
DirLine: Integer;
LineCount: Integer;
SectionLine: Integer;
Lines: TArrayOfString;
//Lines: String;
AHA: TArrayOfString;
begin
Result := '';
if LoadStringsFromFile(FileName, Lines) then
begin
LineCount := GetArrayLength(Lines);
for SectionLine := 0 to LineCount - 1 do
begin
AHA := StrSplit(Trim(Lines[SectionLine]), '|')
if AHA[0] = Section then
begin
Result := AHA[1];
Exit;
end
end
end;
end;
First, note that the Unicode is not an encoding. The Unicode is a character set. Encoding is UTF-8, UTF-16, UTF-32 etc. So we do not know which encoding you actually use.
In the Unicode version of Inno Setup, the LoadStringsFromFile function (plural – do not confuse with singular LoadStringFromFile) uses the current Windows Ansi encoding by default.
But, if the file has the UTF-8 BOM, it will treat the contents accordingly. The BOM is a common way to autodetect the UTF-8 (and other UTF-*) encoding. You can create a file in the UTF-8 encoding with BOM using Windows Notepad.
UTF-16 or other encodings are not supported natively.
For implementation of reading UTF-16 file, see Reading UTF-16 file in Inno Setup Pascal script.
For working with files in any encoding, including UTF-8 without BOM, see Inno Setup - Convert array of string to Unicode and back to ANSI or Inno Setup replace a string in a UTF-8 file without BOM.

Writing UTF16 file with std::fstream

Is it possible to imbue a std::fstream so that a std::string containing UTF-8 encoding can be streamed to an UTF-16 file?
I tried the following using the utf8-to-utf16 facet, but the result file is still UTF-8:
std::fstream utf16_stream("test.txt", std::ios_base::trunc | std::ios_base::out);
utf16_stream.imbue(std::locale(std::locale(), new codecvt_utf8_utf16<wchar_t,
std::codecvt_mode(std::generate_header | std::little_endian)>);
std::string utf8_string = "\x54\\xE2\x83\xac\x73\x74";
utf16_stream << utf8_string;
References for the codecvt_utf8_utf16 facet seem to indicate it can be used to read and write UTF-8 files, not UTF-16 - is that correct, and if so, is there a simple way to do what I want to do?
file streams (by virtue of the requirements of std::basic_filebuf §22.4.1.4.2[locale.codecvt.virtuals]/3) do not support N:M character encoding conversions as would be the case with UTF8 internal / UTF16 external.
You'd have to build a UTF-16 string, e.g. by using wstring_convert, reinterpret it as a sequence of bytes, and output it using usual (non-converting) std::ofstream.
Or, alternatively, convert UTF-8 to wide first, and then use std::codecvt_utf16 which produces UTF-16 as a sequence of bytes, and therefore, can be used with file streams.

Perl unicode file with non-unicode content

A software is producing UTF-8 files, but writing content to the file that isn't unicode. I can't change that software and have to take the output as it is now. Don' t know if this will show up here correctly, but an german umlaut "ä" is shown in the file as "ä".
If I open the file in Notepad++, it tells me the file is UTF-8 (without BOM) encoded. Now, if I say "convert to ANSI" in Notepad and then switch the file encoding back to UTF-8 (without converting), the German umlauts in the file are correct. How can I achieve the exact same behaviour in Perl? Whatever I tried up to now, the umlaut mess just got worse.
To reproduce, create yourself an UTF-8 encoded file and write content to it:
Ok, I'll try. Create yourself a UTF-8 file and write this to it:
Männer Schüle Vöogel SüÃ
Then, on an UTF-8 mysql database, create a table with varchar field an UTF8_unicode encoding. Now, use this script:
use utf8;
use DBI;
use Encode;
if (open FILE, "test.csv") {
my $db = DBI->connect(
'DBI:mysql:your_db;host=127.0.0.1;mysql_compression=1', 'root', 'Yourpass',
{ PrintError => 1 }
);
my $sql="";
my $sql = qq{SET NAMES 'utf8';};
$db->do($sql);
while (my $line = <FILE>) {
my $sth = $db->prepare("INSERT IGNORE INTO testtable (testline) VALUES (?);");
$sth->execute($line);
}
}
The exact contents of file will get written to the database. But, the output I expect in database is with German umlauts:
Männer Schüler Vögel Süß
So, how can I convert that correctly?
It's ironic: as I see it, the software you talk about is not writing 'non-unicode content' (that's non-sense) - it encodes it UTF-8 twice. Let's take this ä character, for example: it's represented by two bytes in UTF-8, %C3 %A4. But then something in that program decides to treat these bytes as Latin-1 encodings instead: thus they become two separate characters (which will be eventually encoded into UTF-8, and that's what'll be saved into a file).
I suppose the simplest way of reversing this is making Perl think that it uses a series of bytes (and not a sequence of characters) when dealing with the string read from the file. It can be done as simple (and as ugly) as...
open my $fh, '<:utf8', $file_name or die $!;
my $string = <$fh>; # a sequence of characters
$string = utf8::decode($string); # ... will be considered a sequence of octets
Sounds like something is converting it a second time, assuming it to be something like ISO 8859-15 and then converting that to UTF-8. You can reverse this by converting UTF-8 to ISO 8859-15 (or whichever encoding seems to make sense for your data).
As seen on http://www.fileformat.info/info/unicode/char/E4/index.htm the bytes 0xC3 0xA4 are the valid UTF-8 encoding of ä. When viewed as ISO 8859-15 (or 8859-1, or Windows-1252, or a number of other 8-bit encodings) they display the string ä.