Zend Framework and UTF-8 characters (æøå) - zend-framework

I use Zend Framework and I have problem with JSON and UTF-8.
Output
\u00c3\u00ad\u00c4\u008d
íÄ
I use...
JavaScript (jQuery)
contentType : "application/json; charset=utf-8",
dataType : "json"
Zend Framework
$view->setEncoding('UTF-8');
$view->headMeta()->appendHttpEquiv('Content-Type', 'text/html;charset=utf-8');
header('Content-Type: application/json; charset=utf-8');
utf8_encode();
Zend_Json::encode
Database
resources.db.params.charset = "utf8"
resources.db.params.driver_options.1002 = "SET NAMES utf8"
resources.db.isDefaultTableAdapter = true
Collation
utf8_unicode_ci
Type
MyISAM
Server
PHP Version 5.2.6
What did I do wrong? Thank you for your reply!

utf8_encode();
If you've got UTF-8 strings from your database and UTF-8 strings from your browser, then you don't need to utf8_encode any more. You've already got UTF-8 strings; calling this function again will just give you the UTF-8 representation of what you'd get if you read UTF-8 bytes as ISO-8859-1 by mistake.
Pass your untouched UTF-8 strings straight to the JSON encoder.

I think this question is some how related to yours
my problem was when encoding some [ Arabic , Hebrew or Chinese as you might see ]
turns out that unicode notation understood by javascript/ecmascript like what did you see
I hope that explain to you in details

Related

Illegal Character '?' a when I create the JSON using ConvertTo-JSON

I am not a powershell guy please excuse if my question is confusing.
We are creating a JSON file using ConverTo-JSON and it successfully creates the JSON file. However when I cat the contents of JSON it has '??' at the beginning of the json file but the same is not seen when I download the file/ view the file in file system.
Below is the powershell code which is used to create the JSON File:
$packageJson = #{
packageName = "ABC.DEF.GHI"
version = "1.1.1"
branchName = "somebranch"
oneOps = #{
platform = "XYZ"
component = "JNL"
}
}
$packageJson | ConvertTo-Json -depth 100 | Out-File "$packageName.json"
Above set of code creates the files successfully and when I view the file everything looks fine but when I cat the file it has leading '??' as shown below:
??{
"packageName": "ABC.DEF.GHI",
"version": "0.1.0-looper-poc0529",
"oneOps": {
"platform": "XYZ",
"component": "JNL"
},
"branchName": "somebranch"
}
Due to this I am unable to parse JSON file and it gives out following error:
com.jayway.jsonpath.InvalidJsonException: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('?' (code 65533 / 0xfffd)): expected a valid value (number, String, array, object, 'true', 'false' or 'null')
Those aren't ? characters. Those are two different unprintable characters that make up a Unicode byte order mark. You see ? because that's how the debugger, text editor, OS, or font in question renders unprintable characters.
To fix this, either change the output encoding, or use a character set on the other end that understands UTF-8. The former is a simpler fix, but the latter is probably better in the long run. Eventually you'll end up with data that needs an extended character.
tl;dr
It sounds like your Java code expects a UTF-8-encoded file without BOM, so direct use of the .NET Framework is needed:
[IO.File]::WriteAllText("$PWD/$packageName.json", ($packageJson | ConvertTo-Json))
As Tom Blodget points out, BOM-less UTF-8 is mandated by the IETF's JSON standard, RFC 8259.
Unfortunately, Windows PowerShell's default output encoding for Out-File and also redirection operator > is UTF-16LE ("Unicode"), in which:
(most) characters are represented as 2-byte units.
the file starts with a special 2-byte unit (0xff 0xfe, the UTF-16LE encoding of Unicode character U+FEFF the ), the so-called (BOM byte-order mark) or Unicode signature, which serves to identify the encoding.
If target programs do not understand this encoding, they treat the BOM as data (and would subsequently misinterpret the actual data), which causes the problem you saw.
The specific symptom you saw - a complaint about character U+FFFD, which is used as the generic stand-in for an invalid character in the input - suggests that your Java code likely expects UTF-8 encoding.
Unfortunately, using Out-File -Encoding utf8 is not a solution, because PowerShell invariably writes a BOM for UTF-8 as well, which Java doesn't expect.
Workarounds:
If you can be sure that the JSON string contains **only characters in the 7-bit ASCII range** (no accented characters), you can get away with Out-File -Encoding Ascii, as TheIncorrigible1 suggests.
Otherwise, use the .NET framework directly for creating your output file with BOM-less UTF-8 encoding.
The answers to this question demonstrate solutions, one of which is shown in the "tl;dr" section at the top.
If it's an option, use the cross-platform PowerShell Core edition instead, whose default encoding is sensibly BOM-less UTF-8, for compatibility with the rest of the world.
Note that not all Windows PowerShell functionality is available in PowerShell Core, however, and vice versa, but future development efforts will focus on PowerShell Core.
A more general solution that's not specific to Out-File is to set these before you call ConvertTo-Json:
$OutputEncoding = [Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8;

Powershell base64 vs regular base64

I use powershell to convert string
$Text = 'ouser:v3$34##85b&g%fD79a3nf'
$Bytes = [System.Text.Encoding]::Unicode.GetBytes($Text)
$EncodedText =[Convert]::ToBase64String($Bytes)
$EncodedText
However when using https://www.base64decode.org/, or some java libraries for base64 encoding I get a different, shorter version.
Sample string:
This is a secret and should be hiden
powershell result:
VABoAGkAcwAgAGkAcwAgAGEAIABzAGUAYwByAGUAdAAgAGEAbgBkACAAcwBoAG8AdQBsAGQAIABiAGUAIABoAGkAZABlAG4A
normal base64 result:
VGhpcyBpcyBhIHNlY3JldCBhbmQgc2hvdWxkIGJlIGhpZGVu
While using the website I am able to decode both versions, however using my java code I am only able to decode the latter. Why is that? Is there more than one version of base64? Where those differences come from?
Adding the comment from #Raziel as an answer for better discoverability of this question.
[System.Text.Encoding]::Unicode] is UTF-16, the later is UTF-8. There's [System.Text.Encoding]::UTF8 that you can use.

Trouble understanding C# URL decode with Unicode character(s) in PowerShell

I'm currently working on something that requires me to pass a Base64 string to a PowerShell script. But while decoding the string back to the original I'm getting some unexpected results as I need to use UTF-7 during decoding and I don't understand why. Would someone know why?
The Mozilla documentation would suggest that it's insufficient to use Base64 if you have Unicode characters in your string. Thus you need to use a workaround that consists of using encodeURIComponent and a replace. I don't really get why the replace is needed and shortened it to btoa(escape('✓ à la mode')) to encode the string. The result of that operation would be JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl.
Using PowerShell to decode the string back to the original, I need to first undo the Base64 encoding. In order to do System.Convert can be used (which results in a byte array) and its output can be converted to a UTF-8 string using System.Text.Encoding. Together this would look like the following:
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
What's left to do is URL decode the whole thing. As it is a UTF-8 string I'd expect only to need to run the URL decode without any further parameters. But if you do that you end up with a accented a that looks like � in a file or ? on the console. To get the actual original string it's necessary to tell the URL decode to use UTF-7 as the character set. It's nice that this works but I don't really get why it's necessary since the string should be UTF-8 and UTF-8 certainly supports an accented a. See the last two lines of the entire script for what I mean. With those two lines you will end up with one line that has the garbled text and one which has the original text in the same file encoded as UTF-8
Entire PowerShell script:
Add-Type -AssemblyName System.Web
$inputstring = "JXUyNzEzJTIwJUUwJTIwbGElMjBtb2Rl"
$bytes = [System.Convert]::FromBase64String($inputstring)
$utf8string = [System.Text.Encoding]::UTF8.GetString($bytes)
[System.Web.HttpUtility]::UrlDecode($utf8string) | Out-File -Encoding utf8 C:\temp\output.txt
[System.Web.HttpUtility]::UrlDecode($utf8string, [System.Text.UnicodeEncoding]::UTF7) | Out-File -Append -Encoding utf8 C:\temp\output.txt
Clarification:
The problem isn't the conversion of the Base64 to UTF-8. The problem is some inconsistent behavior of the UrlDecode of C#. If you run escape('✓ à la mode') in your browser you will end up with the following string %u2713%20%E0%20la%20mode. So we have a Unicode representation of the check mark and a HTML entity for the á. If we use this directly in UrlDecode we end up with the same error. My current assumption would be that it's an issue with the encoding of the PowerShell window and pasting characters into it.
Turns out it actually isn't all that strange. It's just for what I want to do it's advantages to use a newer function. I'm still not sure why it works if you use the UTF-7 encoding. But anyways, as an explanation:
... The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/escape
As TesselatingHecksler pointed out What is the proper way to URL encode Unicode characters? would indicate that the %u format wasn't formerly standardized. A newer version to escape characters exists though, which is encodeURIComponent.
The encodeURIComponent() function encodes a Uniform Resource Identifier (URI) component by replacing each instance of certain characters by one, two, three, or four escape sequences representing the UTF-8 encoding of the character (will only be four escape sequences for characters composed of two "surrogate" characters).
The output of this function actually works with the C# implementation of UrlDecode without supplying an additional encoding of UTF-7.
The original linked Mozilla article about a Base64 encode for an UTF-8 strings modifies the whole process in a way to allows you to just call the Base64 decode function in order to get the whole string. This is realized by converting the URL encode version of the string to bytes.

Why does Perl's LWP gives me a different encoding than the original website?

Lets say i have this code:
use strict;
use LWP qw ( get );
my $content = get ( "http://www.msn.co.il" );
print STDERR $content;
The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94"
which i'm guessing it's utf-16 ?
The website's encoding is with
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">
so why these characters appear and not the windows-1255 chars ?
And, another weird thing is that i have two servers:
the first server returning CP1255 chars and i can simply convert it to utf8,
and the current server gives me these chars and i can't do anything with it ...
is there any configuration file in apache/perl/module that is messing up the encoding ?
forcing something ... ?
The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"
One more thing that i tested is ...
Through perl:
my $content = `curl "http://www.anglo-saxon.co.il"`;
I get utf8 encoding.
Through Bash:
curl "http://www.anglo-saxon.co.il"
and here i get CP1255 ( Windows-1255 ) encoding ...
Also,
when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...
fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:
use Text::Iconv;
my $converter = Text::Iconv->new("utf8", "CP1255");
$content=$converter->convert($content);
my $converter = Text::Iconv->new("CP1255", "utf8");
$content=$converter->convert($content);
All of this manual encoding and decoding is unnecessary. The HTML is lying to you when it says that the page is encoded in windows-1255; the server says it's serving UTF-8, and it is. Blame Microsoft HTML-generation tools.
Anyway, since the server does return the correct encoding, this works:
my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;
$content is now a perl character string, ready to do whatever you need. If you want to convert it to some other encoding, then calling Encode::encode on it is appropriate; do not use Encode::decode as it's already been decoded once.
http://www.msn.co.il is in UTF-8, and indicates that properly. The string "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" is also proper UTF-8 (להדפסה). I don't see the problem.
I think your second problem is due to you mixing different encodings (UTF-8 and Windows-1252). You might want to encode/decode your strings properly.
First, note that you should import get from LWP::Simple. Second, everything works fine with:
#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';
which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.
The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get() method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.
You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the data in your desired encoding with something like
use Encode;
...;
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));
The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.

Decode an UTF8 email header

I have an email subject of the form:
=?utf-8?B?T3.....?=
The body of the email is utf-8 base64 encoded - and has decoded fine.
I am current using Perl's Email::MIME module to decode the email.
What is the meaning of the =?utf-8 delimiter and how do I extract information from this string?
The encoded-word tokens (as per RFC 2047) can occur in values of some headers. They are parsed as follows:
=?<charset>?<encoding>?<data>?=
Charset is UTF-8 in this case, the encoding is B which means base64 (the other option is Q which means Quoted Printable).
To read it, first decode the base64, then treat it as UTF-8 characters.
Also read the various Internet Mail RFCs for more detail, mainly RFC 2047.
Since you are using Perl, Encode::MIME::Header could be of use:
SYNOPSIS
use Encode qw/encode decode/;
$utf8 = decode('MIME-Header', $header);
$header = encode('MIME-Header', $utf8);
ABSTRACT
This module implements RFC 2047 Mime
Header Encoding. There are 3 variant
encoding names; MIME-Header, MIME-B
and MIME-Q. The difference is
described below
decode() encode()
MIME-Header Both B and Q =?UTF-8?B?....?=
MIME-B B only; Q croaks =?UTF-8?B?....?=
MIME-Q Q only; B croaks =?UTF-8?Q?....?=
I think that the Encode module handles that with the MIME-Header encoding, so try this:
use Encode qw(decode);
my $decoded = decode("MIME-Header", $encoded);
Check out RFC2047. The 'B' means that the part between the last two '?'s is base64-encoded. The 'utf-8' naturally means that the decoded data should be interpreted as UTF-8.
MIME::Words from MIME-tools work well too for this. I ran into some issue with Encode and found MIME::Words succeeded on some strings where Encode did not.
use MIME::Words qw(:all);
$decoded = decode_mimewords(
'To: =?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld#dkuug.dk>',
);
This is a standard extension for charset labeling of headers, specified in RFC2047.