WebClient Form Data with french special characters are not transmitted correctly - webclient

I am using HtmlUnit 2.13 to send data using a form, everything works fine, the problem is with french characters like (é, è, â, ê, °n, etc ...) are not transmitted correctly :
example : a have a text input for address information : lets say "Cité BLABLABLA n° 123", only "Cit" is transferred but everything after the character "é" (and the other characters mentionned before) is not transmitted.

Make sure you declare <meta charset="UTF-8"> inside the head tag and you should be okay.

Related

How to properly encode html entities in emails? e.g. &nearr; for Gmail

So I modified some emails I send to get rid of images and replace them by special unicode characters. For example I had an arrow image and replaced it with &nearr; while wrapping it in a <span> to give it the color I want.
When I look at the source in Gmail (3 dots > Show Original) I see this:
...
--1234567890123456789012345678
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.=
w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns=3D"http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF-8" />
</head>
<body>
...
... <span style=3D"font-family:arial,verdana;font-weight:bold;color:#209a20">&nearr;</span> ...
...
</body>
</html>
--1234567890123456789012345678--
Which is what I'd expect since that's what I wrote in my code.
Now the problem is that it displays like this in the Gmail web interface:
What am I doing wrong? Isn't UTF-8 a unicode encoding that should support this character?
I would understand if some of these special characters are displayed as square boxes or something, but I do not understand how they can remain encoded while the turns into a space correctly.
It also makes me question whether other email clients will display these correctly (would love feedback on that too).
In the 1950's computers could handle only capital letters, digits and some punctuation.
Before 1970, EBCDIC was invented (only to later die out) for handling lower case and a few more punctuation characters.
Then came a plethora of encodings to handle European accents, Cyrillic, Greek, and eventually Chinese. (There are some interesting stories on the invention of typewriters for handling Chinese!)
Eventually, the Unicode group got together and slowly created a universal standard. It has been evolving for a few decades and continues to enhance it -- emojis are a big addition that is ongoing.
But, meanwhile, how does one put Emoji, etc, in URLs, type them on a keyboard, etc, etc? Those standards are lagging way behind. So, there are kludges in place.
HTML allows "entities", such as &nearr; for that arrow.
Putting such in a URL would require something like %E2%86%97.
Several encodings also base their kludge on the hex encoding of the utf8.
Unicode allows \U8599 which is based on the decimal value of the "codepoint". (I think Java goes that direction.)
MySQL INSERT: UNHEX('E28697')
Keyboards -- good luck.
I don't know of anything other than HTML that reacts favorably to &nearr;
Ever notice a + in a URL? That is the encoding for a single space. (Also %20 works there.)
Try the HTML code rather than the HTML entity.
So ↗ for the north east arrow, as per
https://www.toptal.com/designers/htmlarrows/arrows/north-east-arrow/
Best reference for this is usually https://unicode-table.com/en/

Character encoding mystery ( working flawlessly for me, but unfortunately not for all )

I have a content manager website, where character encoding working wonderfully for almost everyone, but not for a few "lucky" users, which drives me crazy.
I have a (search) keyword database table (log) which sometimes shows very strange keywords (the keyword originally was in Hungarian but something happened with it )
For example the database shows:
gyerekeknek papã rbã³l
instead of
gyerekeknek papírból
I can't replicate the culprit word/encoding or url because it works flawlessly for me with every accents and even different languages (such as Cyrillic letters, Chinese and special characters etc).
An average search url look like this:
https://example.com/keyword=árvíztűrő+tükörfúrógép
OR: https://example.com/keyword=%c3%a1rv%c3%adzt%c5%b1r%c5%91+t%c3%bck%c3%b6rf%c3%bar%c3%b3g%c3%a9p
And it works absolutely fine (the database row for this is "árvíztűrő tükörfúrógép" as expected)
The strange thing: The GET['keyword'] param seems like it is already in the wrong encoding format ("gyerekeknek papã rbã³l") when it arrives (without any jquery/php validation and/or processing).
My website and database is UTF8 encoded everywhere. My website has:
AddDefaultCharset utf-8 (in .htaccess)
All files in UTF8 without BOM
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> on every single page
mysqli_query($con,"SET NAMES 'utf8'");
mysqli_query($con,"SET CHARACTER SET 'utf8'");
mysqli_query($con,"SET COLLATION_CONNECTION='utf8_general_ci'"); On every database connection
Any idea or help is greatly appreciated,
Thank you!

Laravel issue with: language character encoding

Privjet!
I don't understand for what reason I am not getting displayed the non ASCII language characters like say, "ç, ñ, я " for my different languages.
The text in question is hardcoded, it is not served from a DB.
I have seen identical questions here
Charset=utf8 not working in my PHP page
I have seen that I should write this:
header('Content-type: text/html; charset=utf-8');
But where the heck does that go? I cant write it like that, the browser just mirrors the words and displays them as plain text, no parsing.
My encoding for the frontpage says this:
<head>
<meta charset="utf-8">
</head>
which is supposed to be Unicode.
I tried to test my page in validator.w3.org and it went:
Sorry, I am unable to validate this document because on line 60 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
Line 60 actuallly has the word Español (Spanish) with that weird n.
Any hint?
thank you
best regards

.ENCODING international chars (hebrew,thai,russian,chinese,....)

international html files archived by wget
should contain chars like this
(example hebrew and thai:)
אב
הם
and ยคน
instead they are saved like this:
íäáåãéú and ÃÒ¡à§é
How to get the these displayed properly?
iconv filename.html
iconv: illegal input sequence at position 1254
SOLVED: There was nothing wrong.
Only i didnt notice the default php.ini did set the charset in the http header but
to use various charsets like this meta http-equiv="Content-Type" content="text/html; charset=windows-874" you needed to set: default_charset = "empty";
....
The pages aren't "saved like this", whatever you're using to view the file is simply interpreting the encoding incorrectly. To know what encoding the file is in you should have paid attention to the HTTP Content-Type header during download; that's gone now.
Your only other chance is to parse the equivalent HTML meta tag in the <head>, if the document has one.
Otherwise, you can only guess the encoding of the document.
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text for more required background knowledge.

Java EE Web Project and Character Encoding

we built a java ee web project and use jdbc for storing our data.
The problem is that German 'Umlaute' like äöü are in use and properly stored in the mysql database. We don't know why, but in the browser those characters are broken, displaying weird stuff like
ö�
instead.
I've already tried setting the encoding of the jdbc connection like described in this question:
JDBC character encoding
And the encoding of the html page is correctly set:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Any ideas how to fix that?
Update
connection.prepareStatement("SET CHARACTER SET utf8").execute();
won't make umlauts work.
changing the meta-tag to
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
won't change anything, too
"We don't know why, but in the browser those characters are broken"
Well, that's the first thing to find out. You should trace your data at every stage:
As you fetch it out of the database (with logging)
When you inject it into the page (with logging)
On the wire (via Wireshark)
When you log, don't just log the strings: log the Unicode characters that make up the strings, as integers. Just cast each character in the string to an integer and log it. It's primitive, but it'll tell you what you need to know.
When you look on the wire, of course, you'll be seeing bytes rather than characters as such. You should work out what bytes you expect for your chosen encoding, and check those against what's actually coming across the network.
You've specified the encoding in the HTML - but have you told whatever's generating your page that you want it in ISO Latin 1? That's likely to be responsible for both setting the content-type header and performing the actual conversion from text to bytes.
Additionally, is there any reason why you're using ISO Latin 1 instead of UTF-8? Why would you deliberately restrict yourself like that? (ISO Latin 1 can only handle the first 256 characters of Unicode, instead of the full range of Unicode characters. UTF-8 can handle everything, and is just as efficient for ASCII.)