WireShark - Can I decode UTF-8 data in packets?

WireShark - Can I decode UTF-8 data in packets? - unicode

In Wireshark, how can I see non-ASCI characters in packets? some of my network data is in UTF-8 encoding and I would like Wireshark to recognize it.
Is there a plugin for it?
I found this but maybe there is something new about that.
I want to see Arabic, Chinese and Hebrew.

In Wireshark, how can I see non-ASCI characters in a packet? some of my data in the packets is in the UTF-8 encoding and I would like Wireshark to recognize it.
Is there a plugin for it?
No - that's not what Wireshark plugins do. They can dissect packets, and do statistical analysis, but they can't affect the routines that a dissector uses.
I found this but maybe there is something new about that.
What's new is that you can (and should) specify the character encoding when dissecting a string.
What's not new is that Wireshark currently doesn't try to fully handle strings it's told are UTF-8 encoded. It can't just blindly try to display them, as they might not be valid UTF-8 strings, or they might contain non-graphic characters. We haven't yet added full UTF-8 support (or support for other character encodings that handle non-ASCII characters).

Related

Why Base64 is used "only" to encode binary data?

I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :
Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?

There's two reasons why Base64 is used:
systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).
Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.
Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.
This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).
As to your second question, the answer is two-fold:
textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.
Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.

One of the use of BASE64 is to send email.
Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of #). Also systems may not be fully ASCII.
Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.
BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.
Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.

How to send text in UTF-8 using Indy TIdTCPServer in c++ builder

My client j2me application reading text input stream using UTF-8
reader = new InputStreamReader(in,"UTF-8");
and my server when gets connected sends text using this statement
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text,TEncoding::UTF8);
but result text showing weird characters like ?????????????????????????? ?????????????
Where I'm doing wrong?
also when i tried to load from utf-8 encoding data file in such a way
AContext->Connection->IOHandler->WriteFile("c:\\fids.xml");
it's all the same!

Indy 10 completely supports UTF-8 encoding. I've myself worked with it's TIdFTP component & successfully uploaded Unicode text files. From what I can make of it:
Your connection/transfer type is set to ftASCII rather than ftBinary.
Your J2ME applet/Host platform does not suport UTF-8

'?' characters occur when data is going through a Unicode-to-Ansi conversion to an Ansi charset that does not support the Unicode characters being converted.
What version of C++Builder are you using? In versions prior to CB2009, you should tell Indy the encoding of the AnsiString data that you are passing in. Indy defaults to ASCII (ie: TIdTextEncoding::ASCII) for most String-based operation. That can be overridden when needed, either with optional AAnsiEncoding parameters, the TIdIOHandler::DefAnsiEncoding property, or the global Idglobal::GIdDefaultAnsiEncoding setting. If you do not specify the correct encoding, the AnsiString data may not be converted to Unicode correctly before then being converted to UTF-8. For example:
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text, TIdTextEncoding_UTF8, TTIdTextEncoding_Default);
Or:
AContext->Connection->IOHandler->DefAnsiEncoding = TIdTextEncoding_Default;
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text, TIdTextEncoding_UTF8);
You can optionally also use the TIdIOHandler::DefStringEncoding property if you do not want to specify the UTF-8 encoding on every call:
AContext->Connection->IOHandler->DefStringEncoding = TIdTextEncoding_UTF8;
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text);
Now, with that said, the fact that WriteFile() is also sending data that J2ME is not handling correctly tells me that Indy is not the root of the issue. WriteFile() simply dups the raw file data as-is to the connection without any interpretation at all. If you send a UTF-8 encoded file, then UTF-8 encoded octets will be sent to J2ME.
I suggest you use a packet sniffer, such as Wireshark, to verify the data that Indy is sending. That will tell you for sure whether Indy is really at fault or not.
*PS: notice in the examples above that I use Indy's TIdTextEncoding macros instead of TEncoding directly. This is because Indy's TIdTextEncoding logic works around some bugs in Embarcadero's TEncoding classes. Also, we're going to phase out direct support for TEncoding in Indy 11 and expand on TIdTextEncoding so Indy has more control than Embarcadero offers.

UTF-8 incorrectly displayed in Lua/ Corona

In Lua, for an iPad Corona project, I'm requesting a UTF-8 server text file (containing Chinese characters) using network.request, but the result when displayed in the console or in the app shows as "garbage". Google Chrome, for instance, displays the same UTF-8 page fine, as I'm setting the http header when the server sends this (using PHP) to 'Content-Type: text/plain; charset=utf-8' (and there's no BOM, byte order mark either). The "garbage" I'm seeing in Lua looks similar to when I "force" Chrome to render the page as ISO-8859-1 using the options menu.
Does anyone have any help or pointers?
If all else fails, how would I convert the "garbage" string back to its UTF-8 origins within Lua?
Thanks for any help!

Lua doesn't know anything about UTF-8; Lua strings are just sequences of bytes. It sounds like Corona itself is parsing the strings as ISO8859-1. The most likely cause for this is them doing something really stupid and naive like treating each byte of the string as a Unicode code point.
I'm afraid I don't know Corona, so can't provide any specific solutions, but I'd suggest looking to see what functions it's got that involve encodings --- there may be a specific function to render a string with a particular encoding, for example.

Can you show the code for your network.request() call?
If you're downloading a html page, you should use network.download().

I had this exact same problem, except with Japanese characters. Although Lua doesn't support UTF-8, Corona acts like it does. What that means is that... if you pass a UTF-8 String to display.newText(...), it should display properly. Now, if you output to the console, it will actually print out the raw bytes of the String. And, if you try to print the length of the string, it will actually print out the number of bytes.
So, in summary, Lua treats all strings as an array of bytes. It knows nothing about UTF-8. Some Corona API methods, when passed UTF-8 strings, will display the strings correctly.
I had issues when I mixed UTF-8 with plain ASCII characters, which I believe confused Corona (what I mean is that I mixed English characters with Japanese characters... still all UTF-8, though). I have a hunch that each character in the string must be of the same length in bytes for Corona to display it properly. Try printing out one character at a time to see if that helps. Please feel free to post comments here if you run into trouble. I'd like to figure this issue out myself, too.

What are some common character encodings that a text editor should support?

I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.
What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?

Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.
As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).

I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)
If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/
Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx

There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.
I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.

Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.
As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?
You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.

UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.
1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.

Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.
Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.

How to display a non-ascii filename in the file download box in browsers?

There doesn't seem to be an accepted way of sending down a header parameter in non ascii format.
The header for file download usually looks like
Content-disposition: attachment; filename="theasciifilename.doc"
Except if you smash a utf8 encoded string in the filename parameter, Firefox will handle it fine, whereas IE will throw up.
There is a document on CodeProject that explains a method for encoding the filename.
This document encodes Bản Kiểm Kê.doc to B%e1%ba%a3n%20Ki%e1%bb%83m%20K%c3%aa.doc by hex encoding the bytes.
Problem #1: the first character in that string: ả has a value of ả -- encode that number in Hex and you get %a3%1e. How did this guy get %e1%ba%a3? (I'm obviously missing something simple here)
Problem #2: While IE acknowledges this encoding, Firefox doesn't! What to do?

The specs basically don't permit anything other than US-ASCII. HTTP headers are US-ASCII. HTTP's payload defaults to ISO 8859-1 but that refers to the content body, not the headers.
Arguably the Right Thing to do would be to use MIME's technique for encoding non-ASCII data in headers, as described in RFC 2047, but I have no idea whether browsers actually support that.
EDIT: Whoops, no, RFC 2047 section 5 explicitly says that the encoded form is not permitted in Content-Disposition. Looks like you're out of luck - there is no standard.
EDIT 2: There is a standard - RFC 2231 defines how this is now supposed to work. It has support from some browsers, but is not supported in IE. I found some test cases which demonstrate how it works and what browser support is available.

Answer to question #1: You are confusing Unicode and UTF-8. The hex value of 'ả' is 0xA31E however that is not a UTF-8 character. In UTF-8 that character requries three bytes, 0xE1 0xBA 0xA3. URL encoding is poorly defined for non-ascii encodings but %e1%ba%a3 is the valid UTF-8 encoding to use for that character.

For Problem #2 you need to URL encode the file name for both Internet Explorer and Firefox. The only difference is that you need to use the format of RFC 2231 in Firefox.
This applies to Firefox 3 and Internet Explorer 7.

In the link you've got above, e1 ba a3 is the UTF-8 encoding of the character mentioned, not the character code.

Answer (sort of) to problem #2:
Since you've discovered that the naming scheme in one browser does not work in the other, your only solution is to do it differently for each browser, similar to the example here.
In case the link goes away, the solution is basically:
1. If browser is IE URL encode filename
2. Generate Content-disposition header
Of course determining if the browser is IE by User-agent (which is about the only way you can do it) is fraught with all sorts of the usual peril.
As North American centric as this sounds, if it is important that this work in a large number of browsers you do not control which may have the User-agent blocked, or modified, then simply avoid UTF-8 encoded characters in the filename and always use "Download" or something.

Unfortunately, there currently is no single way that would work in all User Agents.
See http://greenbytes.de/tech/tc2231/ for test cases, then complain to Microsoft, Google and Apple.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse