strange encoding of an HTTP GET. Lots of square brackets - encoding

Seeing many packets that appear to be encoded in some fashion which is puzzling me. Here is a sample : GET/somestuff/abc/fergo-scd.pdf0ð[[#6]][[#3]]u[[#29]][[#31]][[#4]] HTTP/1.1. Any help appreciated

Related

Can someone tell what encoding this is?

I've been working on a small project and came across some information that has some sort of encoding (I assume).
7C-FC-1B-C9-97-1B-A9-EB-2E-45-2A-73-CE-E3-17-F9
01-3E-6A-50-09-ED-1C-A1-80-A0-27-B9-0C-D3-C4-9D
89-4C-B3-52-4A-B8-93-CB-95-4F-E2-9A-0C-59-7C-FD
Does anyone know what sort of encoding this is? I looked into UTF-8 since this came from a SQL file. No luck there.
I think that is written in hexadecimal. Not encoded

UTF-8 encoding in emails, parsing the body

So I don't really want this question to be language specific, however I suspect Go (my language choice) is playing a part here.
I'm trying to find a string within the body of a raw email. To do so, I am getting the encoding, and the marjority of cases are quoted-printable.
Ok so thats fine, I am encoding my search query quoted printable and then doing a search for it. That works.
However. In one specific case the raw email I see in gmail looks fine, however when I retrieve the raw email from the gmail API the although the encoding and everything is identical, its encoding the " as =22
Research shows me thats because the charset is utf-8.
I haven't quite got my head around whether thats encoded utf-8 then quoted-printable or the other way around, but thats not quite the question either....
If I look at the email where the " is =22 I see the char set is utf-8 and when I look at another where its not encoded, the charset is UTF-8 (notice the case). I can't believe that the case here is whats causing this to happen, but it doesn't seem a robust enough way to work out if =22 is actually =22 or is a " encoded utf-8.
My original thought was to always decode the quoted-printable and then re-encode it before doing the search but I don't think this is going to be a robust approach going forward and thought others might have a better suggestion?
Conclusion, I'm trying to find a string in a raw email but the encoding is causing me problems getting my search string to match the encoding of the body
The =22-type encoding actually has nothing to do with the charset (whether that is utf-8 lowercase or UTF-8 uppercase or any other charset).
It is the Content-Transfer-Encoding: quoted-printable encoding.
The quoted-printable encoding is just a way of hex-encoding octets, typically limited to octets that fall outside of the printable ascii range. It seems odd that the DQUOTE character would be encoded in this way, but it's perfectly legal to do so.
If you want to search for strings in the body of the message, you'll need to first decode the body of the message. Otherwise you will not be successful.
I would recommend reading rfc2045 at a minimum.
You may also need to end up reading rfc2047 if you end up wanting to search headers at some point, but that gets... tricky due to various bugs that sending clients have.
Now that I've been "triggered" into a rant about MIME, let me explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.
Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:
First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.
This brings us to:
There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
a. encoded-word tokens illegally being embedded in other word tokens
b. encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
c. multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
d. the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.
If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.
If you want to see how to deal with these issues, you can take a look at how I did it using C in my GMime library, here: https://github.com/jstedfast/gmime/blob/master/gmime/gmime-utils.c#L1894 (in case line offsets change in the future, look for _g_mime_utils_header_decode_text() and the various internal methods it uses in that source file - I have written comments explaining how it deals with the above issues).
Or you can see how I did it using C# in my MimeKit library, here: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/Utils/Rfc2047.cs
For more infomation about why & how dealing with email is hard, check out Joshua Cramner's blog series: http://quetzalcoatal.blogspot.com/search/label/email-hard

Trim UTF8 from the end of a UTF16 string

Here's an interesting puzzle, and I'd welcome any thoughts that anyone has. I don't think that there's a right or wrong answer here.
My program is loading a file which contains (amongst other things) string data structures. It helpfully declares what type the structure is (UTF8, UTF16 etc), and its length (of course) before the structure, so my program knows how to handle the data. Up to now this has worked perfectly every time.
Now I have been given a data file to load which has rubbish on the end of it - and, when I say rubbish, I mean that it looks like UTF8 at the end of a structure which is declared as UTF16.
D·a·v·e···E·d·m·u·n·d·s·o·n·d`upbqp!c°rÞPrpupÎÐAgâh(28.RïSÿ
The Dave Edmundson part is fine - it's everything after that which needs to be trimmed in this case. The fly in the ointment is that I still need to be able to handle legitimate UTF16 extended characters (like Korean, Chinese etc).
I could just fling my hands up in the air and say, "this data is corrupt", and spit out an error. But I'd like to be able to clean it if at all possible. Any ideas that anyone might have would be very welcome!
This is a logical issue, hence no code - if anyone is interested, I'm using Objective-C, but all I'm really after is some intelligent conversation on how I might approach this issue. I don't need code to be written for me!

What kind of encoding is this and what does it translate to? }(]&[([%!"=))%%!")"

}(]&[([%!"=))%%!")"}"{"={}]+&"{*="!&"&&]!+}"{])"=]#)!!)]"][}{*/[(#{*%*[#=&)}""]}
{]]%/)(="{![{)=&%{}&+{#)%==#"(*})+%%)(+{)(%*{}!&"=&[#]&*)%+/})+/)!!#{%)+)()+!=[(
)=}([={[!{/)+)&/"]!/=!+*%(&/)#")!*[#(#(="][*=+(*&/!)!()+#)#}[%]*"#")*(#]{&*%%)}#
%*({"+)/#&""=&=/})={)}")"%}]%+&*#={)//=+(}/"!{%!!{=%/}}!}](}*"/]&&%=}*[*(&["={%{
}#&){#%[%[)%)+%}/&#%(/=((][}}["]=!&))!/[]#&"=]=[*+#*{])="]"/[#]]*!"}![)%})(&"/*#
...
I've never seen an encoding like this before. I've tried all sorts of online automatic decoders to no avail, and googling "code with curly braces and parenthesis" isn't doing me any good. Above are the first four lines out of exactly 1000 lines.
Here's a pastebin link since it wouldn't fit into the 30000 body character limit. https://pastebin.com/4LfEvd4b

I need help decoding (what I believe) is a base64 encoded message

I was given a encoded message as a challenge, and I am completely lost. Sorry if the answer to this is extremely obvious, but I cannot figure out what this code is supposed to represent. The code is:
JVCEK6CNIRCXOTKEIFTU2RCFPBGVIQLYJVCECZ2NIRCXQTKUIF4E2RCBM5GUIRLYJVKEC52NIRAWOTKEIV4E2VCBO5GVIRLHJVCEC6CNKRCXOTKUIFTU2RCBPBGUIRLYJVKEKZ2NIRAXQTKEIV4E2VCFM5GUIRLYJVKEC6CNKRCWOTKEIV4E2VCBPBGVIRLHJVCEK6CNKRAXQTKUIVTU2RCBPBGUIRLYJVKECZ2NIRCXQTKUIV3U2RCFM5GUIRLYJVCEK6CNKRCWOTKEIV4E2VCBPBGUIRLHJVCEK6CNKRAXQTKEIFTU2RCFPBGVIQLYJVCEKZ2NIRCXQTKEIF3U2VCBM5GUIRLYJVCEC6CNIRCWOTKEIF4E2RCFPBGVIQLHJVCEK6CNIRAXOTKUIVTU2RCFPBGUIRLYJVKEKZ2NIRCXQTKEIV4E2RCFM5GUIQLYJVCEK6CNKRCWOTKEIV4E2VCBPBGVIRLHJVCEK6CNIRAXOTKEIVTU2RCFPBGVIQLYJVCECZ2NIRCXQTKEIF3U2VCFM5GUIRLYJVCEK52NIRAWOTKEIF4E2VCFPBGVIRLHJVCEK6CNKRAXQTKUIFTU2RCBPBGVIRLYJVCEKZ2NIRCXQTKEIV4E2VCFM5GUIRLXJVCEK52NIRAWOTKEIV4E2RCBPBGVIRLHJVCEC6CNKRAXQTKEIVTU2RCFO5GVIQLXJVKEKZ2NIRCXOTKEIV3U2VCBM5GUIRLXJVKEK52NIRCWOTKEIV3U2VCBO5GVIQLHJVCEK52NIRCXOTKEIFTU2RCFO5GUIQLXJVCEKZ2NIRAXQTKUIF3U2RCBHU======
I believe it is base64 because of the = trailing at the end, but when converting it to ASCII, I get a bunch of nothing (making me think it is something else). I am not really asking for a direct answer to what the string is, but rather where I can go from here. I have already tried converting the binary value to jpg, but to no avail. Any help is greatly appreciated! (I also noticed it repeated JV*E a lot, I do not know whether that is significant or not.)
It's more likely base32 than base64, since there are no lower case letters, 1s, 8s, 9s, +s, or /s. Decoding it as base32 produces:
MDExMDEwMDAgMDExMTAxMDAgMDExMTAxMDAgMDExMTAwMDAgMDExMTAwMTEgMDAxMTEwMTAgMDAxMDExMTEgMDAxMDExMTEgMDExMTAxMTEgMDExMTAxMTEgMDExMTAxMTEgMDAxMDExMTAgMDExMTEwMDEgMDExMDExMTEgMDExMTAxMDEgMDExMTAxMDAgMDExMTAxMDEgMDExMDAwMTAgMDExMDAxMDEgMDAxMDExMTAgMDExMDAwMTEgMDExMDExMTEgMDExMDExMDEgMDAxMDExMTEgMDExMTAxMTEgMDExMDAwMDEgMDExMTAxMDAgMDExMDAwMTEgMDExMDEwMDAgMDAxMTExMTEgMDExMTAxMTAgMDAxMTExMDEgMDExMDExMTEgMDEwMDEwMDAgMDExMDAxMTEgMDAxMTAxMDEgMDEwMTAwMTEgMDEwMDEwMTAgMDEwMTEwMDEgMDEwMTAwMTAgMDEwMDEwMDAgMDEwMDAwMDEgMDAxMTAwMDA=
Hopefully, that helps.