Identify if a file uses BER encoding - encoding

I'm new to ASN.1 encoding, and I'm wondering if a BER encoded file has a header or anything that identifies it as a BER encoded file. I mean, if someone just hands me a file, could I tell that it is BER (or CER or DER) encoded?
Then I could have a function that operated like this:
if FILE is BER-encoded
return "BER"
else if FILE is DER-encoded
return "DER"
else
return "It's something else"
But I'm not sure if BER encoding works that way, or if you have to have something to decode it with before you even know if it's BER.

There is no special header that would identify a BER encoding. However, there is a lot of redundancy in the format so that you can identify many byte sequences as not valid BER if you analyse them fully.
Every DER encoding is also a valid BER encoding, but not necessarily the other way round. you could read a byte sequence as BER and then reencode the abstract value with DER and check if you got the same result. If yes, it was originally DER.

As #Henry says, there is no special header.
However, you can know if your file contains BER encoded data by trying to decode them.
Go to http://asn1-playground.oss.com/, make sure BER is checked, select Binary file in the Data dropdown (upper right box) and upload your binary file.
If your file contains valid BER encoded data, it will show in the CONSOLE OUTPUT
However, you may be disappointed by the result: even if it is valid, you may not see any useful information (you need the schema, aka specification, to understand the data).

Related

Did anyone ever heard about asciihex encoding?

this type of encoding is used in soap messages...
I'm receiving a message encoded in ASCIIHEX and I don't have any ideas on how this encoding actually works although I have the clear description of the encoding method:
"If this mode is used, every single original byte is encoded as a sequence of two characters representing it in hexadecimal. So, if the original byte was 0x0a, the transmitted bytes are 0x30 and 0x41 (‘0’ and ‘a’ in ASCII)."
The buffer received : "1f8b0800000000000000a58e4d0ac2400c85f78277e811f2e665329975bbae500f2022dd2978ff95715ae82cdcf9415efec823c6710247582d5965c32c65aab0f5fc0a5204c415855e7c190ef61b34710bcdc7486d2bab8a7a4910d022d5e107d211ed345f2f37a103da2ddb1f619ab8acefe7fdb1beb6394998c7dfbde3dcac3acf3f399f3eeae152012e010000"
The actual file contains this : "63CD13C1697540000000662534034000030000120011084173878R 00000001000018600050000000100460000009404872101367219 000000000000 DNSO_038114 000000002001160023Replacem000000333168625 N0000 00000000"
The provider sent me the file that contains the string above. I tried to start from the buffer string and get the same result as the one sent by the provider but no results. I also tried searching after this "asciihex" encoding and same. If someone knows anything about this encoding or can give me any advice I would really appreciate it. I have pretty much no experience with SOAP services.
Based on the comments above, it's possible the buffer is compressed. It starts with 1F 8B which is a signature for GZIP compression. See the following list of signatures.
Write the bytes that correspond to the hex strings into a file. Name that file with a gz or tar.gz extension and try to extract it or open it with some file archiver tool.
Another thing you could try would be to not send the Compress element in your request, assuming it's an optional field and you can do that. If you can, check if the buffer changes and has the proper length and you can see similar patterns as the original content (for those zeros at the end, for example).

Is it possible to base64 decode part of a base64 encoded message

I am working on a project where I am getting parts of base64 encoded data, but not the whole thing. Is it possible to figure out what that part of the base64 encoded data was?
For example. Say I base64 encode hello world
It becomes aGVsbG8gd29ybGQ=
But say I am only able to capture sbG8gd29y
Which base4 decodes to ݽ
I am familiar with how base64 encoding process works and I cannot think of a way to figure out what part of a base64 encoded message is without adding data randomly to the chunk on the front and back and comparing with dictionary words, but the problem is I am not even 100% sure that the data I am working with includes dictionary words.
Thanks
I just spent a little time using an online conveter (http://www.convertstring.com/EncodeDecode/Base64Decode)
If you take your captured section you can run it through the converter and see that its an invalid length for a base64 encoded string.
For a captured section to have a valid length you will need to add some extra characters (0-3 depending on the length of the section). A valid base64 string has a length that is exactly devisible by 4.
Pick a character ('a' for example) and then run through the posibilities of adding the correct amount of characters to the section, front and back. With your added characters the string will be decodable and one of the decoded values will be more readable, that will be the one that has the partially decoded data.
E.G:
sbG8gd29yaaa
and
aaasbG8gd29y
decodes to:
����ݽɦ�
and
i��lo wor
You can make a rudimentary programatic test for readability by counting the number of 'normal' characters within the string (a-z for example). You will need to make up your own mind what is 'normal', it will depend on the expected language of the data and the context (is it known to be numeric only for example).

How to auto detect a String encoding?

I have a String which contains some encoded values in some way like Base64.
The problem is that I really don't know if it's actually Base64 (there are A-Z, a-z. 0-9, +, /) so it can be some any other code that i'm not familiar with.
Is there a way or any other online site to send him an encoded input and it can tell me in which code is it?
NOTE:
I'm not asking how to know if my String is UTF-8 or iso-8859-1 or something like that.
What I need is to know in which is my code is encoded.
EDIT:
To be more clear,
I need something to get an input like: 23Nzi4lUE4qlc+Pmc3blWMS1Irmgo3i8UTQHhoL7VyzqpEV/i9bDhoiteZ0a7/TqcVSkrXR89V2Yj7tEFDGJx4gvWEBs= this is the encoded String that I have.
The output should be the type of the encoded String and it's decoding like:
Base64 -> "Big yellow fish is swimming in the tube."
Maybe there is some program which get's an input and tries to decode it with a list of coding types (Base64 and etc.). The output doesn't really matter because it's the users decision if it's good or not.
This site handles base64 de/encoding.
Since Base64 is just one instance of a class of encoding schemes ( specifically, encoding a bit stream as base_<n> number ), you probably will never fare better than testing for just a couple of standard encoding schemes.
You either check the well-formedness of the encoding scheme or try to decode without getting an error thrown using a web service or your own code.
In (possibly pathological) cases there will be more than one encoding scheme for which a given octet stream will successfully decode.
Best practice would be to take the effort invested into setting up the verification to committing the data provider to one (or 'a few') encoding(s) first (won't always be possible, of course).

How do I use the StackExchange API from Matlab?

How do I access data from the StackExchange API using Matlab?
The naive
sitedata = urlread('http://api.stackoverflow.com/1.1/questions?tagged=matlab')
fails since the data is compressed. However, when I write this to file (using fprintf(fileID,'%s',sitedata)), I get a zip-file that cannot be uncompressed.
Try urlwrite() instead:
urlwrite('http://api.stackoverflow.com/1.1/questions?tagged=matlab',...
'tempfile.zip')
gunzip('tempfile.zip')
fid = fopen('tempfile');
str = textscan(fid,'%s',Delimiter','\n');
fclose(fid);
A better version of this snippet would use tempname to dynamically generate temporary filenames.
Matlab's urlread assumes you're getting text data back, not binary. The gzip binary data is getting mangled either when urlread is decoding the character data to Unicode values to stick in Matlab chars, or when the formatted-output fprintf function is writing them out, encoding them to UTF-8 or whatever default character encoding you're using for fileID and changing the byte sequence, or maybe both.
IIRC, urlread will default to using ISO-8859-1 encoding, which means the bytes will be turned in to the Unicode code points with the same numeric values - effectively just a widening. So you can get the byte data back by doing sitebytes = uint8(sitedata). (That's a regular uint8() conversion, not a typecast().) (If this isn't the case, you can probably fiddle with urlread's CharSet option.)
If you can't get the right bytes out from urlread by fiddling with the encoding and casts, then you can drop down and make calls against the Java HttpAgent like urlread does and bypass the character set decoding step, or fiddle with its options. See the urlread source for how to do it.
Once you have the right bytes in memory, you can write them out to a file using the lower-level fwrite() function, which won't mangle them by doing character set encoding. Then you'll have a valid gzip file of the site's original response. (I think it'll work if you also just use fwrite(fileID, sitedata, 'uint8') directly on the char string, but it's uglier IMHO.)
You can also unzip it in memory using Java classes and save a trip to the filesystem. Do jsitebytes = typecast(sitebytes 'int8') to get them as Java-friendly signed bytes and then stick it into a ByteArrayInputStream and read it out through a GZIPInputStream. You'll need to build a little Java helper class because Matlab doesn't play well with passing byte[] buffers by reference like java.io wants, but it may be worthwhile if you do a lot of in-memory munging like this.
When working with web services or fancier data downloads (e.g. sites that need sessions or certificates), I've often ended up dropping down and coding directly against the HttpAgent and java.io classes from within Matlab.

How to determine if a file is IBM1047 encoded

I have a bunch of XML files that are declared as encoding="IBM1047" but they don't seem to be:
when converted with iconv from IBM1047 to UTF-8 or ISO8859-1 (Latin 1) they result in indecipherable garbage
file -i <name_of_file> says "unknown 8-bit encoding"
when parsed by an XML parser the parser complains there is text before the prolog but there isn't; this error doesn't happen if I change the encoding in the XML declaration to something else
It would be nice to find out the real encoding of these files (I tried 'file -i' as mentioned above, and 'enca' but it's limited to Slavic languages (the files are in French)).
I have little control about how these files are produced; short of finding the actual encoding, if I can prove conclusively that the files are not in fact IBM1047 I may get the producer to do something about it.
How do I prove it?
Some special chars:
'é' is '©'
'à' is 'ë'
'è' is 'Û'
'ê' is 'ª'
The only way to prove that any class of data streams is encoded or not encoded in a particular way is to know, for at least one instance of the class, exactly what characters are supposed to be in the stream. If you have agreement on what characters are (supposed to be) in a particular test case, you can then calculate the bits that should be in the IBM 1047 (or any other) encoding of the test case, and compare those bits to the bits you actually see.
One simple way for EBCDIC data to be mangled, of course, is for it to have passed through some EBCDIC/ASCII gateway along the way that used a translate table designed for some other EBCDIC code page. But if you are working with EBCDIC data you presumably already know that.