ASN.1 sequence with missing tag/length field - tags

I'm implementing a specification that, as the outermost data type, specifies a sequence
LogMessage ::= SEQUENCE {
version INTEGER (4),
...
}
When encoded, I would expect the messages to always start with 30, but this is not the case. Indeed what I see when I look at the messages is that the inner part of the SEQUENCE is encoded, but the outer definition (30 and length field of payload) is omitted.
I cannot find why this is happening (the ASN.1 looks just "normal"). Is there a special mode in which this behavior can be seen? Of course I can "manually" trucate that leading data, but I would like to build a robust solution and if this is somehow defined in ASN.1, I'm sure my library (pyasn1) would have an option to set it.

Related

Is empty string value generally allowed by the FIX protocol?

When I look at the definition of a String type in the FIX protocol (e.g. here or here), I don't see a minimum length specified. Is it allowed to use empty strings? One online decoder seems to accept an empty string value (see tag 320), an other complains that it's invalid.
The FIX 4.4 specification states the following (emphasis in the original text):
Each message is constructed of a stream of <tag>=<value> fields with a
field delimiter between fields in the stream. Tags are of data type
TagNum. All tags must have a value specified. Optional fields without
values should simply not be specified in the FIX message. A Reject
message is the appropriate response to a tag with no value.
That strongly suggests (but does not unambiguously state) to me that the use of an empty value for a string is invalid. It is unsurprising to me that different FIX implementations might treat this edge case in different ways. So, I think the best approach is to avoid using empty values for strings.
+1 for Ciaran's and Grant's answer/comments. Just want to add something.
I generally suggest to look up things like this in the most current specification since they usually have been refined/reworded/clarified to eliminate unclear or ambiguous statements from older specs.
The answer is on the very page you link to in your question (emphasis mine, search for "Well-formed field"): https://www.fixtrading.org/standards/tagvalue-online/#field-syntax
A well-formed field has the form:
tag=value<SOH>
A field shall be considered malformed if any of the following occurs as a result of encoding:
the tag is empty
the tag delimiter is missing
the value is empty
the value contains an <SOH> character and the datatype of the field is not data or XMLdata
the datatype of the field is data and the field is not immediately preceded by its associated Length field.

Is it legal to repeat the same value in a MULTIPLECHARVALUE or MULTIPLESTRINGVALUE field?

Let's assume a FIX field is of type MULTIPLECHARVALUE or MULTIPLESTRINGVALUE, and the enumerated values defined for the field are A, B, C and D. I know that "A C D" is a legal value for this field, but is it legal for a value to be repeated in the field? For example, is "A C C D" legal? If so, what are its semantics?
I can think of three possibilities:
"A C C D" is an invalid value because C is repeated.
"A C C D" is valid and semantically the same as "A C D". In other words, set semantics are intended.
"A C C D" is valid and has multiset/bag semantics.
Unfortunately, I cannot find any clear definition of the intended semantics of MULTIPLECHARVALUE and MULTIPLESTRINGVALUE in FIX specification documents.
The FIX50SP2 spec does not answer your question, so I can only conclude that any of the three interpretations could be considered valid.
Like so may questions with FIX, the true answer is dependent upon the counterparty you are communicating with.
So my answer is:
if you are client app, ask your counterparty what they want (or check their docs).
if you are the server app, you get to decide. Your docs should tell your clients how to act.
If it helps, the QuickFIX/n engine treats MultipleCharValue/MultipleStringValue fields as strings, and leaves it to the application code to parse out the individual values. Thus, it's easy for a developer to support any of the interpretations, or even different interpretations for different fields. (I suspect the other QuickFIX language implementations are the same.)
The definition of MultipleValueString field is a string field containing one or more space delimited values. I haven't got the official spec, but there are few locations where this definition can be found:
https://www.onixs.biz/fix-dictionary/4.2/index.html#MultipleValueString (I know onixs.biz to be very faithful to the standard specification)
String field (see definition of "String" above) containing one or more space delimited values.
https://aj-sometechnicalitiesoflife.blogspot.com/2010/04/fix-protocol-interview-questions.html
12. What is MultipleValueString data type? [...]
String field containing one or more space delimited values.
This leaves it up to a specific field of this type whether multiples are allowed or not, though I suspect only a few if any would need to have multiples allowed. As far as I can tell, the FIX specification deliberately leaves this open.
E.g. for ExecInst<18> it would be silly to specify the same instruction multiple times. I would also suspect each and every implementation to behave differently (for instance one ignoring duplicates, another balking with an error/rejection).

Encoding and decoding implicit tagging

I have a question about explicit and implicit tagging, in the following example
X ::= [APPLICATION 5] IMPLICIT INTEGER
for X, since the implicit tag will replace the existing tag on INTEGER with [APPLICATION 5], so the encoding in BER of the value 5 would be in hex 45 01 05. How does the decoder know the type from 45 01 05?
I suspect your real question is, "How can a BER decoder know what to do when implicit tags are used and these tags replace the tags that would otherwise signal the ASN.1 type that needs to be decoded?"
Whether the decoder can handle IMPLICIT tags depends on whether the decoder is informed by the ASN.1 specification, which provides the necessary context. There are requirements imposed on the components of SEQUENCE, SET, and CHOICE to ensure that a decoder can read a tag and know which component needs to be decoded and, therefore, what the type is. This requires knowledge of the ASN.1 specification.
By contrast, a generic BER decoder that is not informed by the ASN.1 specification will have a problem with implicit tags, because it lacks the necessary context to interpret them.
The only way for the decoder to recover original type from octet stream is to know that it is coming. AFAIK, your decoder should be given a hint on what type to expect in given circumstances and, most importantly, into what base ASN.1 type that implicitly tagged type maps.
Consider checking out this book.
Usually, the BER decoder is generated by an ASN.1 compiler based on the given specification (schema). Then, during the decoding, beside the input encoded data, the users will also specify the type that they want to decode. Using the type information the decoder will know what to decode.
First ,I have cheked a book of "ASN.1 Communication between Heterogeneous Systems" that send me Ilya Etingof , the following shows more detaills:
"The IMPLICIT marker proceeds as follows: all the following tags, explicitly mentioned or indirectly reached through a type reference are ignored until the next occurrence (included) of the UNIVERSAL class tag (except if the EXPLICIT marker is encountered before). So, for the type T below:
T ::= [1] IMPLICIT T1
T1 ::= [5] IMPLICIT T2*
T2 ::= [APPLICATION 0] IMPLICIT INTEGER
only the tag [1] should be encoded. Another way of explaining the concept
of implicit tagging is to say that a tag marked IMPLICIT overwrites
the tag that follows it (recursively); hence, for the example above, tag[1] overwrites tag [5], which in turn overwrites tag [APPLICATION 0] which
fnally overwrites the default tag [UNIVERSAL 2] of the INTEGER type.
A type tagged in implicit mode can be decoded only if the receiving
application `knows' the abstract syntax, i.e. the decoder has been
generated from the same ASN.1 module as the encoder was (and such
is the case most of the time)."
So i guess that a negociation of (ASN1 specification)should be perfermed in the presentation layaer at the begining of transfert of data.

Subresource and path variable conflicts in REST?

Is it considered bad practice to design a REST API that may have an ambiguity in the path resolution? For example:
GET /animals/{id} // Returns the animal with the given ID
GET /animals/dogs // Returns all animals of type dog
Ok, that's contrived, because you would actually just do GET /dogs, but hopefully it illustrates what I mean. From a path resolution standpoint, it seems like you wouldn't know whether you were looking for an animal with id="dogs" or just all the dogs
Specifically, I'm interested in whether Jersey would have any trouble resolving this. What if you knew the id to be an integer?
"Specifically, I'm interested in whether Jersey would have any trouble resolving this"
No this would not be a problem. If you look at the JAX-RS spec § 3.7.2, you'll see the algorithm for matching requests to resource methods.
[E is the set of matching methods]...
Sort E using the number of literal characters in each member as the primary key (descending order), the number of capturing groups as a secondary key (descending order) and the number of capturing groups with non-default regular expressions (i.e. not ‘([^ /]+?)’) as the tertiary key (descending order)
So basically it's saying that the number of literal characters is the primary key of which to sort by (note that it is short circuiting; you win the primary, you win). So for example if a request goes to /animals/cat, #Path("/animals/dogs") would obviously not be in the set, so we don't need to worry about it. But if the request is to /animals/dogs, then both methods would be in the set. The set is then sorted by the number of literal characters. Since #Path("/animals/dogs") has more literal characters than #Path("/animals/"), the former wins. The capturing group {id} doesn't count towards literal characters.
"What if you knew the id to be an integer?"
The capture group allows for regex. So you can use #Path("/animals/{id: \\d+}"). Anything not numbers will not pass and lead to a 404, unless of course it is /animals/dogs.

How can I find which value caused a bson.errors.InvalidStringData

I have a system that reads data from various sources and stores them in MongoDB. The data I receive is already properly encoded in utf-8 or in unicode. Documents are loosely related and vary greatly in schema, if you will.
Every now and then, a document has a field value that is pure binary data, like a JPEG image. I know how to wrap that value in a bson.binary.Binary object to avoid the bson.errors.InvalidStringData exception.
Is there a way to tell which part of a document made pymongo driver to raise a bson.errors.InvalidStringData, or do I have to try and convert each field to find it ?
(+If by chance a binary object happens to be a valid unicode string or utf-8, it will be stored as a string and that's ok)
PyMongo has two BSON implementations, one in Python for portability and one in C for speed. _make_c_string in the Python version will tell you what it failed to encode but the C version, which is evidently what you're using, does not. You can tell which BSON implementation you have with import bson; bson.has_c(). I've filed PYTHON-533, it'll be fixed soon.
(Answering my own question)
You can't tell from the exception, and some rewrite of the driver would be required to support that feature.
The code is in bson/__init__.py. There is a function named _make_c_string that raises InvalidStringData if string throws a UnicodeError if it is to be encoded in utf-8. The same function is used for both keys and values that are strings.
In other words, at this point in code, the driver does not know if it is dealing with a key or value.
The offending data is passed as a raw string to the exception's constructor, but for a reason I don't understand, it does not come out of the driver.
>>> bad['zzz'] = '0\x82\x05\x17'
>>> try:
... db.test.insert(bad)
... except bson.errors.InvalidStringData as isd:
... print isd
...
strings in documents must be valid UTF-8
But that does not matter: you would have to look up the keys for that value anyway.
The best way is to iterate over the values, trying to decode them in utf-8. If a UnicodeDecodeError is raised, wrap the value in a Binary object.
Somewhat like this :
try:
#This code could deal with other encodings, like latin_1
#but that's not the point here
value.decode('utf-8')
except UnicodeDecodeError:
value = bson.binary.Binary(str(value))