utf-8 gets stored differently on server side (JAVA) - encoding

Im trying to figure out the answer to one of my other questions but anyways maybe this will help me.
When I persist and entity to the server, the byte[] property holds different information than what I persisted. Im persisting in utf-8 to
the server.
An example.
{"name":"asd","image":[91,111,98,106,101,99,116,32,65,114,114,97,121,66,117,102,102,101,114,93],"description":"asd"}
This is the payload I send to the server.
This is what the server has
{"id":2,"name":"asd","description":"asd","image":"W29iamVjdCBBcnJheUJ1ZmZlcl0="}
as you can see the image byte array is different.
WHat im trying to do it get the image bytes saved on the server and display them on the front end. But i dont know how to get the original bytes.

No, you are wrong. Both method stored the ASCII string [object ArrayBuffer].
You are confusing the data with its representation. The data it is the same, but on both examples, you represent binary data in two different way:
The first as an array of bytes (decimal representation), on the second a classic for binary data representation: BASE64 (you may discover it because of final character =.
So you just have different representation of the same data. But so the data is stored on the same manner.
You may need to specify how to get binary data in string form (as in your example), and so the actual representation.

Related

How to implement MongoDB ObjectId validation from scratch?

I'm developing a front-end app where I want to support searching data by the id, so I'm going to have an "object id" field. I want to validate the object id to make sure it's a valid MongoDB ObjectId before sending it to the API.
So I searched on how to do it and I found this thread, where all the answers suggest using a implementation provided by a MongoDB driver or ORM such as mongodb or mongose. However, I don't want to go that way, because I don't want to install an entire database driver/ORM in my front-end app just to use some id validation - I'd rather implement the validation myself.
Unfortunately, I couldn't find an existing implementation. Then I tried checking the ObjectId spec and implementing the validation myself, but that didn't work out either.
The specification says...
The ObjectID BSON type is a 12-byte value consisting of three different portions (fields):
a 4-byte value representing the seconds since the Unix epoch in the highest order bytes,
a 5-byte random number unique to a machine and process,
a 3-byte counter, starting with a random value.
Which doesn't make much sense to me. When it says the ObjectId has 12 bytes, it makes me think that the string representation is going to have 12 characters (1 byte = 1 char), but it doesn't. Most object ids have 24 characters.
Finally, I searched mongodb's and mongoose's source code but I didn't had much luck with that either. The best I could do was finding this line of code, but I don't know where to go from there.
TL;DR: What is the actual algorithm to check if a given string is a valid MongoDB Object Id?
You find is correct, you just stopped too early. the isValid comes from the underlying bson library: https://github.com/mongodb/js-bson/blob/a2a81bc1bc63fa5bf3f918fbcaafef25aca2df9d/src/objectid.ts#L297
And yes, you get it right - there is not much to validate. Any 12 bytes can be an object ID. The reason you see 24 characters is because not all 256 ASCII are printable/readable, so the ObjectID is usually presented in hex format - 2 characters per byte. The regexp to validate 12-bytes hex representation would be /[0-9a-f]{24}/i
TL;DR: check the constructor of ObjectId in the bson library for the official validation algorithm
Hint: you don't need most of it, as you are limited to string input on frontend.

GRPC test client GUI that supports representing a bytes type as a hex string?

MongoDB's ObjectID type is a 12 byte array. When you view the database, you can see it displayed as: ObjectId("6000e9720C683f1b8e638a49").
We also want to share this value with SQL server and pass it into a GRPC request.
When the same value stored in MS SQL server as a binary(12) column, it is displayed as: 0x6000E9720C683F1B8E638A49. It's simple enough to convert this representation to the Mongo representation.
However, when trying to pass it via GRPC as a bytes type, BloomRPC requires that we represent it in the format: "id": [96,0,233,114,12,104,63,27,142,99,138,73]
So I'm looking for a GRPC test client GUI application to replace BloomRPC that will support a hex string format similar to MongoDB or SQL server to represent the underlying byte array. Anyone have a line on something like this that could work?
We could just represent it as a string in the proto, but my personal opinion is that it should be unnecessary to do this. It will require our connected services to convert bytes->string->bytes on every GRPC call. The other 2 tools seem to be happy having a byte array in the background and representing it as a string in the front end, so if we could just get our last tool to behave the same, that would be great.

Get BLOB out of Avro FlowFile

I've retrieving some binary files (in this case, some PDFs) from a database using ExecuteSQL, which returns the result in an Avro FlowFile. I can't figure out how to get the binary result out of the Avro records.
I've tried using ConvertAvroToJSON, which gives me an object like:
{"MYBLOB": {"bytes": "%PDF-1.4\n [...] " }}
However, using EvaluateJSONPath and grabbing $.MYBLOB.bytes, causes corruption because the binary bytes get converted to UTF8.
None of the record writer options with ConvertRecord seem appropriate for binary data.
The best solution I can think of is to base64 encode the binary before it leaves the database, then I'm dealing with only character data and can decode it in NiFi. But that's extra steps and I'd prefer not to do that.
You may need a scripted solution in this case (as a workaround), to get the field and decode it using your own encoding. In any case please feel free to file a Jira case, ConvertAvroToJSON is deprecated but we should support Character Sets for the JsonRecordSetWriter in ExecuteSQLRecord/ConvertRecord (if that also doesn't work for you).

outlook EntryId syntax

I am writing a tool to backup my mails. In order to understand if I have already backed up a mail I use the entryID.
The Entry ID is however very very long and so I have problems in serializing my datastructure with JSON, using the entryID as index in a hash.
Furthermore I noticed that the first part of the entryID remains identic throughout all my mails. Therefore my suspect, that the first part identifies the Outlook Server, and the last part the e-mails themselves. Therefore there should no need to use the whole entryID to identify a single mail in my account.
Anybody knows the syntax of this entryID, I did not find nothing on the Microsoft Site, maybe I did the wrong query.
Thx a lot
Example of EntryID:
00000000AC032ADC2BFB3545BD2CEE24F67EAFF507000C7E507D761D09469E2B3AC3FA5E65770034EA28BA320000FD962E1BCA05E74595C077ACB6D7D7D30001C72579700000
quite long, isntĀ“t it ?
All entry ids must be treated as black boxes. The first 4 bytes (8 hex characters) are the flags (0s for the long term entry id). Next 16 bytes (32 hex characters) are the provider UID registered with the M

Deduplication Suggestions for Email Storage

The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does not require me to actually store the less efficient base64 with the message. Most of the time I will be able to store just the base64 line length that was used.
This way, we can perform attachment-level deduplication.
But how can the deduplication go further? Here are my thoughts:
All attachments and emails could be compressed (byte-level deduplicated) individually of course.
I could compress sets of maybe 12 attachments together in a single file. Compressing multiple files of the same type (for example, PDFs), even those from the same sender, may be more effective.
The MIME messages can also be compressed in sets.
I am not concerned about search efficiency because there will be full text indexing used.
Searching of the emails would of course use a type of full text indexing, that would not be compressed.
Decompressed cache would be created as the email first arrives, and would only be deleted after the email is not viewed for a time.
Do you have any advice in this area? What is normal for an email storage system?
decode all base64 mime parts, not only attachments
calculate secure hash of its content
replace part with reference in email body, or create custom header with list of extracted mime parts
store in blob storage under secure hash (content addresable storage)
use reference counter for deletions and garbage collection, or smarter double counter (https://docs.wildduck.email/#/in-depth/attachment-deduplication, https://medium.com/#andrewsumin/efficient-storage-how-we-went-down-from-50-pb-to-32-pb-99f9c61bf6b4)
or store each reference relation hash-emailid in db
carefully check and control base64 folds, some email have shorter line in middle, some have additional characters (dot, whitespace) at the end
store encoding parameters (folds, tail) in reference in email body for exact reconstruction
compress compressible attachments, be carefull with content addresable storage because compression changes its content hash
jpeg images can be significantly losslessly compressed using JPEG XL or https://github.com/dropbox/lepton
wav files can be compressed using flac, etc.
content-type is sender specified, same attachment can have different content-types
quoted printable encoded parts are hard to decode and reconstruct exactly. There are many encoder parameters, because each encoder escapes different characters and fold lines differently.
be carefull with reference format, so malicious sender could not create email with reference and fetch attachment he does not own. Or detect and escape reference in received emails
small mime parts may not be worth extracting before specific number of duplicities are present in system