Deduplication Suggestions for Email Storage - email

The proposed storage model is to store attachments in separate files (or blobs), and to store the email itself as a MIME multipart message, with references to the attached file and how it was encoded. This allows the user to Show Original, but does not require me to actually store the less efficient base64 with the message. Most of the time I will be able to store just the base64 line length that was used.
This way, we can perform attachment-level deduplication.
But how can the deduplication go further? Here are my thoughts:
All attachments and emails could be compressed (byte-level deduplicated) individually of course.
I could compress sets of maybe 12 attachments together in a single file. Compressing multiple files of the same type (for example, PDFs), even those from the same sender, may be more effective.
The MIME messages can also be compressed in sets.
I am not concerned about search efficiency because there will be full text indexing used.
Searching of the emails would of course use a type of full text indexing, that would not be compressed.
Decompressed cache would be created as the email first arrives, and would only be deleted after the email is not viewed for a time.
Do you have any advice in this area? What is normal for an email storage system?

decode all base64 mime parts, not only attachments
calculate secure hash of its content
replace part with reference in email body, or create custom header with list of extracted mime parts
store in blob storage under secure hash (content addresable storage)
use reference counter for deletions and garbage collection, or smarter double counter (https://docs.wildduck.email/#/in-depth/attachment-deduplication, https://medium.com/#andrewsumin/efficient-storage-how-we-went-down-from-50-pb-to-32-pb-99f9c61bf6b4)
or store each reference relation hash-emailid in db
carefully check and control base64 folds, some email have shorter line in middle, some have additional characters (dot, whitespace) at the end
store encoding parameters (folds, tail) in reference in email body for exact reconstruction
compress compressible attachments, be carefull with content addresable storage because compression changes its content hash
jpeg images can be significantly losslessly compressed using JPEG XL or https://github.com/dropbox/lepton
wav files can be compressed using flac, etc.
content-type is sender specified, same attachment can have different content-types
quoted printable encoded parts are hard to decode and reconstruct exactly. There are many encoder parameters, because each encoder escapes different characters and fold lines differently.
be carefull with reference format, so malicious sender could not create email with reference and fetch attachment he does not own. Or detect and escape reference in received emails
small mime parts may not be worth extracting before specific number of duplicities are present in system

Related

Get BLOB out of Avro FlowFile

I've retrieving some binary files (in this case, some PDFs) from a database using ExecuteSQL, which returns the result in an Avro FlowFile. I can't figure out how to get the binary result out of the Avro records.
I've tried using ConvertAvroToJSON, which gives me an object like:
{"MYBLOB": {"bytes": "%PDF-1.4\n [...] " }}
However, using EvaluateJSONPath and grabbing $.MYBLOB.bytes, causes corruption because the binary bytes get converted to UTF8.
None of the record writer options with ConvertRecord seem appropriate for binary data.
The best solution I can think of is to base64 encode the binary before it leaves the database, then I'm dealing with only character data and can decode it in NiFi. But that's extra steps and I'd prefer not to do that.
You may need a scripted solution in this case (as a workaround), to get the field and decode it using your own encoding. In any case please feel free to file a Jira case, ConvertAvroToJSON is deprecated but we should support Character Sets for the JsonRecordSetWriter in ExecuteSQLRecord/ConvertRecord (if that also doesn't work for you).

utf-8 gets stored differently on server side (JAVA)

Im trying to figure out the answer to one of my other questions but anyways maybe this will help me.
When I persist and entity to the server, the byte[] property holds different information than what I persisted. Im persisting in utf-8 to
the server.
An example.
{"name":"asd","image":[91,111,98,106,101,99,116,32,65,114,114,97,121,66,117,102,102,101,114,93],"description":"asd"}
This is the payload I send to the server.
This is what the server has
{"id":2,"name":"asd","description":"asd","image":"W29iamVjdCBBcnJheUJ1ZmZlcl0="}
as you can see the image byte array is different.
WHat im trying to do it get the image bytes saved on the server and display them on the front end. But i dont know how to get the original bytes.
No, you are wrong. Both method stored the ASCII string [object ArrayBuffer].
You are confusing the data with its representation. The data it is the same, but on both examples, you represent binary data in two different way:
The first as an array of bytes (decimal representation), on the second a classic for binary data representation: BASE64 (you may discover it because of final character =.
So you just have different representation of the same data. But so the data is stored on the same manner.
You may need to specify how to get binary data in string form (as in your example), and so the actual representation.

outlook EntryId syntax

I am writing a tool to backup my mails. In order to understand if I have already backed up a mail I use the entryID.
The Entry ID is however very very long and so I have problems in serializing my datastructure with JSON, using the entryID as index in a hash.
Furthermore I noticed that the first part of the entryID remains identic throughout all my mails. Therefore my suspect, that the first part identifies the Outlook Server, and the last part the e-mails themselves. Therefore there should no need to use the whole entryID to identify a single mail in my account.
Anybody knows the syntax of this entryID, I did not find nothing on the Microsoft Site, maybe I did the wrong query.
Thx a lot
Example of EntryID:
00000000AC032ADC2BFB3545BD2CEE24F67EAFF507000C7E507D761D09469E2B3AC3FA5E65770034EA28BA320000FD962E1BCA05E74595C077ACB6D7D7D30001C72579700000
quite long, isntĀ“t it ?
All entry ids must be treated as black boxes. The first 4 bytes (8 hex characters) are the flags (0s for the long term entry id). Next 16 bytes (32 hex characters) are the provider UID registered with the M

Unique identifier for an email

I am writing a C# application which allows users to store emails in a MS SQL Server database. Many times, multiple users will be copied on an email from a customer. If they all try to add the same email to the database, I want to make sure that the email is only added once.
MD5 springs to mind as a way to do this. I don't need to worry about malicious tampering, only to make sure that the same email will map to the same hash and that no two emails with different content will map to the same hash.
My question really boils down to how one would combine multiple fields into one MD5 (or other) hash value. Some of these fields will have a single value per email (e.g. subject, body, sender email address) while others will have multiple values (varying numbers of attachments, recipients). I want to develop a way of uniquely identifying an email that will be platform and language independent (not based on serialization). Any advice?
What volume of emails do you plan on archiving? If you don't expect the archive require many terabytes, I think this is a premature optimization.
Since each field can be represented as a string or array of bytes, it doesn't matter how many values it contains, it all looks the same to a hash function. Just hash them all together and you will get a unique identifier.
EDIT Psuedocode example
# intialized the hash object
hash = md5()
# compute the hashes for each field
hash.update(from_str)
hash.update(to_str)
hash.update(cc_str)
hash.update(body_str)
hash.update(...) # the rest of the email fields
# compute the identifier string
id = hash.hexdigest()
You will get the same output if you replace all the update calls with
# concatenate all fields and hash
hash.update(from_str + to_str + cc_str + body_str + ...)
How you extract the strings and interface will vary based on your application, language, and api.
It doesn't matter that different email clients might produce different formatting for some of the fields when given the same input, this will give you a hash unique to the original email.
Have you looked at some other headers like (in my mail, OS X Mail):
X-Universally-Unique-Identifier: 82d00eb8-2a63-42fd-9817-a3f7f57de6fa
Message-Id: <EE7CA968-13EB-47FB-9EC8-5D6EBA9A4EB8#example.com>
At least the Message-Id is required. That field could well be the same for the same mailing (send to multiple recipients). That would be more effective than hashing.
Not the answer to the question, but maybe the answer to the problem :)
Why not just hash the raw message? It already encodes all the relevant fields except the envelope sender and recipient, and you can add those as headers yourself, before hashing. It also contains all the attachments, the entire body of the message, etc, and it's a natural and easy representation. It also doesn't suffer from the easily generated hash collisions of mikerobi's proposal.

RESTful design - how to model entity's attachments

I am trying to model entity's attachments in REST. Let's say a defect entity can have multiple attachments attached to it. Every attachment has a description and some other properties (last modified, file size...) . The attachment itself is a file in any format (jpeg, doc ...)
I was wondering how should I model it RESTfully
I thought about the following two options:
First approach (using same resource, different representations):
GET , content-type:XML on http://my-app/defects/{id}/attachments will return the defect's
attachments metadata in XML format (description, last modified, file size...)
GET , content-type:gzip on http://my-app/defects/{id}/attachments will return the defect's attachments in a zip file
GET , content-type:mime multi-part on http://my-app/defects/{id}/attachments will return the defect's attachments in a multi-part message (binary data and XML metadata altogether)
POST, content-type:XML on http://my-app/defects/{id}/attachments will create new attachment, metadata only no file attached (then the user has to send PUT request with the binary data)
POST , content-type:mime\multi-part on http://my-app/defects/{id}/attachments will create the attachment, the client can send both metadata and file itself in a single roundtrip
Second approach (separate the attachment's data from the metadata):
GET , content-type:XML on http://my-app/defects/{id}/attachments will return the defect's
attachments metadata in XML format (description, last modified, file size...)
GET , content-type:gzip on http://my-app/defects/{id}/attachments/files will return the defect's attachments binary data in a single zip
Creating a new attachment, first call:
POST, content-type:XML on http://my-app/defects/{id}/attachments will create new attachment, metadata only no file attached (then the user has to send PUT request with the binary data)
Then add the binary data itself:
POST , content-type:mime\multi-part on http://my-app/defects/{id}/attachments/{id}/file will create the attachment file
On one hand the first approach is more robust and efficient since the client can create\get the attachments metadata and binary data in single round trip. On the other hand, I am a bit reluctant to use the mime-multipart representation as it's more cumbersome to consume and produce.
EDIT: I checked out flicker upload REST API. It seems they are using multi part messages to include both the photo and the photo attributes.
Much of this problem has already been solved by the Atom Pub spec. See here
One thing to be careful about in your proposed solutions is that you are using content negotiation to deliver different content. I believe that is considered bad. Content negotiation should only deliver different representations of the same content.
Don't manage metadata separately. A two-part action defeats the point of REST.
One smooth GET/POST/PUT/DELETE with one -- relatively -- complex payload is what's typically done.
The fact that it's multiple underlying "objects" in "tables" is irrelevant to REST.
At the REST level, it's just one complex object's state transmitted with one message.