Unique identifier for an email - email

I am writing a C# application which allows users to store emails in a MS SQL Server database. Many times, multiple users will be copied on an email from a customer. If they all try to add the same email to the database, I want to make sure that the email is only added once.
MD5 springs to mind as a way to do this. I don't need to worry about malicious tampering, only to make sure that the same email will map to the same hash and that no two emails with different content will map to the same hash.
My question really boils down to how one would combine multiple fields into one MD5 (or other) hash value. Some of these fields will have a single value per email (e.g. subject, body, sender email address) while others will have multiple values (varying numbers of attachments, recipients). I want to develop a way of uniquely identifying an email that will be platform and language independent (not based on serialization). Any advice?

What volume of emails do you plan on archiving? If you don't expect the archive require many terabytes, I think this is a premature optimization.
Since each field can be represented as a string or array of bytes, it doesn't matter how many values it contains, it all looks the same to a hash function. Just hash them all together and you will get a unique identifier.
EDIT Psuedocode example
# intialized the hash object
hash = md5()
# compute the hashes for each field
hash.update(from_str)
hash.update(to_str)
hash.update(cc_str)
hash.update(body_str)
hash.update(...) # the rest of the email fields
# compute the identifier string
id = hash.hexdigest()
You will get the same output if you replace all the update calls with
# concatenate all fields and hash
hash.update(from_str + to_str + cc_str + body_str + ...)
How you extract the strings and interface will vary based on your application, language, and api.
It doesn't matter that different email clients might produce different formatting for some of the fields when given the same input, this will give you a hash unique to the original email.

Have you looked at some other headers like (in my mail, OS X Mail):
X-Universally-Unique-Identifier: 82d00eb8-2a63-42fd-9817-a3f7f57de6fa
Message-Id: <EE7CA968-13EB-47FB-9EC8-5D6EBA9A4EB8#example.com>
At least the Message-Id is required. That field could well be the same for the same mailing (send to multiple recipients). That would be more effective than hashing.
Not the answer to the question, but maybe the answer to the problem :)

Why not just hash the raw message? It already encodes all the relevant fields except the envelope sender and recipient, and you can add those as headers yourself, before hashing. It also contains all the attachments, the entire body of the message, etc, and it's a natural and easy representation. It also doesn't suffer from the easily generated hash collisions of mikerobi's proposal.

Related

Design REST endpoint with sublist as query param

I have a get request that will give me a winner based on a list of inputs.
eg). [{rabbit:3, tiger:2}, {rabbit:1, donkey:3}, {bird:2}]. // the winner is {rabbit:1, donkey:3}
I would like to design a get end point that will take a list.
One way I could think of is like this:
/GET
winner?rabbit,3?tiger,2&rabbit,1?donkey,3
A request param map would like like key:{rabbit,3?tiger,2}: value=[]
alternatively, I could do:
/GET
winner?id1=rabbit,3?tiger,2&id2=rabbit,1?donkey,3
but I don't need the id information at all.
While this serves the purpose for what I need, I am wondering what would be the best way to represent query param with sub-object?
There really isn't a great answer here.
As far as HTTP is concerned, any spelling that is consistent with the production rules described by RFC 3986 is fine.
If you have a representation that is easily described by a URI Template, then you (and your clients) can take advantage of the general purpose template libraries.
BUT... templates are not so flexible that they can be used to describe arbitrary message schemas. We've got support for strings, and lists (of strings) and associative arrays (of strings), and... that's pretty much it.
On the web, we might handle an arbitrary case using a form with a textarea control that accepts a string representation of the message; the browser would then create a key value pair, where the value is an encoded representation of the information in the text area.
So, for example, you could copy a string representation of a JSON document into the form, submit the form, and the browser would compose the matching query-part. On the server, you would reverse the process to get at the JSON document.
There's nothing particularly magic about using a key value pair, of course. Another possibility would be to ignore the key of the key value, and just use the properly encoded value as the query. Again, the server just reverses the process.
Another somewhat common attempt is to use key value pairs, treating the keys as "paths" - which is to say each key identifies a location in the original document, and the value indicates the information available at that location.
?/0/rabbit=1&/0/tiger=2&/1/rabbit=1&/1/donkey=3&/2/bird=2
In this example, the schema of the keys is based on JSON Pointer (RFC 6901), which is possible way to flatten hierarchical data into key value pairs. Which may not be "best", but is at least leaning in the direction of readily standardizable. (A standardized form would be an improvement, but I wasn't able to identify one).
The most obvious seems:
GET /winner?rabbit=3&tiger=2&rabbit=1&donkey=3

What is the use of Message-ID in email?

From what I've read, every Message-ID must be unique, however it is possible to create repeated Message-IDs if we force the header with a fixed value. So I don't understand what the point of them saying that the Message-ID should be unique, but they are very easy to create duplicates. If they can easily be generated by anyone with a little reading and basic programmatic knowledge, why do Message-IDs exist and what are they used for, which I can easily duplicate?
Short answer: For threading in email clients.
The message-id header is defined in RFC 2822:
The "Message-ID:" field contains a single unique message identifier.
The "References:" and "In-Reply-To:" field each contain one or more
unique message identifiers
The message ID is used to show which message is a reply to which other message, for example. That way mail clients can show a tree of emails with their replies even if other things like the Subject don't change. (Counting leading Re:s of the subject line would be a bad way to determine ancestors and children: not every mail client adds them, and some use language specific ones.)
https://datatracker.ietf.org/doc/html/rfc5322#section-3.6.4
in conjunction with the References and In-Reply-To fields, mail clients use Message-ID to organize multiple messages into threads.
https://en.wikipedia.org/wiki/Message-ID
and at least some clients will consider two messages with the same ID to be the same thing and discard one of them.

outlook EntryId syntax

I am writing a tool to backup my mails. In order to understand if I have already backed up a mail I use the entryID.
The Entry ID is however very very long and so I have problems in serializing my datastructure with JSON, using the entryID as index in a hash.
Furthermore I noticed that the first part of the entryID remains identic throughout all my mails. Therefore my suspect, that the first part identifies the Outlook Server, and the last part the e-mails themselves. Therefore there should no need to use the whole entryID to identify a single mail in my account.
Anybody knows the syntax of this entryID, I did not find nothing on the Microsoft Site, maybe I did the wrong query.
Thx a lot
Example of EntryID:
00000000AC032ADC2BFB3545BD2CEE24F67EAFF507000C7E507D761D09469E2B3AC3FA5E65770034EA28BA320000FD962E1BCA05E74595C077ACB6D7D7D30001C72579700000
quite long, isntĀ“t it ?
All entry ids must be treated as black boxes. The first 4 bytes (8 hex characters) are the flags (0s for the long term entry id). Next 16 bytes (32 hex characters) are the provider UID registered with the M

What's the most efficient way to get all emails with a specific type of attachment over IMAP?

From what I can tell, IMAP SEARCH doesn't support searching by if an email has attachments (except Gmail's variation, which I'm not interested in...I need a general IMAP solution). Is that correct?
Assuming that's the case, my understanding is that I have to issue a FETCH and filter on the client side.
If this is correct, what's the FETCH that will yield the smallest amount of information that will allow me to filter by attachment type? I believe it's FETCH BODYSTRUCTURE, but I'd like confirmation.
I looked at FETCH BODY[MIME], but it appears that needs a section number (or numbers) and MIME can't be used by itself. I believe that there can be any number of sections and subsections, and theres no way to specify to search all sections. Is that correct?
I'm looking for a protocol level answer. I don't need an answer using any specific language or library.
Thanks!
Generally, to get all attachment, you look for their number and names first in imap_fetchstructure->parts, it's an array of file names.
Then to get file content you need to get imap_fetchbody and add 1 to it.
For example, attachment number one is found on section number 2.
I created my Imap solution and it's working well.
based to that you can add you search section

How to separate a person's identity from his personal data?

I'm writing an app which main purpose is to keep list of users
purchases.
I would like to ensure that even I as a developer (or anyone with full
access to the database) could not figure out how much money a
particular person has spent or what he has bought.
I initially came up with the following scheme:
--------------+------------+-----------
user_hash | item | price
--------------+------------+-----------
a45cd654fe810 | Strip club | 400.00
a45cd654fe810 | Ferrari | 1510800.00
54da2241211c2 | Beer | 5.00
54da2241211c2 | iPhone | 399.00
User logs in with username and password.
From the password calculate user_hash (possibly with salting etc.).
Use the hash to access users data with normal SQL-queries.
Given enough users, it should be almost impossible to tell how much
money a particular user has spent by just knowing his name.
Is this a sensible thing to do, or am I completely foolish?
I'm afraid that if your application can link a person to its data, any developer/admin can.
The only thing you can do is making it harder to do the link, to slow the developer/admin, but if you make it harder to link users to data, you will make it harder for your server too.
Idea based on #no idea :
You can have a classic user/password login to your application (hashed password, or whatever), and a special "pass" used to keep your data secure. This "pass" wouldn't be stored in your database.
When your client log in your application I would have to provide user/password/pass. The user/password is checked with the database, and the pass would be used to load/write data.
When you need to write data, you make a hash of your "username/pass" couple, and store it as a key linking your client to your data.
When you need to load data, you make a hash of your "username/pass" couple, and load every data matching this hash.
This way it's impossible to make a link between your data and your user.
In another hand, (as I said in a comment to #no) beware of collisions. Plus if your user write a bad "pass" you can't check it.
Update : For the last part, I had another idea, you can store in your database a hash of your "pass/password" couple, this way you can check if your "pass" is okay.
Create a users table with:
user_id: an identity column (auto-generated id)
username
password: make sure it's hashed!
Create a product table like in your example:
user_hash
item
price
The user_hash will be based off of user_id which never changes. Username and password are free to change as needed. When the user logs in, you compare username/password to get the user_id. You can send the user_hash back to the client for the duration of the session, or an encrypted/indirect version of the hash (could be a session ID, where the server stores the user_hash in the session).
Now you need a way to hash the user_id into user_hash and keep it protected.
If you do it client-side as #no suggested, the client needs to have user_id. Big security hole (especially if it's a web app), hash can be easily be tampered with and algorithm is freely available to the public.
You could have it as a function in the database. Bad idea, since the database has all the pieces to link the records.
For web sites or client/server apps you could have it on your server-side code. Much better, but then one developer has access to the hashing algorithm and data.
Have another developer write the hashing algorithm (which you don't have access to) and stick in on another server (which you also don't have access to) as a TCP/web service. Your server-side code would then pass the user ID and get a hash back. You wouldn't have the algorithm, but you can send all the user IDs through to get all their hashes back. Not a lot of benefits to #3, though the service could have logging and such to try to minimize the risk.
If it's simply a client-database app, you only have choices #1 and 2. I would strongly suggest adding another [business] layer that is server-side, separate from the database server.
Edit:
This overlaps some of the previous points. Have 3 servers:
Authentication server: Employee A has access. Maintains user table. Has web service (with encrypted communications) that takes user/password combination. Hashes password, looks up user_id in table, generates user_hash. This way you can't simply send all user_ids and get back the hashes. You have to have the password which isn't stored anywhere and is only available during authentication process.
Main database server: Employee B has access. Only stores user_hash. No userid, no passwords. You can link the data using the user_hash, but the actual user info is somewhere else.
Website server: Employee B has access. Gets login info, passes to authentication server, gets hash back, then disposes login info. Keeps hash in session for writing/querying to the database.
So Employee A has user_id, username, password and algorithm. Employee B has user_hash and data. Unless employee B modifies the website to store the raw user/password, he has no way of linking to the real users.
Using SQL profiling, Employee A would get user_id, username and password hash (since user_hash is generated later in code). Employee B would get user_hash and data.
Keep in mind that even without actually storing the person's identifying information anywhere, merely associating enough information all with the same key could allow you to figure out the identity of the person associated with certain information. For a simple example, you could call up the strip club and ask which customer drove a Ferrari.
For this reason, when you de-identify medical records (for use in research and such), you have to remove birthdays for people over 89 years old (because people that old are rare enough that a specific birthdate could point to a single person) and remove any geographic coding that specifies an area containing fewer than 20,000 people. (See http://privacy.med.miami.edu/glossary/xd_deidentified_health_info.htm)
AOL found out the hard way when they released search data that people can be identified just by knowing what searches are associated with an anonymous person. (See http://www.fi.muni.cz/kd/events/cikhaj-2007-jan/slides/kumpost.pdf)
The only way to ensure that the data can't be connected to the person it belongs to is to not record the identity information in the first place (make everything anonymous). Doing this, however, would most likely make your app pointless. You can make this more difficult to do, but you can't make it impossible.
Storing user data and identifying information in separate databases (and possibly on separate servers) and linking the two with an ID number is probably the closest thing that you can do. This way, you have isolated the two data sets as much as possible. You still must retain that ID number as a link between them; otherwise, you would be unable to retrieve a user's data.
In addition, I wouldn't recommend using a hashed password as a unique identifier. When a user changes their password, you would then have to go through and update all of your databases to replace the old hashed password IDs with the new ones. It is usually much easier to use a unique ID that is not based on any of the user's information (to help ensure that it will stay static).
This ends up being a social problem, not a technological problem. The best solutions will be a social solution. After hardening your systems to guard against unauthorized access (hackers, etc), you will probably get better mileage working on establishing trust with your users and implementing a system of policies and procedures regarding data security. Include specific penalties for employees who misuse customer information. Since a single breach of customer trust is enough to ruin your reputation and drive all of your users away, the temptation of misusing this data by those with "top-level" access is less than you might think (since the collapse of the company usually outweighs any gain).
The problem is that if someone already has full access to the database then it's just a matter of time before they link up the records to particular people. Somewhere in your database (or in the application itself) you will have to make the relation between the user and the items. If someone has full access, then they will have access to that mechanism.
There is absolutely no way of preventing this.
The reality is that by having full access we are in a position of trust. This means that the company managers have to trust that even though you can see the data, you will not act in any way on it. This is where little things like ethics come into play.
Now, that said, a lot of companies separate the development and production staff. The purpose is to remove Development from having direct contact with live (ie:real) data. This has a number of advantages with security and data reliability being at the top of the heap.
The only real drawback is that some developers believe they can't troubleshoot a problem without production access. However, this is simply not true.
Production staff then would be the only ones with access to the live servers. They will typically be vetted to a larger degree (criminal history and other background checks) that is commiserate with the type of data you have to protect.
The point of all this is that this is a personnel problem; and not one that can truly be solved with technical means.
UPDATE
Others here seem to be missing a very important and vital piece of the puzzle. Namely, that the data is being entered into the system for a reason. That reason is almost universally so that it can be shared. In the case of an expense report, that data is entered so that accounting can know who to pay back.
Which means that the system, at some level, will have to match users and items without the data entry person (ie: a salesperson) being logged in.
And because that data has to be tied together without all parties involved standing there to type in a security code to "release" the data, then a DBA will absolutely be able to review the query logs to figure out who is who. And very easily I might add regardless of how many hash marks you want to throw into it. Triple DES won't save you either.
At the end of the day all you've done is make development harder with absolutely zero security benefit. I can't emphasize this enough: the only way to hide data from a dba would be for either 1. that data to only be accessible by the very person who entered it or 2. for it to not exist in the first place.
Regarding option 1, if the only person who can ever access it is the person who entered it.. well, there is no point for it to be in a corporate database.
It seems like you're right on track with this, but you're just over thinking it (or I simply don't understand it)
Write a function that builds a new string based on the input (which will be their username or something else that cant change overtime)
Use the returned string as a salt when building the user hash (again I would use the userID or username as an input for the hash builder because they wont change like the users' password or email)
Associate all user actions with the user hash.
No one with only database access can determine what the hell the user hashes mean. Even an attempt at brute forcing it by trying different seed, salt combinations will end up useless because the salt is determined as a variant of the username.
I think you've answered you own question with your initial post.
Actually, there's a way you could possibly do what you're talking about...
You could have the user type his name and password into a form that runs a purely client-side script which generates a hash based on the name and pw. That hash is used as a unique id for the user, and is sent to the server. This way the server only knows the user by hash, not by name.
For this to work, though, the hash would have to be different from the normal password hash, and the user would be required to enter their name / password an additional time before the server would have any 'memory' of what that person bought.
The server could remember what the person bought for the duration of their session and then 'forget', because the database would contain no link between the user accounts and the sensitive info.
edit
In response to those who say hashing on the client is a security risk: It's not if you do it right. It should be assumed that a hash algorithm is known or knowable. To say otherwise amounts to "security through obscurity." Hashing doesn't involve any private keys, and dynamic hashes could be used to prevent tampering.
For example, you take a hash generator like this:
http://baagoe.com/en/RandomMusings/javascript/Mash.js
// From http://baagoe.com/en/RandomMusings/javascript/
// Johannes Baagoe <baagoe#baagoe.com>, 2010
function Mash() {
var n = 0xefc8249d;
var mash = function(data) {
data = data.toString();
for (var i = 0; i < data.length; i++) {
n += data.charCodeAt(i);
var h = 0.02519603282416938 * n;
n = h >>> 0;
h -= n;
h *= n;
n = h >>> 0;
h -= n;
n += h * 0x100000000; // 2^32
}
return (n >>> 0) * 2.3283064365386963e-10; // 2^-32
};
mash.version = 'Mash 0.9';
return mash;
}
See how n changes, each time you hash a string you get something different.
Hash the username+password using a normal hash algo. This will be the same as the key of the 'secret' table in the database, but will match nothing else in the database.
Append the hashed pass to the username and hash it with the above algorithm.
Base-16 encode var n and append it in the original hash with a delimiter character.
This will create a unique hash (will be different each time) which can be checked by the system against each column in the database. The system can be set up be allow a particular unique hash only once (say, once a year), preventing MITM attacks, and none of the user's information is passed across the wire. Unless I'm missing something, there is nothing insecure about this.