java imap, performance issues, fetching all mails - email

I will use java mail api to handle mails like thunderbird etc. I have to fetch mails having 1000 messages. My design will be: When user performs a synch on a folder, i will get all uids of the messages in the folder:
Message[] msgs = ufolder.getMessagesByUID(1, UIDFolder.LASTUID);
// Use a suitable FetchProfile
FetchProfile fp = new FetchProfile();
fp.add(FetchProfile.Item.ENVELOPE);
fp.add(FetchProfile.Item.FLAGS);
I wil then compare the list of uids with the list stored in my db.
For the deleted ones, for example a message is not in the folder but in the db, i will mark it as deleted.
For the new ones, for example a message is in the folder but not in the db, i will mark as possible new. But, because messageuids are not safe (can be changed by the mail server on some cases), for the new mails, i will use additioanlly a custom hash value build from message id in the header + subject + receivedate and build a md5 hash. Only for the possible new mails i will use this hash and catch new mails.
For the moved messages, because their uids will be changed in the new folder, it will be flagged as deleted in the first and will be a new message in new folder, but the message will have same custom hash value becaue message id in the header and other properties will remain same duing the movement.
Question about performance issue: On each click on folder (folder synch) i will do the compare operation of all uids in the folder with the local uid list stored in the db to learn the deleted ones. I could not find another better way to accomplish this. As you know, thunderbird catches immediately a deleted message without relogin, even if the folder is large and the deleted message is very old (5 years). I think thunderbird also compares all message uids in that folder with a list stored locally.
How can i implement a better mechanism for the synch for a better performance? Does thunderbird apply a different approach? How can thunderbird
accomplish it so quickly?
If we were interested only for the new messages, i could have kept last stored uid and only compare the new messages later than that, but for the deleted ones, i already have to compare full folder. Additionally, UIDNEXT value is always -1 in my mail server, if it were set correctly, it will not help to get deleted ones again, a full compare is a must i think, am I wrong?
Note: I canot not use or add message listeners because the appliaction is server-client based and the mail handling task is on the server side and we do not support threads listeners etc. The events should be triggered from the client and the request is being processed on ther server and a response is returned and client handles the response on the gui.

What you want is called condstore or quick resync, RFC7162 in both cases. That's what Thunderbird uses.
That's a pair of extensions support commands like "give me all the UIDs that have changed since last time I connected", "tell me what's been deleted" and so on.

If you can't use threads to listen for these events from the mail server, your options are very limited. Probably the best thing you can do is limit the resynchronization to the messages that are visible to the client.

Related

JavaMail "UID" really unique?

I have recently been working with javamail. Right now, I am trying to store all mails in a file. For such a thing, one would need a unique ID, so I assumend UID would fit best here. However, I noticed something odd: A mail in the "Inbox" Folder with the subject "Hello" has the UID 10. If I fetched the same message from the "All Messages" folder, I'd get the same Message(because I am in "All Messages") with the same content, but with a different UID.
This isn't actually that much of a problem, however, is it possible that two completely different mails from different folders might have the same UID? In this case, I would have to overthink the way i store mails.
Thanks in advance.
The UIDs are not JavaMail UIDs, they're IMAP UIDs, defined by the IMAP RFC.
The UIDs are unique per-folder, based on the UIDVALIDITY value for the folder. There is no unique ID for the folder itself.
Depending on your needs, you might consider using the Message-ID for the message, although note that while it's very, very likely to be unique, there's no guarantee that it is unique, and there's no guarantee that it exists for every message.

Avoid duplicate POSTs with REST

I have been using POST in a REST API to create objects. Every once in a while, the server will create the object, but the client will be disconnected before it receives the 201 Created response. The client only sees a failed POST request, and tries again later, and the server happily creates a duplicate object...
Others must have had this problem, right? But I google around, and everyone just seems to ignore it.
I have 2 solutions:
A) Use PUT instead, and create the (GU)ID on the client.
B) Add a GUID to all objects created on the client, and have the server enforce their UNIQUE-ness.
A doesn't match existing frameworks very well, and B feels like a hack. How does other people solve this, in the real world?
Edit:
With Backbone.js, you can set a GUID as the id when you create an object on the client. When it is saved, Backbone will do a PUT request. Make your REST backend handle PUT to non-existing id's, and you're set.
Another solution that's been proposed for this is POST Once Exactly (POE), in which the server generates single-use POST URIs that, when used more than once, will cause the server to return a 405 response.
The downsides are that 1) the POE draft was allowed to expire without any further progress on standardization, and thus 2) implementing it requires changes to clients to make use of the new POE headers, and extra work by servers to implement the POE semantics.
By googling you can find a few APIs that are using it though.
Another idea I had for solving this problem is that of a conditional POST, which I described and asked for feedback on here.
There seems to be no consensus on the best way to prevent duplicate resource creation in cases where the unique URI generation is unable to be PUT on the client and hence POST is needed.
I always use B -- detection of dups due to whatever problem belongs on the server side.
Detection of duplicates is a kludge, and can get very complicated. Genuine distinct but similar requests can arrive at the same time, perhaps because a network connection is restored. And repeat requests can arrive hours or days apart if a network connection drops out.
All of the discussion of identifiers in the other anwsers is with the goal of giving an error in response to duplicate requests, but this will normally just incite a client to get or generate a new id and try again.
A simple and robust pattern to solve this problem is as follows: Server applications should store all responses to unsafe requests, then, if they see a duplicate request, they can repeat the previous response and do nothing else. Do this for all unsafe requests and you will solve a bunch of thorny problems. Repeat DELETE requests will get the original confirmation, not a 404 error. Repeat POSTS do not create duplicates. Repeated updates do not overwrite subsequent changes etc. etc.
"Duplicate" is determined by an application-level id (that serves just to identify the action, not the underlying resource). This can be either a client-generated GUID or a server-generated sequence number. In this second case, a request-response should be dedicated just to exchanging the id. I like this solution because the dedicated step makes clients think they're getting something precious that they need to look after. If they can generate their own identifiers, they're more likely to put this line inside the loop and every bloody request will have a new id.
Using this scheme, all POSTs are empty, and POST is used only for retrieving an action identifier. All PUTs and DELETEs are fully idempotent: successive requests get the same (stored and replayed) response and cause nothing further to happen. The nicest thing about this pattern is its Kung-Fu (Panda) quality. It takes a weakness: the propensity for clients to repeat a request any time they get an unexpected response, and turns it into a force :-)
I have a little google doc here if any-one cares.
You could try a two step approach. You request an object to be created, which returns a token. Then in a second request, ask for a status using the token. Until the status is requested using the token, you leave it in a "staged" state.
If the client disconnects after the first request, they won't have the token and the object stays "staged" indefinitely or until you remove it with another process.
If the first request succeeds, you have a valid token and you can grab the created object as many times as you want without it recreating anything.
There's no reason why the token can't be the ID of the object in the data store. You can create the object during the first request. The second request really just updates the "staged" field.
Server-issued Identifiers
If you are dealing with the case where it is the server that issues the identifiers, create the object in a temporary, staged state. (This is an inherently non-idempotent operation, so it should be done with POST.) The client then has to do a further operation on it to transfer it from the staged state into the active/preserved state (which might be a PUT of a property of the resource, or a suitable POST to the resource).
Each client ought to be able to GET a list of their resources in the staged state somehow (maybe mixed with other resources) and ought to be able to DELETE resources they've created if they're still just staged. You can also periodically delete staged resources that have been inactive for some time.
You do not need to reveal one client's staged resources to any other client; they need exist globally only after the confirmatory step.
Client-issued Identifiers
The alternative is for the client to issue the identifiers. This is mainly useful where you are modeling something like a filestore, as the names of files are typically significant to user code. In this case, you can use PUT to do the creation of the resource as you can do it all idempotently.
The down-side of this is that clients are able to create IDs, and so you have no control at all over what IDs they use.
There is another variation of this problem. Having a client generate a unique id indicates that we are asking a customer to solve this problem for us. Consider an environment where we have a publicly exposed APIs and have 100s of clients integrating with these APIs. Practically, we have no control over the client code and the correctness of his implementation of uniqueness. Hence, it would probably be better to have intelligence in understanding if a request is a duplicate. One simple approach here would be to calculate and store check-sum of every request based on attributes from a user input, define some time threshold (x mins) and compare every new request from the same client against the ones received in past x mins. If the checksum matches, it could be a duplicate request and add some challenge mechanism for a client to resolve this.
If a client is making two different requests with same parameters within x mins, it might be worth to ensure that this is intentional even if it's coming with a unique request id.
This approach may not be suitable for every use case, however, I think this will be useful for cases where the business impact of executing the second call is high and can potentially cost a customer. Consider a situation of payment processing engine where an intermediate layer ends up in retrying a failed requests OR a customer double clicked resulting in submitting two requests by client layer.
Design
Automatic (without the need to maintain a manual black list)
Memory optimized
Disk optimized
Algorithm [solution 1]
REST arrives with UUID
Web server checks if UUID is in Memory cache black list table (if yes, answer 409)
Server writes the request to DB (if was not filtered by ETS)
DB checks if the UUID is repeated before writing
If yes, answer 409 for the server, and blacklist to Memory Cache and Disk
If not repeated write to DB and answer 200
Algorithm [solution 2]
REST arrives with UUID
Save the UUID in the Memory Cache table (expire for 30 days)
Web server checks if UUID is in Memory Cache black list table [return HTTP 409]
Server writes the request to DB [return HTTP 200]
In solution 2, the threshold to create the Memory Cache blacklist is created ONLY in memory, so DB will never be checked for duplicates. The definition of 'duplication' is "any request that comes into a period of time". We also replicate the Memory Cache table on the disk, so we fill it before starting up the server.
In solution 1, there will be never a duplicate, because we always check in the disk ONLY once before writing, and if it's duplicated, the next roundtrips will be treated by the Memory Cache. This solution is better for Big Query, because requests there are not imdepotents, but it's also less optmized.
HTTP response code for POST when resource already exists

NSMutableURLRequest on succession of another NSMutableURLRequest's success

Basically, I want to implement SYNC functionality; where, if internet connection is not available, data gets stored on local sqlite database. Whenever, internet connection is available, SYNC gets into the action.
Now, Say for example; 5 records are stored locally, and then internet connection is available. I want the server to be updated. So, What I do currently is:
Post first record to the server.
Wait for the success of first request.
Post local NSNotification to routine, that the first record has been updated on server & now second request can go.
The routine fires the second post request on server and so on...
Question: Is this approach right and efficient enough to implement SYNC functionality; OR anything I should change into it ??
NOTE: Records to be SYNC will have no limit in numbers.
Well it depends on the requirements on the data that you save. If it is just for backup then you should be fine.
If the 5 records are somehow dependent on each other and you need to access this data from another device/application you should take care on the server side that either all 5 records are written or none. Otherwise you will have an inconsistent state if only 3 get written.
If other users are also reading / writing those data concurrently on the server then you need to implement some kind of lock on all records before writing and also decide how to handle conflicts when someone attempts to overwrite somebody else changes.

IMAP: Search for messages with UID greater than X (or generally, after my last search)

I'm writing a script to analyze my mailbox and want to periodically check for new messages. The search criteria would be: give me the UIDs for all emails with UID greater than X, where X is the UID of the last email I processed.
Or, more generally, I'm looking for a way to only see messages since my last search.
Note that I'm not looking for seen/unseen messages; the script opens the mailbox as read-only, and I'd like it to not interfere with my flags, etc.
I know I can specify a date in the IMAP search, but the granularity of that seems to be by day, so not exactly what I need.
I'm starting with Gmail as the IMAP server, but would like to support generic IMAP servers in the future.
Is there way to search for emails with UID greater than X? Or another means of specify all messages since message X?
You can use IMAP SEARCH for UIDs. Assuming your most recently fetched UID is 1999, I think you would do:
SEARCH UID 2000:*
Why not use IMAP IDLE for this?
with IMAP IDLE, the server warns you every time a new message arrives

Mirth losing data in mapper variables

I have a database reader channel set up that actually reads the database at 10 second intervals and sends to a web service just fine. We get a valid response from the wsdl.
However, I need to update the database record so that it is flagged as having been processed. in this case we are simple changing a field from 100 to 101. However, when I try to update the field OR send an email containing ANY data that has been stored into mapper variables I get nothing. The database does not update. Emails send blanks for fields.
When I go into the channel messages for processed messages I can see good data in the Raw Message and Encoded Message tabs. There are no values in the Mappings tab.
Any suggestions on troubleshooting?
The Run-on-Update statement does not have access to the channel map, as it runs after message Encoding (and even the post-processor, I believe).
It DOES have access to the globalChannelMap and the responseMap. Put your new ID in the globalChannelMap and you should be good to go.
If you also want to send an email, would recommend you instead add an SMTP Writer destination (e.g., SMTP writer), which will have access to any channelMap variables created in a 'Destination 1'; as well as the globalChannelMap.