Email search using Elasticsearch

Email search using Elasticsearch - email

I am using Elasticsearch to search an email inbox. I want to be able to search subject lines as well as message bodies. Simple enough. But if I just do a simple search that ends up matching a subject line, all the email messages in that thread will be returned (because they all have the same subject line). How can I use Elasticsearch to collapse all those messages into one response, while still having it return individual messages that have matches in the message bodies? I want results that behave similar to gmail search results, in other words. Is this possible without having to do two individual searches (one over subjects, one over message bodies), and then combining the results? I'd like to avoid that if possible.
Each message includes a thread identifier, which is unique per message thread.

Related

What is the use of Message-ID in email?

From what I've read, every Message-ID must be unique, however it is possible to create repeated Message-IDs if we force the header with a fixed value. So I don't understand what the point of them saying that the Message-ID should be unique, but they are very easy to create duplicates. If they can easily be generated by anyone with a little reading and basic programmatic knowledge, why do Message-IDs exist and what are they used for, which I can easily duplicate?

Short answer: For threading in email clients.
The message-id header is defined in RFC 2822:
The "Message-ID:" field contains a single unique message identifier.
The "References:" and "In-Reply-To:" field each contain one or more
unique message identifiers
The message ID is used to show which message is a reply to which other message, for example. That way mail clients can show a tree of emails with their replies even if other things like the Subject don't change. (Counting leading Re:s of the subject line would be a bad way to determine ancestors and children: not every mail client adds them, and some use language specific ones.)

https://datatracker.ietf.org/doc/html/rfc5322#section-3.6.4
in conjunction with the References and In-Reply-To fields, mail clients use Message-ID to organize multiple messages into threads.
https://en.wikipedia.org/wiki/Message-ID
and at least some clients will consider two messages with the same ID to be the same thing and discard one of them.

Nifi Email ConsumeIMAP filter by subject, from , to and date

Using ConsumeIMAP to read emails from an Inbox and trying to select only emails that have
- attachment to download
- sent "From" xyz#yahoo.com
- send "To" abc#gmail.com
- Have "Daily" in their subject
- at 8 am EST
Please let me know if it can be set in any component. I tried to use EvaluateJsonPath, ExtractEmailHeaders and RouteonAttribute but no luck yet.

It sounds like you have been exploring the correct path. You should be able to achieve this using a flow consisting of:
ConsumeIMAP >> ExtractEmailHeaders >> RouteOnAttribute
ConsumeIMAP will download messages from the email server and create a single FlowFile for each message, storing the email message raw bytes in the FlowFile contents.
ExtractEmailHeaders attempts to parse a FlowFile's contents as email (must be RFC-2822 compliant), extract email headers, and write each header field to a FlowFile attribute, including:
email.headers.from.*
email.headers.to.*
email.headers.subject
email.headers.sent_date
Note that ExtractEmailHeaders is not doing any filtering, just populating FlowFile attributes based on the FlowFile content, thus making the FlowFiles more easily routable downstream in the flow. Start just by creating a flow that has these two processors and verify that the output of the ExtractEmailHeaders processor meets these expectations. If not, its possible the email messages are malformed or not RFC-2822 compliant.
After you have successfully sent email FlowFiles through ExtractEmailHeaders, you can do the filtering using one or more RouteOnAttribute processors using the NiFi Expression Language to define your match conditions, e.g.:
${email.headers.subject.contains("Daily")}
If you have verified that your flow is working correctly through ExtractEmailHeaders, but the filtering in RouteOnAttribute is not working as expected, make sure your attribute expressions and assumptions about email header values (e.g., capitalization, datetime format) are correct. Consult the Apache NiFi Expression Language Guide and if you have specific questions relating to the expression language itself, search here or post another question on that specifically.
I hope this helps!

Multiple In-Reply-To

Mail user agents usually display threads of Emails by chaining messages together according to the In-Reply-To and References header fields that contain the Message-IDs of other messages. Although a mail usually only replies to one other message, it may be the case that one message answers multiple others. The standard allows multiple entries in both fields. What can I expect when I send an email that References or is In-Reply-To multiple IDs this way?
Is it good practice to do so?
Does it confuse widespread MUAs?
Is there any common ground on how to display such a message in a
threaded view?

The "In-Reply-To:" field will contain the contents of the
"Message-ID:" field of the message to which this one is a reply (the
"parent message"). If there is more than one parent message, then
the "In-Reply-To:" field will contain the contents of all of the
parents' "Message-ID:" fields. If there is no "Message-ID:" field in
any of the parent messages, then the new message will have no "In-
Reply-To:" field.
Technically there COULD be a reason where you would reply to multiple emails and it would be valid to place multiple message ids in the In-Reply-To header. I can’t think of any program that actually supports this. As to MUAs they won’t care the delivery that the MUA cares about is the To, Cc, Bcc headers.
The In-Reply-To header and References header would control how threads are displayed. Not sure if any mail clients would have an issue handling the multiple In-Reply-To headers. 99% of the time there would only be a single message ID in the In-Reply-To header. So it’s feasible mail applications won’t support it. However they would support the additional reference entries. And this shouldn’t pose an issue.

How to pass a large number of input parameters to RESTful service?

I have a RESTful service that returns detailed data about a machine by the supplied list of Ids. GET api/machine/
http://service.com/api/machine/1,2,3,4
Up till now this has been fine since I am getting a small number of machines at a time, but now I need to get all machines (more then 1000). This exceeds the 2000 character limit on URLs.
I have gotten both of the options below to work and I'm looking for some community feedback on which way to go.
Option 1: Split up my GET. Make multiple calls with a subset of the ids. Pros: I am doing a get so using the HTTP verb GET makes sense. Cons: If a person new to the service doesn't know about this limit, or doesn't use my client, it would cause problems.
Option 2: Add a PUT/POST method and include the full list of ids in the body. Pros: Makes 1 call to get all data. Cons: I am now doing a get from a PUT/POST.

Probably your best course-of-action would be something in the lines of option 2, you can create a JSON on your side with an array of the numbers you want to send in the Body of the message. If there's the possibility of it still being far too large, you can split it in several messages, when you receive the response of one you'd send the next item in the queue, and so on.

Another option, used by the Facebook API among others, is to create a "/batch" POST method which can be used to make multiple requests in one go.
So instead of having http://service.com/api/machine/1,2,3,4,5,.... you'll have a batch of requests with /machine/1, /machine/2, /machine/3, etc.
The advantage is that you keep clean RESTful URLs (no more coma-separated values) and it scales very well since you can batch as many requests as you want.
The disadvantage is that it is slightly more complex to build.
See there for more information - https://developers.facebook.com/docs/graph-api/making-multiple-requests

RESTful way to create multiple items in one request

I am working on a small client server program to collect orders. I want to do this in a "REST(ful) way".
What I want to do is:
Collect all orderlines (product and quantity) and send the complete order to the server
At the moment I see two options to do this:
Send each orderline to the server: POST qty and product_id
I actually don't want to do this because I want to limit the number of requests to the server so option 2:
Collect all the orderlines and send them to the server at once.
How should I implement option 2? a couple of ideas I have is:
Wrap all orderlines in a JSON object and send this to the server or use an array to post the orderlines.
Is it a good idea or good practice to implement option 2, and if so how should I do it.
What is good practice?

I believe that another correct way to approach this would be to create another resource that represents your collection of resources.
Example, imagine that we have an endpoint like /api/sheep/{id} and we can POST to /api/sheep to create a sheep resource.
Now, if we want to support bulk creation, we should consider a new flock resource at /api/flock (or /api/<your-resource>-collection if you lack a better meaningful name). Remember that resources don't need to map to your database or app models. This is a common misconception.
Resources are a higher level representation, unrelated with your data. Operating on a resource can have significant side effects, like firing an alert to a user, updating other related data, initiating a long lived process, etc. For example, we could map a file system or even the unix ps command as a REST API.
I think it is safe to assume that operating a resource may also mean to create several other entities as a side effect.

Although bulk operations (e.g. batch create) are essential in many systems, they are not formally addressed by the RESTful architecture style.
I found that POSTing a collection as you suggested basically works, but problems arise when you need to report failures in response to such a request. Such problems are worse when multiple failures occur for different causes or when the server doesn't support transactions.
My suggestion to you is that if there is no performance problem, for example when the service provider is on the LAN (not WAN) or the data is relatively small, it's worth it to send 100 POST requests to the server. Keep it simple, start with separate requests and if you have a performance problem try to optimize.

Facebook explains how to do this: https://developers.facebook.com/docs/graph-api/making-multiple-requests
Simple batched requests
The batch API takes in an array of logical HTTP requests represented
as JSON arrays - each request has a method (corresponding to HTTP
method GET/PUT/POST/DELETE etc.), a relative_url (the portion of the
URL after graph.facebook.com), optional headers array (corresponding
to HTTP headers) and an optional body (for POST and PUT requests). The
Batch API returns an array of logical HTTP responses represented as
JSON arrays - each response has a status code, an optional headers
array and an optional body (which is a JSON encoded string).

Your idea seems valid to me. The implementation is a matter of your preference. You can use JSON or just parameters for this ("order_lines[]" array) and do
POST /orders
Since you are going to create more resources at once in a single action (order and its lines) it's vital to validate each and every of them and save them only if all of them pass validation, ie. you should do it in a transaction.

I've actually been wrestling with this lately, and here's what I'm working towards.
If a POST that adds multiple resources succeeds, return a 200 OK (I was considering a 201, but the user ultimately doesn't land on a resource that was created) along with a page that displays all resources that were added, either in read-only or editable fashion. For instance, a user is able to select and POST multiple images to a gallery using a form comprising only a single file input. If the POST request succeeds in its entirety the user is presented with a set of forms for each image resource representation created that allows them to specify more details about each (name, description, etc).
In the event that one or more resources fails to be created, the POST handler aborts all processing and appends each individual error message to an array. Then, a 419 Conflict is returned and the user is routed to a 419 Conflict error page that presents the contents of the error array, as well as a way back to the form that was submitted.

I guess it's better to send separate requests within single connection. Of course, your web-server should support it

You won't want to send the HTTP headers for 100 orderlines. You neither want to generate any more requests than necessary.
Send the whole order in one JSON object to the server, to: server/order or server/order/new.
Return something that points to: server/order/order_id
Also consider using CREATE PUT instead of POST

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Email search using Elasticsearch - email

Related

What is the use of Message-ID in email?

Nifi Email ConsumeIMAP filter by subject, from , to and date

Multiple In-Reply-To

How to pass a large number of input parameters to RESTful service?

RESTful way to create multiple items in one request

Categories

Resources