The Best Way to Parse Dates from an Email - email

Im currently developing an app that can parse dates from an email - i.e extract the time and dates from an email (similar to gmail).
Currently I do this in php but this is a tad clunky.
Whats the best language to do this in and are there any existing open source solutions?

I think PHP is as capable as any other language. Can we see the code you're using so we can suggest improvements? I'd use a regular expression... you just need a good one that supports a variety of formats.

What I do in my email client is extract all the tokens delimited by whitespace and then iterate over them using heuristics to decide how to classify each token. For instance if the token has a ':' character in it then I treat it as a time, to be parsed as ##:##:##. If it has '.' or '-' treat it as a day/month/year combo, and you have to decide which end is which... could be any number of combinations. If the token starts with a letter (i.e. isalpha(*string)) then you do a month name lookup. If it's a number it could be the day or year... decide based on length and whether you have an existing day or year already etc. If the token starts with '-' or '+' then it's a timezone, parse accordingly.
Seems to work in the field quite well, my email client has been around for 10 years or so. My code is C++, but you can write the same in PHP easily, it's not particularly language specific.

if you mean the date it was sent (or received), you are retrieving them from the mail headers (for example 'Date:' header) and they have a standard date format, see the RFC 2822
Anyway, if you use javamail (it's open source now) you can get the sent date with
Date sentDate = mail.getSentDate();

Related

Localization for REST APIs

I am starting this discussion to gather more info on localization practices for APIs. It seems HTTP does NOT provide sufficient guidance and even the state of practice is not sufficient enough.
The basic problem is that APIs may need to provide content that is dependent on the user culture, country, language and timezone. For example a German user would like to read messages in German language, with European metric dates, numbers, units, using Euro currency and in Central European Timezone.
Reading through RFC 7231 Section 5.3.5 Accept-Language and further into RFC 4647 one may think Accept-Language is sophisticated enough and is what should be done. There are several notable shortcomings though:
Language tags may not be precise enough e.g. user may only request language without country code and thus leave ambiguity as: "de, en;q=0.8"
Even if the user supplies both language and country preferences it is not clear how to tie the selection of message locale and value formatting locale. For example if a user requests: "hu_HU, en_US;q=0.9" while the application lacks Hungarian messages and is written in Java that knows how to format date in Hungarian. So should the app use English messages with Hungarian dates or rather provide English messages with US dates? The actual situation may be more complex.
Timezone is not present in the language tags. There is no HTTP standard header for this it seems.
I see Microsoft have thought about #2 in ASP.Net and introduce the notion of Culture and UICulture to separate selection of message language from formatting.
In Java world Spring have introduced TimeZoneAwareLocaleContext to address #3
W3c have issued guideline to Accept-Language used for locale setting. This more or less says that Accept-Language is not enough
So what is your thinking?
Do you know of APIs tat solve this problem in comprehensive way? Pointers?
Should APIs accept multiple values for selecting message language, value formatting locale and timezone?
Should Accept-Language be used at all?
Ok guys,
here is a summary of how I answer my question. I hope this helps future API authors.
The fundamental requirements for an UI based on top of API excluding currency presentation seem to be:
Select the best language out of the available product translations using RFC 4647 list of language ranges
Select the best data format out of the available using RFC 4647 list of language ranges
Allow clients to provide distinct preferences for translation and format. There will be cases where people will not find the best translation and yet prefer to see the proper formatting aligned with their culture.
Allow clients to specify a timezone using IANA TZDB identifiers
Format data elements using Unicode CLDR http://cldr.unicode.org/
Use named placeholders in localization bundles e.g. "{drive} is corrupt" is easier to translate properly than "{1} is corrupt"
On the REST HTTP headers I suggest use of 3 headers
accept-language - used for selecting translation and following the guidelines of RFC 7231 https://www.rfc-editor.org/rfc/rfc7231#section-5.3.5
format-locale - used to select data formatting style if different from the translation language preferences. Again list of language range elements. Defaults to accept-language if omitted.
timezone - used to select timezone for rendering date and time values. This should be valid timezone ID from the IANA TZDB https://www.iana.org/time-zones
Implementation wise it seems Java 8 and later have full capability to implement a globalized application. Other languages and older Java versions seem to have varying degrees of issues.
I would keep all data in a universal locale independent format. For numbers using . as a decimal separator, date and time using ISO 8601 and in UTC, etc.
Provide localized text only if it absolutely necessary. In that case get the locale from accept-language header field, and if you have the localized string pass that. If not fallback to the string you have.
For example, you might a multilingual product database that contains product data in several languages. When you write an API for the database you can select the product data in user's language (if any).
Here is a sample.

REST API semantics for querying dates/times?

I have a node.js application that stores many dates in a database. They are stored in the ISO format, such as '2016-11-02T16:30:12-04:00'.
Some fields which are dates are just dates, other are date/times. An example of a date/time would be "last modified" for a record, where a person's birthday is just a date.
The question is about best practices for storage and query patterns on these things. Because a date always has a time, you must choose how to store for example a birthday. Following the 5 laws of API dates and times this is of course done in UTC.
There are edge cases though where proper API behavior seems unclear. Suppose someone submits a birthdate to the API of '2016-11-02T16:30:12-04:00'. This is bad news, because a search like /users?birthdate=2016-11-02 will fail, as that date will get converted to '2016-11-02T00:00:00Z' and fail to match in the DB. What then should correct behavior be?
When someone POSTs a user, convert date fields into dates at midnight UTC, and then have the convention that querying birthdates should assume the same?
Convert date queries for certain fields into implicit ranges, i.e. searching for 2016-11-02 is really looking for 2016-11-02T00:00:00Z <= x <= 2016-11-02:23:59:59Z?
Match only on the exact moment, and rely on the client to know that a birthday of '2016-11-02T16:30:12-04:00' really means 4:30PM EST, and does not mean just on November 2nd?
What's the established pattern / best practice here for distinguishing between dates and datetimes?
I have been studying REST best practices and standards a lot for a while and I can't recall reading anything about that, but for the usage of ISO standard. From your description it seems to be something that really depends on the application and its use-cases.
I would go for your option #2: if a GET request comes with a date but no time, consider it a query for the whole day, and do the "conversion" in your GET response server code. Maybe you'd want to support both a "date" and a separate "time" query string parameters if the precise time might matter occasionally. This can also help you to keep clients "unaware" of the database storage format you choose, and may even allow you to support localized date formats.
The problem here is the usage of UTC, which implies that there's a time associated with it. There's not, a birthdate is considered (in iCalendar) a 'floating date' and does not have a specific time associated with it.
If your birthdate is November 3rd, and you move to Australia, your birthdate does not actually change to November 2nd, because your birthdate does not have a time, does not have a timezone and is the same where ever you are in the world.
The solution is simple. If you allow users to submit a date/time for birthday searches, then you should just 'cut off' the time and timezone. Assume that you're only going to be using the date portion and just search your database based on that.
Ideally you don't allow users to submit a time at all though. I think this just creates confusion. Just force api clients to submit a date only.
Those '5 laws' are an extreme over-simplification and don't apply to many situations.

Proper REST URI date range

Which date range format would be more preferable in a REST URI? I prefer the first one. However, I have seen compressed timestamps as well. Thoughts?
api.example.com/report/id?start=2015-08-07&end=2015-08-15
api.example.com/report/id?start=20150807&end=20150815
api.example.com/report/id?range=20150807-20150815
To state that it is proper or preferable would be overstating my case but I have had success with the following which is based upon the ISO 8601 format. with .. as a range separator. I have used this both as a URL segment and for query parameters in the past. Pick that which fits your requirements most closely, I would consider the time portion optional.
2012-01-01T00:00:00.000Z..2012-12-31T00:00:00.000Z

Cross Language/Cross Platform Date and Time Transfer

What is the best way to transfer Dates and Times across. I am using GWT on the client/browser side and .NET C Sharp on the server and I am using JSON as data-interchange format. I am currently storing all the dates and times on the server as .NET DateTime. Now I have noticed, that if I use the GWT DatePicker or DateBox to pick a date and send it to the server as miliseconds (by doing date.getTime()) where the server takes this param as DateTime, I can see an hour offset due to the BST. I have situations where I have to have the date and time in separate boxes on the UI and the time setting along with the correct date is crucial because of scheduling.
The best way to interchange Date and Time values would be to serialize them into culture-independent, UTC based strings like: 2010-09-18T18:37:11. The problem is, Date and Time related operations tend to be implemented incorrectly...
As for your problem, I assume that it pops up during deserialization of JSON time, i.e. .Net treats this time as local (DateTimeKind.Local or DateTimeKind.Unspecified), thus converting it. Not sure how to deal with it, the brute force would be probably sending serialized string like above and deserializing manually like this:
DateTime date = DateTime.Parse(dateString, CultureInfo.InvariantCulture, DateTimeStyles.AssumeUniversal);
I'd recommend using standard such as ISO 8601 to transfer date time info in a string form. At my company, date time information encoded in JSON object is almost always in this format, e.g. "2015-10-12T18:41:11+01:00". This string can be parsed and understood correctly in all clients with different programming languages (Obj-C, Java, C/C++).

Should dateTime elements include time zone information in SOAP messages?

I've been searching for a definitive answer to this, and the XML schema data types document seems to suggest that timezones are accepted, yet I found at least one implementation which does not properly convert time zones ( NUSOAP ).
To make sure that the problem is not at my end, I'd like to know if a format such as 2009-11-05T11:53:22+02:00 is indeed valid and should be parsed with timezone information, i.e. as 2009-11-05T13:53:22.
Given the following sentences from the w3c schema documentation:
"Local" or untimezoned times are
presumed to be the time in the
timezone of some unspecified locality
as prescribed by the appropriate legal
authority;
and
When a timezone is added to a UTC
dateTime, the result is the date and
time "in that timezone".
it does not sound like there is a definitive answer to this. I would assume that it is the usual ambiguity: Both versions are principally valid, and the question of what version to use depends on the configuration/behavior/expectations of the system one is interfacing with.
And even if there where a definitive answer, I would definitely not rely on it, but rather expect that every other web service and library had its own way of dealing with this :/
You converted the timezone incorrectly.
2009-11-05T11:53:22+02:00
is equivalent to
2009-11-05T09:53:22Z
Is that what NUSOAP did?