NLP Date Parsing - date

I've been experimenting with a number of NLP text parsers, but have found that most fail at even some of the simplest tasks that occur in actual texts (aren't preprocessed to show how "great" the systems are. An example is the following:
From Sundays until Thursdays every week
I've yet to find a single parser that can parse this correctly. I've tried with quite a number including Stanford's sutime. Can anyone recommend software that can handle natural text dates?

I did not find one either when I went looking so I wrote my own. It's part of my natural language engine for .NET.
Here's what the demo shows when you enter that phrase (qualified to next week rather than every week - it can handle that too but it's infinite):
Some comments:
1) Handling all possible english language temporal expressions is a huge task. I've been working on this problem for years to come up with a clean way to represent temporal expressions plus the many rules needed to parse english expressions of time.
2) In addition to finding a way to represent typical calendar date times and ranges of such, you also need ways to represent infinite sequences like 'every monday', and half-infinite sequences like 'every weekday before ...'. And then you'll need an algebra on top of that for combining temporal expressions.
3) Temporal expressions are often ambiguous in the English language and interpretation may vary from culture to culture.
4) The result must often be interpreted in the context of the sentence and/or the conversation history. "Who called Monday?" is a different Monday from "Remind me on Monday" and is different again from "Show me statistics for Monday".

Related

drools working with dates

In official documentation I can't find any information how to write conditional statements for java.util.Date type fact fields in guided rules. For example how to compare such field to current date, check if it is equal omitting time, or check if it is date before some time from now?
Drools isn't a real-time program and it doesn't have an innate idea of Time or Now. If you need to investigate relations of some fact property w.r.t. some point of time X, you'll have to establish a fact carrying X as its data, and write your rules based on that.
A more or less coarse approximation of a fact representing Now can be made using timers. You can implement a rule that modifies a fact containing a value representing Time (e.g. java.util.Date) every second, or less frequently.
Blending out the time of the day is something you'll have to do using Java or DRL functions. Alternatively, if it is days you are interested in, use some custom class representing days, with some suitable day 1 defined by you.
you can give like
inputDate>=11-Nov-2014
provide your current date to inputDate rule input Fact variable.

Tool or technique to compare and group diffs by similarity

I have developed a system that allows visitors to submit typo corrections for my blog. It works by having a small client-side app which then sends unified diffs to a server. Behind that, I have an interface which allows me to see all diffs in a nice graphical way, sort them, etc.
However I am thinking that as time passes, many visitors will submit corrections for the same things before I have time to fix them. So I would need a way to group similar or identical diffs together.
Identical diffs are easy enough. But there might be people who fix errors differently, e.g. using American or British spellings, different rules for punctuation, varying understandings of unclear phrases, that kind of thing. Grouping similar diffs would be tremendously helpful.
Are there techniques, algorithms, or tools that are specifically designed or can be used to compute the similarity of diffs?
I believe that you have two problems to solve: 1. recognizing fixes for the same text (e.g. same typo location), 2. potentially remove those with the same or nearly equal solutions and at least group all the patches that are related to that location.
Problem 1. The unified diff format is somewhat OK as it gives the lines, but a word level or character level diff (for example, counting each word as a line as wdiff does) might be more precise and help you group more precisely the patches.
Problem 2. if the patches are identical, as you noted it is trivial, if they are different, solving the problem 1 already did much of the work. You can of course use a normalization such as "inflected word parts removal" (removing 's', 'ing' and so on at end of words for example) or "lower casing" before the comparison the replacements part in the unified diffs, thus helping group together nearly identical solutions.
The problem 1 is the problem paused by integration or merge of patches. Problem 2 is more relevant to your particular case.
Maybe you could adopt the Damerau-Levenshtein algorithm. It is used to calculate the distance between two strings.

How do you represent forever (infinitely in the future) in iso8601?

An API defines that a date should be sent as iso8601, but we have a requirement to send "forever" as a date, and the standard does not seem to cover this. Can anyone suggest a better solution than Dec 31 9999? Is there a different standard that would be more appropriate?
Quoting ISO 8601:2004(E):
3.5 Expansion
By mutual agreement of the partners in information interchange, it is permitted to expand the component
identifying the calendar year, which is otherwise limited to four digits. This enables reference to dates and
times in calendar years outside the range supported by complete representations, i.e. before the start of the
year [0000] or after the end of the year [9999].
And also relevant may be section 3.7 Mutual agreement which basically says you're free to define your own representations as long as you don't interfere with the representations defined in ISO 8601. So 9999-12-32 or 9999-13-00 could be mutually agreed upon for your proposed forever value.
As to what's common practice, I'd say it depends.
I'd go for 3.7 whenever possible. But it's important to assess your role within the whole set-up. E.g. if you're using a 3rd party API within your own set of components for the sake of convenience or future compatibility, there should be no problem at all. If you're part of a bigger system and you'd have to convince tens of other system parties/components/modules/etc. I'd say it's not worth the trouble.
Also very important to check legacy code. And at least sketch out a plan on how to do the migration in case it breaks set-ups beyond belief. That could be anything from documenting your API "extension" to actually sending patches to the legacy code maintainers.

What's a good data structure for periodic or recurring dates?

Is there a published data structure for storing periodic or recurring dates? Something that can handle:
The pump need recycling every five days.
Payday is every second Friday.
Thanksgiving Day is the second Monday in October (US: the fourth Thursday in November).
Valentine's Day is every February 14th.
Solstice is (usually) every June 21st and December 21st.
Easter is the Sunday after the first full moon on or after the day of the vernal equinox (okay, this one's a bit of a stretch).
I reckon cron's internal data structure can handle #1, #4, #5 (two rules), and maybe #2, but I haven't had a look at it. MS Outlook and other calendars seem to be able to handle the first five, but I don't have that source code lying around.
Use a iCalendar implementation library, like these ones: ruby, java, php, python, .net and java, and then add support for calculating special dates.
With all these variations in the way you specify the recurrence, I would shy away from one single data structure implementation to accommodate all 5 scenarios.
Instead, I would (and have for a previous project) build simple structures that address each type of recurrence. You could wrap them all up so that it feels like a single data structure, but under the hood they could do whatever they like. By implementing an interface, I was able to treat each type of recurrence similarly so it felt like a one-size-fits-all data structure. I could ask any instance for all the dates of recurrence within a certain time frame, and that did the trick.
I'd also want to know more about how these dates need to be used before settling on a specific implementation.
If you want to hands-on create a data structure, I'd recommend a hash table (where the holidays or event are keys with the new date occurrence as a value), if there are multiplicities of each occurrence you could hash the value that finds a section in a linked list, which then has a list of all the occurrences (this would make finding as well as insertion run in O(1)).

Temporal Extraction (i.e. Extract date/time entities from free form text) - How?

Has anyone found a simple, but effective way to extract date references from text? I've done a fair amount of searching for temporal extraction tools, but there isn't a lot out there. There are a few white papers, but it seems to fall into a subset of the whole semantic web thingy but not given much attention.
I'm just looking for something that is 80% effective. There is no need to capture things like "the month after Jan 2009", but basic common dates entities would be nice.
I'm open to all suggestions, even fancy regex expressions.
Fire away!
(and thanks - Henry)
If the target temporal expressions in your data are only in limited format, use regular expression and iterative approach to refine your system
Otherwise, use Stanford NLP toolkit, SUTime, which might be an over-kill but definitely meet your demands
One way I have done this is to just look for anything that is 4 numbers and convert it to a number. If the number falls within the range of years you are interested in, you probably have a year you can use. If you are interested in any matching months and days you could check adjacent words to see if they are a month name or a number between 1 and 31. I am confident this would satisfy your 80% requirement.
Regex for years: [0-9]{4} - you will need to convert to a number and see if it's within the range of years you consider valid.
Regex for months: jan|january|feb|february ... etc for each month
Regex for days of the month: [0-9]{1,2} - you would need to convert to a number and see if it is 1-31
I'm drawing a blank on how to find what to feed it, but this library will parse a wide range of dates and could be used as the "is this a real date" function. (Full disclosure, I'm the author of that lib)