Address Unification - forms

I'm creating a business directory where I need to display results based on area and keywords. The problem is the scope might be across countries that have fairly irregular address structures. I currently have the following as form fields (and their respective database fields)
Fields (All required):
- Address 1
- Address 2
- Area <------key search criteria
- Keywords <------key search criteria
The problem is I'm not sure how reliable this setup is. I would have to rely on the data entry when searching to be relevant enough for it to work, and that goes against validating everything before inserting to the database. Is there a standard way of looking up areas across countries? And if so, how?

I decided to solve this by running (and verify) addresses via batch geocoding, which converts the addresses to 'geocodes' one can use with mapping plugins (there seems to be a lot of solutions in this regard. Google "batch geocode addresses"), although you may have to research further for accuracy. Though I initially started with OpenLayers for mapping I found leaflet faster to understand and deploy (with emphasis on mobile), Though I am talking from my own experience of learning and being able to implement in time.

Related

CQRS projections, joining data from different aggregates via probe commands

In CQRS when we need to create a custom-tailored projections for our read-models, we usually prefer a "denormalized" projections (assume we are talking about projecting onto a DB). It is not uncommon to have the information need by the application/UI come from different aggregates (possibly from different BCs).
Imagine we need a projected table to contain customer's information together with her full address and that Customer and Address are different aggregates in our system (possibly in different BCs). Meaning that, addresses are generated and maintained independently of customers. Or, in other words, when a new customer is created, there is no guarantee that there will be an AddressCreatedEvent subsequently produced by the system, this event may have already been processed prior to the creation of the customer. All we have at the time of CreateCustomerCommand is an UUID of an existing address.
We have several solutions here.
Enrich CreateCustomerCommand and the subsequent CustomerCreatedEvent to contain full address of the customer (looking up this information on the fly from the UI or the controller). This way the projection handler will just update the table directly upon receiving CustomerCreatedEvent.
Use the addrUuid provided in CustomerCreatedEvent to perform an ad-hoc query in the projection handler to get the missing part of the address information before updating the table.
These are commonly discussed solution to this problem. However, as noted by many others, there are problems with each approach. Enriching events can be difficult to justify as well described by Enrico Massone in this question, for example. Querying other views/projections (kind of JOINs) will work but introduces coupling (see the same link).
I would like describe another method here, which, as I believe, nicely addresses these concerns. I apologize beforehand for not giving a proper credit if this is a known technique. Sincerely, I have not seen it described elsewhere (at least not as explicitly).
"A picture speaks a thousand words", as they say:
The idea is that :
We keep CreateCustomerCommand and CustomerCreatedEvent simple with only addrUuid attribute (no enriching).
In API controller we send two commands to the command handler (aggregates): the first one, as usual, - CreateCustomerCommand to create customer and project customer information together with addrUuid to the table leaving other columns (full address, etc.) empty for time being. (Warning: See the update, we may have concurrency issue here and need to issue the probe command from a Saga.)
Right after this, and after we have obtained custUuid of the newly created customer, we issue a special ProbeAddrressCommand to Address aggregate triggering an AddressProbedEvent which will encapsulate the full state of the address together with the special attribute probeInitiatorUuid which is, of course our custUuid from the previous command.
The projection handler will then act upon AddressProbedEvent by simply filling in the missing pieces of the information in the table looking up the required row by matching the provided probeInitiatorUuid (i.e. custUuid) and addrUuid.
So we have two phases: create Customer and probe for the related Address. They are depicted in the diagram with (1) and (2) correspondingly.
Obviously, we can send as many such "probe" commands (in parallel) as needed by our projection: ProbeBillingCommand, ProbePreferencesCommand, etc. effectively populating or "filling in" the denormalized projection with missing data from each handled "probe" event.
The advantages of this method is that we keep the commands/events in the first phase simple (only UUIDs to other aggregates) all the while avoiding synchronous coupling (joining) of the projections. The whole approach has a nice EDA feeling about it.
My question is then: is this a known technique? Seems like I have not seen this... And what can go wrong with this approach?
I would be more then happy to update this question with any references to other sources which describe this method.
UPDATE 1:
There is one significant flaw with this approach that I can see already: command ProbeAddrressCommand cannot be issued before the projection handler had a chance to process CustomerCreatedEvent. But this is impossible to know from the API gateway (or controller).
The solution would probably involve a Saga, say CustomerAddressJoinProjectionSaga with will start upon receiving CustomerCreatedEvent and which will only then issue ProbeAddrressCommand. The Saga will end upon registering AddressProbedEvent. Or, if many other aggregates are involved in probing, when all such events have been received.
So here is the updated diagram.
UPDATE 2:
As noted by Levi Ramsey (see answer below) my example is rather convoluted with respect to the choice of aggregates. Indeed, Customer and Address are often conceptualized as belonging together (same Aggregate Root). So it is a better illustration of the problem to think of something like Student and Course instead, assuming for the sake of simplicity that there is a straightforward relation between the two: a student is taking a course. This way it is more obvious that Student and Course are independent aggregates (students and courses can be created and maintained at different times and different places in the system).
But the question still remains: how can we obtain a projection containing the full information about a student (full name, etc.) and the courses she is registered for (title, credits, the instructor's full name, prerequisites, etc.) all in the same table, if the UI requires it ?
A couple of thoughts:
I question why address needs to be a separate aggregate much less in a different bounded context, in view of the requirement that customers have an address. If in some other bounded context customer addresses are meaningful (e.g. you want to know "which addresses have more customers" etc.), then that context can subscribe to the events from the customer service.
As an alternative, if there's a particularly strong reason to model addresses separately from customers, why not have the read side prospectively listen for events from the address aggregate and store the latest address for a given address UUID in case there's a customer who ends up with that address. The reliability per unit effort of that approach is likely to be somewhat greater, I would expect.

Which features should be added for NER in search result snippets

I want to cluster queries by help of the snippets of the search engine results they are currently returning. While using the noun phrases in the snippet worked well for Google results I felt that I should try a different approach for bing snippets and hence was going for Named Entity Extraction.
I have identified the following entities that can be extracted as of now using standard tools:
Person Names
Organisation Names
Locations
But I think I should be extracting more entities. Could anyone help me out here to identify more entities that may be useful?
This is an endless list, once you get to real data problems.
For example, dates are a common thing to extract. But for example booking codes such as airline tickets, or tracking codes such as parcels are something Google Mail already recognizes and extracts.
I don't think this is a very good question for a Q/A site. Plus, you may want to read more literature, and see what kind of data you can get - it clearly is data-driven what entities you want to extract. When analyzing log files, you might be interested in extracting host names, IPs, usernames and daemon/serivce names, for example.

Database design decision: normalization or repetition?

I'm building a delivery system, by now, my design looks like that:
The problem is, very frequently, I'll need a structure (array, json, objects...) that looks like that (very hierarchical):
The problem with this, is that it creates a lot of repetition of StreetAddress, DeliveryPoint and Customer, since each Itinerary would create lots of them and itineraries looks very much like others.
The good part is that everything would be pretty with just a few joins.
With the first schema, it would be very weird to create the second structure, but its possible.
Any ideas on how to control the repetition and still get an easy to query schema for the above structure?
I'm using:
PostgreSQL 9.1
PHP 5.5
Symfony Framework Standard Edition 2.4.0-BETA1 (With Doctrine)
[In case anyone wants to know how did I draw the schemas: www.gliffy.com]
Repetition and normalization are not always opposing questions.
Here's the basic problem:
Normalization doesn't care about repetition per se but about functional dependency
Repetition is the wrong question. Functional dependency is the right question. In some of your cases, addresses are remarkably hard to determine functional dependencies regarding because there are so many conventions out there, and even if you did, you'd still run into formatting issues.
A simple way to get to the bottom of this is asking about reasons why a given piece of data may change. Good, normalized design limits the reasons why a given piece of data may need to change. Now, with that in mind, it looks like you need to store historical locations for customers and it looks to me like you might want to do something slightly different.
Instead of:
Delivery -> customer -> street address -> itinerary
It looks to me like it would make more sense to:
Customer -> street address
And
delivery -> itinerary -> street address
In this model you may have duplicate information and you may need to have dates in the street address indicating when it is valid to and from, but that doesn't strike me as a normalization problem especially given the normalization problems that addresses already pose. But from there you can easily track the customer the delivery was made to, while in your model it isn't clear you can track the street address or itinerary of a given delivery.

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.
Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.
Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with
Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.

What ist a RESTful-resource in the context of large data sets, i.E. weather data?

So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values).
So I read all about RESTful services and its ideology.
So I understand that an URL is adressing a ressource.
But what is a ressource in my case?
The common use case is that you want to get the data for a couple of parameters over a timespan at one or more location. So clearly giving every value its own URL is not pratical and would result in hundreds of requests. I have the feeling that my specific problem doesn't excactly fit into the RESTful pattern.
Update: To clarify: There are two usage patterns of the service. 1. Raw data; rows and rows of data for several locations and parameters.
Interpreted data; the raw data calculated into symbols (Suns & clouds, for example) and other parameters.
There is not one 'forecast'. Different clients have different needs for data.
The reason I think this doesn't fit into the REST-pattern is, that while I can actually have a 'forecast' ressource, I still have to submit a lot of request parameters. So a simple GET-request on a ressource doesn't work, I end up POSTing data all over the place.
So I am working on a webservice to access our weather forecast data (10000 locations, 40 parameters each, hourly values for the next 14 days = about 130 million values). ... But what is a ressource in my case?
That depends on the details of your problem domain. Simply having a large amount of data is not a good reason to avoid REST. There are smart ways and dumb ways to model and expose that data.
As you rightly see, your main goal at this point should be to understand what exactly a resource is. Knowing only enough about weather forecasting to follow the Weather Channel, I won't be much help here. It's for domain experts like yourself to make that call.
If you were to explain in a little more detail the major domain concepts you're working with, it might make it a little easier to give specific advice.
For example, one resource might be Forecast. When weatherpeople talk about Forecasts, what words keep coming up? When you think about breaking a forecast down into smaller elements, what words do you use to describe the pieces?
Do this process recursively, and you'll probably be able to make a list of important terms. Don't forget that these terms can describe things or actions. Think about what these terms really mean, what data you can use to model them, how they can be aggregated.
At this point you'll have the makings of something you can start building a RESTful system around - but not before.
Don't forget that a RESTful system is not a data dump wrapped in HTTP - it's a hypertext-driven system.
Also don't forget that media types are the point of contact between your server and its clients. A media type is only limited by your imagination and can model datasets of any size if you're clever about it. It can contain XML, JSON, YAML, binary elements such as a Bloom Filter, or whatever works for the problem.
Firstly, there is no once-and-for-all right answer.
Each valid url is something that makes sense to query, think of them as equivalents to providing query forms for people looking for your data - that might help you narrow down the scenarios.
It is a matter of personal taste and possibly the toolkit you use, as to what goes into the basic url path and what parameters are encoded. The debate is a bit like the XML debate over putting values in elements vs attributes. It is not always a rational or logically decided issue nor will everybody be kind in their comments on your decisions.
If you are using a backend like Rails, that implies certain conventions. Even if you're not using Rails, it makes sense to work in the same way unless you have a strong reason to change. That way, people writing clients to talk to Rails-based services will find yours easier to understand and it saves you on documentation time ;-)
Maybe you can use forecast as the ressource and go deeper to fine grained services with xlink.
Would it be possible to do something like this,Since you have so many parameters so i was thinking if somehow you can relate it to a mix of id / parameter combo to decrease the url size
/WeatherForeCastService//day/hour
www.weatherornot.com/today/days/x // (where x is number of days)
www.weatherornot.com/today/9am/hours/h // (where h is number of hours)