How can I match up user inputs to ambiguous city names? - postgresql

We have a set of tables shown below we use for our other tables to reference for location data. Some examples are:
Find all companies within X miles of X City
Create a company profile's location as X City
We solve the problem of multiple cities with similar names by matching with State as well, but now we ran into a different set of problems. We use Google's Place Autocomplete for both Geocoding and matching up a users query with our Cities. This works fairly well until Google's format deviates from ours.
Example:
St. Louis !== Saint Louis and
Ameca del Torro !== Ameca Torro
Is there a way to fuzzy match cities in our queries?
Our query to match cities now looks like:
SELECT c.id
FROM city c
INNER JOIN state s
ON s.id = c.state_id
WHERE c.name = 'Los Angeles' AND s.short_name = 'CA'
I've also considered the denormalizing city and simply storing coordinates to still accomplish the radius search. We have around 2 million rows in our company table now so a radius search would be performed on that rather than by city table with a JOIN on company. This would also mean we wouldn't be able to create custom regions (simply anyway) for cities, and add other attributes to cities in the future.
I found this answer but it is basically affirming our way of normalizing input is a good method, but not how we match to our local Table (unless Google offers a City Name export I don't know about).

The short answer is that you can use Postgres's full text search functionality, with a customized search configuration.
Since your dealing with place names, your probably want to avoid stemming, so you can use the simple configuration as a starting point. You can also add stop-words that make sense for place names (with the examples above, you can probably consider "St.", "Saint", and "del" as stop-words).
A pretty basic outline of setting up your customized is below:
Create a stopwords file and put it in your $SHAREDIR/tsearch_data Postgres directory. See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS.
Create a dictionary that uses this stopwords list (you can probably use the pg_catalog.simple as your template dictionary). See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY.
Create a search configuration for place names. See https://www.postgresql.org/docs/9.1/static/textsearch-configuration.html.
Alter your search configuration to use the dictionary you created in Step 2 (cf. the link above).
Another consideration is how to consider internationalization. It seems that the issue for your second example (Ameca del Torro vs. Ameca Torro) might be a Spanish vs. English representation of the name. If that's the case, you could also consider storing both a "localized" and "universal" (e.g. English) version of the city name.
At the end, your query (using full-text search) might look like this (where the 'places' is the name of your search configuration):
SELECT cities."id"
FROM cities
INNER JOIN "state" ON "state".id = cities.state_id
WHERE
"state".short_name = 'CA'
AND TO_TSVECTOR('places', cities.name) ## TO_TSQUERY('places', 'Los & Angeles')

Related

Search selected tabels in database and line up results?

I'm trying to build a sarch page on my case management system, but I'm struggling to get something dynamic in place.
I would like to search multiple tables for a given string.
And return a list of cases these refer to.
Ie I have 3 tables (the project include several more, but to explain I just use 3).
1: Case main table, including caseID, title, and description.
2: notes table, including a ntext note field.
This is 1:* from the case table so each case can have multiple notes
3: adress table, including street and city for the case
This is also 1:* from the case table so each case can have multiple addresses
I would like to search for ie "Sunset Boulevard", and if the string is found in either the case title, the note or the address I would like to return the list of cases that match.
I can do a normal SELECT statement at get the caseID and Title and in the WHERE Clause specify which to include, ie:
SELECT CaseID, Title
FROM Cases
WHERE Cases.caseID IN ( SELECT CaseID FROM notes WHERE Notes.note like '%Sunset boulevard%' )
OR Cases.caseID IN ( SELECT CaseID FROM address WHERE address.address1 like '%Sunset boulevard%' )
And then expand the where clause to all columns I want to search.
But that won't give me any hint on where the searched string has been found.
I also found another article here https://stackoverflow.com/a/709137
and could use this to search entire database for fields, but this will still not give me a list of cases.
Anyone got a "best practice" for creating small search engine on website?
Best practice will be to move so massive search functionality outside OLTP area and use search engine: eg. Solr, Sphinx, Elasticsearch etc.

Project Academic Knowlede | Query for and list papers by AA.AuId?

I've got a list of author names but I don't have Id's for any of them.
I'd like to:
Query by author name and store the most probable AuId.
List all papers written by a given AuId.
Is there any way to do this with the current interpret/evaluate APIs? It seems like everything is tied to a paper entity and I want to be sure I am only ever selecting and using one AuId.
Thanks.
I am not aware of such a feature. But indirectly, you could first search for the author name (AA.AuN in the expr-field), obtain all the (unique) various author IDs (AA.AuId in the attributes field), and search for their publications.
(You could even add orderby=logprob:desc, but to be honest, I am not 100% sure what logprob does.)
So, the first step could be to search for the author name (e.g. John Smith) like this and fetch all those AA.AuId where the names (AA.AuN) seem to fit John Smith (let's just add the orderby=logprob:desc):
https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?&expr=Composite(AA.AuN=%27john%20smith%27)&count=100&attributes=AA.AuN,AA.AuId&orderby=logprob:desc&subscription-key={YOUR-KEY}
As a second step, if you have an Author ID AA.AuId (here, for example, 3038752200), use this to list their papers (ordered by year, in a descending manner orderby=Y:desc):
https://api.labs.cognitive.microsoft.com/academic/v1.0/evaluate?&expr=Composite(AA.AuId=3038752200)&count=100&attributes=AA.AuN,AA.AuId,DOI,Ti,VFN,Y&orderby=Y:desc&subscription-key={YOUR-KEY}
The approach would be more promising if you had an institutional affiliation as well. Then you could change the expr field to Composite(And(AA.AuN='{AUTHOR-NAME}',AA.AfId={AFFILIATION-ID})) so as to search for all {AUTHOR-NAMES} affiliated to {AFFILIATION-ID}.

Partially matching a post code with Algolia

I've loaded a dataset into an Algolia search index. Each item in the index is a shop with a catchment area (the catchment area is just an array of UK Postcodes that a store covers). For example:
['DS4 6','DS4 7', 'DS5 8, 'DS6 9' ... ]
The search feature is working to a point. If people search for "DS4" then Algolia returns several stores, but most people are typing their full post code (for example DS4 8XX) and this isn't returning anything even though "DS4" is indexed several times.
Is there a configuration in Algolia to search for the first part of a word, even when a person has 'typed past it'?
To clarify this a bit further. I could store every single individual postcode in a catchment area but there are millions and millions of them. A full UK postcode would be "DS4 7EN", so there are two more characters on the end representing a street in the UK. I've got the first part of a postcode: eg "DS4 7" because it seems excessive to store everything when I only really care about the wider area, ie: DS4, DS5, CV43, AB2 (and so on).
I could also probably use a places api and geocode the address. But I already have this catchment area postcode data, so it seems a shame not to use it if I can.
Algolia, like most search engines supports prefix search in order to allow search-as-you-type results, which is leveraged with InstantSearch libraries, where results are updated live as the user types. Without prefix search, you would have to wait for the user to enter an entire word before displaying any meaningful result.
In your case, since the catchment areas are indexed, e.g., DS4 6, when a user types DS4 6XX, no records will match the query since the query acts as a filter on the records based on their searchable attributes.
That said, I see two possible workaround that you can implement.
The first solution is to use the removeWordsIfNoResults index setting and set it to "Last Word". This will remove the last word of the query if there are no results. For instance, with the query DS4 6XX it will remove 6XX to just keep DS4 and retrieve the items that match this query. Note that this solution relies on the fact that DS4 6XX has two words (separated by a space) and it won't work with DS46XX.
The second solution is to change the structure of the records to add the full postcode in each item of the index. Since these are shops, I believe that it should be possible. This way your users will be able to search for both the full postcode DS4 6XX and the catchment areas DS4 6. Unless I misunderstood your problem, I don't see the need to store the full list of postcodes associated to a catchment area.

postgresql multiple column search with ranking

I want to search for multiple columns in multiple tables. Like this:
Given tables:
Users
id
first_name
last_name
email
Companies
user_id
address
Lands
name
company_id
Lets say User is Johny Bravo(johny.bravo#gmail.com) working in Washington in United States.
I want to find the record based on query
"ate" -> from United States, or
"rav" from Bravo
When I type "rav" my Johny Bravo rank is higher than Johny Bravos with other emails so it is first in results
How can I implement such functionality?
I've looked at ts_vector and ts_rank but it seems that it supports only right wildcard ("to_tsquery('Brav:*')") will work, also I don't need full-text-search functionalities(I will look for adresses and usernames so no need to alias names etc.) I can do wildcard search but then I would have to manually calculate ranking in application
You could use pg_trgm extension.
You must have the contrib installed, then you install the extension:
create extension pg_trgm;
Then you can create trigram indexes:
create index user_idx on user using gist (user_data gist_trgm_ops);
And you can then query which will give you first 10 most similar values:
select * from user order by user_data <-> 'rav' limit 10;
Note that you can replace user_data with an immutable function, which can concatenate all of the info into one (text) field thus enabling search across more fields.
To get "ranking score", you can use similarity function, which returns 1 for identical strings and 0 for completely unrelated.
If you need full text search across whole database, a better solution might be a separate search facility, such as Apache Solr.

Italicize a specified string inside of a field in FileMaker Pro

I administer a simple Filemaker Pro 12 database for a company. The current project we are working on requires us to italicize proper names. For example, If the database was movie database I would have the following caption:
Wendy,
Peter Pan
At the moment all captions like these are stored in one field, I would normally have two fields to separate the proper name from the character name, but doing so at this point would be very time consuming. I would like to make a script the italicizes the property names in this field, by looping through an array of proper names, and when a match is found it italicizes that name. This would be extremely useful, normally I could do this easily with another language, but Filemakers scripting language is foreign to me. This is simple in other languages using a foreach loop with a string array. Is there a simple solution someone can point me in the direction of.
You could probably loop through the list of proper names (where is it, and in what form?) and set the field to a calculation using:
Substitute ( field ; searchString ; TextStyleAdd ( searchString ; italic ) )
where searchString is the current value of the inner loop. The outer loop is, of course, looping through all found records. Hard to be more specific with so few details.
That said, IMHO it would take no more time and effort to fix the real problem here once and for all, i.e. separate the two facts into two individual fields.
Note also that there is an assumption here that the proper names match only themselves; for example, "Peter Pan, Peter Pan" would become "Peter Pan, Peter Pan" using the above method.