postgresql multiple column search with ranking - postgresql

I want to search for multiple columns in multiple tables. Like this:
Given tables:
Users
id
first_name
last_name
email
Companies
user_id
address
Lands
name
company_id
Lets say User is Johny Bravo(johny.bravo#gmail.com) working in Washington in United States.
I want to find the record based on query
"ate" -> from United States, or
"rav" from Bravo
When I type "rav" my Johny Bravo rank is higher than Johny Bravos with other emails so it is first in results
How can I implement such functionality?
I've looked at ts_vector and ts_rank but it seems that it supports only right wildcard ("to_tsquery('Brav:*')") will work, also I don't need full-text-search functionalities(I will look for adresses and usernames so no need to alias names etc.) I can do wildcard search but then I would have to manually calculate ranking in application

You could use pg_trgm extension.
You must have the contrib installed, then you install the extension:
create extension pg_trgm;
Then you can create trigram indexes:
create index user_idx on user using gist (user_data gist_trgm_ops);
And you can then query which will give you first 10 most similar values:
select * from user order by user_data <-> 'rav' limit 10;
Note that you can replace user_data with an immutable function, which can concatenate all of the info into one (text) field thus enabling search across more fields.
To get "ranking score", you can use similarity function, which returns 1 for identical strings and 0 for completely unrelated.
If you need full text search across whole database, a better solution might be a separate search facility, such as Apache Solr.

Related

Defining relevant indices for database indexing

I need to define and create indices for a postgresql DB used for translation memory.
This is related to this (Database design question regarding performance) question I've posted and the oversimplified design follows this (How to design a database for translation dictionary?) answer. The only difference being I have a Segment (basically a sentence instead of a word).
Tables:
I. languages
ID NAME
---------------
1 English
2 Slovenian
II. segments
ID CONTENT LANGUAGE_ID
-------------------------------
1 Hello World 1
2 Zdravo, svet 2
III. translation_records (TranslationRecord has more columns, omitted here, like domain, user etc.)
ID SOURCE_SEGMENT_ID TARGET_SEGMENT_ID
--------------------------------------
1 1 2
I want to index the segments table for when searching existing translations and for when searching combination of words in the DB.
My question is, is it enough to create an index for the segments table for the CONTENT column or should I also tokenize the CONTENT column to a new column TOKENS and index that as well?
Also, am I missing something else that might be important for creating such indices?
---EDIT---
Querying examples:
When a user enters a new text to translate, the app returns predefined number of existing translation records where source segment's content matches by a certain percent with the entered text.
When a user triggers a manual query to list a predefined number of existing translation records where source segment's content includes the words marked by the user (i.e. the concordance search).
Since there is only one table for all language combinations the first condition for querying would be the language_combination (attribute of translation_record).
---EDIT---

How can I match up user inputs to ambiguous city names?

We have a set of tables shown below we use for our other tables to reference for location data. Some examples are:
Find all companies within X miles of X City
Create a company profile's location as X City
We solve the problem of multiple cities with similar names by matching with State as well, but now we ran into a different set of problems. We use Google's Place Autocomplete for both Geocoding and matching up a users query with our Cities. This works fairly well until Google's format deviates from ours.
Example:
St. Louis !== Saint Louis and
Ameca del Torro !== Ameca Torro
Is there a way to fuzzy match cities in our queries?
Our query to match cities now looks like:
SELECT c.id
FROM city c
INNER JOIN state s
ON s.id = c.state_id
WHERE c.name = 'Los Angeles' AND s.short_name = 'CA'
I've also considered the denormalizing city and simply storing coordinates to still accomplish the radius search. We have around 2 million rows in our company table now so a radius search would be performed on that rather than by city table with a JOIN on company. This would also mean we wouldn't be able to create custom regions (simply anyway) for cities, and add other attributes to cities in the future.
I found this answer but it is basically affirming our way of normalizing input is a good method, but not how we match to our local Table (unless Google offers a City Name export I don't know about).
The short answer is that you can use Postgres's full text search functionality, with a customized search configuration.
Since your dealing with place names, your probably want to avoid stemming, so you can use the simple configuration as a starting point. You can also add stop-words that make sense for place names (with the examples above, you can probably consider "St.", "Saint", and "del" as stop-words).
A pretty basic outline of setting up your customized is below:
Create a stopwords file and put it in your $SHAREDIR/tsearch_data Postgres directory. See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS.
Create a dictionary that uses this stopwords list (you can probably use the pg_catalog.simple as your template dictionary). See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY.
Create a search configuration for place names. See https://www.postgresql.org/docs/9.1/static/textsearch-configuration.html.
Alter your search configuration to use the dictionary you created in Step 2 (cf. the link above).
Another consideration is how to consider internationalization. It seems that the issue for your second example (Ameca del Torro vs. Ameca Torro) might be a Spanish vs. English representation of the name. If that's the case, you could also consider storing both a "localized" and "universal" (e.g. English) version of the city name.
At the end, your query (using full-text search) might look like this (where the 'places' is the name of your search configuration):
SELECT cities."id"
FROM cities
INNER JOIN "state" ON "state".id = cities.state_id
WHERE
"state".short_name = 'CA'
AND TO_TSVECTOR('places', cities.name) ## TO_TSQUERY('places', 'Los & Angeles')

How to get best matching products by number of matches in postgres

Postgres 9.1 shopping cart contains product table
create table products (
id char(30) primary key,
name char(50),
description text );
Cart has search field. If something is entered into it, autocomplete dropdown must show best matching products ordered by number products matched to this criteria.
How to implement such query in postgres 9.1 ? Search should performed by name and description fields in products table.
It is sufficient to use substring match. Full text search with sophisticated text match is not strictly required.
Update
Word joote can be part of product name or description.
For example for first match in image text may contain
.. See on jootetina ..
and for other product
Kasutatakse jootetina tegemiseks ..
and another with upper case
Jootetina on see ..
In this case query should return word jootetina and matching count 3.
How to make it working like auotcomplete which happens when search term is typed in Google Chrome address bar ?
How to implement this ?
Or if this is difficult, how to return word jootetina form all those texts which matches search term joote ?
select word, count(distinct id) as total
from (
select id,
regexp_split_to_table(name || ' ' || description, E'\\s+') as word
from products
) s
where position(lower('joote') in lower(word)) > 0
group by word
order by 2 desc, 1
First of all, do not use the data type char(n). That's a misunderstanding, you want varchar(n) or just text. I suggest text.
Any downsides of using data type "text" for storing strings?
With that fixed, you need a smart index-based approach or this is a performance nightmare. Either trigram GIN indexes on the original columns or a text_pattern_ops btree index on a materialized view of individual words (with count).
Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
The MV approach is probably superior for many repetitions among words.

Cassandra: Column Families for complex queries?

Every source tells me that supporting complex queries in cassandra is complicated and you usually need to create a new Column Family to support specific queries (like JOINS in a relational database).
I don't understand why you would actually need another Column Family for a query.
An example of this was demonstrated by IBM here: http://www.ibm.com/developerworks/library/os-apache-cassandra/
The system has Books with the following columns: Author,Price, tag1, tag2, tag...
If I wanted to perform a query like "Get all authors that have written books with the tag sci-fi", they recommend creating a column family called TagsToAuthor. Why is this necessary. I believe you can do the following 2 solutions without creating a new column family:
Create a Tag column family, with the columns: Book1, Book2, Book..., Author1, Author2, Author...
Create a Tag column family & create a BookTag column family that contains the columns: book_id & tag_id. Although Cassandra doesnt have a join functionality, you can simply get the tag id from the Tag column family, then get the list of book_id's by querying BookTag, then using those id's to query Book. Just like you would in a normal relational database.
What are the disadvantages to these solutions?

Most efficient database schema for counting keywords

I'm working on an iPhone app with a GAE backend. I currently have a database of ~8000 products and each product has 5 keywords, mined from reviews, that are the words used most often to describe the product. Once I deploy the app, I'd like to allow users to add new products, and add their 5 keywords to existing products. So, when "reviewing" an existing product, they would add their 5 words, and these would be reflected in the Top 5 words if they push a word over into the Top 5. These keywords will be selected via a large whitelist with indirect selection so I can control the user input. I'd like the application to scale to thousands of users without hitting my backend too hard.
My question is:
What's the most efficient database schema for keeping track of all the words for a product and calculating the top 5 for each product once it's updated?
My two ideas (which may be terrible):
Have a "words" column which contains a 2d array, one dimension is the word, the other is the count for that word. They would then be incremented/decremented as needed.
Have a database with each word as a column and each product as a row and the corresponding row/column would contain the count.
The easiest way to do this would be to have a 'tags' kind, defined something like this (you haven't specified a backend language, so I'm assuming Python):
class Tag(db.Model):
# Tags should be child entities of Products and have key name based on the tag
# eg, created with Tag(parent=a_product, key_name='awesome', ...)
count = db.IntegerProperty(required=True, default=0)
#classmethod
def increment_tags(cls, product, tag_names):
def _tx():
tags = cls.get_by_key_name(tag_names, parent=product)
for i, tag in enumerate(tags):
if tag is None:
# New tag
tags[i] = tag = cls(key_name=tag_names[i], parent=product)
tag.count += 1
db.put(tags)
return db.run_in_transaction(_tx)
#classmethod
def get_top_product_tags(cls, product, num=5):
return [x.key().name() for x
in cls.all().ancestor(product).order('-count').fetch(num)]
The increment_tags method increments the count property on all the relevant tags. Since they all have the same parent entity, they're in the same entity group, and it can do this transactionally, in a single transaction.
The get_top_product_tags method does a simple datastore query to find the num top ranked tags for a product.
You should use a normalized schema and let SQL and the database engine be your friend. Have a single table with a design like this:
create table KeywordUse
( AppID int
, UserID int
, Sequence int
, Word varchar(50) -- or whatever makes sense
)
You can also have an identity primary key if you like, but AppID + UserID + Sequence is a candidate key (i.e. the combination of these three must be unique).
To find the top 5 keywords for any app, do a SQL query like this:
select top 5
count(AppID) as Frequency -- If you have an identity PK count that instead.
, Word
from KeywordUse
where AppID = #AppIDVariable...
group by Word, AppID
order by count(AppID) desc
If you are really, really worried about performance you could denormalize the results of this query into a table that shows the words for each app. Then you'd have to work out how often to refresh that snapshot.
REVISED ANSWER:
As Nick Johnson so generously pointed out, aggregate functions are not available in GQL. However, the philosophy of my answer remains unchanged. Let the database engine do its job.
The table should be AppID, Word, and Frequency. (AppID and Word are the PK.) Then each use of the word would be added up as it is applied. Then, when you want to know the top five words for an app you select by AppID := #Value and order by Frequency (descending) with a LIMIT = 5.
You would need a separate table to track user keywords if that is important.