Sphinx and wordforms

Sphinx and wordforms - sphinx

How could I make Sphinx to recognize "auto" and "car" as similar words?
Let's image I have three database records
Andy likes to drive auto.
Mary don't like to drive car.
Bob is going to buy automobile.
Here is sample queries and it's results...
query: car
result: Mary don't like to drive car.
-------------------------------------
query: auto
result: Andy likes to drive auto.
-------------------------------------
query: automobile
Bob is going to buy automobile.
..but I want sphinx to return...
query: car
result:
Andy likes to drive auto.
Mary don't like to drive car.
Bob is going to buy automobile.
-------------------------------------
query: auto
result:
Andy likes to drive auto.
Mary don't like to drive car.
Bob is going to buy automobile.
-------------------------------------
query: automobile
result:
Andy likes to drive auto.
Mary don't like to drive car.
Bob is going to buy automobile.
I know that Sphinx have stowords, but what should I put into stopwords dictionary to make Sphinx think this way?
Thank you.

all you have to do is supply sphinx with a correctly-formatted text file of wordforms in your .conf file.
documentation found here: http://www.sphinxsearch.com/docs/manual-0.9.9.html#conf-wordforms
auto > car
automobile > car
four-wheeled-vehicle-intended-for-public-roads > car
cars > car

Let me give you an example for wordforms morphology with the terms "gearing" and "leverage" as these words are equal terms in finance and should be considered as synonyms (the meaning of both words is "Financial leverage").
Originally your "wordforms.txt" file should contain them listed like this:
gear > gear
geared > gear
gearing > gear
gears > gear
……
leverage > leverage
leveraged > leverage
leverages > leverage
leveraging > leverage
It means that originally these two words are not connected. In order to fix that you should modify the content of "wordforms.txt" this way:
gear > leverage
geared > leverage
gearing > leverage
gears > leverage
……
leveraged > leverage
leverages > leverage
leveraging > leverage
This edit connects them (and all their forms). After you edit the "wordforms.txt" file you must save it and re-index your indexes in order to apply the changes.
Now when you search for "gearing" or "leverage" your results will contain both the words along with all their morphological forms.

Related

Graph DB and Relations from Edges

I'm going to design simple knowledge base with Objects, Relations between them and Questions about Objects and Relations. I understand well how to do this in the RDBMS. But now I'm going to study Graph DB capabilities, in particular OrientDB.
Visual representation of what I want:
Vertex: Moon --> "What is the mass of the M?", etc
|
|
o---> This Edge must also have 'one to many' (edge to questions) relation, e.g
| "How far is the M. from the E?" etc.
|
| <-- Edge: Has a satellite (or stands nearby)
|
Vertex: Earth --> "What is the age of the E?"
I found Link list and Link set datatypes in the manual, but I'm still not sure about (a) are they really what I need and (b) how should I properly represent Questions.
Need your help.

How can I match up user inputs to ambiguous city names?

We have a set of tables shown below we use for our other tables to reference for location data. Some examples are:
Find all companies within X miles of X City
Create a company profile's location as X City
We solve the problem of multiple cities with similar names by matching with State as well, but now we ran into a different set of problems. We use Google's Place Autocomplete for both Geocoding and matching up a users query with our Cities. This works fairly well until Google's format deviates from ours.
Example:
St. Louis !== Saint Louis and
Ameca del Torro !== Ameca Torro
Is there a way to fuzzy match cities in our queries?
Our query to match cities now looks like:
SELECT c.id
FROM city c
INNER JOIN state s
ON s.id = c.state_id
WHERE c.name = 'Los Angeles' AND s.short_name = 'CA'
I've also considered the denormalizing city and simply storing coordinates to still accomplish the radius search. We have around 2 million rows in our company table now so a radius search would be performed on that rather than by city table with a JOIN on company. This would also mean we wouldn't be able to create custom regions (simply anyway) for cities, and add other attributes to cities in the future.
I found this answer but it is basically affirming our way of normalizing input is a good method, but not how we match to our local Table (unless Google offers a City Name export I don't know about).

The short answer is that you can use Postgres's full text search functionality, with a customized search configuration.
Since your dealing with place names, your probably want to avoid stemming, so you can use the simple configuration as a starting point. You can also add stop-words that make sense for place names (with the examples above, you can probably consider "St.", "Saint", and "del" as stop-words).
A pretty basic outline of setting up your customized is below:
Create a stopwords file and put it in your $SHAREDIR/tsearch_data Postgres directory. See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-STOPWORDS.
Create a dictionary that uses this stopwords list (you can probably use the pg_catalog.simple as your template dictionary). See https://www.postgresql.org/docs/9.1/static/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY.
Create a search configuration for place names. See https://www.postgresql.org/docs/9.1/static/textsearch-configuration.html.
Alter your search configuration to use the dictionary you created in Step 2 (cf. the link above).
Another consideration is how to consider internationalization. It seems that the issue for your second example (Ameca del Torro vs. Ameca Torro) might be a Spanish vs. English representation of the name. If that's the case, you could also consider storing both a "localized" and "universal" (e.g. English) version of the city name.
At the end, your query (using full-text search) might look like this (where the 'places' is the name of your search configuration):
SELECT cities."id"
FROM cities
INNER JOIN "state" ON "state".id = cities.state_id
WHERE
"state".short_name = 'CA'
AND TO_TSVECTOR('places', cities.name) ## TO_TSQUERY('places', 'Los & Angeles')

Openstreetmap: from "address" to "shapefile"

I've been looking for an answer on the web since quite a long time, but I couldn't make it. So, I hope Stackoverflow users could help/advice me a bit.
I have 7 000 addresses (like "67, place Lobligeois 75017 Paris, France") and I would like to get a Shapefile that contains the 7 000 buildings corresponding to these 7 000 addresses.
My idea is to:
Use Mapquest API to get the "OSM node" for these 7 000 "addresses"
Use Overpass API to get, for all buildings in Paris, their "ways" and "nodes"
Match (1) et (3) to get the "ways" corresponding to my 7 000 "nodes/adresses"
Load in QGIS a shapefile (found at download.bbbike.org/osm/bbbike/Paris/) of all Paris buildings (shapefile where "OSM_ID" equals "way")
Find in my shapefile the "ways" obtained in (3) and delete all buildings that do not match.
Is it a good idea? Or is there a simpler way to do it (I hope)?
By the way, I am not able to download the data from my step 2, overpass-turbo.eu fails each time. Do you have any idea (is my Bbox too big)?
I would be delighted to get some advices/help.
Charles H.

Try to use this: https://github.com/kiselev-dv/gazetteer/tree/develop/Gazetteer
You can get csv with addresses, address components, osm id's and geometry as WKT string.
After that, you can compare points from step one by osm id or by address and filter csv rows you need.
Finally open csv in QGIS and save it as shape.

There are a couple of things I recommend.
Don't bother trying to extract the buildings. That will put a big hurting on your browser. Instead, grab one of the Geofabrik daily extracts for the Paris region. While those won't include the address nodes, they will have all the buildings.
Next do an overpass query for just addresses on nodes using the NominatimArea function. It looks like there are 30MB worth of them in Paris (!!), so you may have to break that area down into smaller districts, if Paris has any. Export that as GeoJSON and convert to shape.

Mahout: Recommending Items for a user in particular product category

What do we have as of now? - We are using Mahout's GenericItemBasedRecommender to get a list of recommended products for a user using TanimotoCoefficientSimilarity as ItemSimilarity.
Where do we want to go from here? - The above works fine when we don't care about product category but what we want to know is the Product Category specific recommendations i.e. Say if a user has been buying, browsing, liking etc. specifically more in Men's and Gadgets category, I would then want to show this user recommendation in that specific category saying Recommended for you in [X] where X would be replaced by Mens or Gadgets in this case. We are thinking about a couple of options below to achieve this and we need some leads/opinion/feedback etc. so as to make sure we are going in the right direction. Options:
Firstly we'll have to move to a non-tanimoto version for calculating item similarity so that we account for users buying, liking, etc and not only view/browsing data.
Figuring out product category for a particular user (this is where we need direction) - Our product category hierarchy is basically a tree and we need to know which top 4 nodes (with best recommendations) in tree we would show to the user. Also if we are saying that node X is a category which we are showing to the user and node Y is a parent of node X we then don't want show user products in category Y or any parent for that matter. Couple of ways achieving this:
For every user calculate SUM of similarity scores values of items for a nodes at leaf level and recursively calculate for parent node till the root. Now at each node we have A = SUM of similarity scores & B = Number of Items Recommended so we also have A/B=Value (V) at each node. Now we pick the top 4 V values from the tree and recommend that to the user. The challenge here is that if we try to calculate this online during the request it we would tough to limit this under 150 ms for the entire request. An Example:
Root Level - Category12 (A=11, B=4) (category1 + category2)
|
_____________________|_________________________
/ \
/ \
Leaf Level - category1 (A=6, B=2) category2 (A=5, B=2)
Recommended products in Category 1: Item1 (score = 2), Item2 (score = 4)
Recommended products in Category 2: Item3 (score = 1), Item4 (score = 4)
Second option: For every category create a cluster of users based on their behaviour (likes, buying, viewing etc.) and then figure out the top 4 categories to which the user belongs. Not sure if we can achieve this using clustering in Mahout but I think we can do this offline.
Please provide your feedback/suggestions/leads/thoughts.
Thanks in advance!

If you want to model more than one thing in your data, I would suggest to use the SVD recommender instead with the ALSWR factorizer set to implicit feedback. With that done you can have user,item,preference in your data and the preference value would be how strongly associated your user is to the item. You can play with the numbers, for example a purchase is a 20 and a view is just a 2. I'm just throwing numbers here, I wouldn't know what will work best for your data, because you can also model things proportionally, as in if a purchase is 30 times less likely to happen than a view, then a purchase should be 30 times stronger than a view.
Mahout provides a way to influence the recommendations through the IDRescorer. You implement your own logic here and decide how to affect the recommendations. For example, the IDRescorer would check if a recommendation candidate belongs to the same category and if it does, boost the score by X. There's an example here (link) from the Mahout in Action Book (which you should definitely read), showing a rescorer.
Hope this helps

cassandra schema data design for many-to-many array relationship

So I need a DB that can store info for about 300 million users. Each user will have two vectors: their 5 favorite items, and their 5 most similar users (these users also contained in the user set)
ex:
preferences users
user | item user | user
-------------- --------------
user1 | item1 user1 | user2
user1 | item2 user1 | user4
user1 | item3 user2 | user8
user2 | item3 . . .
user2 | item4
. . .
So basically I need two tables, both many-many relationships, and both relatively big.
Ive been exploring cassandra (but im open to other solutions) and I was wondering how I would define the schema, and what type of indexing I need for this to be optimized and working properly.
I will need to query in two fashions:
1.By user of course, and
2. by whatever item is in their list.
(so i can get a list of users with the same favorite item)
Ive already set up cassandra and started messing with it but I cant even get lists to work because i need 'composite' primary keys? I dont understand why.
Any help/a push in the right direction is greatly appreciated.
Thanks!

I am not sure you've adequately described your use case. It is the access patterns that first and foremost define your key design, which is ultimately what defines your workload characteristics with NoSQL databases. For example, are you going to have to do searches for users based on a certain geography or something along those lines or is this just simple , grab 1 user and his favorite items and/or his similar users.
Based on what you've described, you should probably just create a keyspace for user_ids and then your value can be the denormalized copies of "favorite items" and a list of "similar user id's". Assuming your next action is to do something with those similar users, you can quickly get them from the list of id's.
The important point is how big is your key ( i mean in characters / bytes ) and will you be able to fit them into memory so you get really fast performance. If your machines have limited memory for your key size, then you need to plan for a number of nodes which can accommodate a given number of keys and let those nodes run on separate servers. At least that is the most important part for Oracle NoSQL Database (ONDB) .... I am part of that team. Good news is that 300M is still very small.
Hope it helps,
-Robert

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Sphinx and wordforms - sphinx

all you have to do is supply sphinx with a correctly-formatted text file of wordforms in your .conf file. documentation found here: http://www.sphinxsearch.com/docs/manual-0.9.9.html#conf-wordforms auto > car automobile > car four-wheeled-vehicle-intended-for-public-roads > car cars > car

Related

Graph DB and Relations from Edges

How can I match up user inputs to ambiguous city names?

Openstreetmap: from "address" to "shapefile"

Mahout: Recommending Items for a user in particular product category

cassandra schema data design for many-to-many array relationship

Categories

Resources