Tag hierarchies and handling of - tags

This is a real issue that applies on tagging items in general (and yes, this applies to StackOverflow too, and no, it is not a question about StackOverflow).
The whole tagging issue helps cluster similar items, whatever items they may be (jokes, blog posts, so questions etc). However, there (usually but not strictly) is a hierarchy of tags, meaning that some tags imply other tags too. To use a familiar example, the "c#" so tag implies also ".net"; another example, in a jokes database, a "blondes" tag implies the "derisive" tag, similarly to "irish" or "belge" or "canadian" etc depending on the joke's country origin.
How have you handled this, if you have, in your projects? I will supply an answer describing two different methods I have used in two separate cases (actually, the same mechanism but implemented in two different environments), but I am also interested not only on similar mechanisms, but also on your opinion on the hierarchy issue.

This is a tough question. The two extremes are an ontology (everything is hierarchical) and a folksonomy (tags have no hierarchy). I have answered this on WikiAnswers, with a reference to Clay Shirky's "Ontology is Overrated" article which claims you should set no hierarchy.

Actually I would say that it is not so much a hierarchical system but a semantic net with felt distancies between tags meanings. What do I mean: mathematics is closer to experimental physics then to gardening.
Possibility to build such a net: Build pairs of tags and let people judge the perceived distance (using a measure like 1-10, meaning something like [synonyms, alike,...,antonyms], ...) and when searching, search for all tags within a certain distance.
Does a measure have to be equal distance if coming from the oposite direction ([a,b] close -> [b,a,] close)? Or does proximity imply [a,b] close and [b,c] close -> [a,b] close?
Maybe the first word will by default trigger another semantic field? If you start at "social worker", "analyst" ist near. If you start at "programmer", "analyst" is near as well. But starting at any of these points, you probably would not count the other as near ("sozial worker" is by no means close to "programmer").
You therefore would have only pairs judged and judged in both directions (in random order).
[TagRelations]
tagId integer
closeTagId integer
proximity integer
Example for selection of similar tags:
select closeTagId from TagRelations where tagId = :tagID and proximity < 3

The mechanism I have implemented was to not use the tags given themselves, but an indirect lookup table (not strictly DBMS terms) which links a tag to many implied tags (obviously, a tag is linked with itself for this to work).
In a python project, the lookup table is a dictionary keyed on tags, with values sets of tags (where tags are plain strings).
In a database project (indifferent which RDBMS engine it was), there were the following tables:
[Tags]
tagID integer primary key
tagName text
[TagRelations]
tagID integer # first part of two-field key
tagID_parent integer # second part of key
trlValue float
where the trlValue was a value in the (0, 1] space, used to give a gravity for the each linked tag; a self-to-self tag relation always carries 1.0 in the trlValue, while the rest are algorithmically calculated (it's not important how exactly). Think the example jokes database I gave; a ['blonde', 'derisive', 0.5] record would correlate to a ['pondian', 'derisive', 0.5] and therefore suggest all derisive jokes given another.

Related

Partially matching a post code with Algolia

I've loaded a dataset into an Algolia search index. Each item in the index is a shop with a catchment area (the catchment area is just an array of UK Postcodes that a store covers). For example:
['DS4 6','DS4 7', 'DS5 8, 'DS6 9' ... ]
The search feature is working to a point. If people search for "DS4" then Algolia returns several stores, but most people are typing their full post code (for example DS4 8XX) and this isn't returning anything even though "DS4" is indexed several times.
Is there a configuration in Algolia to search for the first part of a word, even when a person has 'typed past it'?
To clarify this a bit further. I could store every single individual postcode in a catchment area but there are millions and millions of them. A full UK postcode would be "DS4 7EN", so there are two more characters on the end representing a street in the UK. I've got the first part of a postcode: eg "DS4 7" because it seems excessive to store everything when I only really care about the wider area, ie: DS4, DS5, CV43, AB2 (and so on).
I could also probably use a places api and geocode the address. But I already have this catchment area postcode data, so it seems a shame not to use it if I can.
Algolia, like most search engines supports prefix search in order to allow search-as-you-type results, which is leveraged with InstantSearch libraries, where results are updated live as the user types. Without prefix search, you would have to wait for the user to enter an entire word before displaying any meaningful result.
In your case, since the catchment areas are indexed, e.g., DS4 6, when a user types DS4 6XX, no records will match the query since the query acts as a filter on the records based on their searchable attributes.
That said, I see two possible workaround that you can implement.
The first solution is to use the removeWordsIfNoResults index setting and set it to "Last Word". This will remove the last word of the query if there are no results. For instance, with the query DS4 6XX it will remove 6XX to just keep DS4 and retrieve the items that match this query. Note that this solution relies on the fact that DS4 6XX has two words (separated by a space) and it won't work with DS46XX.
The second solution is to change the structure of the records to add the full postcode in each item of the index. Since these are shops, I believe that it should be possible. This way your users will be able to search for both the full postcode DS4 6XX and the catchment areas DS4 6. Unless I misunderstood your problem, I don't see the need to store the full list of postcodes associated to a catchment area.

list of all valid parameters and criteria that can be used in RMA queries

I would like to get specific neuron models and even though I believe I understand the RMA query system, I can not find a list of the valid keywords/arguments/criteria/parameters that would correspond to what I am looking for.
For example 'homo sapiens' as donor species is valid, and makes sense.
But if 'm__biophys_perisomatic' returns all cells with perisomatic biophysical models, what about 'all active' ones (just an example, I would be interested in many other categories)?
I assume it is obvious but I will not stumble upon it until I have posted this question.
Thanks for your question. You can see what fields and associations are available for a table using the describe route. For example:
http://api.brain-map.org/api/v2/data/NeuronalModel/describe.xml
From your question, I believe you're looking at this table:
http://api.brain-map.org/api/v2/data/ApiCellTypesSpecimenDetail/describe.xml
You can use m__biophys_all_active to see if a cell in that table has an all-active model.
FYI: The ApiCellTypesSpecimenDetail table is a denormalized table, which means it combines a complex set of relationships among tables into a single flat table.
You could similarly use the following, more generic query to find the all-active models.
http://api.brain-map.org/api/v2/data/query.xml?criteria=model::NeuronalModel,rma::criteria,neuronal_model_template[name$eq'Biophysical - all active']&num_rows=150

Sphinx: Show all results order by previous searches

I use SphinxQL for searching and filtering in product database and I store last x search phrases of each user. I wonder if is it possible to show all products (all rows) to every user but with relevance on previous search.
Let's say one user sought for mobile phones (iphone, galaxy s7...), ie. electronics category. I want to show him all products randomly, but products from category electronics more often and products with those searched keywords even more often.
Is it even possible with Sphinx?
Thanks and sorry for english.
An alternative, would be perhaps to create random numbers attached to each result. A high and a low number, with an overlapping range.
sql_query = SELECT id, RAND()*100 AS rand_low, (RAND()*100)+50 AS rand_high, ...
sql_attr_uint = rand_low
sql_attr_uint = rand_high
Can then arrange the ranking expression to pick either of these numbers depending on if matches or not, and sort by the result.
SELECT id FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
OPTION ranker=expr('IF(doc_word_count>1,rand_high,rand_low)');
Will be mixed up. But results that match one of the words, have a greater chance of showing up first (because use the weighted random number) - its still only a chance, because a rand_high CAN still be smaller than rand_low.
... can change the size of the number 'overlap' to tweak the mix of matching/non matching results.
(added as a new answer as its a quite differnt idea, although uses the same 'all' keyword)
Sphinx doesn't have a 'mode' to just do that. But can get very close...
Can use MAYBE operator
MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
The complication is need a way to match all products. Depending on your data you may already have a word can use (eg word like 'the' in every single product), or add the word to every document, during indexing.
... using MAYBE allows the matching results to have a higher weight.
But you dont want to sort strictly by weight. So need a different alogithm, something to shuffle the results a bit (as you not really wanting 'random'!)
SELECT id, IDIV(id/10000) AS int,WEIGHT() AS w
FROM index WHERE MATCH('_all_ MAYBE electronics MAYBE (galaxy s7)')
ORDER BY int DESC, w DESC;
This creates banding by ID, as in theory results can be spread over all the id-space will mix them up a bit. But the category results will still tend to be shown first within each band.
If you have one a different attribute other than ID might be better, something more spread out. Or can add a deliberate random attribute to results)
... there are all sort so variations, your imagination is the only limitation, this basic techqiue can be used to mix things up quote a bit.
(There are other possiblities, Sphinxes little known GROUP N BY function, can be used to produce a sampling search result. This is isnt random, but it might give the similar enough result - ie just mixing up results)

FileMaker: Dropdown list with exact values in text

I have a drop down list of subjects. Two particular subjects are Mathematics and Additional Mathematics. When I choose Mathematics from the drop down list, records from Additional Mathematics and Mathematics are both displayed. Worse is that records from Additional Mathematics are shown first. Many colleagues made mistakes because of this.
How do I make the drop down list such that when clicked, the exact terms are used instead?
This is a problem that is not necessarily unique to FileMaker. You are searching for a name that is imprecise because it is a match for multiple names. Rather instead you might want to search for a unique key whose subject name is 'Mathematics' as displayed in your drop down. It is the use of that unique key that allows you to perform a precise search, even when the name of one subject is a partial or complete match for another.
This solution requires you to add a unique serial number which is, in your case, to alter the Subjects table and add a field called 'idnumber' or similar. The field type should be Number, and the options should include Auto-Enter-Serial number-Generate and On creation-increment by 1. The trick here lies in making sure no two subjects have the same 'idnumber' even when you aren't paying attention, so set the next value to something greater than the number of subjects that already exist. Then from another layout assign each existing subject a unique idnumber, noting that if there are a great many subjects you could script that step.
I should mention that many recommend a best practice of never changing a production layout, but rather to duplicate the layout and make the required changes to the duplicate. This minimizes the effects of testing your changes etc.
Finally, change your layout in inspector such that the drop down list shows Use values from field: 'idnumber'. Select Also display values from second field: 'Subject' and Show values only from second field. Now your drop down is the same clean selection as before. The field will not look correct yet because it will show a number. To make it look correct you can insert another field, selecting 'Subject'. Place that field over top of the 'idnumber' and send 'idnumber' to the back. Fill the 'Subject' field with the correct background solid color instead of none, and enjoy your new precision search capability! The entire process is handled server side so it should not matter that client access is IWP.
If you're using the selection to do a find, put an "==" before the text you're searching on. This will tell FileMaker to do an exact field contents search, instead of a "contains" search.

How to form an Endeca query where a field must start with certain letters

Is it possible to form an Endeca query to retrieve a field that must start with certain letters? Say like get all users who's first letter is A? I checked with Range filters but it is supporting only numerical fields as well as Wild card search. But nothing worked well so far.
Creating a dimension is one way of approaching the problem as Paul Lemke mentioned.Wildcard is not an option since the performance overhead as well as irrelevant records.
But we solved it using couple of other alternatives.
Create a new property for the Object called "StartWith", store the first letter of the Object and make it searchable. We found it easier than creating a Dimension.
There is a problem where letters like 'A' are usually stop words in Endeca. We can do you a couple of work around to solve this.
Get the ASCII value of the first letter and store the numerical value in to that property. One more advantage with this approach is that we can use Range Filters. But you can't search for 'AB' kind of requirements.
Pre-pend some characters like ^^^My name and search for ^^^M. The advantage with this approach is you can search conditions like letters starts with AB.
Endeca at it's current version (6.1) does not have a search filter that works like a "startswith" function in other programming languages.
I do have two options that might possibly get you close:
If you are truly only looking for the first letter you can setup a Dimension value for each letter of the Alphabet (A,B,C...). You can then refine on each letter and see only the values that start with letter A, B, C, etc. The only downside to this is you can only filter based on how many dimension values you setup. So if you added "A", you couldn't filter anything that started with "AB". You could go down the line and add "AB", "BA, "CA", and so on but that would get unwieldy very fast.
If you want something closer to a "startswith" function the only other option is to use a wildcard search. Basically you would do a property search like this: N=0&Ntk=Username&Ntt=ab*
The trick with wildcard searching is it will do that across multiple words in that property. So assuming you had a data set of these values:
Smithers Smith
Larry Smith
Jenna-Smith
Doing a search of sm* would actually return all 3 results because "sm" was in their last name. Even the one with the dash would return as Endeca think's that is a seperate word. (It might be possible to turn that off though, not sure).
So basically it comes down to this: Stick a one word in a property, set that property to allow wildcard search, then do a "blah*" against that property and you should have the results you're looking for.
Have you tried the First relevance rank module which is supposed to rank based on proximity to the beginning of the field?
It sounds similar to what you are looking for and together with a wild card may produce your intended results.