Sphinx Filtering based on categories using OR - sphinx

I have the following text fields I search with Sphinx: Title, Description, keywords.
However, sometimes things are narrowed down using categories. We have 3 category fields: CatID1, CatID2 and CatID3.
So, for example, I need to see if the word "Kittens" is in the Title, Description, or Keywords, but I also want to filter so that only items that have the categories (Animals - ID Number 8) or (Pets - ID Number 9) or (Felines - Category ID Number 10) in either of those CatID fields.
To clarify, only show items that have a 8,9 or 10 in CatID1, 2 or 3.
Any ideas on how I would accomplish this using sphinx filtering or searching the CatID1 fields as keywords?
Note: I am able to filter and it works great only using one category, i.e:
if(!empty($cat_str)) {
$cl->SetFilter( 'catid1', array( $cat_str ));
}
Thanks!
Craig

SetFilter takes an array. In your example you are putting $cat_str into an array. A array of one item.
So you just needs to build array with all the ids.
$cl->SetFilter( 'catid', array( $cat1, $cat2, $cat3 ));
But thats not very flexible. So you probably build the array dynamically, rather than hard-coded like that. But thats upto your application how to build the array.
But also storing the ids, in three sperate attributes, makes it hard to search. Notice in the above example, just noticed a attribute called catid. This would be a single multi-value attribute, that contains the ids from all three cat fields. That way its easy to search for ids in ANY of the columns at once.
http://sphinxsearch.com/docs/current.html#mva
if using a sql source, could do with something like
sql_query = SELECT id, title ... , CONCAT_WS(',', CatID1, CatID2 and CatID3) as catid FROM ...
sql_attr_multi = uint catid from field;

Related

Doctrine : PostgreSQL group by different than select

I have two tables :
user and activityHistory (that has a key to the user)
I am trying to query activityHistory while grouping by user and postgreSQL does not allow me to do that WHILE Sqlite does allow me
return $qb
->select('a.id')
->leftJoin('a.user', 'user')
->leftJoin(
ActivityHistory::class,
'b',
'WITH',
'a.id = b.id AND a.createdAt > b.createdAt'
)
->groupBy('user.id')
->orderBy( 'a.createdAt','ASC' )
->getQuery()->getArrayResult();
I am getting this error only with postgreSQL : Grouping error: 7 ERROR: column "a0_.id" must appear in the GROUP BY clause or be used in an aggregate function
The point is I don't want to groupBy activityHistory id, I only want it to be selected, how can I do ? (I heard about aggregate but I think this works only with functions like SUM etc)
First of all, let's clarify how aggregation works. Aggregation is the act of grouping by certain field(s) and selecting either those fields or calling aggregation functions and passing ungrouped fields.
You misunderstand how this works - hence the question -, but let me provide you a few very simple examples:
Example 1
Let's consider that there is a town and there are individuals living in that town. Each individual has an eye color, but, if you are wondering what the eye color of the people of the town is, then your question does not make sense, because the group itself does not have an eye color, unless specified otherwise.
Example 2
Let's modify the example above by grouping the people of the town by eye color. Such an aggregation will have a row for all existent eye colors and you can select the eye color, along with the average age, number of individuals, etc. as you like, because you are grouping by eye color
Your example
You have users and they are performing actions. So, an activity is performed by a single user, but a user may perform many activities. So, if you want to group by your user id, then the "eye color" that you are not grouping by here is the history id.
You will have a single record for any user, so you are grouping multiple history items into the same row and after the grouping, asking about the history item's id does not exist.
But, you can use string_agg(some_column, ',') which will take all the values you have and put them all into a string of values separated by comma.
You can explode(',', '$yourvalues) in PHP to convert such a value into an array.

How to group data in a table (tableau)

I have a table, which contains the sum of different values. you can see in the pic attached that the Filling Method equals to FM-Paper and FM-Electronic. How can I show it so that it is visually more attractive?
Caveat: There is some risk in the format of your data - double counting values. Considering this is just for display purposes only, here's what I'd suggest:
Creating a visual where users are able to intuitively realize that one or more categories are a subset of another category is the goal. (Tableau does this natively when the data is shaped correctly.)
To mimic a subcategory, create a calculated field like so:
If [Category] = 'FM-Paper' or [Category] = 'FM-Electronic' Then
STR(' ') + [Category]
Else [Category]
END
This simply adds space before the two 'sub-categories.' If it looks better or is better understood, you could also do something like '*****' or '---->'.
You could take this a step further and remove the parent category if not needed - leaving only the 'sub-category' by doing replacing your values with a calculated field:
IF [Category] = 'Filling Method' Then NULL
Else [Value]
END

Using MYSQLI to select rows in which part of a column matches part of an input

I have a database in which one of the columns contains a series of information 'tags' about the row that are stored as a comma-separated list (a string) of dynamic length. I am using mysqli within PHP, and I want to select rows in which any of these items match any of the items in an input string.
For example, there could be a row describing an apple, containing the tags: "tasty, red, fruit, sour, sweet, green." I want this to show up as a result in a query like: "SELECT * FROM table WHERE info tags IN ('blue', 'red', 'yellow')", because it has at least one item ("red") overlapping. Kind of like "array_intersect" in PHP.
I think I could use IN if each row had only one tag, and I could use LIKE if I used only one input tag, but both are of dynamic length. I know I can loop over all the input tags, but I was hoping to put this in a single query. Is that possible? If not, can I use a different structure to store the tags in the database to make this possible (something other than a comma separated string)?
I think the best would be to create tags table (id + label) then separate "table_tags" table which holds table_id and tag_id.
that means using JOINS to get the final result.
another (but lazy) solution would be to prefix and suffix tags with commas so the full column contains something like:
,tasty,red,fruit,sour,sweet,green,
and you can do a LIKE search without being worried about overlapping words (i.e red vs bored) and still get a proper match by using LIKE '%,WORD,%'

Unable to use Sphinx MVA sql_attr_multi

I have a field called "tags" and it has values (say) "Music, Art, Sports, Food" etc. How can I use setFilter function in PHP-Sphinx for this field. I know that it has to be an integer and should be used as an array in PHP. So, if I use a numeric field for tags, what about the delimiters (in this case comma). Currently, I am using "sql_attr_multi" like this…
sql_attr_multi = uint tags from field
I have to filter the search based on any of the keywords the user has selected, Music, Sports, Food etc. As such, only MVA is the right option to do this. But I am just not able to figure out, how to do this. I can store all tag elements as numeric values and make the tags field as int. But what about the comma or how will I convert the whole string (Music, Art, Sports, Food) as an integer. Later, how do I call setFilter using PHP.
Any help is highly appreciated.
Well using a MVA, suggests you already unique-ids for each tag.
Which if you had a seperate table for tags (with a PK), and many-to-many table joining your documents, and tags. (thats a very common way to store tags - in normal form)
If you have a text column containing the text, would be easier to just use a Field. Can easily filter by fields in the main text-query.
crispy creams #tags Food
for example (thats extended mode query)
(But fields can't do Grouping like you can with Attributes)

Most efficient database schema for counting keywords

I'm working on an iPhone app with a GAE backend. I currently have a database of ~8000 products and each product has 5 keywords, mined from reviews, that are the words used most often to describe the product. Once I deploy the app, I'd like to allow users to add new products, and add their 5 keywords to existing products. So, when "reviewing" an existing product, they would add their 5 words, and these would be reflected in the Top 5 words if they push a word over into the Top 5. These keywords will be selected via a large whitelist with indirect selection so I can control the user input. I'd like the application to scale to thousands of users without hitting my backend too hard.
My question is:
What's the most efficient database schema for keeping track of all the words for a product and calculating the top 5 for each product once it's updated?
My two ideas (which may be terrible):
Have a "words" column which contains a 2d array, one dimension is the word, the other is the count for that word. They would then be incremented/decremented as needed.
Have a database with each word as a column and each product as a row and the corresponding row/column would contain the count.
The easiest way to do this would be to have a 'tags' kind, defined something like this (you haven't specified a backend language, so I'm assuming Python):
class Tag(db.Model):
# Tags should be child entities of Products and have key name based on the tag
# eg, created with Tag(parent=a_product, key_name='awesome', ...)
count = db.IntegerProperty(required=True, default=0)
#classmethod
def increment_tags(cls, product, tag_names):
def _tx():
tags = cls.get_by_key_name(tag_names, parent=product)
for i, tag in enumerate(tags):
if tag is None:
# New tag
tags[i] = tag = cls(key_name=tag_names[i], parent=product)
tag.count += 1
db.put(tags)
return db.run_in_transaction(_tx)
#classmethod
def get_top_product_tags(cls, product, num=5):
return [x.key().name() for x
in cls.all().ancestor(product).order('-count').fetch(num)]
The increment_tags method increments the count property on all the relevant tags. Since they all have the same parent entity, they're in the same entity group, and it can do this transactionally, in a single transaction.
The get_top_product_tags method does a simple datastore query to find the num top ranked tags for a product.
You should use a normalized schema and let SQL and the database engine be your friend. Have a single table with a design like this:
create table KeywordUse
( AppID int
, UserID int
, Sequence int
, Word varchar(50) -- or whatever makes sense
)
You can also have an identity primary key if you like, but AppID + UserID + Sequence is a candidate key (i.e. the combination of these three must be unique).
To find the top 5 keywords for any app, do a SQL query like this:
select top 5
count(AppID) as Frequency -- If you have an identity PK count that instead.
, Word
from KeywordUse
where AppID = #AppIDVariable...
group by Word, AppID
order by count(AppID) desc
If you are really, really worried about performance you could denormalize the results of this query into a table that shows the words for each app. Then you'd have to work out how often to refresh that snapshot.
REVISED ANSWER:
As Nick Johnson so generously pointed out, aggregate functions are not available in GQL. However, the philosophy of my answer remains unchanged. Let the database engine do its job.
The table should be AppID, Word, and Frequency. (AppID and Word are the PK.) Then each use of the word would be added up as it is applied. Then, when you want to know the top five words for an app you select by AppID := #Value and order by Frequency (descending) with a LIMIT = 5.
You would need a separate table to track user keywords if that is important.