Get all facets in lucene.net using SimpleFacetedSearch - lucene.net

I am trying to implement faceted search by using the SimpleFacetedSearch example that was added to Lucene 2.9.4 and I want to know whether is it possible to get all the facets in Lucene.NET using SimpleFacetedSearch?
Say for example i have three columns indexed
ID A B
1 | F1 | E1
2 | F2 | E2
3 | F1 | E1
4 | F3 | E3
5 | F2 | E2
According to my understanding of SimpleFacetedSearch, I have to parse a query, pass it to SimpleFacetedSearch and then search it - which will only get facets matched with the parsed query.
But I want all of the facets without having to parse a query: that is, the facet counts of all possible facets in the index.
Say in above table i want the output as
A=F1(2),F2(2),F3(1)
B=E1(2),E2(2),E3(1)
In short I do not want to parse any query and want all facets returned for the entire index.
Thanks

You can use the MatchAllDocsQuery query, so you would create your query as Query query = new MatchAllDocsQuery(). Then you simply call Search with that query passed. You do not have to parse it since this is part of Lucene's query API, you must only parse a query when it is coming from the user. Basically, use a QueryParser when your query is being generated by a user, but use the Query API to add terms when you want to programmatically generate queries, I do not think they did a good job of teaching that in the example code for SFS.
Do keep in mind that max values are set in SimpleFacetedSearch, mainly MAXFACETS=2048, which means an exception will be thrown if you have more than 2048 facet combinations present. You can tweak this value if you need to, but keep in mind that faceting is an expensive operation and you will increase the search time by going through so many facets.
I am not sure you understand faceting by the example you gave. Sample output would be { (F1,E1) - 2, (F2, E2) - 2, (F3, E3) - 1 } where the sets are in parentheses with the count after the dash.

Related

Splitting up ODATA query with multiple ApplyTo returns incorrect TotalCount

I'm using an ODATA-enabled grid with ASP.NET Core 6 and EF Core 3 backend. The grid applies sorting, filtering and paging with ODATA which I can easily integrate into EF Core using query.ApplyTo(efContext.Table).
There is a special requirement though: all data rows are grouped in sets, and filtering should not apply per row but rather per set of rows.
Let's say the rows are products, which are bundled in packages. A user can only buy the whole package; however, they need to be able to search the products in detail.
For example: there are columns such as Price or Brand that can be used for filtering. If a user filters on Price <= 100 && Brand = 'Contoso' then all sets need to be shown where at least one row satisfies the filter. All rows in a set need to be shown if one passes the filter.
Currently I implemented this by calling ApplyTo twice, with different AllowedQueryOptions:
public IQueryable<Product> ListPackages(ODataQueryOptions<Product> query)
{
var packageIds = ((IQueryable<Product>)query.ApplyTo(efContext.Products, AllowedQueryOptions.Filter))
.Select(product => product.PackageID)
.Distinct()
.ToHashSet();
var packages = efContext.Products.Where(product => packageIds.Contains(product.PackageID).AsQueryable();
return (IQueryable<Product>)query.ApplyTo(packages, AllowedQueryOptions.Select | AllowedQueryOptions.OrderBy | AllowedQueryOptions.Top | AllowedQueryOptions.Skip | AllowedQueryOptions.Count | AllowedQueryOptions.Expand);
}
It seems that the TotalCount that is returned, is based on the results of the first ApplyTo() call, even though Count is not applied there. The second call is actually the one that returns the total set of rows. How could I fix this?

Limit results on OR condition in Sphinx

I am trying to limit results by somehow grouping them,
This query attempt should makes things clear:
#namee ("Cameras") limit 5| #namee ("Mobiles") limit 5| #namee ("Washing Machine") limit 5| #namee ("Graphic Cards") limit 5
where namee is the column
Basically I am trying to limit results/ based upon specific criteria.
Is this possible ? Any alternative way of doing what I want to do.
I am on sphinx 2.2.9
There is no Sphinx syntax to do this directly.
The easiest would be just to do directly 4 separate queries and 'UNION' them in the application itself. Performance isn't going to be terrible.
... If you REALLY want to do it in Sphinx, can explicit a couple of tricks to get close, but it gets very complicated.
Would need to create 4 separate indexes (or upto as many terms as you need!). Each with the the same data, but with the field called something different. (they duplicate each other!) You would also need an attribute on each one (more on why later)
source str1 {
sql_query = SELECT id, namee AS field1, 1 as idx FROM ...
sql_attr_unit = idx
source str2 {
sql_query = SELECT id, namee AS field2, 2 as idx FROM ...
sql_attr_unit = idx
... etc
Then create a single distributed index over the 4 indexes.
Then can run a single query to get all results kinda magically unioned...
MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
(The ##relaxed is important, as the fields are different. the matches must come from different indexes)
Now to limiting them... Because each keyword match must come from a different index, and each index has a unique attribute, the attribute identifies what term matches....
in Sphinx, there is a nice GROUP N BY where you only get a certain number of results from each attribute, so could do... (putting all that together)
SELECT *,WEIGHT() AS weight
FROM dist_index
WHERE MATCH('##relaxed #field1 ("Cameras") | #field2 ("Mobiles") | #field3 ("Washing Machine") | #field4 ("Graphic Cards")')
GROUP 4 BY idx
ORDER BY weight DESC;
simples eh?
(note it only works if want 4 from each index, if want different limits is much more complicated!)

Using groupBy in Spark and getting back to a DataFrame

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.
For example, I have a DataFrame called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
Extract unique id's from a log table.
Use unique ids to extract all events for a particular id.
Use some kind of analysis on this data that has been extracted.
It's the first two steps I would appreciate some guidance with here.
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
Thanks
Just use distinct not groupBy:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
where ... has to be provided by agg or equivalent method.
A group by in spark followed by aggregation and then a select statement will return a data frame. For your example it should be something like:
val machineId = logs
.groupBy("machine_id", "event")
.agg(max("other_stuff") )
.select($"machine_id").where($"event" === "thing")

How to use Berkeley DB's non-SQL, Key/Value API to implement fuzzy query (LIKE key word)

I can understand this Blog, but it seems unable to apply in such case that using Berkeley DB's non-SQL, Key/Value API to implement "SELECT * FROM table WHERE name LIKE '%abc%'"
Table structure
-------------------------------------------
key data(name)
-------------------------------------------
0 abc
1 abcd
2 you
3 spring
. sabcd
. timeab
.
I guess iterating all records is not an efficient way, but it really do a trick.
You're correct. Absent any other tables, you'd have to scan all the entries and test each data item. In many cases, it's as simple as this.
If you're using SQL LIKE, I doubt you'll be able to do better unless your data items have a well-defined structure.
However, if the "WHERE name LIKE %abc%" query you have is really WHERE name="abc", then you might choose to take a performance penalty on your db_put call to create a reverse index, in addition to your primary table:
-------------------------------------------
key(name) data(index)
-------------------------------------------
abc 0
abcd 1
sabcd 4
spring 3
timeab 5
you 2
This table, sorted in alphabetical order, requires a lexical key comparison function, and uses support for duplicate keys in BDB. Now, to find the key for your entry, you could simply do a db_get ("abc"), or better, open a cursor with DB_SETRANGE on "abc".
Depending on the kinds of LIKE queries you need to do, you may be able to use the reverse index technique to narrow the search space.

Adding a constraint to a Splunk search yields more result rows

Consider the following two Splunk searches:
index=a | join type=inner MyKey [search index=b]
and:
index=a | join type=inner MyKey [search index=b | where MyVal > 0]
Remarkably, the latter of the searches - the search whose subsearch has a constraint - has three times as many result rows as the former.
The Splunk documentation page for the join command suggests semantics that are close enough for the sake of argument to SQL's join:
A join is used to combine the results of a search and subsearch if specified fields are common to each. You can also join a table to itself using the selfjoin command.
This snippet is relevant to the type=inner argument:
A join is used to combine the results of a search and subsearch if specified fields are common to each. You can also join a table to itself using the selfjoin command.
Based on this information, I assume the two Splunk searches above should be equivalent to the following SQL, respectively:
SELECT *
FROM a
INNER JOIN b ON a.MyKey = b.MyKey
and:
SELECT *
FROM a
INNER JOIN b ON a.MyKey = b.MyKey
WHERE b.MyVal > 0
How is it possible that adding a constraint increases the number of result rows?
Interestingly, the following Splunk search produces a third result - one that matches what I got when I put the same data in an SQL database:
index=a | join type=outer MyKey [search index=b | eval hasmatch=1]
| where hasmatch=1
Some more notes:
the MyVal field has no duplicates in either table / index
I have verified that the raw events in Splunk's indexes match the raw source data in event counts and values for MyVal
the only search-time operations configured for the relevant sourcetypes in props.conf is a report to extract the fields based on a stanza in transforms.conf (the source data is in a CSV dialect)
Can anyone give me some clues here? As far as I'm concerned this behaviour is nonsensical.
I would suggest that you re-post this question over at the official Splunk forums (splunk-base.splunk.com). There are some very useful and experience "Splunkers" there (Employees, customers, partners, etc.) and it may be a better place for your question.
Cheers,
MHibbin