Sphinx search for distinct values with counts - sphinx

I have an "objects" table (actually a result of a SQL join), with data like this:
ID, content, category_id
1, some searchable data, 5
2, some more data, 6
3, more data, 5
4, another example, 7
I'd like to use Sphinx to index this table and return distinct category_id values, as well as how many records had data hits, sorted by the number of hits.
For instance, if I search this index with the term "data", I'd like the result to be:
5, 2 hits
6, 1 hit
This would be pretty trivial with grouping and counts in MySQL, but I can't get my head around doing the same thing with a Sphinx search.
What should my sql_query be? How should I use the PHP API to get the results I need?

$cl->SetGroupBy( "category_id", SPH_GROUPBY_ATTR, "#count desc" );
distinct category-ids would be in total_found
how many records had data hits you don't directly get, the easiest way would be to run a non-group by query, then it is in total_found

Related

Flask-sqlalchemy efficiently get count of query ordered by relationship

I have 2 models: Post and PostLike. I want to return a query of Posts filtered by their PostLike created_at field, as well as return the total count of posts in that query efficiently (without using .count()).
Simplified models:
class Post(db.Model):
...
post_likes = db.relationship('PostLike', back_populates="post")
class PostLike(db.Model):
...
created_at = db.column(db.Float)
post_id = db.Column(UUID(as_uuid=True), db.ForeignKey("post.id"), index=True)
post = db.relationship("Post", back_populates="post_likes", foreign_keys=[post_id])
Here are the queries I'm trying to run:
# Get all posts
posts = Post.query.join(PostLike, Post.post_likes).order_by(PostLike.created_at.desc()).all()
# Get total # of posts
posts = Post.query.join(PostLike, Post.post_likes).order_by(PostLike.created_at.desc()).count()
There are 3 problems with those queries.
I'm not sure those queries are the best for my use case. Are they?
The query returns the wrong number as count. The count query returns a number higher than the results of the .all() query. Why?
This is not performant as it is calling directly .count(). How do I implement an efficient query to also retrieve the count? Something like .statement.with_only_columns([func.count()])?
I'm using Postgres, and I'm expecting up to millions of rows to count. How do I achieve this efficiently?
re: efficiency of .count()
The comments in other answers (linked in a comment to this question) appear to be outdated for current versions of MySQL and PostgreSQL. Testing against a remote table containing a million rows showed that whether I used
n = session.query(Thing).count()
which renders a SELECT COUNT(*) FROM (SELECT … FROM table_name), or I use
n = session.query(sa.func.count(sa.text("*"))).select_from(Thing).scalar()
which renders SELECT COUNT(*) FROM table_name, the row count was returned in the same amount of time (0.2 seconds in my case). This was tested using SQLAlchemy 1.4.0b2 against both MySQL version 8.0.21 and PostgreSQL version 12.3.

MongoDB query is slow even when searching by indexes

I have a collection called calls containing properties DateStarted, DateEnded, IdAccount, From, To, FromReversed, ToReversed. In other words this is how a call document looks like:
{
_id : "LKDJLDKJDLKDJDLKJDLKDJDLKDJLK",
IdAccount: 123,
DateStarted: ISODate('2020-11-05T05:00:00Z'),
DateEnded: ISODate('2020-11-05T05:20:00Z'),
From: "1234567890",
FromReversed: "0987654321",
To: "1231231234",
ToReversed: "4321321321"
}
On our website we want to give customers the option to search by custom calls. When they search for calls they must specify the DateStarted and DateEnded Those fields are required the other ones are optional. The IdAccount will be injected on our end so that the customer can only get calls that belong to his account.
Because we have about 5 million records we have created the following indexes
db.calls.ensureIndex({"IdAccount":1});
db.calls.ensureIndex({"DateStarted":1});
db.calls.ensureIndex({"DateEnded":1});
db.calls.ensureIndex({"From":1});
db.calls.ensureIndex({"FromReversed":1});
db.calls.ensureIndex({"To":1});
db.calls.ensureIndex({"ToReversed":1});
The reason why we did not created a compound index is because we want to be able to search by custom criteria. For example we may want to search by all calls with date smaller than December 11 and from a specific account.
Because of the indexes all these queries execute very fast:
db.calls.find({'DateStarted' : {'$gte': ISODate('2020-11-05T05:00:00Z')}).limit(200).explain();
db.calls.find({'DateEnded' : {'$lte': ISODate('2020-11-05T05:00:00Z')}).limit(200).explain();
db.calls.find({'IdAccount' : 123 ).limit(200).explain();
// etc...
Even queries that use regexes execute very fast. They only work fast if I use ^... meaning that it must start with a search pattern as:
db.calls.find({ 'From' : /^305/ ).limit(200).explain();
and that is the reason why we created the field FromReversed and ToReversed. If I want to search for a To phone number that ends with 3985 I will execute:
db.calls.find({ 'ToReversed' : /^5893/ ).limit(200).explain(); // note I will have to reverse the search option to
So the only queries that are slow are the ones that do not start with something such as this query:
db.calls.find({ 'ToReversed' : /1234/ ).limit(200).explain();
Question
Why is it that if I combine all the queries it is very slow? For example this query is very slow:
db.calls.find({
'DateStarted':{'$gte':ISODate('2018-11-05T05:00:00Z')},
'DateEnded':{'$lte':ISODate('2020-11-05T05:00:00Z')},
'IdAccount':123,
'ToReversed' : /^5893/
}).limit(200).explain();
The problem is the 'ToReversed' : /^5893/. If I execute that query by itself it is really fast. Even if I put something that does not give me the limit of 200 results fast. Should I add a compound index as well? just for the scenario where it is slow
I need to give our customers the option to search by phone numbers that end with or start with a specific criteria. The moment I add extra stuff to the query it becomes really slow.
Edit
By researching on the internet if I use the hint option it is faster. It goes from 20 seconds to 5 seconds.
db.calls.find({
'DateStarted':{'$gte':ISODate('2018-11-05T05:00:00Z')},
'DateEnded':{'$lte':ISODate('2020-11-05T05:00:00Z')},
'IdAccount':123,
'ToReversed' : /^5893/
}).hint({'ToReversed':1}).limit(200).explain();
This is still slow and it will be great if I can lower it to 1 second just like the simple queries take milliseconds.
For the find query you showed us involving filtering on 4 fields, ideally the optimal index would cover all 4 fields:
db.calls.createIndex( {
"DateStarted": 1,
"DateEnded": 1,
"IdAccount": 1,
"ToReversed": 1
} )
As to which columns should appear first, you should generally place the most restrictive columns first. Check the cardinality of your data to determine this.

Postgres full text search against arbitrary data - possible or not?

I was hoping to get some advice or guidance around a problem I'm having.
We currently store event data in Postgres 12 (AWS RDS) - this data could contain anything. To reduce the amount of data (alot of keys for example are common across all events) we flatten this data and store it across 3 tables -
event_keys - the key names from events
event_values - the values from events
event_key_values - a lookup table, containing the event_id, and key_id and value_id.
First inserting the key and value (or returning the existing id), and finally storing the ids in the event_key_values table. So 2 simple events such as
[
{
"event_id": 1,
"name": "updates",
"meta": {
"id": 1,
"value: "some random value"
}
},
{
"event_id": 2,
"meta": {
"id": 2,
"value: "some random value"
}
}
]
would become
event_keys
id key
1 name
2 meta.id
3 meta.value
event_values
id value
1 updates
2 1
3 some random value
4 2
event_key_values
event_id key_id value_id
1 1 1
1 2 2
1 3 3
2 2 4
2 3 3
All values are converted to text before storing, and a GIN index has been added to the event_key and event_values tables.
When attempting to search this data, we are able to retrieve results, however once we start hitting 1 million or more rows (we are expecting billions!) this can take anywhere from 10 seconds too minutes to find data. The key-values could have multiple search operations applied to them - equality, contains (case-sensitive and case-insensitive) and regex. To complicate things a bit more, the user can also search against all events, or a filtered selection (so only search against the last 10 days, events belonging to a certain application etc).
Some things I have noticed from testing
searching with multiple WHERE conditions on the same key e.g meta.id, the GIN index is used. However, a WHERE condition with multiple keys does not hit the index.
searching with multiple WHERE conditions on both the event_keys and event_values table does not hit the GIN index.
using 'raw' SQL - we use Jooq in this project and this was to rule out any issues caused by it's SQL generation.
I have tried a few things
denormalising the data and storing everything in one table - however this resulted in the database (200 GB disk) becoming filled within a few hours, with the index taking up more space than the data.
storing the key-values as a JSONB value against an event_id, the JSON blob containing the flattened key-value pairs as a map - this had the same issues as above, with the index taking up 1.5 times the space as the data.
building a document from the available key-values using concatenation using both a sub-query and CTE - from testing with a few million rows this takes forever, even when attempting to tune some parameters such as work_mem!
From reading solutions and examples here, it seems full text search provides the most benefits and performance when applied against known columns e.g. a table with first_name, last_name and a GIN index against these two columns, but I hope I am wrong. I don't believe the JOINs across tables is an issue, or event_values needing to be stored in the TOAST storage due to the size to be an issue (I have tried with truncated test values, all of the same length, 128 chars and the results still take 60+ seconds).
From running EXPLAIN ANALYSE it appears no matter how I tweak the queries or tables, most of the time is spent searching the tables sequentially.
Am I simply spending time trying to make Postgres and full text search suit a problem it may never work (or at least have acceptable performance) for? Or should I look at other solutions e.g. One possible advantage of the data is it is 'immutable' and never updated once persisted, so something syncing the data to something like Elasticsearch and running search queries against it first might be a solution.
I would really like to use Postgres for this as I've seen it is possible, and read several articles where fantastic performance has been achieved - but maybe this data just isn't suited?
Edit;
Due to the size of the values (some of these could be large JSON blobs, several 100Kbs), the GIN index on event_values is based on the MD5 hash - for equality checks the index is used but never for searching as expected. For event_keys the GIN index is against the key_name column. Users can search against key names, values or both, for example "List all event keys beginning with 'meta.hashes'"

Couchbase N1QL Query getting distinct on the basis of particular fields

I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458
GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);

MongoDB Order by two columns

Does someone know how can I sort my tables by two values?
For example :
I have stats collections that has columns "_id", "title", "link", "stats", "range".
"stats" column could consist values ['duration','pace', 'distance'],
"range" column could consist velues like 0-10 km, 20-20 min so on depend of stats values.
I'd like to order by stats and after that by range .
This link
In link above I've sorted by stats and now I want to sort by range for each value of stats!
My current code :
guides = yield gen.Task(Guides.objects.find, query={}, limit=20,
sort={'stats': 1})
To order your documents by more than one field, you would need a compound sort statement and ideally a compound index backing it.
For compound indexes in MongoDB see: http://docs.mongodb.org/manual/tutorial/create-a-compound-index/
Your sort would look something like:
guides = yield gen.Task(Guides.objects.find, query={}, limit=20,
sort={'stats': 1, 'range': 1})
You need to use something that is ordered, in this case tuples of arrays, like:
sort([("stats",pymongo.ASCENDING), ("range",pymongo.ASCENDING)])
Since, of course, python dicts are not ordered. That should work I believe.