I have two tables in PostgreSQL:
urls (table with indexed pages, host is indexed column, 30 mln rows)
hosts (table with information about hosts, host is indexed column, 1mln rows)
One of the most frequent SELECT in my application is:
SELECT urls.*
FROM urls
JOIN hosts ON urls.host = hosts.host
WHERE urls.projects_id = ?
AND hosts.is_spam IS NULL
ORDER by urls.id DESC, LIMIT ?
In projects which have more than 100 000 rows in urls table the query executes very slow.
Since the tables has grown the query is execution slower and slower. I've read a lot about NoSQL databases (like MongoDB) which are designed to handle so big tables and i'am taking into consideration move my data to MongoDB. Everything would be easy, if i didn't have to check hosts table during selecting data from urls table. I've heard that MongoDB doesn't support joins, so my question is how to solve above problem? I could put information about host in urls collection, but the field hosts.is_spam could be updated by user and i would have to update the whole urls collection. I don't know it it is right solution.
I would be greatful for any advices.
If you don't use joins, then relational dbs can also work pretty fast. I think, this is the case where you need to denormalize for the sake of performance.
Option 1
Copy is_spam column to the urls table. When this value of the host changes, update all related urls. This is okay if you don't do it too often.
Option 2
I don't know your app, but I assume that the number of spam hosts is relatively small. In this case, you could put their ids to an in-memory store (memcached, redis, ...), query all the urls and filter out spam urls in the app. This way your pagination gets a little broken, but sometimes it's a viable option.
You are correct in that the problem is the join, but my guess is that it's just the wrong kind of join. As Frank H. mentioned, PostgreSQL should be able to process this type of query rather handily depending on the frequency of hosts.is_spam. You probably want to cluster the urls table on id to optimize the order by-limit phase. Since you only care about urls.* you can minimize disk io by creating a partial index on hosts.host where is_spam is not null to make it easy to get just the short list of hosts to avoid.
Try this:
select urls.*
from urls
left join hosts
on urls.host = hosts.host
and hosts.is_spam is not null
where urls.projects_id = ?
and hosts.host is null
Or this:
select *
from urls
where urls.projects_id = ?
and not exists (
select 1
from hosts
where hosts.host = urls.hosts
and hosts.is_spam is not null
)
This will allow PostgreSQL to use an anti-join to pull only urls which are not mapped to a known spammy host. The results may be different from your query if there are urls with a null or invalid host.
It is true that MongoDB doesn't support joins. In a case like this, I would structure my urls collection like this
urls : {
name,
some_other_property,
host
}
You can then fetch the host for a particular URL, and check the is_spam field for it in the hosts collection. Note that this will need to be done by the client querying the DB and cannot be done at the DB itself like you would with a JOIN.
Similar to the answer by #xbones, but with specific examples
Putting a host_id field in your urls documents is one way to go. It will require that you first pull a result of url documents, and then pull a result of spam hosts, then filter locally in your client code
Roughly:
var urls = db.urls.find({projects_id:'ID'}, {_id: 1, host_id: 1});
var hosts = db.hosts.find({is_spam: 1}, {_id: 1});
# psuedocode
ids_array = _id for _id in urls if host_id is not in hosts
urls = db.urls.find({_id: {$in: ids_array}});
Or:
var urls = db.urls.find({projects_id:'ID'});
var hosts = db.hosts.find({is_spam: 1}, {_id: 1});
# psuedocode
urls = url for url in urls if host_id is not in hosts
The first example assumes the result of the project_id query could be huge (and your url documents are bigger) and you only wanted to grab the smallest amount of data possible, then you filter locally, and then batch get the full final url documents.
The second example just gets the full url documents to start, and filters them down locally.
Related
I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458
GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);
I'm using cloudant, with no auth, Cors enabled.
it works very well, Limit and skip working good.
but i can't find how to search for something .
I'm trying to find a document where cp is 24000 , for example with this query :
https://1c54473b-be6e-42d6-b914-d0ecae937981-bluemix.cloudant.com/etablissements/_all_docs?skip=0&limit=10&include_docs=true&q=cp:24000
But, the query doesn't return the right document.
I've also tried
https://1c54473b-be6e-42d6-b914-d0ecae937981-bluemix.cloudant.com/etablissements/_all_docs?skip=0&limit=10&include_docs=true&_search({'cp':24000})
with no luck.
oh, and by the way, do you know if jquery.couch.js lib has been discontinued? I cant even find it on github, nor on my hard disk while im using foxant, and it is not in the directory also..
The /db/_all_docs endpoint hits the primary index of the database where all of the documents in the database can be found in _id order.
If you wish to query the database to get a subset of the data you have three options
Cloudant Query - hit the POST /db/_find endpoint passing in a JavaScript object containing the selector which defines the query you wish to perform (like the WHERE clause of a SQL query) e.g. {selector: {cp: 24000}}
MapReduce - create a Map function in a design document that filters the documents you are interested it. It creates a materialized view that can be queried and filtered later. e.g. function(doc){ emit(doc.cp, null);}
Cloudant Search - this uses the Apache Lucene library to generate an index on the fields you specify. You can then query the index: q=cp:24000, which looks similar to the query you are looking to perform.
QUERYING MONGODB: RETREIVE SHOPS BY NAME AND BY LOCATION WITH ONE SINGLE QUERY
Hi folks!
I'm building a "search shops" application using MEAN Stack.
I store shops documents in MongoDB "location" collection like this:
{
_id: .....
name: ...//shop name
location : //...GEOJson
}
UI provides to the users one single input for shops searching. Basically, I would perform one single query to retrieve in the same results array:
All shops near the user (eventually limit to x)
All shops named "like" the input value
On logical side, I think this is a "$or like" query
Based on this answer
Using full text search with geospatial index on Mongodb
probably assign two special indexes (2dsphere and full text) to the collection is not the right manner to achieve this, anyway I think this is a different case just because I really don't want to apply sequential filter to results, "simply" want to retreive data with 2 distinct criteria.
If I should set indexes on my collection, of course the approach is to perform two distinct queries with two distinct mehtods ($near for locations and $text for name), and then merge the results with some server side logic to remove duplicate documents and sort them in some useful way for user experience, but I'm still wondering if exists a method to achieve this result with one single query.
So, the question is: is it possible or this kind of approach is out of MongoDB purpose?
Hope this is clear and hope that someone can teach something today!
Thanks
I'm looking at using Postgres as a database to let our clients segment their customers.
The idea is that they can select a bunch of conditions in our front-end admin, and these conditions will get mapped to a SQL query. Right now, I'm thinking the best structure could be something like this:
SELECT DISTINCT id FROM users
WHERE id IN (
-- condition 1
)
AND id IN (
-- condition 2
)
AND id IN (
-- etc
)
Efficiency and query speed is super important to us, and I'm wondering if this is the best way of structuring things. When going through each of the WHERE clauses, will Postgres pass the id values from one to the next?
The ideal scenario would be, for a group of 1m users:
Query 1 filters down to 100k
Query 2 filters down from 100k to 10k
Query 3 filters down to 10k to 5k
As opposed to:
Query 1 filters from 1m to 100k
Query 2 filters down from 1m to 50k
Query 3 filters down from 1m to 80k
The intersection of all queries are mashed together, to 5k
Maybe I'm misunderstanding something here, I'd love to get your thoughts!
Thanks!
Postgres uses a query planner to figure out how to most efficiently apply your query. It may reorder things or change how certain query operations (such as joins) are implemented, based on statistical information periodically collected in the background.
To determine how the query planner will structure a given query, stick EXPLAIN in front of it:
EXPLAIN SELECT DISTINCT id FROM users ...;
This will output the query plan for that query. Note that an empty table may get a totally different query plan from a table with (say) 10,000 rows, so be sure to test on real(istic) data.
Database engines are much more sophisticated than that.
The specific order of the conditions should not matter. They will take your query as a whole and try to figure out the best way to get the data according to all the conditions you specified, the indexes that each table has, the amount of records each condition will filter out, etc.
If you want to get an idea of how your query will actually be solved you can ask the engine to "explain" it for you: http://www.postgresql.org/docs/current/static/sql-explain.html
However, please note that there is a lot of technical background on how DB engines actually work in order to understand what that explanation means.
lets say I have 2 collections wherein each document may look like this:
Collection 1:
target:
_id,
comments:
[
{ _id,
message,
full_name
},
...
]
Collection 2:
user:
_id,
full_name,
username
I am paging through comments via $slice, let's say I take the first 25 entries.
From these entries I need the according usernames, which I receive from the second collection. What I want is to get the comments sorted by their reference username. The problem is I can't add the username to the comments because they may change often and if so, I would need to update all target documents, where the old username was in.
I can only imagine one way to solve this. Read out the entire full_names and query them in the user collection. The result would be sortable but it is not paged and so it takes a lot of resources to do that with large documents.
Is there anything I am missing with this problem?
Thanks in advance
If comments are an embedded array, you will have to do work on the client side to sort the comments array unless you store it in sorted order. Your application requirements for username force you to either read out all of the usernames of the users who commented to do the sort, or to store the username in the comments and have (much) more difficult and expensive updates.
Sorting and pagination don't work unless you can return the documents in sorted order. You should consider a different schema where comments form a separate collection so that you can return them in sorted order and paginate them. Store the username in each comment to facilitate the sort on the MongoDB side. Depending on your application's usage pattern this might work better for you.
It also seems strange to sort on usernames and expect/allow usernames to change frequently. If you could drop these requirements it'd make your life easier :D