Get all documents from an index - elasticsearch - rest

How can I get all documents from an index in elasticsearch without determining the size in the query like
GET http://localhost:8090/my_index/_search?size=1000&scroll=1m&pretty=true'-d '{"size": 0,"query":{"query_string":{ "match_all" : {}}}}
Thanks

According to the ES scan query documentation, size parameter is not just the number of results:
The size parameter allows you to configure the maximum number of hits
to be returned with each batch of results. Each call to the scroll API
returns the next batch of results until there are no more results left
to return, ie the hits array is empty.
To retrieve all the results you need to do subsequent calls to the API in the manner described in the aforementioned documentation, or to use some ready made implementation, like there is in python. Here is an example script to dump resulting jsons on stdout:
import elasticsearch
from elasticsearch.helpers import scan
import json
es = elasticsearch.Elasticsearch('https://localhost:8090')
es_response = scan(
es,
index='my_index',
doc_type='my_doc_type',
query={"query": { "match_all" : {}}}
)
for item in es_response:
print(json.dumps(item))

As per the latest documentation, you will have to use the search_after parameter to retrieve records more than 10,000 from an index. take a look here https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#search-after

Related

Is it required to specify Sort for skip limit in Mongodb

I am using skip & limit for mongodb C# driver to fetch tickets batchwise like below,
var data = db.collectionName.Find({}).Skip(1000).Limit(500).ToList()
Data fetching is happening as expected. Need confirmation on whether Sort() is mandatory for Skip & limit methods like below ? or sort will be handled by mongodb if not specified
var data = db.collectionName.Find({}).Sort("{_id:1}").Skip(1000).Limit(500).ToList()
I have removed Sort from query to improve time taken to complete fetch operation.
No, Sort() is not mandatory for Skip() and Limit() methods. You can use them directly like you are using in your query:
var data = db.collectionName.Find({}).Skip(1000).Limit(500).ToList()
To know more about default sort order, refer to below link:
https://stackoverflow.com/questions/11599069/how-does-mongodb-sort-records-when-no-sort-order-is-specified#:~:text=When%20we%20run%20a%20Mongo,objects%20in%20forward%20natural%20order.

Couchbase N1QL Query getting distinct on the basis of particular fields

I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458
GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);

How to skip elements before certain document

Background:
I am writing a mobile application which has lazy loading page. My backend is using go and mongodb with mongo-go driver. There are 10 elements on that page and i want to get next ten when i scroll to the bottom. I am planning to send ObjectID (_id) as request query parameter and get next ten elements starting from the index of id + 1.
I write what i want in mongo shell "language" so that more people understand what i want and can help in shell syntax.
Is there a way to get index of the document by it's _id or may be i can get skip until it in skip().
something like db.collection.find().skip(idOfDocument+1).limit(10)
I found the answer here.
nextDocuments = db.collection.find({'_id'> last_id}).limit(10)

How can you measure the space that a set of documents takes up (in bytes) in mongo db?

What I would like to do is figure out how much space in bytes a certain set of documents takes up. E.g. something like:
collection.stuff.stats({owner: someOwner}, {sizeInBytes: 1})
Where the first parameter is a query, and the second is like a projection of the statistics you want calculated.
I read that there's a bsonsize function you can use to measure the size of a single document. I'm wondering if maybe I could use that along with the aggregation methods to calculate the size of a search. But if I was going to do that, I'd want to know how bsonsize works. How does it work? Is it expensive to run?
Are there other options for measuring the size of data in mongo?
One perhaps "quick and dirty" way to find this would be to assign your results to a cursor, then insert that result into a new collection and call db.collection.stats on it. It would look like this in the shell:
var myCursor = db.collection.find({key:value});
while(myCursor.hasNext()) {
db.resultColl.insert(myCursor.next())
}
db.resultColl.stats();
Which should return the information on the subset of documents

what is the Recommended max emits in map function?

I am new to mongoDb and planning to use map reduce for computing large amount of data.
As you know we have map function to match the criteria and then emit the required data for a given filed. In my map function I have multiple emits. As of now I have 50 Fields emitted from a single document. That means from a single document in a collection explodes to 40 document in temp table. So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Thought of splitting the map function into two….but then one more problem …while performing map function if by chance I ran into an exception thought of skipping entire document data (I.e not to emit any data from that document) but if I split I won't be able to….
In mongoDB.org i found a comment which said..."When I run MR job, with sort - it takes 1.5 days to reach 23% at first stage of MR. When I run MR job, without sort, it takes about 24-36 hours for all job.Also when turn off jsMode is speed up my MR twice ( before i turn off sorting )"
Will enabling sort help? or Will turning OFF jsmode help? i am using mongo 2.0.5
Any suggestion?
Thanks in advance .G
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Don't know what you mean, MR's don't have sort params, only the incoming query has a sort param. The sort param of the incoming query only sorts the data going in. Unless you are looking for some specific behaviour that will avoid sorting the final output using an incoming sort you don't normally need to sort.
How are you looking to use this MR. Obviusly it won't be in realtime else you would just kill your servers so Ima guess it is a background process that runs and formats data to the way you want. I would suggest looking into incremental MRs so that you do delta updates throughout the day to limit the amount of resources used at any given time.
So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
Are you emiting multiple times? If not then the temp table should have only one key per row with documents of the format:
{
_id: emitted_id
[{ //each of your docs that you emit }]
}
This is shown: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/
or Will turning OFF jsmode help? i am using mongo 2.0.5
Turning off jsmode is unlikely to do anything significant and results from it have varied.