A better way to delete the data in a MongoDB collection but keeping the index - mongodb

I was working on my unittests which for each cases the database should be reset.
Because that the indexes should be kept but the data should be cleared, what will be a faster way to do that reset?

I am using pymongo, not sure if the driver will heavily impacts the performance.
There are 2 ways that come up to my mind
Simply executes delete_many()
Drop the whole collection and recreate the indexes
I found that:
for collections containing < 25K entries, the first way is faster
for collections containing ~ 25K entries, two ways are similar
for collections containing < 25K entries, the second way is faster.
Below is my script to test and run.
import time
import pymongo
from bson import ObjectId
from extutils import exec_timing_result
client = pymongo.MongoClient("mongodb://192.168.50.33:27017/?readPreference=primary&ssl=false")
def test_drop_recreate(col):
col.drop()
col.create_index([("mike", pymongo.DESCENDING)])
def test_delete_many(col):
col.delete_many({})
def main():
db = client.get_database("_tests")
STEP = 1000
to_insert = []
for count in range(0, STEP * 101, STEP):
# Get the MongoDB collection
col = db.get_collection("a")
# Insert dummy data
to_insert.extend([{"_id": ObjectId(), "mike": "ABC"} for _ in range(STEP)])
if to_insert:
col.insert_many(to_insert)
# Record the execution time of dropping the databㄇse then recreate indexes of it
_start = time.time()
test_drop_recreate(col)
ms_drop = time.time() - _start
# Insert dummy data
if to_insert:
col.insert_many(to_insert)
# Record the execution time of simply executes `delete_many()`
_start = time.time()
test_delete_many(col)
ms_del = time.time() - _start
if ms_drop > ms_del:
print(f"{count},-{(ms_drop / ms_del) - 1:.2%}")
else:
print(f"{count},+{(ms_del / ms_drop) - 1:.2%}")
if __name__ == '__main__':
main()
After I ran this script a few times, I generated a graph to visualize the result using the output.
(Above 0) means deletion takes longer
(Below 0) means dropping and recreating takes longer
The value represents additional time consumed.
For example: +20 means deletion takes 20 times longer than drop & create.

I tried to "save" all the existing indexes and recreate them after the collection drop, but there are tons of edge cases and hidden parameters to the index object. Here is my shot:
def convert_key_to_list_index(son_index):
"""
Convert index SON object to a list that can be used to create the same index.
:param SON son_index: The output of "list_indexes()":
SON([
('v', 2),
('key', SON([('timestamp', 1.0)])),
('name', 'timestamp_1'),
('expireAfterSeconds', 3600.0)
])
:return list: List of tuples: (field, direction) to use for pymonog function create_index
"""
key_index_list = []
index_key_definitions = son_index["key"]
for field, direction in index_key_definitions.items():
item = (field, int(direction))
key_index_list.append(item)
return key_index_list
def drop_and_recreate_indexes(db, collection_name):
"""
Use list_indexes() and not index_information() to get the "key" as a SON object instead of regular dict,
Because the SON object is an ordered dict, which the order of the key of the index are important for the
recreation action.
"""
son_indexes = list(db[collection_name].list_indexes())
db[collection_name].drop()
# Re-create the collection indexes
for son_index in son_indexes:
# Skip the default index for the field "_id"
# Notice: For collection without any index except of the default "_id_" index,
# the collection will not be recreated (Recreate the index recreate the collection).
if son_index['name'] == '_id_':
continue
recreate_index(db, collection_name, son_index)
def recreate_index(db, collection_name, son_index):
"""
Get a SON object that represent an index from the db, and create the same index in the db.
:param pymongo.database.Database db:
:param str collection_name:
:param SON son_index: The object that returned from the call to pymongo.collection.Collection.list_indexes()
# Notice: Currently supported only these fields: ["name", "unique", "expireAfterSeconds"]
# For more fields/options see: https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html
# ?highlight=create_index#pymongo.collection.Collection.create_index
:return:
"""
list_index = convert_key_to_list_index(son_index)
# This dict is defined to send parameter if exists to the "create_index" function.
create_index_kwargs = {}
# "name" will always be part of the SON object that represent the index.
create_index_kwargs["name"] = son_index.get("name")
# "unique" may exist or not in the SON object
unique = son_index.get("unique")
if unique:
create_index_kwargs["unique"] = unique
# "expireAfterSeconds" may exist or not in the SON object
expire_after_seconds = son_index.get("expireAfterSeconds", None)
if expire_after_seconds:
create_index_kwargs["expireAfterSeconds"] = expire_after_seconds
db[collection_name].create_index(list_index, **create_index_kwargs)

Related

How does scala slick determin which rows to update in this query

I was asked how scala slick determines which rows need to update given this code
def updateFromLegacy(criteria: CertificateGenerationState, fieldA: CertificateGenerationState, fieldB: Option[CertificateNotification]) = {
val a: Query[CertificateStatuses, CertificateStatus, Seq] = CertificateStatuses.table.filter(status => status.certificateState === criteria)
val b: Query[(Column[CertificateGenerationState], Column[Option[CertificateNotification]]), (CertificateGenerationState, Option[CertificateNotification]), Seq] = a.map(statusToUpdate => (statusToUpdate.certificateState, statusToUpdate.notification))
val c: (CertificateGenerationState, Option[CertificateNotification]) = (fieldA, fieldB)
b.update(c)
}
Above code is (as i see it)
a) looking for all rows that have "criteria" for "certificateState"
b) a query for said columns is created
c) a tuple with the values i want to update to is created
then the query is used to find rows where tuple needs to be applied.
Background
I wonder were slick keeps track of the Ids of the rows to update.
What i would like to find out
What is happening behind the covers?
What is Seq in "val a: Query[CertificateStatuses, CertificateStatus, Seq]"
Can someone maybe point out the slick source where the moving parts are located?
OK - I reformatted your code a little bit to easier see it here and divided it into chunks. Let's go through this one by one:
val a: Query[CertificateStatuses, CertificateStatus, Seq] =
CertificateStatuses.table
.filter(status => status.certificateState === criteria)
Above is a query that translated roughly to something along these lines:
SELECT * // Slick would list here all your columns but it's essiantially same thing
FROM certificate_statuses
WHERE certificate_state = $criteria
Below this query is mapped that is, there is a SQL projection applied to it:
val b: Query[
(Column[CertificateGenerationState], Column[Option[CertificateNotification]]),
(CertificateGenerationState, Option[CertificateNotification]),
Seq] = a.map(statusToUpdate =>
(statusToUpdate.certificateState, statusToUpdate.notification))
So instead of * you will have this:
SELECT certificate_status, notification
FROM certificate_statuses
WHERE certificate_state = $criteria
And last part is reusing this constructed query to perform update:
val c: (CertificateGenerationState, Option[CertificateNotification]) =
(fieldA, fieldB)
b.update(c)
Translates to:
UPDATE certificate_statuses
SET certificate_status = $fieldA, notification = $fieldB
WHERE certificate_state = $criteria
I understand that last step may be a little bit less straightforward then others but that's essentially how you do updates with Slick (here - although it's in monadic version).
As for your questions:
What is happening behind the covers?
This is actually outside of my area of expertise. That being said it's relatively straightforward piece of code and I guess that an update transformation may be of some interest. I provided you a link to relevant piece of Slick sources at the end of this answer.
What is Seq in "val a:Query[CertificateStatuses, CertificateStatus, Seq]"
It's collection type. Query specifies 3 type parameters:
mixed type - Slick representation of table (or column - Rep)
unpacked type - type you get after executing query
collection type - collection type were above unpacked types are placed for you as a result of a query.
So to have an example:
CertificateStatuses - this is your Slick table definition
CertificateStatus this is your case class
Seq - this is how your results would be retrieved (it would be Seq[CertificateStatus] basically)
I have it explained here: http://slides.com/pdolega/slick-101#/47 (and 3 next slides or so)
Can someone maybe point out the slick source where the moving parts are located?
I think this part may be of interest - it shows how query is converted in update statement: https://github.com/slick/slick/blob/51e14f2756ed29b8c92a24b0ae24f2acd0b85c6f/slick/src/main/scala/slick/jdbc/JdbcActionComponent.scala#L320
It may be also worth to emphasize this:
I wonder were slick keeps track of the Ids of the rows to update.
It doesn't. Look at generated SQLs. You may see them by adding following configuration to your logging (but you also have them in this answer):
<logger name="slick.jdbc.JdbcBackend.statement" level="DEBUG" />
(I assumed logback above).

How to get from n to n items in mongodb

I'm trying to create an android app which pulls first 1-10 documents in the mongodb collection and show those item in a list, then later when the list reaches the end i want to pull 11-20 documents in the mongodb collection and it goes on.
def get_all_tips(from_item, max_items):
db = client.MongoTip
tips_list = db.tips.find().sort([['_id', -1]]).limit(max_items).skip(from_item)
if tips_list.count() > 0:
from bson import json_util
return json_util.dumps(tips_list, default=json_util.default), response_ok
else:
return "Please move along nothing to see here", response_invalid
pass
But the above code does not work the way i intended but rather it returns from from_item to max_items
Example: calling get_all_tips(3,4)
It returns:
Document 3, Document 4, Document 5, Document 6
I'm expecting:
Document 3, Document 4
In your code you are specifying two parameters.
from_item: which is the starting index of the documents to return
max_items: number of items to return
Therefore calling get_all_tips(3,4) will return 4 documents starting from document 3 which is exactly what's happening.
Proposed fixes:
If you want it to return documents 3 and 4 call get_all_tips(3,2) instead, which means return a maximum of two documents starting from 3.
If you'd rather like to specify the start and end indexes in your function, I recommend the following changes:
def get_all_tips(from_item, to_item):
if to_item < from_item:
return "to_item must be greater than from item", bad_request
db = client.MongoTip
tips_list = db.tips.find().sort([['_id', -1]]).limit(to_item - from_item).skip(from_item)
That being said, I'd like to point out that MongoDb documentation does not recommend use of skip for pagination for large collections.
MongoDb 3.2 cursor.skip

(Spark/Scala) What would be the most effective way to compare specific data in one RDD to a line of another?

Basically, I have two sets of data in two text files. One set of data is in the format:
a,DataString1,DataString2 (One line) (The first character is in every entry but not relevant)
.... (and so on)
The second set of data is in format:
Data, Data Data Data, Data Data, Data, Data Data Data (One line)(separated by either commas or spaces, but I'm able to use a regular expression to handle this, so that's not the problem)
.... (And so on)
So what I need to do is check if DataString1 AND DataString2 are both present on any single line of the second set of data.
Currently I'm doing this like so:
// spark context is defined above, imported java.util.regex.Pattern above as well
case class test(data_one: String, data_two: String)
// case class is used to just more simply organize data_one to work with
val data_one = sc.textFile("path")
val data_two = sc.textFile("path")
val rdd_one = data_one.map(_.split(",")).map( c => test(c(1),c(2))
val rdd_two = data_two.map(_.split("[,\\s*]"))
val data_two_array = rdd_two.collect()
// this causes data_two_array to be an array of array of strings.
one.foreach { line =>
for (array <- data_two_array) {
for (string <- array) {
// comparison logic here that checks finds if both dataString1 and dataString2
// happen to be on same line is in these two for loops
}
}
}
How could I make this process more efficient? At the moment it does work correctly, but as data sizes grow this becomes very ineffective.
The double for loop scans for all elements with size m*n where m,n are sizes of each set. You can start with join to eliminate rows. Since you have 2 columns to verify, make sure the join takes care of those.

Motor Index not created on empty Collection

I have the following code to setup my database:
self.con = motor.MotorClient(host, port)
self.Db = self.con.DB
self.Col = self.Db.Col
self.Col.create_index("c")
self.Col.create_index("h")
When I run index_information() I only see index information on _id field. However if I move the create_index() after some entries are inserted the index_information() shows the new indexes. Does this mean I have to wait until I have entries in the collection before I can create indexes? Is there another way to do this since I start with an empty collection?
You can create an index on an empty, or non-existent, MongoDB collection, and the index appears in index_information:
>>> from tornado import ioloop, gen
>>> import motor
>>>
>>> con = motor.MotorClient()
>>> db = con.test
>>> col = db.collection
>>>
>>>
>>> #gen.coroutine
... def coro():
... yield db.drop_collection("collection")
... yield col.create_index("c")
... yield col.create_index("h")
... print((yield col.index_information()))
...
>>> ioloop.IOLoop.current().run_sync(coro)
{u'c_1': {u'key': [(u'c', 1)], u'v': 1}, u'_id_': {u'key': [(u'_id', 1)], u'v': 1}, u'h_1': {u'key': [(u'h', 1)], u'v': 1}}
Since I don't see any "yield" statements in your example code, or any callbacks, I suspect you're not using Motor correctly. Motor is asynchronous; in order to wait for any Motor method that talks to the database server to complete you must either pass a callback to the method, or yield the Future the method returns.
For more information consult the tutorial:
http://motor.readthedocs.org/en/stable/tutorial.html#inserting-a-document
The discussion of calling asynchronous methods with Motor (and this applies to all Tornado libraries, not just Motor) begins at the "inserting a document" section.
You can very easily create the index on mongodb (even on empty collection) using
field_name and direction.
field_name: can be any field on which you want to create the index.
direction: can be any one from these values: 1, -1, 2dsphere, text or hashed
Refer MotorCollection Doc for details
In the below code I am trying to create index using motor library and python.
db.collection_name.create_index([("field_name", 1)] # To create ascending index
db.collection_name.create_index([("geoloc_field_name", "2dsphere")] # To create geo index
db.collection_name.create_index([("field_name", "text")] # To create text based index

PyMongo updating array records with calculated fields via cursor

Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time
MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.