How to avoid adding duplicate data in Scrapy using MongoDB? - mongodb

I want to avoid adding duplicate data and just 1) update one field (number of views) or 2) all the fields that had changed in the website. To do so I'm using an ID (origin_id) that I have found in the website that I'm scraping.
Pipelines
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.MongoClient(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_item(self, item, spider):
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
if valid:
# Update item if it is in the database and insert otherwise.
self.collection.update({'origin_id': item['origin_id']}, dict(item), upsert=True)
return item
MongoDB record
{
"_id" : ObjectId("59725e919a1a6b7f0350027a"),
"origin_id" : "12256699",
"views":"556",
"url":"...",
"title":"...",
}
Please let me know if you want more details ...

You need to increment views field by 1 if the origin_id exists in the document.
Note that you can only set the other fields as they hold non-numeric values.
This is also necessary in order to skip an extra query that checks if a document with that origin_id exists in the collection.
self.collection.update({
'origin_id': item['origin_id']},
{
'$set': {'url': item['url'], 'title': item['title']},
'$inc': {'views': 1}
}
},
upsert=True)

Related

MongoDB updating the wrong subdocument in array

I've recently started using MongoDB using Mongoose (from NodeJS), but now I got stuck updating a subdocument in an array.
Let me show you...
I've set up my Restaurant in MongoDB like so:
_id: ObjectId("5edaaed8d8609c2c47fd6582")
name: "Some name"
tables: Array
0: Object
id: ObjectId("5ee277bab0df345e54614b60")
status: "AVAILABLE"
1: Object
id: ObjectId("5ee277bab0df345e54614b61")
status: "AVAILABLE"
As you can see a restaurant can have multiple tables, obviously.
Now I would like to update the status of a table for which I know the _id. I also know the _id of the restaurant that has the table.
But....I only want to update the status if we have the corresponding tableId and this table has the status 'AVAILABLE'.
My update statement:
const result = await Restaurant.updateOne(
{
_id: ObjectId("5edaaed8d8609c2c47fd6582"),
'tables._id': ObjectId("5ee277bab0df345e54614b61"),
'tables.status': 'AVAILABLE'
},
{ $set: { 'tables.$.status': 'CONFIRMED' } }
);
Guess what happens when I run the update-statement above?
It strangely updates the FIRST table (with the wrong table._id)!
However, when I remove the 'tables.status' filter from the query, it does update the right table:
const result = await Restaurant.updateOne(
{
_id: ObjectId("5edaaed8d8609c2c47fd6582"),
'tables._id': ObjectId("5ee277bab0df345e54614b61")
},
{ $set: { 'tables.$.status': 'CONFIRMED' } }
);
Problem here is that I need the status to be 'AVAILABLE', or else it should not update!
Can anybody point me in the wright direction with this?
according to the docs, the positional $ operator acts as a placeholder for the first element that matches the query document
so you are updating only the first array element in the document that matches your query
you should use the filtered positional operator $[identifier]
so your query will be something like that
const result = await Restaurant.updateOne(
{
_id: ObjectId("5edaaed8d8609c2c47fd6582"),
'tables._id': ObjectId("5ee277bab0df345e54614b61"),
'tables.status': 'AVAILABLE'
},
{
$set: { 'tables.$[table].status': 'CONFIRMED' } // update part
},
{
arrayFilters: [{ "table._id": ObjectId("5ee277bab0df345e54614b61"), 'table.status': 'AVAILABLE' }] // options part
}
);
by this way, you're updating the table element that has that tableId and status
hope it helps

Meteor/Mongo - add/update element in sub array dynamically

So I have found quite few related posts on SO on how to update a field in a sub array, such as this one here
What I want to achieve is basically the same thing, but updating a field in a subarray dynamically, instead of just calling the field name in the query.
Now I also found how to do that straight in the main object, but cant seem to do it in the sub array.
Code to insert dynamically in sub-object:
_.each(data.data, function(val, key) {
var obj = {};
obj['general.'+key] = val;
insert = 0 || (Documents.update(
{ _id: data._id },
{ $set: obj}
));
});
Here is the tree of what I am trying to do:
Documents: {
_id: '123123'
...
smallRoom:
[
_id: '456456'
name: 'name1'
description: 'description1'
],
[
...
]
}
Here is my code:
// insert a new object in smallRoom, with only the _id so far
var newID = new Mongo.ObjectID;
var createId = {_id: newID._str};
Documents.update({_id: data._id},{$push:{smallRooms: createId}})
And the part to insert the other fields:
_.each(data.data, function(val, key) {
var obj = {};
obj['simpleRoom.$'+key] = val;
console.log(Documents.update(
{
_id: data._id, <<== the document id that I want to update
smallRoom: {
$elemMatch:{
_id : newID._str, <<== the smallRoom id that I want to update
}
}
},
{
$set: obj
}
));
});
Ok, having said that, I understand I can insert the whole object straight away, not having to push every single field.
But I guess this question is more like, how does it work if smallRoom had 50 fields, and I want to update 3 random fields ? (So I would NEED to use the _each loop as I wouldnt know in advance which field to update, and would not want to replace the whole object)
I'm not sure I 100% understand your question, but I think the answer to what you are asking is to use the $ symbol.
Example:
Documents.update(
{
_id: data._id, smallRoom._id: newID._str
},
{
$set: { smallroom.$.name: 'new name' }
}
);
You are finding the document that matches the _id: data._id, then finding the object in the array smallRoom that has an _id equal to newId._str. Then you are using the $ sign to tell Mongo to update that object's name key.
Hope that helps

Unique array value in Mongo

I'm having a hard time to find a way to make a collection index work the way I need. That collection has an array that will contain two elements, and no other array can have these two elements (in any order):
db.collection.insert(users : [1,2] // should be valid
db.collection.insert(users : [2,3] // should be valid
db.collection.insert(users : [1,3] // should be valid
db.collection.insert(users : [3,2] // should be invalid, since there's another array with that same value.
But, if I use db.collection.createIndex({users:1}, {unique: true}), it won't allow me to have two arrays with a common element:
db.collection.insert(users : [1,2] // valid
db.collection.insert(users : [2,3] // invalid, since 2 is already on another document
One of the solutions I tried was to make the array one level deeper. Creating the very same index, but adding documents a little different would make it almost the way I need, but it would still allow two arrays to have the same value in the reverse orders:
db.chat.insert({ users : { people : [1,2] }}) // valid
db.chat.insert({ users : { people : [2,3] }}) // valid
db.chat.insert({ users : { people : [2,1] }}) // valid, but it should be invalid, since there's another document with [1,2] array value.
db.chat.insert({ users : { people : [1,2] }}) // invalid
Is there a way to achieve this on a index level?
The mongodb doesn't create indexes on the entire array. But...
We want one atomic operation insert or update, and guarantee uniqueness of the array's content? Then, we need to calculate one feature which is the same for all permutations of the array's items, and create an unique index for it.
One way would be to sort array items (solves permutation problem) and concatenate them (creates one feature). The example, in javascript:
function index(arr) {
return arr.sort().join();
}
users1 = [1, 2], usersIndex1 = index(users1); // "1,2"
users2 = [2, 1], usersIndex2 = index(users2); // "1,2"
// Create the index
db.collection.ensureIndex({usersIndex: 1}, {unique: true});
//
db.collection.insert({users: users1, usersIndex: usersIndex1}); // Ok
db.collection.insert({users: users2, usersIndex: usersIndex2}); // Error
If the arrays are long, you can apply a hash function on the strings, minimizing the size of the collection. Though, it comes with a price of possible collisions.
You have to write a custom validation in the pre save hook:
Coffee Version
Pre Save Hook
chatSchema.pre('save', (next) ->
data = #
#.constructor.findOne {}, (err, doc) ->
return next err if err?
return next "Duplicate" if customValidation(data, doc) == false
return next()
)
Custom Validation
customValidation = (oldDoc, newDoc)->
#whatever you need
e.g. return !lodash.equal(oldDoc, newDoc)
Js Version
var customValidation;
chatSchema.pre('save', function(next) {
var data;
data = this;
return this.constructor.findOne({}, function(err, doc) {
if (err != null) {
return next(err);
}
if (customValidation(data, doc) === false) {
return next("Duplicate");
}
return next();
});
});
customValidation = function(oldDoc, newDoc) {
return !lodash.equal(oldDoc, newDoc);
};
You should first find the records and If no record available then insert that
db.chat.findOne({users: {$all: [3,2]}})
.then(function(doc){
if(doc){
return res.json('already exists');
} else {
db.chat.insert({users: [3,2]})
}
})
.catch(next);

Update nested array object (put request)

I have an array inside a document of a collection called pown.
{
_id: 123..,
name: pupies,
pups:[ {name: pup1, location: somewhere}, {name: pup2, ...}]
}
Now a user using my rest-service sends the entire first entry as put request:
{name: pup1, location: inTown}
After that I want to update this element in my database.
Therefore I tried this:
var updatedPup = req.body;
var searchQuery = {
_id : 123...,
pups : { name : req.body.name }
}
var updateQuery = {
$set: {'pups': updatedPup }
}
db.pown.update(searchQuery, updateQuery, function(err, data){ ... }
Unfortunately it is not updating anythig.
Does anyone know how to update an entire array-element?
As Neil pointed, you need to be acquainted with the dot notation(used to select the fields) and the positional operator $ (used to select a particular element in an array i.e the element matched in the original search query). If you want to replace the whole element in the array
var updateQuery= {
"$set":{"pups.$": updatedPup}
}
If you only need to change the location,
var updateQuery= {
"$set":{"pups.$.location": updatedPup.location}
}
The problem here is that the selection in your query actually wants to update an embedded array element in your document. The first thing is that you want to use "dot notation" instead, and then you also want the positional $ modifier to select the correct element:
db.pown.update(
{ "pups.name": req.body.name },
{ "$set": { "pups.$.locatation": req.body.location }
)
That would be the nice way to do things. Mostly because you really only want to modify the "location" property of the sub-document. So that is how you express that.

Auto increment in MongoDB to store sequence of Unique User ID

I am making a analytics system, the API call would provide a Unique User ID, but it's not in sequence and too sparse.
I need to give each Unique User ID an auto increment id to mark a analytics datapoint in a bitarray/bitset. So the first user encounters would corresponding to the first bit of the bitarray, second user would be the second bit in the bitarray, etc.
So is there a solid and fast way to generate incremental Unique User IDs in MongoDB?
As selected answer says you can use findAndModify to generate sequential IDs.
But I strongly disagree with opinion that you should not do that. It all depends on your business needs. Having 12-byte ID may be very resource consuming and cause significant scalability issues in future.
I have detailed answer here.
You can, but you should not
https://web.archive.org/web/20151009224806/http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
Each object in mongo already has an id, and they are sortable in insertion order. What is wrong with getting collection of user objects, iterating over it and use this as incremented ID? Er go for kind of map-reduce job entirely
I know this is an old question, but I shall post my answer for posterity...
It depends on the system that you are building and the particular business rules in place.
I am building a moderate to large scale CRM in MongoDb, C# (Backend API), and Angular (Frontend web app) and found ObjectId utterly terrible for use in Angular Routing for selecting particular entities. Same with API Controller routing.
The suggestion above worked perfectly for my project.
db.contacts.insert({
"id":db.contacts.find().Count()+1,
"name":"John Doe",
"emails":[
"john#doe.com",
"john.doe#business.com"
],
"phone":"555111322",
"status":"Active"
});
The reason it is perfect for my case, but not all cases is that as the above comment states, if you delete 3 records from the collection, you will get collisions.
My business rules state that due to our in house SLA's, we are not allowed to delete correspondence data or clients records for longer than the potential lifespan of the application I'm writing, and therefor, I simply mark records with an enum "Status" which is either "Active" or "Deleted". You can delete something from the UI, and it will say "Contact has been deleted" but all the application has done is change the status of the contact to "Deleted" and when the app calls the respository for a list of contacts, I filter out deleted records before pushing the data to the client app.
Therefore, db.collection.find().count() + 1 is a perfect solution for me...
It won't work for everyone, but if you will not be deleting data, it works fine.
Edit
latest versions of pymongo:
db.contacts.count() + 1
First Record should be add
"_id" = 1 in your db
$database = "demo";
$collections ="democollaction";
echo getnextid($database,$collections);
function getnextid($database,$collections){
$m = new MongoClient();
$db = $m->selectDB($database);
$cursor = $collection->find()->sort(array("_id" => -1))->limit(1);
$array = iterator_to_array($cursor);
foreach($array as $value){
return $value["_id"] + 1;
}
}
I had a similar issue, namely I was interested in generating unique numbers, which can be used as identifiers, but doesn't have to. I came up with the following solution. First to initialize the collection:
fun create(mongo: MongoTemplate) {
mongo.db.getCollection("sequence")
.insertOne(Document(mapOf("_id" to "globalCounter", "sequenceValue" to 0L)))
}
An then a service that return unique (and ascending) numbers:
#Service
class IdCounter(val mongoTemplate: MongoTemplate) {
companion object {
const val collection = "sequence"
}
private val idField = "_id"
private val idValue = "globalCounter"
private val sequence = "sequenceValue"
fun nextValue(): Long {
val filter = Document(mapOf(idField to idValue))
val update = Document("\$inc", Document(mapOf(sequence to 1)))
val updated: Document = mongoTemplate.db.getCollection(collection).findOneAndUpdate(filter, update)!!
return updated[sequence] as Long
}
}
I believe that id doesn't have the weaknesses related to concurrent environment that some of the other solutions may suffer from.
// await collection.insertOne({ autoIncrementId: 1 });
const { value: { autoIncrementId } } = await collection.findOneAndUpdate(
{ autoIncrementId: { $exists: true } },
{
$inc: { autoIncrementId: 1 },
},
);
return collection.insertOne({ id: autoIncrementId, ...data });
I used something like nested queries in MySQL to simulate auto increment, which worked for me. To get the latest id and increment one to it you can use:
lastContact = db.contacts.find().sort({$natural:-1}).limit(1)[0];
db.contacts.insert({
"id":lastContact ?lastContact ["id"] + 1 : 1,
"name":"John Doe",
"emails": ["john#doe.com", "john.doe#business.com"],
"phone":"555111322",
"status":"Active"
})
It solves the removal issue of Alex's answer. So no duplicate id will appear if any record is removed.
More explanation: I just get the id of the latest inserted document, add one to it, and then set it as the id of the new record. And ternary is for cases that we don't have any records yet or all of the records are removed.
this could be another approach
const mongoose = require("mongoose");
const contractSchema = mongoose.Schema(
{
account: {
type: mongoose.Schema.Types.ObjectId,
required: true,
},
idContract: {
type: Number,
default: 0,
},
},
{ timestamps: true }
);
contractSchema.pre("save", function (next) {
var docs = this;
mongoose
.model("contract", contractSchema)
.countDocuments({ account: docs.account }, function (error, counter) {
if (error) return next(error);
docs.idContract = counter + 1;
next();
});
});
module.exports = mongoose.model("contract", contractSchema);
// First check the table length
const data = await table.find()
if(data.length === 0){
const id = 1
// then post your query along with your id
}
else{
// find last item and then its id
const length = data.length
const lastItem = data[length-1]
const lastItemId = lastItem.id // or { id } = lastItem
const id = lastItemId + 1
// now apply new id to your new item
// even if you delete any item from middle also this work
}