MongoDB C# collection.Save vs Insert+Update - mongodb

From the C# documentation:
The Save method is a combination of Insert and Update. If the Id member of the document has a value, then it is assumed to be an existing document and Save calls Update on the document (setting the Upsert flag just in case it actually is a new document after all).
I'm creating my IDs manually in a base class that all my domain objects inherit from. So all my domain objects have an ID when they are inserted into MongoDB.
Questions is, should I use collection.Save and keep my interface simple or does this actually result in some overhead in the Save-call (with the Upsert flag), and should I therefor use collection.Insert and Update instead?
What I'm thinking is that the Save method is first calling Update and then figures out that my new object didn't exist in the first place, and then call Insert instead. Am I wrong? Has anyone tested this?
Note: I insert bulk data with InsertBatch, so big datachunks won't matter in this case.
Edit, Follow up
I wrote a small test to find out if calling Update with Upsert flag had some overhead so Insert might be better. Turned out that they run at the same speed. See my test code below. MongoDbServer and IMongoDbServer is my own generic interface to isolate the storage facility.
IMongoDbServer server = new MongoDbServer();
Stopwatch sw = new Stopwatch();
long d1 = 0;
long d2 = 0;
for (int w = 0; w <= 100; w++)
{
sw.Restart();
for (int i = 0; i <= 10000; i++)
{
ProductionArea area = new ProductionArea();
server.Save(area);
}
sw.Stop();
d1 += sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i <= 10000; i++)
{
ProductionArea area = new ProductionArea();
server.Insert(area);
}
sw.Stop();
d2 += sw.ElapsedMilliseconds;
}
long a1 = d1/100;
long a2 = d2/100;

The Save method is not going to make two trips to the server.
The heuristic is this: if the document being saved does not have a value for the _id field, then a value is generated for it and then Insert is called. If the document being saved has a non-zero value for the _id, then Update is called with the Upsert flag, in which case it is up to the server to decide whether to do an Insert or an Update.
I don't know if an Upsert is more expensive than an Insert. I suspect they are almost the same and what really matters is that either way it is a single network round trip.
If you know it's a new document you might as well call Insert. And calling InsertBatch is way more performant than calling many individual Inserts. So definitely prefer InsertBatch to Save.

Related

Script is taking 11 - 20 seconds to lookup up an item in an 18,000 row data set

I have two Google sheets workbooks.
One is the "master" source of lookup data with a key based on manufacturer item #, which could be anything from 1234 to A-01/234-Name_1. This sheet, referenced via SpreadsheetApp.openByUrl, has 18,000 rows and 13 columns. The key column has been converted to plain text and the sheet is sorted by this column.
The second is the "template" where people enter item #s that they need to look up against the master, typically 20 - 1500 items at a time.
The script is in the template. It is very slow and routinely times out after 30 minutes. It was written by someone else and I am new to App Script, but I think I've managed to understand what the script is doing and where the bottleneck is occurring.
It does a bunch of stuff, but this is the meat of the lookup:
var numrows = master.getDataRange().getNumRows();
var masterdata = master.getDataRange().getValues();
var itemnumberlist = template.getDataRange().getValues();
var retreiveddata = [];
// iterate through the manf item number list to find all matches in the
// master and return those matches to another sheet
for (i = 1; i < template.getDataRange().getValues().length; i++) {
for (j = 0; j < numrows; j++) {
if (masterdata[j][1].toString() === itemnumberlist[i][1].toString()) {
retreiveddata.push(data[j]);
anothersheet.appendRow(data[j]);
}
}
}
I used Logger.log() to determine that each time through the i loop is taking 11 - 19 seconds, which just seems insane.
I've been doing some google searching and I've tried a couple of different things...
First I tried moving the writing of found data out of the for loop so the script would be doing all of its reading first and then writing in one big chunk, but I couldn't get it exactly right. My two attempts are below.
var mycounter = 0;
for (i = 0; i < template.getDataRange().getValues().length; i++) {
for (j = 0; j < numrows; j++) {
if (masterdata[j][0].toString() === itemnumberlist[i][0].toString()) {
retreiveddata.push(masterdata[j]);
mycounter = mycounter + 1;
}
}
}
// Attempt 1
// var myrange = retreiveddata.length;
// for(k = 0; k < myrange; k++) {
// anothersheet.appendRow(retreiveddata.pop([k]);
// }
//Attempt 2
var myotherrange = anothersheet.getRange(2,1,myothercounter, 13)
myotherrange.setValues(retreiveddata);
I can't remember for sure, because this was on Friday, but I think both attempts resulted in the script trying to write the entire master file into "anothersheet".
So I temporarily set this aside and decided to try something else. I was trying to recreate the issue in a couple of sample spreadsheets, but I was unable to do so. The same script is getting through my 15,000 row sample "master" file in less than 1 second per lookup. The only thing I can think of is that I used a random number as my key instead of a weird text string.
That led me to think that maybe I could use a hash algorithm on both the master data and the values to be looked up, but this is presenting a whole other set of issues.
I borrowed these functions from another forum post:
function GetMD5Hash(value) {
var rawHash = Utilities.computeDigest(Utilities.DigestAlgorithm.MD5,
value);
var txtHash = '';
for (j = 0; j <rawHash.length; j++) {
var hashVal = rawHash[j];
if (hashVal < 0)
hashVal += 256;
if (hashVal.toString(16).length == 1)
txtHash += "0";
txtHash += hashVal.toString(16);
Utilities.sleep(100);
}
return txtHash;
}
function RangeGetMD5Hash(input) {
if (input.map) { // Test whether input is an array.
return input.map(GetMD5Hash); // Recurse over array if so.
Utilities.sleep(100);
} else {
return GetMD5Hash(input)
}
}
It literally took me all day to get the hash value for all 18,000 item #s in my master spreadsheet. Neither GetMD5Hash nor RangeGetMD5Hash will return a value consistently. I can only do a few rows at a time. Sometimes I get "Loading..." indefinitely. Sometimes I get "#Name" with a message about GetMD5Hash being undefined (despite the fact that it worked on the previous row). And sometimes I get "#Error" with a message about an internal error.
This method actually reduces the lookup time of each item to 2 - 3 seconds (much better, but not great). However, I can't get the hash function to consistently work on the input data.
At this point I'm so frustrated and behind on my other work that I thought I'd reach out to the smart people on these forums and hope for some sort of miracle response.
To summarize, I'm looking for suggestions on these three items:
What am I doing wrong in my attempt to move the write out of the for loop?
Is there a way to get my hash value faster or utilize a different method to accomplish the same goal?
What else can I try to help speed up the script?
Any suggestions you can offer would be greatly appreciated!
-Mandy
It sounds like you hit on the right approach with attempting to move the appendRow() call out of the loop. Anytime you are reading or writing to a spreadsheet you can expect the individual call to take 1 to 2 seconds, so this will eat up a lot of time when you get matches. Storing the matches in an array and writing them all at once is the way to go.
Another thing I notice is that your script calls getValues() in the actual for loop condition statement. The condition statement is executed each time on each iteration of the loop, so this is potentially wasting a lot of time even when you don't have matches.
A final tweak that may be helpful depending on your desired behaviour. You can stop the inner for loop after it finds the first match, which, if you only care about the first match or know there will only be one match, will save you a lot of iterations. To do this, put "break" immediately after the retreiveddata.push(masterdata[j]); line.
To fix the getValues issue, Change:
for (i = 1; i < template.getDataRange().getValues().length; i++) {
To:
for (i = 1; i < itemnumberlist.length; i++) {
And that fix along with the appendRow issue, and including the break call:
for (i = 1; i < itemnumberlist.length; i++) {
for (j = 0; j < numrows; j++) {
if (masterdata[j][0].toString() === itemnumberlist[i][0].toString()) {
retreiveddata.push(masterdata[j]);
break; //stop searching after first match, move on to next item
}
}
}
//make sure you have data to write before trying to write it.
if(retreiveddata.length > 0){
var myotherrange = anothersheet.getRange(2,1,retreiveddata.length, retreiveddata[0].length);
myotherrange.setValues(retreiveddata);
}
If you are re-using the same sheet for "anothersheet" on each execution, you may also want to call anothersheet.clear() to erase any existing data before you write your fresh results.
I would pass on the hashing approach altogether, comparing strings is comparing strings, so whether they are hashes or actual part numbers I wouldn't expect a significant difference.

how to get a parentNode's index i using d3.js

Using d3.js, were I after (say) some value x of a parent node, I'd use:
d3.select(this.parentNode).datum().x
What I'd like, though, is the data (ie datum's) index. Suggestions?
Thanks!
The index of an element is only well-defined within a collection. When you're selecting just a single element, there's no collection and the notion of an index is not really defined. You could, for example, create a number of g elements and then apply different operations to different (overlapping) subsets. Any individual g element would have several indices, depending on the subset you consider.
In order to do what you're trying to achieve, you would have to keep a reference to the specific selection that you want to use. Having this and something that identifies the element, you can then do something like this.
var value = d3.select(this.parentNode).datum().x;
var index = -1;
selection.each(function(d, i) { if(d.x == value) index = i; });
This relies on having an attribute that uniquely identifies the element.
If you have only one selection, you could simply save the index as another data attribute and access it later.
var gs = d3.selectAll("g").data(data).append("g")
.each(function(d, i) { d.index = i; });
var something = gs.append(...);
something.each(function() {
d3.select(this.parentNode).datum().index;
});

Random Sampling from Mongo

I have a mongo collection with documents. There is one field in every document which is 0 OR 1. I need to random sample 1000 records from the database and count the number of documents who have that field as 1. I need to do this sampling 1000 times. How do i do it ?
For people coming to the answer, you should now use the new $sample aggregation function, new in 3.2.
https://docs.mongodb.org/manual/reference/operator/aggregation/sample/
db.collection_of_things.aggregate(
[ { $sample: { size: 15 } } ]
)
Then add another step to count up the 0s and 1s using $group to get the count. Here is an example from the MongoDB docs.
For MongoDB 3.0 and before, I use an old trick from SQL days (which I think Wikipedia use for their random page feature). I store a random number between 0 and 1 in every object I need to randomize, let's call that field "r". You then add an index on "r".
db.coll.ensureIndex(r: 1);
Now to get random x objects, you use:
var startVal = Math.random();
db.coll.find({r: {$gt: startVal}}).sort({r: 1}).limit(x);
This gives you random objects in a single find query. Depending on your needs, this may be overkill, but if you are going to be doing lots of sampling over time, this is a very efficient way without putting load on your backend.
Here's an example in the mongo shell .. assuming a collection of collname, and a value of interest in thefield:
var total = db.collname.count();
var count = 0;
var numSamples = 1000;
for (i = 0; i < numSamples; i++) {
var random = Math.floor(Math.random()*total);
var doc = db.collname.find().skip(random).limit(1).next();
if (doc.thefield) {
count += (doc.thefield == 1);
}
}
I was gonna edit my comment on #Stennies answer with this but you could also use a seprate auto incrementing ID index here as an alternative if you were to skip over HUGE amounts of record (talking huge here).
I wrote another answer to another question a lot like this one where some one was trying to find nth record of the collection:
php mongodb find nth entry in collection
The second half of my answer basically describes one potential method by which you could approach this problem. You would still need to loop 1000 times to get the random row of course.
If you are using mongoengine, you can use a SequenceField to generate an incremental counter.
class User(db.DynamicDocument):
counter = db.SequenceField(collection_name="user.counters")
Then to fetch a random list of say 100, do the following
def get_random_users(number_requested):
users_to_fetch = random.sample(range(1, User.objects.count() + 1), min(number_requested, User.objects.count()))
return User.objects(counter__in=users_to_fetch)
where you would call
get_random_users(100)

Dataset capacities

Is there any limit of rows for a dataset. Basically I need to generate excel files with data extracted from SQL server and add formatting. There are 2 approaches I have. Either take enntire data (around 4,50,000 rows) and loops through those in .net code OR loop through around 160 records, pass every record as an input to proc, get the relavant data, generate the file and move to next of 160. Which is the best way? Is there any other way this can be handled?
If I take 450000 records at a time, will my application crash?
Thanks,
Rohit
You should not try to read 4 million rows into your application at one time. You should instead use a DataReader or other cursor-like method and look at the data a row at a time. Otherwise, even if your application does run, it'll be extremely slow and use up all of the computer's resources
Basically I need to generate excel files with data extracted from SQL server and add formatting
A DataSet is generally not ideal for this. A process that loads a dataset, loops over it, and then discards it, means that the memory from the first row processed won't be released until the last row is processed.
You should use a DataReader instead. This discards each row once its processed through a subsequent call to Read.
Is there any limit of rows for a dataset
At the very least since the DataRowCollection.Count Property is an int its limited to 4,294,967,295 rows, however there may be some other constraint that makes it smaller.
From your comments this is outline of how I might construct the loop
using (connection)
{
SqlCommand command = new SqlCommand(
#"SELECT Company,Dept, Emp Name
FROM Table
ORDER BY Company,Dept, Emp Name );
connection.Open();
SqlDataReader reader = command.ExecuteReader();
string CurrentCompany = "";
string CurrentDept = "";
string LastCompany = "";
string LastDept = "";
string EmpName = "";
SomeExcelObject xl = null;
if (reader.HasRows)
{
while (reader.Read())
{
CurrentCompany = reader["Company"].ToString();
CurrentDept = reader["Dept"].ToString();
if (CurrentCompany != LastCompany || CurrentDept != LastDept)
{
xl = CreateNewExcelDocument(CurrentCompany,CurrentDept);
}
LastCompany = CurrentCompany;
LastDept = CurrentDept;
AddNewEmpName (xl, reader["EmpName"].ToString() );
}
}
reader.Close();
}

EF4: Object Context consuming too much memory

I have a reporting tool that runs against an MS SQL Server using EF4. The general bulk of this report involves looping over around 5000 rows and then pulling numerous other rows for each one of these.
I pull the initial rows through one data context. The code that pulls the related rows involves using another data context, wrapped in a using statement. It would appear though that the memory consumed by the second data context is never freed and usage shoots up to 1.5GB before an out of memory exception is thrown.
Here a snippet of the code so you can get the idea:
var outlets = (from o in db.tblOutlets
where o.OutletType == 3
&& o.tblCalls.Count() > number && o.BelongsToUser.HasValue && o.tblUser.Active == true
select new { outlet = o, callcount = o.tblCalls.Count() }).OrderByDescending(p => p.callcount);
var outletcount = outlets.Count();
//var outletcount = 0;
//var average = outlets.Average(p => p.callcount);
foreach (var outlet in outlets)
{
using (relenster_v2Entities db_2 = new relenster_v2Entities())
{
//loop over calls and add history
//check the last time the history table was added to for this call
var lastEntry = (from h in db_2.tblOutletDistributionHistories
where h.OutletID == outlet.outlet.OutletID
orderby h.VisitDate descending
select h).FirstOrDefault();
DateTime? beginLooking = null;
I had hoped that by using a second data context that memory could be released after each iteration. It would appear it is not (or the GC is not running in time)
With the input from #adrift I altered the code so that the saving of the changes took place after each iteration of the loop, rather than all at the end. It would appear that there is a limit (in my case anyway) of around 150,000 pending writes that the data context can happily hold before consuming too much memory.
By allowing it to write changes after each iteration, it would appear that it could manage memory more effectively, although it did seem to use as much, it didn't throw an exception.