I have a reporting tool that runs against an MS SQL Server using EF4. The general bulk of this report involves looping over around 5000 rows and then pulling numerous other rows for each one of these.
I pull the initial rows through one data context. The code that pulls the related rows involves using another data context, wrapped in a using statement. It would appear though that the memory consumed by the second data context is never freed and usage shoots up to 1.5GB before an out of memory exception is thrown.
Here a snippet of the code so you can get the idea:
var outlets = (from o in db.tblOutlets
where o.OutletType == 3
&& o.tblCalls.Count() > number && o.BelongsToUser.HasValue && o.tblUser.Active == true
select new { outlet = o, callcount = o.tblCalls.Count() }).OrderByDescending(p => p.callcount);
var outletcount = outlets.Count();
//var outletcount = 0;
//var average = outlets.Average(p => p.callcount);
foreach (var outlet in outlets)
{
using (relenster_v2Entities db_2 = new relenster_v2Entities())
{
//loop over calls and add history
//check the last time the history table was added to for this call
var lastEntry = (from h in db_2.tblOutletDistributionHistories
where h.OutletID == outlet.outlet.OutletID
orderby h.VisitDate descending
select h).FirstOrDefault();
DateTime? beginLooking = null;
I had hoped that by using a second data context that memory could be released after each iteration. It would appear it is not (or the GC is not running in time)
With the input from #adrift I altered the code so that the saving of the changes took place after each iteration of the loop, rather than all at the end. It would appear that there is a limit (in my case anyway) of around 150,000 pending writes that the data context can happily hold before consuming too much memory.
By allowing it to write changes after each iteration, it would appear that it could manage memory more effectively, although it did seem to use as much, it didn't throw an exception.
Related
I like h2o.ai for machine learning using R.
https://cran.r-project.org/web/packages/h2o/h2o.pdf
I like random forests, but I'm making a few thousand predictions in a loop.
It is spamming up my memory with things like this:
I can't afford to keep them all in memory. I'm making my very nice computer work very hard. That means it doesn't have the capacity to hold all the balls in the air at once.
If I could assign a destination frame name to the prediction then each new one would overwrite the old ones.
How do I assign a destination frame name when I am performing "h2o.predict" on an object?
Things that I have tried that did not work:
h2o.predict(object = rf.hex, newdata = test.hex, predictions_frame = "predict.hex")
h2o.predict(object = rf.hex, newdata = test.hex, destination_frame = "predict.hex")
h2o.predict(object = rf.hex, newdata = test.hex, model_id = "predict.hex")
There is no way that I am aware of.
But as an alternative, inside your loop, you could call h2o.rm() on the return value from h2o.predict(). It is worth calling h2o.gc() as well. Something like:
for(data in alldata){
# ... prepare newdata
p = h2o.predict(model, newdata)
# ... do something with p here
h2o.rm(p)
h2o.rm(newdata) # If also not needed any more
h2o.gc()
}
Aside: you said "I'm making a few thousand predictions in a loop". Assuming they were all against the same model, remember you can batch them up, and give all thousand predictions in a single newdata dataframe. One call to h2o.predict() with 1000 entries is much more efficient than making 1000 h2o.predict() calls, for one newdata entry at a time.
The datagrid that I use on the client is based on SQL row number; it also requires a total number of pages for its paging. I also use the PagedList on the server.
SQL Profiler shows that the PagedList makes 2 db calls - the first to get the total number of records and the second to get the current page. The thing is that I can't find a way to extract that total number of records from the PagedList. Therefore, currently I have to make an extra call to get that total which creates 3 calls in total for each request, 2 of which are absolutely identical. I understand that I probably won't be able to rid of the call to get the totals but I hate to call it twice. Here is an extract from my code, I'd really appreciate any help in this:
var t = from c in myDb.MyTypes.Filter<MyType>(filterXml) select c;
response.Total = t.Count(); // my first call to get the total
double d = uiRowNumber / uiRecordsPerPage;
int page = (int)Math.Ceiling(d) + 1;
var q = from c in myDb.MyTypes.Filter<MyType>(filterXml).OrderBy(someOrderString)
select new ReturnType
{
Something = c.Something
};
response.Items = q.ToPagedList(page, uiRecordsPerPage);
PagedList has a .TotalItemCount property which reflects the total number of records in the set (not the number in a particular page). Thus response.Items.TotalItemCount should do the trick.
In a collection with a field image that contains images (BinData). I want to find out how many % of the DB are used by the images. What is the most efficient way to calculate the total size of all images?
I want to avoid fetching all images from the DB server, so I came up with this code:
mapper = Code("""
function() {
var n = 0;
if (this.image) {
n = this.image.length();
}
emit('sizes', n);
}
""")
reducer = Code("""
function(key, sizes) {
var total = 0;
for (var i = 0; i < sizes.length; i++) {
total += sizes[i];
}
}
return total;
""")
result = db.files.map_reduce(mapper, reducer, "image_sizes")
During the execution memory usage of mongodb gets quite high, it looks as if the whole data is loaded into memory. How can this be optimized? Also, does it make sense to call this.image.length() in order to find out how many Bytes the images occupy on the harddrive?
You can not avoid loading all the data into memory. MongoDB treats a document as its atomic unit, and by querying all documents, it will pull in all documents into memory.
As an alternative, what might possibly help you is just to see how many bytes a collection takes up, but that of course only works if the only thing you have stored in your collection are images. On the shell, you can do this with:
db.files.stats()
Which has the field storageSize that shows you how much storage is needed for your images approximately. This is not nearly as accurate as going through all your images though.
Is there any limit of rows for a dataset. Basically I need to generate excel files with data extracted from SQL server and add formatting. There are 2 approaches I have. Either take enntire data (around 4,50,000 rows) and loops through those in .net code OR loop through around 160 records, pass every record as an input to proc, get the relavant data, generate the file and move to next of 160. Which is the best way? Is there any other way this can be handled?
If I take 450000 records at a time, will my application crash?
Thanks,
Rohit
You should not try to read 4 million rows into your application at one time. You should instead use a DataReader or other cursor-like method and look at the data a row at a time. Otherwise, even if your application does run, it'll be extremely slow and use up all of the computer's resources
Basically I need to generate excel files with data extracted from SQL server and add formatting
A DataSet is generally not ideal for this. A process that loads a dataset, loops over it, and then discards it, means that the memory from the first row processed won't be released until the last row is processed.
You should use a DataReader instead. This discards each row once its processed through a subsequent call to Read.
Is there any limit of rows for a dataset
At the very least since the DataRowCollection.Count Property is an int its limited to 4,294,967,295 rows, however there may be some other constraint that makes it smaller.
From your comments this is outline of how I might construct the loop
using (connection)
{
SqlCommand command = new SqlCommand(
#"SELECT Company,Dept, Emp Name
FROM Table
ORDER BY Company,Dept, Emp Name );
connection.Open();
SqlDataReader reader = command.ExecuteReader();
string CurrentCompany = "";
string CurrentDept = "";
string LastCompany = "";
string LastDept = "";
string EmpName = "";
SomeExcelObject xl = null;
if (reader.HasRows)
{
while (reader.Read())
{
CurrentCompany = reader["Company"].ToString();
CurrentDept = reader["Dept"].ToString();
if (CurrentCompany != LastCompany || CurrentDept != LastDept)
{
xl = CreateNewExcelDocument(CurrentCompany,CurrentDept);
}
LastCompany = CurrentCompany;
LastDept = CurrentDept;
AddNewEmpName (xl, reader["EmpName"].ToString() );
}
}
reader.Close();
}
From the C# documentation:
The Save method is a combination of Insert and Update. If the Id member of the document has a value, then it is assumed to be an existing document and Save calls Update on the document (setting the Upsert flag just in case it actually is a new document after all).
I'm creating my IDs manually in a base class that all my domain objects inherit from. So all my domain objects have an ID when they are inserted into MongoDB.
Questions is, should I use collection.Save and keep my interface simple or does this actually result in some overhead in the Save-call (with the Upsert flag), and should I therefor use collection.Insert and Update instead?
What I'm thinking is that the Save method is first calling Update and then figures out that my new object didn't exist in the first place, and then call Insert instead. Am I wrong? Has anyone tested this?
Note: I insert bulk data with InsertBatch, so big datachunks won't matter in this case.
Edit, Follow up
I wrote a small test to find out if calling Update with Upsert flag had some overhead so Insert might be better. Turned out that they run at the same speed. See my test code below. MongoDbServer and IMongoDbServer is my own generic interface to isolate the storage facility.
IMongoDbServer server = new MongoDbServer();
Stopwatch sw = new Stopwatch();
long d1 = 0;
long d2 = 0;
for (int w = 0; w <= 100; w++)
{
sw.Restart();
for (int i = 0; i <= 10000; i++)
{
ProductionArea area = new ProductionArea();
server.Save(area);
}
sw.Stop();
d1 += sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i <= 10000; i++)
{
ProductionArea area = new ProductionArea();
server.Insert(area);
}
sw.Stop();
d2 += sw.ElapsedMilliseconds;
}
long a1 = d1/100;
long a2 = d2/100;
The Save method is not going to make two trips to the server.
The heuristic is this: if the document being saved does not have a value for the _id field, then a value is generated for it and then Insert is called. If the document being saved has a non-zero value for the _id, then Update is called with the Upsert flag, in which case it is up to the server to decide whether to do an Insert or an Update.
I don't know if an Upsert is more expensive than an Insert. I suspect they are almost the same and what really matters is that either way it is a single network round trip.
If you know it's a new document you might as well call Insert. And calling InsertBatch is way more performant than calling many individual Inserts. So definitely prefer InsertBatch to Save.