Database for numerical data from physics simulation

Database for numerical data from physics simulation - mongodb

I work in theoretical physics and I do lot of computer simulations. An important part of my duty is the analysis of the results. I make simulations and store the numerical results in a file with some simple name. Typically I have lot of data files with very similar name and after a while I do not remember what kind of parameters the file corresponds to. I was thinking that maybe there exists a better way to store numerical results from a simulation e.g. some database (SQL, MongoDB etc.) where I could put some comments about parameters of the program, names, date etc. - a sort of a library with numerical data. I just have everything in a one place well organized. Do you know of anything like this? How do you store you numerical data from computer simulations?
More details
Typical procedure looks like this. Let say we want to simulate time evolution of the three body problem. We have three bodies of different masses interacting with Newton forces. I want to test how these objects move in space depending on: relative mass value, initial position - 6 parameters. I run simulation for one choice of parameters and save it in file: three_body_m1=0p1_m2=0p3_(the rest).dat - all double precision in total 1+3*3 (3d) columns of data in one file. Then I lunch gnuplot, python etc. and visualize them. In principle there is no relation between the data from different simulations, but I can use them to make comparison plot.

Within same nodejs context, you can,
Stream big xyz data file to server using socket.io-stream + fs modules and save filename+parameters to database using mongodb module.(max 1-page of coding but more for complex server talking)
If data fits in ram and if you don't have to save immediately, you can use redis module to send everything to server cache easily(as key-value pairs such as data->xyzData and parameters->simulationParameters and user->name_surname) and read from it high speed. If you need data as file by other processes in server, you can stram to a ramdisk instead and have most of RAM bandwidth as a file cache.(needs more ram ofcourse but fast)
mongodb is slow(even with optimizations) for saving millions of particles xyz data but is most easiest and quickest install for parameter saving and sharing.
Using all could be better.
Saving: stream file to physical disk using socket.io-stream and fs. Send parameters to mongodb.
Loading: check redis if user is registered, check if data is in cache, if yes, get it, if no, stream from physical disk and also save some of it to redis at the same time.
Editing: check if cache exists, if yes then edit it. Another serverside process can update physical disk from that cache, if no then update physical disk directly.
The communication scheme could be:
data server talks to cache server if there is any pending writes/reads/edits, consumes jobs from there.
compute server talks to cache server for producing read/write/edit jobs or consuming compute jobs.
clients can talk to cache server for reading only.
admins can also place their own data or produce compute jobs or read stuff.
compute server, data server and cache server can be on same computer easily or moved to other computers thanks to nodejs's awesomeness and countless modules of it such as redis, socket.io-stream, fs, ejs, express(for clients for example), etc.
a cache server can offload some data to another cache server and have a redirection to it(or some mapping of data to it)
a cache server can communicate N number of data servers and M number of compute servers at the same time as long as RAM holds.
You have slow network? You can use gzip module to compress the data on-the-fly with just 3-5 lines of extra code(at both ends)
You don't have money?
Nodejs works on raspberry pi (as data server maybe?)
Nvidia GTX660 can work with an Intel galileo (compute server?) using nodejs with some extra native modules for opencl(could be hard to implement)(also connecting(and powering) gpu and galileo may not be easy but should be much faster than a cluster of raspberry pi boards for fp32 number crunching)
bypass cache, RAM is expensive for now.
data server cluster
\
\
\ client
\ client /
\ / /
\ / /
mainframe cache and database server ----- compute cluster
| \
| \
support cache server admin
A very simple example to send some files to another computer(or same):
var pipeline_n = 8;
var fs = require("fs");
// server part accepting files
{
var io = require('socket.io').listen(80);
var ss = require('socket.io-stream');
var path = require('path');
var ctr = 0;
var ctr2 = 0;
io.of('/user').on('connection', function (socket) {
var z1 = new Date();
for (var i = 0; i < pipeline_n; i++) {
ss(socket).on('data'+i.toString(), function (stream, data) {
var t1 = new Date();
stream.pipe(fs.createWriteStream("m://bench_server" + ctr + ".txt"));
ctr++;
stream.on("finish", function (p) {
var len = stream._readableState.pipes.bytesWritten;
var t2 = new Date();
ctr2++;
if (ctr2 == pipeline_n) {
var z2 = new Date();
console.log(len * pipeline_n);
console.log((z2 - z1));
console.log("throughput: " + ((len * pipeline_n) / ((z2 - z1)/1000.0))/(1024*1024)+" MB/s");
}
});
});
}
});
}
//client or another server part sending a file
//(you can change it to do parts of same file instead of same file n times),
//just a dummy file sending code to stress other server
for (var i = 0; i < pipeline_n; i++)
{
var io = require('socket.io-client');
var ss = require('socket.io-stream');
var socket = io.connect('http://127.0.0.1/user');
var stream = ss.createStream();
var filename = 'm://bench.txt'; // ramdrive or cluster of hdd raid
ss(socket).emit('data'+i.toString(), stream, { name: filename });
fs.createReadStream(filename).pipe(stream);
}
Here is testing insert vs bulk insert performance of mongodb(this could be a wrong way to benchmark but is simple, just uncomment-in the part you want to benchmark)
var mongodb = require('mongodb');
var client = mongodb.MongoClient;
var url = 'mongodb://localhost:2019/evdb2';
client.connect(url, function (err, db) {
if (err) {
console.log('fail:', err);
} else {
console.log('success:', url);
var collection = db.collection('tablo');
var bulk = collection.initializeUnorderedBulkOp();
db.close();
//benchmark insert
//var t = 0;
//t = new Date();
//var ctr = 0;
//for (var i = 0; i < 1024 * 64; i++)
//{
// collection.insert({ x: i + 1, y: i, z: i * 10 }, function (e, r) {
// ctr++;
// if (ctr == 1024 * 64)
// {
// var t2 = 0;
// db.close();
// t2 = new Date();
// console.log("insert-64k: " + 1000.0 / ((t2.getTime() - t.getTime()) / (1024 * 64)) + " insert/s");
// }
// });
//}
// benchmark bulk insert
//var t3 = new Date();
//for (var i = 0; i < 1024 * 64; i++)
//{
// bulk.insert({ x: i + 1, y: i, z: i * 10 });
//}
//bulk.execute();
//var t4 = new Date();
//console.log("bulk-insert-64k: " + 1000.0/((t4.getTime() - t3.getTime()) / (1024 * 64)) + " insert/s");
// db.close();
}
});
be sure to setup mongodb and or redis servers before this. Also "npm install module_name" necessary modules from nodejs command prompt.

Related

How to calculate mean of distributed data?

How I can calculate the arithmetic mean of a large vector(series) in distributed computing where I partition the data on multiple nodes. I do not want to use map reduce paradigm. Is there any distributed algorithm to efficiently compute the mean besides the trivial computation of individual sum on each node and then bringing the result at master node and dividing with the size of the vector(series).

distributed average consensus is an alternative.
The problem with the trivial approach of map-reduce with a master is that if you have a vast set of data, in essence to make everything dependent on each other, it could take a very long time to calculate the data, by which time the information is very out of date, and therefore wrong, unless you lock the entire dataset - impractical for a massive set of distributed data. Using distributed average consensus (the same methods work for alternative algorithms to Mean), you get a more up to date, better guess at the current value of the Mean without locking the data, and in real time.
Here is a link to a paper on it, but it's math heavy :
http://web.stanford.edu/~boyd/papers/pdf/lms_consensus.pdf
You can google for many papers on it.
The general concept is like this: say on each node you have a socket listener. You evaluate your local sum and average, then publish it to the other nodes. Each node listens for the other nodes, and receives their sum and averages on a timescale that makes sense. You can then evaluate a good guess at the total average by (sumForAllNodes(storedAverage[node] * storedCount[node]) / (sumForAllNodes(storedCount[node])). If you have a truly large dataset, you could just listen for new values as they are stored in the node, and amend the local count and average, then publish them.
If even this is taking too long, you could average over a random subset of the data in each node.
Here is some c# code that gives you an idea (uses fleck to run on more versions of windows than windows-10-only microsoft websockets implementation). Run this on two nodes, one with
<appSettings>
<add key="thisNodeName" value="UK" />
</appSettings>
in the app.config, and use "EU-North" in the other. Here is some sample code. The two instances exchange means using websockets. You just need to add your back end enumeration of the database.
using Fleck;
namespace WebSocketServer
{
class Program
{
static List<IWebSocketConnection> _allSockets;
static Dictionary<string,decimal> _allMeans;
static Dictionary<string,decimal> _allCounts;
private static decimal _localMean;
private static decimal _localCount;
private static decimal _localAggregate_count;
private static decimal _localAggregate_average;
static void Main(string[] args)
{
_allSockets = new List<IWebSocketConnection>();
_allMeans = new Dictionary<string, decimal>();
_allCounts = new Dictionary<string, decimal>();
var serverAddresses = new Dictionary<string,string>();
//serverAddresses.Add("USA-WestCoast", "ws://127.0.0.1:58951");
//serverAddresses.Add("USA-EastCoast", "ws://127.0.0.1:58952");
serverAddresses.Add("UK", "ws://127.0.0.1:58953");
serverAddresses.Add("EU-North", "ws://127.0.0.1:58954");
//serverAddresses.Add("EU-South", "ws://127.0.0.1:58955");
foreach (var serverAddress in serverAddresses)
{
_allMeans.Add(serverAddress.Key, 0m);
_allCounts.Add(serverAddress.Key, 0m);
}
var thisNodeName = ConfigurationSettings.AppSettings["thisNodeName"]; //for example "UK"
var serverSocketAddress = serverAddresses.First(x=>x.Key==thisNodeName);
serverAddresses.Remove(thisNodeName);
var websocketServer = new Fleck.WebSocketServer(serverSocketAddress.Value);
websocketServer.Start(socket =>
{
socket.OnOpen = () =>
{
Console.WriteLine("Open!");
_allSockets.Add(socket);
};
socket.OnClose = () =>
{
Console.WriteLine("Close!");
_allSockets.Remove(socket);
};
socket.OnMessage = message =>
{
Console.WriteLine(message + " received");
var parameters = message.Split('~');
var remoteHost = parameters[0];
var remoteMean = decimal.Parse(parameters[1]);
var remoteCount = decimal.Parse(parameters[2]);
_allMeans[remoteHost] = remoteMean;
_allCounts[remoteHost] = remoteCount;
};
});
while (true)
{
//evaluate my local average and count
Random rand = new Random(DateTime.Now.Millisecond);
_localMean = 234.00m + (rand.Next(0, 100) - 50)/10.0m;
_localCount = 222m + rand.Next(0, 100);
//evaluate my local aggregate average using means and counts sent from all other nodes
//could publish aggregate averages to other nodes, if you wanted to monitor disagreement between nodes
var total_mean_times_count = 0m;
var total_count = 0m;
foreach (var server in serverAddresses)
{
total_mean_times_count += _allCounts[server.Key]*_allMeans[server.Key];
total_count += _allCounts[server.Key];
}
//add on local mean and count which were removed from the server list earlier, so won't be processed
total_mean_times_count += (_localMean * _localCount);
total_count = total_count + _localCount;
_localAggregate_average = (total_mean_times_count/total_count);
_localAggregate_count = total_count;
Console.WriteLine("local aggregate average = {0}", _localAggregate_average);
System.Threading.Thread.Sleep(10000);
foreach (var serverAddress in serverAddresses)
{
using (var wscli = new ClientWebSocket())
{
var tokSrc = new CancellationTokenSource();
using (var task = wscli.ConnectAsync(new Uri(serverAddress.Value), tokSrc.Token))
{
task.Wait();
}
using (var task = wscli.SendAsync(new ArraySegment<byte>(Encoding.UTF8.GetBytes(thisNodeName+"~"+_localMean + "~"+_localCount)),
WebSocketMessageType.Text,
false,
tokSrc.Token
))
{
task.Wait();
}
}
}
}
}
}
}
Don't forget to add static lock or separate activity by synchronising at given times. (not shown for simplicity)

There are two simple approaches you can use.
One is, as you correctly noted, to calculate the sum on every node and then combine the sums and divide by the total amount of data:
avg = (sum1+sum2+sum3)/(cnt1+cnt2+cnt3)
Another possibility is to calculate the average on every node and then use weighted average:
avg = (avg1*cnt1 + avg2*cnt2 + avg3*cnt3) / (cnt1+cnt2+cnt3)
= avg1*cnt1/(cnt1+cnt2+cnt3) + avg2*cnt2/(cnt1+cnt2+cnt3) + avg3*cnt3/(cnt1+cnt2+cnt3)
I don't see anything wrong with these trivial ways and am wondering why you would want to use a different approach.

Persisting a Modified Database in Chrome APP using chrome.storage API

I am using SQL.js for SQLite in my chrome app , I am loading external db file to perform query , now i want to save my changes to local storage to make it persistent , it is already define here-
https://github.com/kripken/sql.js/wiki/Persisting-a-Modified-Database
i am using the same way as defined in the article-
function toBinString (arr) {
var uarr = new Uint8Array(arr);
var strings = [], chunksize = 0xffff;
// There is a maximum stack size. We cannot call String.fromCharCode with as many arguments as we want
for (var i=0; i*chunksize < uarr.length; i++){
strings.push(String.fromCharCode.apply(null, uarr.subarray(i*chunksize, (i+1)*chunksize)));
}
return strings.join('');
}
function toBinArray (str) {
var l = str.length,
arr = new Uint8Array(l);
for (var i=0; i<l; i++) arr[i] = str.charCodeAt(i);
return arr;
}
save data to storage -
var data =toBinString(db.export());
chrome.storage.local.set({"localDB":data});
and to get data from storage-
chrome.storage.local.get('localDB', function(res) {
var data = toBinArray(res.localDB);
//sample example usage
db = new SQL.Database(data);
var result = db.exec("SELECT * FROM user");
});
Now when i make a query , i am getting this error -
Error: file is encrypted or is not a database
is there any differnces for storing values in chrome.storage and localStorage ? because its working fine using localStorage, find the working example here-
http://kripken.github.io/sql.js/examples/persistent.html
as document sugggested here -
https://developer.chrome.com/apps/storage
we don't need to use stringify and parse in chrome.storage API unlike localStorage, we can directly saves object and array.
when i try to save result return from db.export without any conversion, i am getting this error-
Cannot serialize value to JSON
So please help me guys what will be the approach to save db export in chrome's storage, is there anything i am doing wrong?

MongoDB 3.2 C# driver version 2.2.3.3 Gridfs Download large files more than 2gb

I am uploading files using the following code:
using (var s = File.OpenRead(#"C:\2gbDataTest.zip"))
{
var t = Task.Run<ObjectId>(() =>
{
return fs.UploadFromStreamAsync("2gbDataTest.zip", s);
});
return t.Result;
}
//works for the files below 2gb
var t1 = fs.DownloadAsBytesAsync(id);
Task.WaitAll(t1);
var bytes = t1.Result;
I am getting error
I am new to MongoDb and C#, can any one please show me how to download files greater than 2GB in size?

You are hitting the limit in terms of the size a byte array (kept in memory) download can be, so your only choice is to use a Stream instead like you are doing when you upload, something like (with a valid destination):
IGridFSBucket fs;
ObjectId id;
FileStream destination;
await fs.DownloadToStreamAsync(id, destination);

//Just writing complete code for others, This will work ;
//Thanks to "Adam Comerford"
var fs = new GridFSBucket(database);
using (var newFs = new FileStream(filePathToDownload, FileMode.Create))
{
//id is file objectId
var t1 = fs.DownloadToStreamAsync(id, newFs);
Task.WaitAll(t1);
newFs.Flush();
newFs.Close();
}

Need help optimizing a google apps script that labels emails

Gmail has a issue where conversation labels are not applied to new messages that arrive in the conversation thread. issue details here
We found a Google Apps Script that fixes the labels on individual messages in the Gmail Inbox to address this issue. The script is as follows:
function relabeller() {
var labels = GmailApp.getUserLabels();
for (var i = 0; i < labels.length; i++) {
Logger.log("label: " + i + " " + labels[i].getName());
var threads = labels[i].getThreads(0,100);
for (var j = 1; threads.length > 0; j++) {
Logger.log( (j - 1) * 100 + threads.length);
labels[i].addToThreads(threads);
threads = labels[i].getThreads(j*100, 100);
}
}
}
However this script times out on email boxes with more than 20,000 messages due to the 5 mins execution time limit on Google Apps Script.
Can anyone please suggest a way to optimize this script so that it doesn't timeout?

OK, I've been working on this for a few days because I was really frustrated with the strange way that Gmail labels/doesn't label messages in conversations.
I'm flabbergasted actually that labels aren't automatically applied to new messages in a conversation. This is not reflected at all in the Gmail UI. There's no way to look at a thread and determine that the labels only apply to some messages in the thread, and you cannot add labels to a single message in the UI. As I was working through my script below, I noticed that you can't even programmatically add labels to a single message. So there really is no reason for the current behavior.
With my rant out of the way, I have a few notes about the script.
I sort of combined Saqib's code with Serge's code.
The script has two parts: an initial run that relabels all threads that have a user label attached, and a maintenance run that labels recent emails (currently looks back 4 days). Only one part executes during a single run. Once the initial run is completed, only the maintenance part will run. You can set a trigger to it run once per day, or more or less often, depending on your needs.
The initial run halts after 4 minutes to avoid being terminated by the 5 minute script time limit. It sets a trigger to run again after 4 minutes (both of these times can be changed using constants in the script). The trigger gets deleted at the next run.
There is no run-time check in the maintenance section. If you have lots of emails in the last 4 days, the maintenance section might hit the script time limit. I could probably change the script to be more efficient here, but so far it's worked for me so I am not really motivated to improve on it.
There's a try/catch statement in the initial run to try to catch the Gmail "write quota error" and exit gracefully (i.e. writing the current progress so it can be picked up again later), but I don't know if it works because I couldn't get the error to happen.
You'll get an email when the time limit is reached, and when the initial run is finished.
For some reason, the log doesn't always clear fully between runs, even when using the Logger.clear() command. So the status logs that it emails to the user have more than just the most recent run info. I don't know why this occurs.
I have used this to process 20,000 emails in around half an hour (including wait times). I actually ran it twice, so it processed 40,000 emails in one day. I guess the Gmail read/write limit of 10,000 isn't what is being applied here (maybe applying a label to 100 threads at a time counts as a single write event instead of 100?). It gets through about 5,000 threads in a 4 minute run, according to the status email it sends.
Sorry for the long lines. I blame the widescreen monitors. Let me know what you think!
function relabelGmail() {
var startTime= (new Date()).getTime(); // Time at start of script
var BATCH=100; // total number of threads to apply label to at once.
var LOOKBACKDAYS=4; // Days to look back for maintenance section of script. Should be at least 2
var MAX_RUN_TIME=4*60*1000; // Time in ms for max execution. 4 minutes is a good start.
var WAIT_TIME=4*60*1000; // Time in ms to wait before starting the script again.
Logger.clear();
// ScriptProperties.deleteAllProperties(); return; // Uncomment this line and run once to start over completely
if(ScriptProperties.getKeys().length==0){ // this is to create keys on the first run
ScriptProperties.setProperties({'itemsProcessed':0, 'initFinished':false, 'lastrun':'20000101', 'itemsProcessedToday':0,
'currentLabel':'null-label-NOTREAL', 'currentLabelStart':0, 'autoTrig':0, 'autoTrigID':'0'});
}
var itemsP = Number(ScriptProperties.getProperty('itemsProcessed')); // total counter
var initTemp = ScriptProperties.getProperty('initFinished'); // keeps track of when initial run is finished.
var initF = (initTemp.toLowerCase() == 'true'); // Make it boolean
var lastR = ScriptProperties.getProperty('lastrun'); // String of date corresponding to itemsProcessedToday in format yyyymmdd
var itemsPT = Number(ScriptProperties.getProperty('itemsProcessedToday')); // daily counter
var currentL = ScriptProperties.getProperty('currentLabel'); // Label currently being processed
var currentLS = Number(ScriptProperties.getProperty('currentLabelStart')); // Thread number to start on
var autoT = Number(ScriptProperties.getProperty('autoTrig')); // Number to say whether the last run made an automatic trigger
var autoTID = ScriptProperties.getProperty('autoTrigID'); // Unique ID of last written auto trigger
// First thing: google terminates scripts after 5 minutes.
// If 4 minutes have passed, this script will terminate, write some data,
// and create a trigger to re-schedule itself to start again in a few minutes.
// If an auto trigger was created last run, it is deleted here.
if (autoT) {
var allTriggers = ScriptApp.getProjectTriggers();
// Loop over all triggers. If trigger isn't found, then it must have ben deleted.
for(var i=0; i < allTriggers.length; i++) {
if (allTriggers[i].getUniqueId() == autoTID) {
// Found the trigger and now delete it
ScriptApp.deleteTrigger(allTriggers[i]);
break;
}
}
autoT = 0;
autoTID = '0';
}
var today = dateToStr_();
if (today == lastR) { // If new day, reset daily counter
// Don't do anything
} else {
itemsPT = 0;
}
if (!initF) { // Don't do any of this if the initial run has been completed
var labels = GmailApp.getUserLabels();
// Find position of last label attempted
var curLnum=0;
for ( ; curLnum < labels.length; curLnum++) {
if (labels[curLnum].getName() == currentL) {break};
}
if (curLnum == labels.length) { // If label isn't found, start over at the beginning
curLnum = 0;
currentLS = 0;
itemsP=0;
currentL=labels[0].getName();
}
// Now start working through the labels until the quota is hit.
// Use a try/catch to stop execution if your quota has been hit.
// Google can actually automatically email you, but we need to clean up a bit before terminating the script so it can properly pick up again tomorrow.
try {
for (var i = curLnum; i < labels.length; i++) {
currentL = labels[i].getName(); // Next label
Logger.log('label: ' + i + ' ' + currentL);
var threads = labels[i].getThreads(currentLS,BATCH);
for (var j = Math.floor(currentLS/BATCH); threads.length > 0; j++) {
var currTime = (new Date()).getTime();
if (currTime-startTime > MAX_RUN_TIME) {
// Make the auto-trigger
autoT = 1; // So the auto trigger gets deleted next time.
var autoTrigger = ScriptApp.newTrigger('relabelGmail')
.timeBased()
.at(new Date(currTime+WAIT_TIME))
.create();
autoTID = autoTrigger.getUniqueId();
// Now write all the values.
ScriptProperties.setProperties({'itemsProcessed':itemsP, 'initFinished':initF, 'lastrun':today, 'itemsProcessedToday':itemsPT,
'currentLabel':currentL, 'currentLabelStart':currentLS, 'autoTrig':autoT, 'autoTrigID':autoTID});
// Send an email
var emailAddress = Session.getActiveUser().getEmail();
GmailApp.sendEmail(emailAddress, 'Relabel job in progress', 'Your Gmail Relabeller has halted to avoid termination due to excess ' +
'run time. It will run again in ' + WAIT_TIME/1000/60 + ' minutes.\n\n' + itemsP + ' threads have been processed. ' + itemsPT +
' have been processed today.\n\nSee the log below for more information:\n\n' + Logger.getLog());
return;
} else {
// keep on going
var len = threads.length;
Logger.log( j * BATCH + len);
labels[i].addToThreads(threads);
currentLS = currentLS + len;
itemsP = itemsP + len;
itemsPT = itemsPT + len;
threads = labels[i].getThreads( (j+1) * BATCH, BATCH);
}
}
currentLS = 0; // Reset LS counter
}
initF = true; // Initial run is done
} catch (e) { // Clean up and send off a notice.
// Write current values back to ScriptProperties
ScriptProperties.setProperties({'itemsProcessed':itemsP, 'initFinished':initF, 'lastrun':today, 'itemsProcessedToday':itemsPT,
'currentLabel':currentL, 'currentLabelStart':currentLS, 'autoTrig':autoT, 'autoTrigID':autoTID});
var emailAddress = Session.getActiveUser().getEmail();
var errorDate = new Date();
GmailApp.sendEmail(emailAddress, 'Error "' + e.name + '" in Google Apps Script', 'Your Gmail Relabeller has failed in the following stack:\n\n' +
e.stack + '\nThis may be due to reaching your daily Gmail read/write quota. \nThe error message is: ' +
e.message + '\nThe error occurred at the following date and time: ' + errorDate + '\n\nThus far, ' +
itemsP + ' threads have been processed. ' + itemsPT + ' have been processed today. \nSee the log below for more information:' +
'\n\n' + Logger.getLog());
return;
}
// Write current values back to ScriptProperties. Send completion email.
ScriptProperties.setProperties({'itemsProcessed':itemsP, 'initFinished':initF, 'lastrun':today, 'itemsProcessedToday':itemsPT,
'currentLabel':currentL, 'currentLabelStart':currentLS, 'autoTrig':autoT, 'autoTrigNumber':autoTID});
var emailAddress = Session.getActiveUser().getEmail();
GmailApp.sendEmail(emailAddress, 'Relabel job completed', 'Your Gmail Relabeller has finished its initial run.\n' +
'If you continue to run the script, it will skip the initial run and instead relabel ' +
'all emails from the previous ' + LOOKBACKDAYS + ' days.\n\n' + itemsP + ' threads were processed. ' + itemsPT +
' were processed today. \nSee the log below for more information:' + '\n\n' + Logger.getLog());
return; // Don't run the maintenance section after initial run finish
} // End initial run section statement
// Below is the 'maintenance' section that will be run when the initial run is finished. It finds all new threads
// (as defined by LOOKBACKDAYS) and applies any existing labels to all messages in each thread. Note that this
// won't miss older threads that are labeled by the user because all messages in a thread get the label
// when the label action is first performed. If another message is then sent or received in that thread,
// then this maintenance section will find it because it will be deemed a "new" thread at that point.
// You may need to search further back the first time you run this if it took more than 3 days to finish
// the initial run. For general maintenance, though, 4 days should be plenty.
// Note that I have not implemented a script-run-time check for this section.
var threads = GmailApp.search('newer_than:' + LOOKBACKDAYS + 'd', 0, BATCH); //
var len = threads.length;
for (var i=0; len > 0; i++) {
for (var t = 0; t < len; t++) {
var labels = threads[t].getLabels();
for (var l = 0; l < labels.length; l++) { // Add each label to the thread
labels[l].addToThread(threads[t]);
}
}
itemsP = itemsP + len;
itemsPT = itemsPT + len;
threads = GmailApp.search('newer_than:' + LOOKBACKDAYS + 'd', (i+1) * BATCH, BATCH);
len = threads.length;
}
// Write the property data
ScriptProperties.setProperties({'itemsProcessed':itemsP, 'initFinished':initF, 'lastrun':today, 'itemsProcessedToday':itemsPT,
'currentLabel':currentL, 'currentLabelStart':currentLS, 'autoTrig':autoT, 'autoTrigID':autoTID});
}
// Takes a date object and turns it into a string of form yyyymmdd
function dateToStr_(dateObj) { //takes in a date object, but uses current date if not a date
if (!(dateObj instanceof Date)) {
dateObj = new Date();
}
var dd = dateObj.getDate();
var mm = dateObj.getMonth()+1; //January is 0!
var yyyy = dateObj.getFullYear();
if(dd<10){dd='0'+dd};
if(mm<10){mm='0'+mm};
dateStr = ''+yyyy+mm+dd;
return dateStr;
}
Edit: 3/24/2017
I guess I should turn on notifications or something, because I never saw the question from user29020. In case anyone ever has the same question, here's what I do: I run it as a maintenance function by setting a daily trigger to run each night between 1 and 2 AM.
An additional note: It seems that at some point in the last year or so, labeling calls to Gmail have slowed down significantly. It now takes around 0.2 seconds per thread, so I would expect an initial run of 20k emails to take at least 20 runs or so before it makes it all the way through. This also means that if you typically receive more than 100-200 emails a day, the maintenance section might also start to take too long and start to fail. Now that's a lot of emails, but I bet there are some people that receive that many, and it seems much more likely that you would hit that than the 1000 or so daily emails that would have been needed for failure back when I first wrote the script.
Anyway, one mitigation would be to reduce the LOOKBACKDAYS to less than 4, but I wouldn't recommend putting it less than 2.

From the documentation :
method getInboxThreads()
Retrieve all Inbox threads irrespective of labels
This call will fail when the size of all threads is too large for the system to handle. Where the thread size is unknown, and potentially very large, please use the 'paged' call, and specify ranges of the threads to retrieve in each call.*
So you should handle a certain number of threads, label the messages and set up a time trigger to run each "page" every 10 minutes or so until all the messages are labelled.
EDIT : I have given this a try , please consider as a draft to start with :
The script will process 100 threads at a time and send you an email to inform you on its progress and show the log.
When it's finished it will warn you with an email as well. It uses scriptProperties to store its state. (don't forget to update the mail adress at the end of the script). I tried it with a time trigger set to 5 minutes and it seems to run smoothly for now...
function inboxLabeller() {
if(ScriptProperties.getKeys().length==0){ // this is to create keys on the first run
ScriptProperties.setProperties({'threadStart':0, 'itemsprocessed':0, 'notF':true})
}
var items = Number(ScriptProperties.getProperty('itemsprocessed'));// total counter
var tStart = Number(ScriptProperties.getProperty('threadStart'));// the value to start with
var notFinished = ScriptProperties.getProperty('notF');// the "main switch" ;-)
Logger.clear()
while (notFinished){ // the main loop
var threads = GmailApp.getInboxThreads(tStart,100);
Logger.log('Number of threads='+Number(tStart+threads.length));
if(threads.length==0){
notFinished=false ;
break
}
for(t=0;t<threads.length;++t){
var mCount = threads[t].getMessageCount();
var mSubject = threads[t].getFirstMessageSubject();
var labels = threads[t].getLabels();
var labelsNames = '';
for(var l in labels){labelsNames+=labels[l].getName()}
Logger.log('subject '+mSubject+' has '+mCount+' msgs with labels '+labelsNames)
for(var l in labels){
labels[l].addToThread(threads[t])
}
}
tStart = tStart+100;
items = items+100
ScriptProperties.setProperties({'threadStart':tStart, 'itemsprocessed':items})
break
}
if(notFinished){
GmailApp.sendEmail('mymail', 'inboxLabeller progress report', 'Still working, '+items+' processed \n - see logger below \n \n'+Logger.getLog());
}else{
GmailApp.sendEmail('mymail', 'inboxLabeller End report', 'Job completed : '+items+' processed');
ScriptProperties.setProperties({'threadStart':0, 'itemsprocessed':0, 'notF':true})
}
}

This will find individual messages that do not have a label and apply the label of the associated thread. It takes much less time because it's not relabeling every single message.
function label_unlabeled_messages() {
var unlabeled = GmailApp.search("has:nouserlabels -label:inbox -label:sent -label:chats -label:draft -label:spam -label:trash");
for (var i = 0; i < unlabeled.length; i++) {
Logger.log("thread: " + i + " " + unlabeled[i].getFirstMessageSubject());
labels = unlabeled[i].getLabels();
for (var j = 0; j < labels.length; j++) {
Logger.log("labels: " + i + " " + labels[j].getName());
labels[j].addToThread(unlabeled[i]);
}
}
}

When should I call SaveChanges() when creating 1000's of Entity Framework objects? (like during an import)

I am running an import that will have 1000's of records on each run. Just looking for some confirmation on my assumptions:
Which of these makes the most sense:
Run SaveChanges() every AddToClassName() call.
Run SaveChanges() every n number of AddToClassName() calls.
Run SaveChanges() after all of the AddToClassName() calls.
The first option is probably slow right? Since it will need to analyze the EF objects in memory, generate SQL, etc.
I assume that the second option is the best of both worlds, since we can wrap a try catch around that SaveChanges() call, and only lose n number of records at a time, if one of them fails. Maybe store each batch in an List<>. If the SaveChanges() call succeeds, get rid of the list. If it fails, log the items.
The last option would probably end up being very slow as well, since every single EF object would have to be in memory until SaveChanges() is called. And if the save failed nothing would be committed, right?

I would test it first to be sure. Performance doesn't have to be that bad.
If you need to enter all rows in one transaction, call it after all of AddToClassName class. If rows can be entered independently, save changes after every row. Database consistence is important.
Second option I don't like. It would be confusing for me (from final user perspective) if I made import to system and it would decline 10 rows out of 1000, just because 1 is bad. You can try to import 10 and if it fails, try one by one and then log.
Test if it takes long time. Don't write 'propably'. You don't know it yet. Only when it is actually a problem, think about other solution (marc_s).
EDIT
I've done some tests (time in miliseconds):
10000 rows:
SaveChanges() after 1 row:18510,534SaveChanges() after 100 rows:4350,3075SaveChanges() after 10000 rows:5233,0635
50000 rows:
SaveChanges() after 1 row:78496,929
SaveChanges() after 500 rows:22302,2835
SaveChanges() after 50000 rows:24022,8765
So it is actually faster to commit after n rows than after all.
My recommendation is to:
SaveChanges() after n rows.
If one commit fails, try it one by one to find faulty row.
Test classes:
TABLE:
CREATE TABLE [dbo].[TestTable](
[ID] [int] IDENTITY(1,1) NOT NULL,
[SomeInt] [int] NOT NULL,
[SomeVarchar] [varchar](100) NOT NULL,
[SomeOtherVarchar] [varchar](50) NOT NULL,
[SomeOtherInt] [int] NULL,
CONSTRAINT [PkTestTable] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
Class:
public class TestController : Controller
{
//
// GET: /Test/
private readonly Random _rng = new Random();
private const string _chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
private string RandomString(int size)
{
var randomSize = _rng.Next(size);
char[] buffer = new char[randomSize];
for (int i = 0; i < randomSize; i++)
{
buffer[i] = _chars[_rng.Next(_chars.Length)];
}
return new string(buffer);
}
public ActionResult EFPerformance()
{
string result = "";
TruncateTable();
result = result + "SaveChanges() after 1 row:" + EFPerformanceTest(10000, 1).TotalMilliseconds + "<br/>";
TruncateTable();
result = result + "SaveChanges() after 100 rows:" + EFPerformanceTest(10000, 100).TotalMilliseconds + "<br/>";
TruncateTable();
result = result + "SaveChanges() after 10000 rows:" + EFPerformanceTest(10000, 10000).TotalMilliseconds + "<br/>";
TruncateTable();
result = result + "SaveChanges() after 1 row:" + EFPerformanceTest(50000, 1).TotalMilliseconds + "<br/>";
TruncateTable();
result = result + "SaveChanges() after 500 rows:" + EFPerformanceTest(50000, 500).TotalMilliseconds + "<br/>";
TruncateTable();
result = result + "SaveChanges() after 50000 rows:" + EFPerformanceTest(50000, 50000).TotalMilliseconds + "<br/>";
TruncateTable();
return Content(result);
}
private void TruncateTable()
{
using (var context = new CamelTrapEntities())
{
var connection = ((EntityConnection)context.Connection).StoreConnection;
connection.Open();
var command = connection.CreateCommand();
command.CommandText = #"TRUNCATE TABLE TestTable";
command.ExecuteNonQuery();
}
}
private TimeSpan EFPerformanceTest(int noOfRows, int commitAfterRows)
{
var startDate = DateTime.Now;
using (var context = new CamelTrapEntities())
{
for (int i = 1; i <= noOfRows; ++i)
{
var testItem = new TestTable();
testItem.SomeVarchar = RandomString(100);
testItem.SomeOtherVarchar = RandomString(50);
testItem.SomeInt = _rng.Next(10000);
testItem.SomeOtherInt = _rng.Next(200000);
context.AddToTestTable(testItem);
if (i % commitAfterRows == 0) context.SaveChanges();
}
}
var endDate = DateTime.Now;
return endDate.Subtract(startDate);
}
}

I just optimized a very similar problem in my own code and would like to point out an optimization that worked for me.
I found that much of the time in processing SaveChanges, whether processing 100 or 1000 records at once, is CPU bound. So, by processing the contexts with a producer/consumer pattern (implemented with BlockingCollection), I was able to make much better use of CPU cores and got from a total of 4000 changes/second (as reported by the return value of SaveChanges) to over 14,000 changes/second. CPU utilization moved from about 13 % (I have 8 cores) to about 60%. Even using multiple consumer threads, I barely taxed the (very fast) disk IO system and CPU utilization of SQL Server was no higher than 15%.
By offloading the saving to multiple threads, you have the ability to tune both the number of records prior to commit and the number of threads performing the commit operations.
I found that creating 1 producer thread and (# of CPU Cores)-1 consumer threads allowed me to tune the number of records committed per batch such that the count of items in the BlockingCollection fluctuated between 0 and 1 (after a consumer thread took one item). That way, there was just enough work for the consuming threads to work optimally.
This scenario of course requires creating a new context for every batch, which I find to be faster even in a single-threaded scenario for my use case.

If you need to import thousands of records, I'd use something like SqlBulkCopy, and not the Entity Framework for that.
MSDN docs on SqlBulkCopy
Use SqlBulkCopy to Quickly Load Data from your Client to SQL Server
Transferring Data Using SqlBulkCopy

Use a stored procedure.
Create a User-Defined Data Type in Sql Server.
Create and populate an array of this type in your code (very fast).
Pass the array to your stored procedure with one call (very fast).
I believe this would be the easiest and fastest way to do this.

Sorry, I know this thread is old, but I think this could help other people with this problem.
I had the same problem, but there is a possibility to validate the changes before you commit them. My code looks like this and it is working fine. With the chUser.LastUpdated I check if it is a new entry or only a change. Because it is not possible to reload an Entry that is not in the database yet.
// Validate Changes
var invalidChanges = _userDatabase.GetValidationErrors();
foreach (var ch in invalidChanges)
{
// Delete invalid User or Change
var chUser = (db_User) ch.Entry.Entity;
if (chUser.LastUpdated == null)
{
// Invalid, new User
_userDatabase.db_User.Remove(chUser);
Console.WriteLine("!Failed to create User: " + chUser.ContactUniqKey);
}
else
{
// Invalid Change of an Entry
_userDatabase.Entry(chUser).Reload();
Console.WriteLine("!Failed to update User: " + chUser.ContactUniqKey);
}
}
_userDatabase.SaveChanges();

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Database for numerical data from physics simulation - mongodb

Related

How to calculate mean of distributed data?

Persisting a Modified Database in Chrome APP using chrome.storage API

MongoDB 3.2 C# driver version 2.2.3.3 Gridfs Download large files more than 2gb

Need help optimizing a google apps script that labels emails

When should I call SaveChanges() when creating 1000's of Entity Framework objects? (like during an import)

Categories

Resources