Meteor: uploading file from client to Mongo collection vs file system vs GridFS - mongodb

Meteor is great but it lacks native supports for traditional file uploading. There are several options to handle file uploading:
From the client, data can be sent using:
Meteor.call('saveFile',data) or collection.insert({file:data})
'POST' form or HTTP.call('POST')
In the server, the file can be saved to:
a mongodb file collection by collection.insert({file:data})
file system in /path/to/dir
mongodb GridFS
What are the pros and cons for these methods and how best to implement them? I am aware that there are also other options such as saving to a third party site and obtain an url.

You can achieve file uploading with Meteor without using any more packages or a third party
Option 1: DDP, saving file to a mongo collection
/*** client.js ***/
// asign a change event into input tag
'change input' : function(event,template){
var file = event.target.files[0]; //assuming 1 file only
if (!file) return;
var reader = new FileReader(); //create a reader according to HTML5 File API
reader.onload = function(event){
var buffer = new Uint8Array(reader.result) // convert to binary
Meteor.call('saveFile', buffer);
}
reader.readAsArrayBuffer(file); //read the file as arraybuffer
}
/*** server.js ***/
Files = new Mongo.Collection('files');
Meteor.methods({
'saveFile': function(buffer){
Files.insert({data:buffer})
}
});
Explanation
First, the file is grabbed from the input using HTML5 File API. A reader is created using new FileReader. The file is read as readAsArrayBuffer. This arraybuffer, if you console.log, returns {} and DDP can't send this over the wire, so it has to be converted to Uint8Array.
When you put this in Meteor.call, Meteor automatically runs EJSON.stringify(Uint8Array) and sends it with DDP. You can check the data in chrome console websocket traffic, you will see a string resembling base64
On the server side, Meteor call EJSON.parse() and converts it back to buffer
Pros
Simple, no hacky way, no extra packages
Stick to the Data on the Wire principle
Cons
More bandwidth: the resulting base64 string is ~ 33% larger than the original file
File size limit: can't send big files (limit ~ 16 MB?)
No caching
No gzip or compression yet
Take up lots of memory if you publish files
Option 2: XHR, post from client to file system
/*** client.js ***/
// asign a change event into input tag
'change input' : function(event,template){
var file = event.target.files[0];
if (!file) return;
var xhr = new XMLHttpRequest();
xhr.open('POST', '/uploadSomeWhere', true);
xhr.onload = function(event){...}
xhr.send(file);
}
/*** server.js ***/
var fs = Npm.require('fs');
//using interal webapp or iron:router
WebApp.connectHandlers.use('/uploadSomeWhere',function(req,res){
//var start = Date.now()
var file = fs.createWriteStream('/path/to/dir/filename');
file.on('error',function(error){...});
file.on('finish',function(){
res.writeHead(...)
res.end(); //end the respone
//console.log('Finish uploading, time taken: ' + Date.now() - start);
});
req.pipe(file); //pipe the request to the file
});
Explanation
The file in the client is grabbed, an XHR object is created and the file is sent via 'POST' to the server.
On the server, the data is piped into an underlying file system. You can additionally determine the filename, perform sanitisation or check if it exists already etc before saving.
Pros
Taking advantage of XHR 2 so you can send arraybuffer, no new FileReader() is needed as compared to option 1
Arraybuffer is less bulky compared to base64 string
No size limit, I sent a file ~ 200 MB in localhost with no problem
File system is faster than mongodb (more of this later in benchmarking below)
Cachable and gzip
Cons
XHR 2 is not available in older browsers, e.g. below IE10, but of course you can implement a traditional post <form> I only used xhr = new XMLHttpRequest(), rather than HTTP.call('POST') because the current HTTP.call in Meteor is not yet able to send arraybuffer (point me if I am wrong).
/path/to/dir/ has to be outside meteor, otherwise writing a file in /public triggers a reload
Option 3: XHR, save to GridFS
/*** client.js ***/
//same as option 2
/*** version A: server.js ***/
var db = MongoInternals.defaultRemoteCollectionDriver().mongo.db;
var GridStore = MongoInternals.NpmModule.GridStore;
WebApp.connectHandlers.use('/uploadSomeWhere',function(req,res){
//var start = Date.now()
var file = new GridStore(db,'filename','w');
file.open(function(error,gs){
file.stream(true); //true will close the file automatically once piping finishes
file.on('error',function(e){...});
file.on('end',function(){
res.end(); //send end respone
//console.log('Finish uploading, time taken: ' + Date.now() - start);
});
req.pipe(file);
});
});
/*** version B: server.js ***/
var db = MongoInternals.defaultRemoteCollectionDriver().mongo.db;
var GridStore = Npm.require('mongodb').GridStore; //also need to add Npm.depends({mongodb:'2.0.13'}) in package.js
WebApp.connectHandlers.use('/uploadSomeWhere',function(req,res){
//var start = Date.now()
var file = new GridStore(db,'filename','w').stream(true); //start the stream
file.on('error',function(e){...});
file.on('end',function(){
res.end(); //send end respone
//console.log('Finish uploading, time taken: ' + Date.now() - start);
});
req.pipe(file);
});
Explanation
The client script is the same as in option 2.
According to Meteor 1.0.x mongo_driver.js last line, a global object called MongoInternals is exposed, you can call defaultRemoteCollectionDriver() to return the current database db object which is required for the GridStore. In version A, the GridStore is also exposed by the MongoInternals. The mongo used by current meteor is v1.4.x
Then inside a route, you can create a new write object by calling var file = new GridStore(...) (API). You then open the file and create a stream.
I also included a version B. In this version, the GridStore is called using a new mongodb drive via Npm.require('mongodb'), this mongo is the latest v2.0.13 as of this writing. The new API doesn't require you to open the file, you can call stream(true) directly and start piping
Pros
Same as in option 2, sent using arraybuffer, less overhead compared to base64 string in option 1
No need to worry about file name sanitisation
Separation from file system, no need to write to temp dir, the db can be backed up, rep, shard etc
No need to implement any other package
Cachable and can be gzipped
Store much larger sizes compared to normal mongo collection
Using pipe to reduce memory overload
Cons
Unstable Mongo GridFS. I included version A (mongo 1.x) and B (mongo 2.x). In version A, when piping large files > 10 MB, I got lots of error, including corrupted file, unfinished pipe. This problem is solved in version B using mongo 2.x, hopefully meteor will upgrade to mongodb 2.x soon
API confusion. In version A, you need to open the file before you can stream, but in version B, you can stream without calling open. The API doc is also not very clear and the stream is not 100% syntax exchangeable with Npm.require('fs'). In fs, you call file.on('finish') but in GridFS you call file.on('end') when writing finishes/ends.
GridFS doesn't provide write atomicity, so if there are multiple concurrent writes to the same file, the final result may be very different
Speed. Mongo GridFS is much slower than file system.
Benchmark
You can see in option 2 and option 3, I included var start = Date.now() and when writing end, I console.log out the time in ms, below is the result. Dual Core, 4 GB ram, HDD, ubuntu 14.04 based.
file size GridFS FS
100 KB 50 2
1 MB 400 30
10 MB 3500 100
200 MB 80000 1240
You can see that FS is much faster than GridFS. For a file of 200 MB, it takes ~80 sec using GridFS but only ~ 1 sec in FS. I haven't tried SSD, the result may be different. However, in real life, the bandwidth may dictate how fast the file is streamed from client to server, achieving 200 MB/sec transfer speed is not typical. On the other hand, a transfer speed ~2 MB/sec (GridFS) is more the norm.
Conclusion
By no mean this is comprehensive, but you can decide which option is best for your need.
DDP is the simplest and sticks to the core Meteor principle but the data are more bulky, not compressible during transfer, not cachable. But this option may be good if you only need small files.
XHR coupled with file system is the 'traditional' way. Stable API, fast, 'streamable', compressible, cachable (ETag etc), but needs to be in a separate folder
XHR coupled with GridFS, you get the benefit of rep set, scalable, no touching file system dir, large files and many files if file system restricts the numbers, also cachable compressible. However, the API is unstable, you get errors in multiple writes, it's s..l..o..w..
Hopefully soon, meteor DDP can support gzip, caching etc and GridFS can be faster...

Hi just to add on to Option1 regarding viewing of the file. I did it without ejson.
<template name='tryUpload'>
<p>Choose file to upload</p>
<input name="upload" class='fileupload' type='file'>
</template>
Template.tryUpload.events({
'change .fileupload':function(event,template){
console.log('change & view');
var f = event.target.files[0];//assuming upload 1 file only
if(!f) return;
var r = new FileReader();
r.onload=function(event){
var buffer = new Uint8Array(r.result);//convert to binary
for (var i = 0, strLen = r.length; i < strLen; i++){
buffer[i] = r.charCodeAt(i);
}
var toString = String.fromCharCode.apply(null, buffer );
console.log(toString);
//Meteor.call('saveFiles',buffer);
}
r.readAsArrayBuffer(f);};

Related

Azure Data Lake HDFS upload file size limit

Does anyone know what is maximum size to upload file via Azure HDFS Rest API? (https://learn.microsoft.com/en-us/azure/data-lake-store/data-lake-store-data-operations-rest-api).
I found someplace 256MB, some place 32MB, so wondering.
Or similar limits for other SDKs?
I was wrestling with the same problem some months ago and it turned out that the IIS which is in front of ADLS is setting the maxAllowedContentLength with default value of 30000000 bytes (or 28.6Mb). This essentially means that whenever we want to push anything bigger that 30Mb, that request never reaches ADL as IIS throws 404.13 before that. Reference.
As already suggested in the links, ADLS has a driver with a 4-MB buffer, I'm using the .NET SDK myself and following code has served me well
public async Task AddFile(byte[] content, string path)
{
const int fourMb = 4 * 1024 * 1024;
var buffer = new byte[fourMb];
using (var stream = new MemoryStream(content))
{
if (!_adlsFileSystemClient.FileSystem.PathExists(_account, path))
{
_adlsFileSystemClient.FileSystem.Create(_account, path);
}
int bytesToRead;
while ((bytesToRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
if (bytesToRead < fourMb)
{
Array.Resize(ref buffer, bytesToRead);
}
using (var s = new MemoryStream(buffer))
{
await _adlsFileSystemClient.FileSystem.AppendAsync(_account, path, s);
}
//skipped for brevity
In my tests, I am finding a maximum file size limit somewhere between 28MB and 30MB.
Using the Azure Data Lake Storage REST API, I have had no issues creating files as large as 28MB. However, when I try to create a file that is 30MB, I receive a 404 Not Found error.
The following references align with the file size limit and 404 error I am observing. The references are about the SDK, but it could be that the SDK is also calling the REST API under the covers. My tests are calling the REST API directly.
NotFound error on call to Data Lake Store Create
https://stackoverflow.com/a/41469724/10363

Mirth is reading too slow from disk

I am using Mirth 3.0.1 version. I am reading a file (using File Reader) having 34,000 records. Every record is having 45 columns and are pipe(|) separated. Mirth is taking too much time while reading the file from the disk. Mirth is installed on the same server where file is located.Earlier, I was facing the java head space issue which I resolved after setting the -Xms1024m -Xmx4096m in files mcserver.vmoptions & mcservice.vmoptions. Now I have to solve reading performance issue. Please find in attachment the channel for the same.
The answer to this problem is highly dependent on the solution itself. As an example, if you are doing transformations when you benchmark, it might be that the problem is not with reading the files, but rather with doing massive amounts of filtering and transformations in Mirth. Since Mirth converts everything you configure into basically one gigantic Javascript that executes on the server, it might just as well be that this is causing the performance problem. Pre-processor scripts might also create a problem if you do something that causes Mirth to read the whole file.
It migh also be that your 34.000 lines in the file contains huge quantities of information, simply making the file very big and extensive to process. If every record in the file is supposed to create new messages within Mirth, you might also want to check your batch settings for the reader.
And in addition to this, the performance of the read operations from disk is of course affected a lot by the infrastructure and hardware of the platform itself. You did mention that you are reading the files locally and that you had to increase the memory for Mirth. All of this could of course be a problem in itself. To make a benchmark you would want to compare this to something else. Maybe write a small Java program to just read the file to compare performance outside of Mirth.
Thanks for the suggestions.
I have used router.routeMessage('channelName','PartOfMsg') to route the 5000 records(from one channel to second channel) from the file having 34000 of records. This has helped to read faster from the file and processing the records at the same time.
For Mirth Community, below is the code to route the msg from one channel to other channel, this solution is also for the requirement if you have bulk of records to process in batches
In Source Transformer,
debug = "ON";
XML.ignoreWhitespace = true;
logger.debug('Inside source transformer "SplitFileIntoFiles" of channel: SplitFile');
var
subSegmentCounter = 0,
xmlMessageProcessCounter = 0,
singleFileLimit = 5000,
isError = false,
xmlMessageProcess = new XML(<delimited><row><column1></column1><column2></column2></row></delimited>),
newSubSegment = <row><column1></column1><column2></column2></row>,
totalPatientRecords = msg.children().length();
logger.debug('Total number of records found in the patient input file are: ');
logger.debug(totalPatientRecords);
try{
for each (seg in msg.children())
{
xmlMessageProcess.appendChild(newSubSegment);
xmlMessageProcess['row'][xmlMessageProcessCounter] = msg['row'][subSegmentCounter];
if (xmlMessageProcessCounter == singleFileLimit -1)
{
logger.debug('Now sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the 5000 records to the next channel from channel DOR Batch File Process IHI');
xmlMessageProcessCounter = 0;
delete xmlMessageProcess['row'];
}
subSegmentCounter++;
xmlMessageProcessCounter++;
}// End of FOR loop
}// End of try block
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
if (xmlMessageProcessCounter > 1)
{
try
{
logger.debug('Now sending the remaining records to the next channel from channel DOR Batch File Process IHI');
router.routeMessage('DOR SendPatientsToMedicare',xmlMessageProcess);
logger.debug('After sending the remaining records to the next channel from channel DOR Batch File Process IHI');
delete xmlMessageProcess['row'];
}
catch (exception)
{
logger.error('The exception has been raised in source transformer "SplitFileIntoFiles" of channel: SplitFile');
logger.error(exception);
globalChannelMap.put('isFailed',true);
globalChannelMap.put('errDesc',exception);
return true;
}
}
return true;
// End of JavaScript
Hope, this will help.

WinRT writing to TCP stream not working

I have started with the development of a "WinRT" app ("Metro"-style apps for Windows 8). The app should read and write some data via a TCP stream. Reading works fine, but writing does not work. Below you can find the code which uses the full .NET Framework (which works):
var client = new TcpClient();
client.Connect(IPAddress.Parse("192.168.178.51"), 60128);
var stream = client.GetStream();
var writer = new StreamWriter(stream);
writer.WriteLine("ISCP\0\0\0\x10\0\0\0.....");
writer.Flush();
In comparison the following code does not work:
var tcpClient = new StreamSocket();
await tcpClient.ConnectAsync(new HostName("192.168.178.51"), "60128");
var writer = new DataWriter(_tcpClient.OutputStream);
writer.WriteString("ISCP\0\0\0\x10\0\0\0....");
writer.FlushAsync();
WriteString returns the correct length of the string (25), yet the other end does not receive the correct command. Via Wireshark I also see a correct package for the full .NET version, but not for the WinRT version.
How to fix this?
.NET version:
WinRT version:
After your call to writer.WriteString() you need to actually commit the date that is now on the buffer by calling writer.StoreAsync()
any call to wrtier.WriteXX will only store data in memory. Once you call writer.StoreAsync() that data in memory will be sent.
My guess is that StreamWrtiers.WriteLine does this for you in a single call.

Node.js: how to flush socket?

I'm trying to flush a socket before sending the next chunk of the data:
var net = require('net');
net.createServer(function(socket) {
socket.on('data', function(data) {
console.log(data.toString());
});
}).listen(54358, '127.0.0.1');
var socket = net.createConnection(54358, '127.0.0.1');
socket.setNoDelay(true);
socket.write('mentos');
socket.write('cola');
This, however, doesn't work despite the setNoDelay option, e.g. prints "mentoscola" instead of "mentos\ncola". How do I fix this?
Looking over the WriteableStream api, and the associated example it seems that you should set your breaks or delimiters yourself.
exports.puts = function (d) {
process.stdout.write(d + '\n');
};
Because your socket is a stream, data will be written/read without your direct control, and #write won't change your data or assume you're meaning to break between writes, since you could be streaming a large piece of information over the socket and might want to set other delimiters.
I'm definitely no expert in this area, but that seems like the logical answer to me.
Edit: This is a duplicate of Nodejs streaming, and the conclusion there was the same as the answer I specified: working with streams isn't line-by-line, set your own delimiters.
Maybe all data written in the same tick is sent as a batch.
Maybe at the receiving side, the node will combine the separate data segments before emitting the data event.

Best way for limit rate downloads in play framework scala

Problem: limit binary files download rate.
def test = {
Logger.info("Call test action")
val file = new File("/home/vidok/1.jpg")
val fileIn = new FileInputStream(file)
response.setHeader("Content-type", "application/force-download")
response.setHeader("Content-Disposition", "attachment; filename=\"1.jpg\"")
response.setHeader("Content-Length", file.lenght + "")
val bufferSize = 1024 * 1024
val bb = new Array[Byte](bufferSize)
val bis = new java.io.BufferedInputStream(is)
var bytesRead = bis.read(bb, 0, bufferSize)
while (bytesRead > 0) {
bytesRead = bis.read(bb, 0, bufferSize)
//sleep(1000)?
response.writeChunk(bytesRead)
}
}
But its working only for the text files. How to work with binary files?
You've got the basic idea right: each time you've read a certain number of bytes (which are stored in your buffer) you need to:
evaluate how fast you've been reading (= X B/ms)
calculate the difference between X and how fast you should have been reading (= Y ms)
use sleep(Y) on the downloading thread if needed to slow the download rate down
There's already a great question about this right here that should have everything you need. I think especially the ThrottledInputStream solution (which is not the accepted answer) is rather elegant.
A couple of points to keep in mind:
Downloading using 1 thread for everything is the simplest way, however it's also the least efficient way if you want to keep serving requests.
Usually, you'll want to at least offload the actual downloading of a file to its own separate thread.
To speed things up: consider downloading files in chunks (using HTTP Content-Range) and Java NIO. However, keep in mind that this will make thing a lot more complex.
I wouldn't implement something which any good webserver should be able to for me. In enterprise systems this kind of thing is normally handled by a web entry server or firewall. But if you have to do this, then the answer by tmbrggmn looks good to me. NIO is a good tip.