I have a console app on Swift3 that read line by line from very large file.txt (~200GB):
guard let reader = LineReader(path: "/Path/to/file.txt") else { return; }
for line in reader {
// do something with each line
}
It takes about 8+ hours to read all data from file. My server has 6 hardware cores, how to read this file in 6 threads?
LineReader from here: https://github.com/andrewwoz/LineReader
PS. Files from the beginning is separate by 1GB per file.
Never thought about multithread reading a .txt file with 200gb but I'd probably let the console detect how many cores (e.x. 6core)are available and split it into (e.x. 6parts). (-> One part for every process)
As far as I know Ubuntu will automatically evenly distribute to processes.
Hope this helped
!!! This solution works only if you read file using POSIX fopen() like here: https://github.com/andrewwoz/LineReader
let reader = LineReader(path: pathToFile)
var threads = [Thread]()
func readTxtFile() {
while let line = reader?.nextLine {
autoreleasepool {
// To do with each line
}}
}
for threadNumber in 0...threadsCount-1 {
threads.append(Thread(){ readTxtFile() })
threads[threadNumber].start()
}
select(0, nil, nil, nil, nil)
Also the real time profit only with hardware cores, not with HT threads. If your CPU has 2 cores and 4 threads, use 2 threads in the code.
Related
So I am currently building an app, do display some user analytics. In order to check if all my background calculations and the corresponding plots look descent, I have written a function to generate some mock Data, called addMockData and looks something like this:
func addMockdata() {
let ClassToHoldData = ClassToHoldData()
for i in 0...15 {
let otherClassToHoldData = otherClassToHoldData()
for j in 0...12000 {
let fx = ...
let fy = ...
let fz = ...
fx.append(...)
fy.append(...)
fz.append(...)
}
otherClassToHoldData.fx = fx
otherClassToHoldData.fy = fy
otherClassToHoldData.fz = fz
ClassToHoldData.info.append(otherClassToHoldData)
try! realm.write {
realm.objects(UserModel.self)[index].data.append(ClassToHoldData)
}
}
I call addMockData in the AppDelegate under the application(...) method. Thus when I build and run the app for the first time addMockData gets called. This works fine in the simulator and the data is generated without a hitch and memory usage peaking at around 450 Mb when generating the mock data.
The issue arises when I run the program on an actual device, in my case an Ipad Air (3rd Gen). There after generating roughly half of the mock data, it terminates with the message "Message from debugger: Terminated due to memory issue". The memory usage steadily rises until it reaches roughly 1.7 Gb, when it crashes. It seems like it does not deallocate all the data generated in the for loop.
I have tried wrapping my for loops in a autoreleasepool{} and have checked that Zombie Objects are disabled.
What else could I try? Any help is greatly appreciated!
I'm a swift newbie and I'm working on a Swift Mac App as a demo project.The app stores stock symbols in a sqlite table, fetches the stock price, calculates value and then finally displays the results in a table view.
I'm looking for ways to improve execution speed when fetching data to populate my table view. So I used Dispatch Queue as shown below. The problem is that the
Stock Price and Stock Value columns (calculated in the async closure) are always empty. What am i doing wrong? The function getStocksData returns a NSMutableArray which is the datasource for my table view
func getStocksData() -> NSMutableArray {
sharedInstance.database!.open()
let resultSet: FMResultSet! = sharedInstance.database!.executeQuery("select stock_id,symbol,company,qty from stocks ", withArgumentsIn: [])
let stocksDBRowsArray : NSMutableArray = NSMutableArray()
if (resultSet != nil) {
while resultSet.next() {
let stockInfo : StockInfo = StockInfo()
stockInfo.StockID = resultSet.string(forColumn: "stock_id")!
stockInfo.Symbol = resultSet.string(forColumn: "symbol")!
stockInfo.StockCompany = resultSet.string(forColumn: "company")!
stockInfo.Qty = resultSet.string(forColumn: "qty")!
//create queue with unique label to fetch stock price
let queue=DispatchQueue(label:stockInfo.Symbol)
queue.async {
//code to fetch stock price goes here
.....
stockInfo.StockPrice=stockPrice
stockInfo.StockValue=stockPrice*stockInfo.Qty
}
stocksDBRowsArray.add(stockInfo)
}
}
sharedInstance.database!.close()
return stocksDBRowsArray
}
What am i doing wrong?
When your getStocksData() method returns, the StockInfo values may, or may not yet be filled with fetched values. StockInfo values eventually get filled, at some undefined point in time, in an undefined application thread.
That's what the provided code snippet does. Of course this is not what you want, but what you want is not very clear either.
The documentation for Swift's Dispatch library is very terse, and won't help you much understand what is happening. I instead suggest that you study the documentation for dispatch_async, which is the C equivalent of the Swift DispatchQueue.async. It is documented with much more details, and you'll read this key sentence:
Calls to this function always return immediately after the block has been submitted and never wait for the block to be invoked.
Generally don't hesitate switching to the Objective-C documentation when the Swift documentation is lacking. Mastering some Swift technologies sometimes requires this little inconvenience. You'll learn a great deal of information there.
I have to generate a big file on the fly. Reading to the database and send it to the client.
I read some documentation and i did this
val streamContent: Enumerator[Array[Byte]] = Enumerator.outputStream {
os =>
// new PrintWriter() read from database and for each record
// do some logic and write
// to outputstream
}
Ok.stream(streamContent.andThen(Enumerator.eof)).withHeaders(
CONTENT_DISPOSITION -> s"attachment; filename=someName.csv"
)
Im rather new to scala in general only a week so don't guide for my reputation.
My questions are :
1) Is this the best way? I found this if i have a big file, this will load in memory, and also don't know what is the chunk size in this case, if it will send for each write() is not to convenient.
2) I found this method Enumerator.fromStream(data : InputStream, chunkedSize : int) a little better cause it has a chunk-size, but i don't have an inputStream cause im creating the file on the fly.
There's a note in the docs for Enumerator.outputStream:
Not [sic!] that calls to write will not block, so if the iteratee that is being fed to is slow to consume the input, the OutputStream will not push back. This means it should not be used with large streams since there is a risk of running out of memory.
If this can happen depends on your situation. If you can and will generate Gigabytes in seconds, you should probably try something different. I'm not exactly sure what, but I'd start at Enumerator.generateM(). For many cases though, your method is perfectly fine. Have a look at this example by Gaëtan Renaudeau for serving a Zip file that's generated on the fly in the same way you're using it:
val enumerator = Enumerator.outputStream { os =>
val zip = new ZipOutputStream(os);
Range(0, 100).map { i =>
zip.putNextEntry(new ZipEntry("test-zip/README-"+i+".txt"))
zip.write("Here are 100000 random numbers:\n".map(_.toByte).toArray)
// Let's do 100 writes of 1'000 numbers
Range(0, 100).map { j =>
zip.write((Range(0, 1000).map(_=>r.nextLong).map(_.toString).mkString("\n")).map(_.toByte).toArray);
}
zip.closeEntry()
}
zip.close()
}
Ok.stream(enumerator >>> Enumerator.eof).withHeaders(
"Content-Type"->"application/zip",
"Content-Disposition"->"attachment; filename=test.zip"
)
Please keep in mind that Ok.stream has been replaced by Ok.chunked in newer versions of Play, in case you want to upgrade.
As for the chunk size, you can always use Enumeratee.grouped to gather a bunch of values and send them as one chunk.
val grouper = Enumeratee.grouped(
Traversable.take[Array[Double]](100) &>> Iteratee.consume()
)
Then you'd do something like
Ok.stream(enumerator &> grouper >>> Enumerator.eof)
From what I understood here, "V8 has a generational garbage collector. Moves objects aound randomly. Node can’t get a pointer to raw string data to write to socket." so I shouldn't store data that comes from a TCP stream in a string, specially if that string becomes bigger than Math.pow(2,16) bytes. (hope I'm right till now..)
What is then the best way to handle all the data that's comming from a TCP socket ? So far I've been trying to use _:_:_ as a delimiter because I think it's somehow unique and won't mess around other things.
A sample of the data that would come would be something_:_:_maybe a large text_:_:_ maybe tons of lines_:_:_more and more data
This is what I tried to do:
net = require('net');
var server = net.createServer(function (socket) {
socket.on('connect',function() {
console.log('someone connected');
buf = new Buffer(Math.pow(2,16)); //new buffer with size 2^16
socket.on('data',function(data) {
if (data.toString().search('_:_:_') === -1) { // If there's no separator in the data that just arrived...
buf.write(data.toString()); // ... write it on the buffer. it's part of another message that will come.
} else { // if there is a separator in the data that arrived
parts = data.toString().split('_:_:_'); // the first part is the end of a previous message, the last part is the start of a message to be completed in the future. Parts between separators are independent messages
if (parts.length == 2) {
msg = buf.toString('utf-8',0,4) + parts[0];
console.log('MSG: '+ msg);
buf = (new Buffer(Math.pow(2,16))).write(parts[1]);
} else {
msg = buf.toString() + parts[0];
for (var i = 1; i <= parts.length -1; i++) {
if (i !== parts.length-1) {
msg = parts[i];
console.log('MSG: '+msg);
} else {
buf.write(parts[i]);
}
}
}
}
});
});
});
server.listen(9999);
Whenever I try to console.log('MSG' + msg), it will print out the whole buffer, so it's useless to see if something worked.
How can I handle this data the proper way ? Would the lazy module work, even if this data is not line oriented ? Is there some other module to handle streams that are not line oriented ?
It has indeed been said that there's extra work going on because Node has to take that buffer and then push it into v8/cast it to a string. However, doing a toString() on the buffer isn't any better. There's no good solution to this right now, as far as I know, especially if your end goal is to get a string and fool around with it. Its one of the things Ryan mentioned # nodeconf as an area where work needs to be done.
As for delimiter, you can choose whatever you want. A lot of binary protocols choose to include a fixed header, such that you can put things in a normal structure, which a lot of times includes a length. In this way, you slice apart a known header and get information about the rest of the data without having to iterate over the entire buffer. With a scheme like that, one can use a tool like:
node-buffer - https://github.com/substack/node-binary
node-ctype - https://github.com/rmustacc/node-ctype
As an aside, buffers can be accessed via array syntax, and they can also be sliced apart with .slice().
Lastly, check here: https://github.com/joyent/node/wiki/modules -- find a module that parses a simple tcp protocol and seems to do it well, and read some code.
You should use the new stream2 api. http://nodejs.org/api/stream.html
Here are some very useful examples: https://github.com/substack/stream-handbook
https://github.com/lvgithub/stick
I've got a piece of code that opens a data reader and for each record (which contains a url) downloads & processes that page.
What's the simplest way to make it multi-threaded so that, let's say, there are 10 slots which can be used to download and process pages in simultaneousy, and as slots become available next rows are being read etc.
I can't use WebClient.DownloadDataAsync
Here's what i have tried to do, but it hasn't worked (i.e. the "worker" is never ran):
using (IDataReader dr = q.ExecuteReader())
{
ThreadPool.SetMaxThreads(10, 10);
int workerThreads = 0;
int completionPortThreads = 0;
while (dr.Read())
{
do
{
ThreadPool.GetAvailableThreads(out workerThreads, out completionPortThreads);
if (workerThreads == 0)
{
Thread.Sleep(100);
}
} while (workerThreads == 0);
Database.Log l = new Database.Log();
l.Load(dr);
ThreadPool.QueueUserWorkItem(delegate(object threadContext)
{
Database.Log log = threadContext as Database.Log;
Scraper scraper = new Scraper();
dc.Product p = scraper.GetProduct(log, log.Url, true);
ManualResetEvent done = new ManualResetEvent(false);
done.Set();
}, l);
}
}
You do not normally need to play with the Max threads (I believe it defaults to something like 25 per proc for worker, 1000 for IO). You might consider setting the Min threads to ensure you have a nice number always available.
You don't need to call GetAvailableThreads either. You can just start calling QueueUserWorkItem and let it do all the work. Can you repro your problem by simply calling QueueUserWorkItem?
You could also look into the Parallel Task Library, which has helper methods to make this kind of stuff more manageable and easier.