Simplest way to process a list of items in a multi-threaded manner - .net-2.0

I've got a piece of code that opens a data reader and for each record (which contains a url) downloads & processes that page.
What's the simplest way to make it multi-threaded so that, let's say, there are 10 slots which can be used to download and process pages in simultaneousy, and as slots become available next rows are being read etc.
I can't use WebClient.DownloadDataAsync
Here's what i have tried to do, but it hasn't worked (i.e. the "worker" is never ran):
using (IDataReader dr = q.ExecuteReader())
{
ThreadPool.SetMaxThreads(10, 10);
int workerThreads = 0;
int completionPortThreads = 0;
while (dr.Read())
{
do
{
ThreadPool.GetAvailableThreads(out workerThreads, out completionPortThreads);
if (workerThreads == 0)
{
Thread.Sleep(100);
}
} while (workerThreads == 0);
Database.Log l = new Database.Log();
l.Load(dr);
ThreadPool.QueueUserWorkItem(delegate(object threadContext)
{
Database.Log log = threadContext as Database.Log;
Scraper scraper = new Scraper();
dc.Product p = scraper.GetProduct(log, log.Url, true);
ManualResetEvent done = new ManualResetEvent(false);
done.Set();
}, l);
}
}

You do not normally need to play with the Max threads (I believe it defaults to something like 25 per proc for worker, 1000 for IO). You might consider setting the Min threads to ensure you have a nice number always available.
You don't need to call GetAvailableThreads either. You can just start calling QueueUserWorkItem and let it do all the work. Can you repro your problem by simply calling QueueUserWorkItem?
You could also look into the Parallel Task Library, which has helper methods to make this kind of stuff more manageable and easier.

Related

What could cause a meteor app to sometimes create two documents?

I have a meteor app that takes the text of an article and splits it into paragraphs, sentences, words, and characters and then stores that in a json which I then save as a document in a collection. The document I am testing now ends up as 15133 bytes in mongodb.
When I insert the document it takes about 20 or 30 seconds to insert. Then sometimes it starts going through my article creation routine again and inserts another document. Sometimes it ends up inserting 3 or more documents. Sometimes it behaves as it should and only inserts 1 document into the collection.
What should I be looking for that could be causing this behavior?
Here is my code, as requested:
Meteor.methods({
'createArticle': function (text, title) {
var article = {}
article.title = title
article.userID = "sdfgsdfg"
article.text = text
article.paragraphs = []
var paragraphs = splitArticleIntoParagraphs(text)
console.log("paragraphs", paragraphs)
_.each(paragraphs, function (paragraph, p) {
if (paragraph !== "") {
console.log("paragraph", paragraph)
article.paragraphs[p] = {}
article.paragraphs[p].read = false
article.paragraphs[p].text = paragraph
console.log("paragraphs[p]", article.paragraphs[p])
var sentences = splitParagraphIntoSentences(paragraph)
article.paragraphs[p].sentences = []
}
_.each(sentences, function (sentence, s) {
if (sentence !== "") {
article.paragraphs[p].sentences[s] = {}
console.log("sentence", sentence)
article.paragraphs[p].sentences[s].text = sentence
article.paragraphs[p].sentences[s].read = false
console.log("paragraphs[p].sentences[s]", article.paragraphs[p].sentences[s])
var wordsForward = splitSentenceIntoWordsForward(sentence)
console.log("wordsForward", JSON.stringify(wordsForward))
article.paragraphs[p].sentences[s].forward = {}
article.paragraphs[p].sentences[s].forward.words = wordsForward
// var wordsReverse = splitSentenceIntoWordsReverse(sentence)
_.each(wordsForward, function (word, w) {
if (word) {
// console.log("word", JSON.stringify(word))
// article.paragraphs[p].sentences[s] = {}
// article.paragraphs[p].sentences[s].forward = {}
// article.paragraphs[p].sentences[s].forward.words = []
article.paragraphs[p].sentences[s].forward.words[w] = {}
article.paragraphs[p].sentences[s].forward.words[w].wordID = word._id
article.paragraphs[p].sentences[s].forward.words[w].simp = word.simp
article.paragraphs[p].sentences[s].forward.words[w].trad = word.trad
console.log("word.simp", word.simp)
var characters = word.simp.split('')
console.log("characters", characters)
article.paragraphs[p].sentences[s].forward.words[w].characters = []
_.each(characters, function (character, c) {
if (character) {
console.log("character", character, p, s, w, c)
article.paragraphs[p].sentences[s].forward.words[w].characters[c] = {}
article.paragraphs[p].sentences[s].forward.words[w].characters[c].text = character
article.paragraphs[p].sentences[s].forward.words[w].characters[c].wordID = Words.findOne({simp: character})._id
}
})
}
})
}
})
})
// console.log("article", JSON.stringify(article))
// console.log(JSON.stringify(article.paragraphs[10].sentences[1].forward))//.words[4].characters[0])
console.log("done")
var id = Articles.insert(article)
console.log("id", id)
return id
}
})
I call the method here:
Template.articleList.events({
"click #addArticle": function(event) {
event.preventDefault();
var title = $('#title').val();
var text = $('#text').val();
$('#title').value = '';
$('#text').value = '';
$('#text').attr('rows', '3');
Meteor.call('createArticle', text, title);
}
})
An important thing to keep in mind is that meteor methods do not work very well when it comes to perform CPU intensive tasks. Since your meteor server only works in a single thread, any kind of blocking computations - like yours - will affect all client connections, e.g. delaying DDP heartbeats. This - in turn - can result in clients thinking that the connection was dropped.
As #ghybs suggested in one of the comments your method is probably triggered several times by an impatient DDP client who thinks that server has disconnected. The easiest way to prevent this behavior is by adding noRetry flag to Meteor.apply as explained here:
https://docs.meteor.com/api/methods.html#Meteor-apply
I believe Meteor.call does not have this option.
Another strategy would be trying to make sure that your methods are idempotent, i.e. calling them more than once should not produce any additional effects. This is usually true - at least when you use method simulation - because retrying db insert will re-use the same document id which will fail on the second try. For some reason this is not happening in your case.
Finally, the problem you described clearly shows that probably a different pattern should be used for a computationally expensive task like yours. If I were you, I would start by splitting the job in several steps:
First I would make sure that the document is uploaded to the server with a POST request rather than through DDP.
Then I would implement a "process file" server side method that grabs the file which is already on the server or in the database (in case you used files collection). The first thing the method should do would be calling this.unblock(), but thats not all.
Ideally the computational task should be executed in a separated process. Only when that process is completed the method would return telling the actual caller that the job is done. But since we called this.unblock() that caller can perform different tasks, e.g. calling other methods/subscriptions while waiting for the result.
Sometimes, having a separated process will not be good enough. I've experienced situations where I had to delegate the task to another worker server(s).

Batching large result sets using Rx

I've got an interesting question for Rx experts. I've a relational table keeping information about events. An event consists of id, type and time it happened. In my code, I need to fetch all the events within a certain, potentially wide, time range.
SELECT * FROM events WHERE event.time > :before AND event.time < :after ORDER BY time LIMIT :batch_size
To improve reliability and deal with large result sets, I query the records in batches of size :batch_size. Now, I want to write a function that, given :before and :after, will return an Observable representing the result set.
Observable<Event> getEvents(long before, long after);
Internally, the function should query the database in batches. The distribution of events along the time scale is unknown. So the natural way to address batching is this:
fetch first N records
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
if the result is not empty, use the last record's time as a new 'before' parameter, and fetch the next N records; otherwise terminate
... and so on (the idea should be clear)
My question is:
Is there a way to express this function in terms of higher-level Observable primitives (filter/map/flatMap/scan/range etc), without using the subscribers explicitly?
So far, I've failed to do this, and come up with the following straightforward code instead:
private void observeGetRecords(long before, long after, Subscriber<? super Event> subscriber) {
long start = before;
while (start < after) {
final List<Event> records;
try {
records = getRecordsByRange(start, after);
} catch (Exception e) {
subscriber.onError(e);
return;
}
if (records.isEmpty()) break;
records.forEach(subscriber::onNext);
start = Iterables.getLast(records).getTime();
}
subscriber.onCompleted();
}
public Observable<Event> getRecords(final long before, final long after) {
return Observable.create(subscriber -> observeGetRecords(before, after, subscriber));
}
Here, getRecordsByRange implements the SELECT query using DBI and returns a List. This code works fine, but lacks elegance of high-level Rx constructs.
NB: I know that I can return Iterator as a result of SELECT query in DBI. However, I don't want to do that, and prefer to run multiple queries instead. This computation does not have to be atomic, so the issues of transaction isolation are not relevant.
Although I don't fully understand why you want such time-reuse, here is how I'd do it:
BehaviorSubject<Long> start = BehaviorSubject.create(0L);
start
.subscribeOn(Schedulers.trampoline())
.flatMap(tstart ->
getEvents(tstart, tstart + twindow)
.publish(o ->
o.takeLast(1)
.doOnNext(r -> start.onNext(r.time))
.ignoreElements()
.mergeWith(o)
)
)
.subscribe(...)

Combining parts of Stream

I've got an observable watching a log that is continuously being written too. Each line is a new onNext call. Sometimes the log outputs a single log item over multiple lines. Detecting this is easy, I just can't find the right RX call.
I'd like to find a way to collect the single log items into a List of lines, and onNext the list when the single log item is complete.
Buffer doesn't seem right as this isn't time based, it's algorithm based.
GroupBy might be what I want, but the documentation is confusing for it. It also seems that the observables it creates probably won't have onComplete called until the completion of the source observable.
This solution can't delay the log much (preferably not at all). I need to be reading the log as close to real time as possible, and order matters.
Any push in the right direction would be great.
This is a typical reactive parsing problem. You could use Rxx Parsers, or for a native solution you can build your own state machine with either Scan or by defining an async iterator. Scan is preferable for simple parsers and often uses a Scan-Where-Select pattern.
Async iterator state machine example: Turnstile
Scan parser example (untested):
IObservable<string> lines = ReadLines();
IObservable<IReadOnlyList<string>> parsed = lines.Scan(
new
{
ParsingItem = (IEnumerable<string>)null,
Item = (IEnumerable<string>)null
},
(state, line) =>
// I'm assuming here that items never span lines partially.
IsItem(line)
? IsItemLastLine(line)
? new
{
ParsingItem = (IEnumerable<string>)null,
Item = (state.ParsingItem ?? Enumerable.Empty<string>()).Concat(line)
}
: new
{
ParsingItem = (state.ParsingItem ?? Enumerable.Empty<string>()).Concat(line),
Item = (List<string>)null
}
: new
{
ParsingItem = (IEnumerable<string>)null,
Item = new[] { line }
})
.Where(result => result.Item != null)
.Select(result => result.Item.ToList().AsReadOnly());

NodeJS: What is the proper way to handling TCP socket streams ? Which delimiter should I use?

From what I understood here, "V8 has a generational garbage collector. Moves objects aound randomly. Node can’t get a pointer to raw string data to write to socket." so I shouldn't store data that comes from a TCP stream in a string, specially if that string becomes bigger than Math.pow(2,16) bytes. (hope I'm right till now..)
What is then the best way to handle all the data that's comming from a TCP socket ? So far I've been trying to use _:_:_ as a delimiter because I think it's somehow unique and won't mess around other things.
A sample of the data that would come would be something_:_:_maybe a large text_:_:_ maybe tons of lines_:_:_more and more data
This is what I tried to do:
net = require('net');
var server = net.createServer(function (socket) {
socket.on('connect',function() {
console.log('someone connected');
buf = new Buffer(Math.pow(2,16)); //new buffer with size 2^16
socket.on('data',function(data) {
if (data.toString().search('_:_:_') === -1) { // If there's no separator in the data that just arrived...
buf.write(data.toString()); // ... write it on the buffer. it's part of another message that will come.
} else { // if there is a separator in the data that arrived
parts = data.toString().split('_:_:_'); // the first part is the end of a previous message, the last part is the start of a message to be completed in the future. Parts between separators are independent messages
if (parts.length == 2) {
msg = buf.toString('utf-8',0,4) + parts[0];
console.log('MSG: '+ msg);
buf = (new Buffer(Math.pow(2,16))).write(parts[1]);
} else {
msg = buf.toString() + parts[0];
for (var i = 1; i <= parts.length -1; i++) {
if (i !== parts.length-1) {
msg = parts[i];
console.log('MSG: '+msg);
} else {
buf.write(parts[i]);
}
}
}
}
});
});
});
server.listen(9999);
Whenever I try to console.log('MSG' + msg), it will print out the whole buffer, so it's useless to see if something worked.
How can I handle this data the proper way ? Would the lazy module work, even if this data is not line oriented ? Is there some other module to handle streams that are not line oriented ?
It has indeed been said that there's extra work going on because Node has to take that buffer and then push it into v8/cast it to a string. However, doing a toString() on the buffer isn't any better. There's no good solution to this right now, as far as I know, especially if your end goal is to get a string and fool around with it. Its one of the things Ryan mentioned # nodeconf as an area where work needs to be done.
As for delimiter, you can choose whatever you want. A lot of binary protocols choose to include a fixed header, such that you can put things in a normal structure, which a lot of times includes a length. In this way, you slice apart a known header and get information about the rest of the data without having to iterate over the entire buffer. With a scheme like that, one can use a tool like:
node-buffer - https://github.com/substack/node-binary
node-ctype - https://github.com/rmustacc/node-ctype
As an aside, buffers can be accessed via array syntax, and they can also be sliced apart with .slice().
Lastly, check here: https://github.com/joyent/node/wiki/modules -- find a module that parses a simple tcp protocol and seems to do it well, and read some code.
You should use the new stream2 api. http://nodejs.org/api/stream.html
Here are some very useful examples: https://github.com/substack/stream-handbook
https://github.com/lvgithub/stick

What is the better way to do the below program(c#3.0)

Consider the below program
private static bool CheckFactorPresent(List<FactorReturn> factorReturnCol)
{
bool IsPresent = true;
StringBuilder sb = new StringBuilder();
//Get the exposure names from Exposure list.
//Since this will remain same , so it has been done outside the loop
List<string> lstExposureName = (from item in Exposures
select item.ExposureName).ToList<string>();
foreach (FactorReturn fr in factorReturnCol)
{
//Build the factor names from the ReturnCollection dictionary
List<string> lstFactorNames = fr.ReturnCollection.Keys.ToList<string>();
//Check if all the Factor Names are present in ExposureName list
List<string> result = lstFactorNames.Except(lstExposureName).ToList();
if (result.Count() > 0)
{
result.ForEach(i =>
{
IsPresent = false;
sb.AppendLine("Factor" + i + "is not present for week no: " + fr.WeekNo.ToString());
});
}
}
return IsPresent;
}
Basically I am checking if all the FactorNames[lstFactorNames] are present in
ExposureNames[lstExposureName] list by using lstFactorNames.Except(lstExposureName).
And then by using the Count() function(if count() > 0), I am writing the error
messages to the String Builder(sb)
I am sure that someone can definitely write a better implementation than the one presented.
And I am looking forward for the same to learn something new from that program.
I am using c#3.0 and dotnet framework 3.5
Thanks
Save for some naming convention issues, I'd say that looks fine (for what I can figure out without seeing the rest of the code, or the purpose in the effort. The naming conventions though, need some work. A sporadic mix of ntnHungarian, PascalCase, camelCase, and abbrv is a little disorienting. Try just naming your local variables camelCase exclusively and things will look a lot better. Best of luck to you - things are looking good so far!
- EDIT -
Also, you can clean up the iteration at the end by just running a simple foreach:
...
foreach (var except in result)
{
isPresent = false;
builder.AppendFormat("Factor{0} is not present for week no: {1}\r\n", except, fr.WeekNo);
}
...