Parallel on List of Objects? - c#-3.0

I have .net 3.5 and I want to use Parallel.ForEach. I have List of Accounts which needs to be refreshed from other system. For this I am thinking of create list of account object; and I will call accountObj.Process method which will does processing. I want to make sure my approach is right and things will be in place for this ?
If anyone of you have already done this then can you point me to correct implementation/example etc...
How does Paralle.ForEach works internally ? Does it create one thread for each item of for loop or it works with finite set of threads ?
Ocean

Parallel.ForEach sounds like a perfectly reasonable approach. It doesn't create a thread per item - it partitions the list into tasks for different threads, but keeps a bound on the number created. If your tasks are network-bound, I believe PFX may notice that and increase the number of threads being used. It's certainly work giving it a try.
I suggest you read the parallelism book from the MS Patterns and Practices group for more detailed advice.

I suggest you read the parallelism book
from the MS Patterns and
Practices group for more detailed
advice.
Thanks Jon :)
The credit review sample in the book chapter 2 shows how to do this. You can actually use Parallel.ForEach
Parallel.ForEach(accounts.AllAccounts, account =>
{
Trend trend = SampleUtilities.Fit(account.Balance);
double prediction = trend.Predict(account.Balance.Length + NumberOfMonths);
account.ParPrediction = prediction;
account.ParWarning = prediction < account.Overdraft;
});
Or PLINQ:
accounts.AllAccounts
.AsParallel()
.ForAll(account =>
{
Trend trend = SampleUtilities.Fit(account.Balance);
double prediction = trend.Predict(account.Balance.Length + NumberOfMonths);
account.PlinqPrediction = prediction;
account.PlinqWarning = prediction < account.Overdraft;
});
In both cases the TPL assigns work from a pool of threads, the .NET ThreadPool. The TPL uses adaptive range partitioning and adaptive concurrency to maximize throughput. You can use a customer Partitioner to get finer control over how the collection is split up over different threads. You can also set the maximum degree of concurrency with MaxDegreeOfParallelism. In general it's better to let the TPL do it's own optimization unless you see perfromance issues.
Note: If you only have .NET 3.5 then Task Parallel Library (TPL) features are not present. At one point there was a CTP of TPL for 3.5 but that's no longer available. I believe that the TPL binaries are also part of the Rx for .NET 3.5 download. This might give you a chance to use TPL with 3.5.

Related

Improve Hasura Subscription Performance

we developed a web app that relies on real-time interaction between our users. We use Angular for the frontend and Hasura with GraphQL on Postgres as our backend.
What we noticed is that when more than 300 users are active at the same time we experience crucial performance losses.
Therefore, we want to improve our subscriptions setup. We think that possible issues could be:
Too many subscriptions
too large and complex subscriptions, too many forks in the subscription
Concerning 1. each user has approximately 5-10 subscriptions active when using the web app. Concerning 2. we have subscriptions that are complex as we join up to 6 tables together.
The solutions we think of:
Use more queries and limit the use of subscriptions on fields that are totally necessary to be real-time.
Split up complex queries/subscriptions in multiple smaller ones.
Are we missing another possible cause? What else can we use to improve the overall performance?
Thank you for your input!
Preface
OP question is quite broad and impossible to be answered in a general case.
So what I describe here reflects my experience with optimization of subscriptions - it's for OP to decide is it reflects their situtation.
Short description of system
Users of system: uploads documents, extracts information, prepare new documents, converse during process (IM-like functionalitty), there are AI-bots that tries to reduce the burden of repetitive tasks, services that exchange data with external systems.
There are a lot of entities, a lot of interaction between both human and robot participants. Plus quite complex authorization rules: visibility of data depends on organization, departements and content of documents.
What was on start
At first it was:
programmer wrote a graphql-query for whole data needed for application
changed query to subscription
finish
It was OK for first 2-3 monthes then:
queries became more complex and then even more complex
amount of subscriptions grew
UI became lagging
DB instance is always near 100% load. Even during nigths and weekends. Because somebody did not close application
First we did optimization of queries itself but it did not suffice:
some things are rightfully costly: JOINs, existence predicates, data itself grew significantly
network part: you can optimize DB but just to transfer all needed data has it's cost
Optimization of subscriptions
Step I. Split subscriptions: subscribe for change date, query on change
Instead of complex subscription for whole data split into parts:
A. Subscription for a single field that indicates that entity was changed
E.g.
Instead of:
subscription{
document{
id
title
# other fields
pages{ # array relation
...
}
tasks{ # array relation
...
}
# multiple other array/object relations
# pagination and ordering
}
that returns thousands of rows.
Create a function that:
accepts hasura_session - so that results are individual per user
returns just one field: max_change_date
So it became:
subscription{
doc_change_date{
max_change_date
}
}
Always one row and always one field
B. Change of application logic
Query whole data
Subscribe for doc_change_date
memorize value of max_change_date
if max_change_date changed - requery data
Notes
It's absolutely OK if subscription function sometimes returns false positives.
There is no need to replicate all predicates from source query to subscription function.
E.g.
In our case: visibility of data depends on organizations and departments (and even more).
So if a user of one department creates/modifies document - this change is not visible to user of other department.
But those changes are like ones/twice in a minute per organization.
So for subscription function we can ignore those granularity and calculate max_change_date for whole organization.
It's beneficial to have faster and cruder subscription function: it will trigger refresh of data more frequently but whole cost will be less.
Step II. Multiplex subscriptions
The first step is a crucial one.
And hasura has a multiplexing of subscriptions: https://hasura.io/docs/latest/graphql/core/databases/postgres/subscriptions/execution-and-performance.html#subscription-multiplexing
So in theory hasura could be smart enough and solve your problems.
But if you think "explicit better than implicit" there is another step you can do.
In our case:
user(s) uploads documents
combines them in dossiers
create new document types
converse with other
So subscriptions becames: doc_change_date, dossier_change_date, msg_change_date and so on.
But actually it could be beneficial to have just one subscription: "hey! there are changes for you!"
So instead of multiple subscriptions application makes just one.
Note
We thought about 2 formats of multiplexed subscription:
A. Subscription returns just one field {max_change_date} that is accumulative for all entities
B. Subscription returns more granular result: {doc_change_date, dossier_change_date, msg_change_date}
Right now "A" works for us. But maybe we change to "B" in future.
Step III. What we would do differently with hasura 2.0
That's what we did not tried yet.
Hasura 2.0 allows registering VOLATILE functions for queries.
That allows creation of functions with memoization in DB:
you define a cache for function call presumably in a table
then on function call you first look in cache
if not exists: add values to cache
return result from cache
That allows further optimizations both for subscription functions and query functions.
Note
Actually it's possible to do that without waiting for hasura 2.0 but it requires trickery on postgresql side:
you create VOLATILE function that did real work
and another function that's defined as STABLE that calls VOLATILE function. This function could be registered in hasura
It works but that's trick is hard to recommend.
Who knows, maybe future postgresql versions or updates will make it impossible.
Summary
That's everything that I can say on the topic right now.
Actually I would be glad to read something similar a year ago.
If somebody sees some pitfalls - please comment, I would be glad to hear opinions and maybe alternative ways.
I hope that this explanation will help somebody or at least provoke thought how to deal with subscriptions in other ways.

Spring Batch: dynamic composite reader/processor/writer

I've seen this (2010) and this (SO, 2012), but still have not got the answer I need...
Is there an option in Spring Batch to have a dynamic composite reader/processor/writer?
The idea is to have the ability to replace processor at runtime, and in case of multiple processors (AKA composite-processor), to have the option to add/remove/replace/change order of processors. As mentioned, same for reader/writer.
I thought of something like reading the processors list from DB (using cache?) and there the items (beans' names) can be changed. Does this make sense?
EDIT - why do I need this?
There are cases that I use processors as "filters", and it may occur that the business (the client) may change the requirements (yes, it is very annoying) and ask to switch among filters (change the priority).
Other use case is having multiple readers to get the data from different data warehouse, and again - the client changes the warehouse from time to time (integration phase), and I do not want my app to be restarted each and every time. There are many other use cases, of course. plus this.
Thanks
I've started working on this project:
https://github.com/OhadR/spring-batch-dynamic-composite
that implements the requirements in the question above. If someone wanna contribute - feel free!

What are the (dis)advantages of early bound?

I'm researching the pros and cons of early and late binding in CRM. I've got a good idea on the subject but there are some points I'm unclear about.
Some say that early biding is the fastest, other that lates is. Is there any significant difference?
How does one handle early binding for custom entities?
How does one handle early binding for default entities with custom fields?
There is a lot of links but the most useful I got my mouse on were those. Any other pointers?
Pro both
Pro early
Pro late
Some say that early biding is the fastest, other that late is. Is there any significant difference?
a. Since Early bound is just a wrapper over the late bound entity class, and contains all the functionality there of, it can't have a faster runtime than late bound. But, this difference is extremely small and I differ to Eric Lippert in the What's Fastest type of questions. The one difference in speed that isn't negligible, is the speed of development though. Early bound is much faster for development, and much less error prone IMHO.
How does one handle early binding for custom entities?
a. The CrmSrvcUtil generates the early bound classes for custom entities, exactly like the default ones (I created this tool to make generating the classes even easier. Update: It has since moved over to GitHub Update 2 It is now in the XrmToolBox Plugin Store, search for "Early Bound Generator" ). Each time a change is made to a CRM entity, the entity type definitions will need to be updated (only if you want to use a new property or entity, or you've removed a property or entity that you currently use. You can use early bound entity classes that are out of date, as long as you don't set the values of any properties that don't actually exist, which is the same exact requirements of late bound)
How does one handle early binding for default entities with custom fields?
a. See the answer to question 2.
One of the little gottcha's when working with early bound entities, is the need to enable early bound proxy types in your IOrganizationService. This is easy for the OrganizationServiceProxy, but may take a few more steps for plugins and especially custom workflow activities.
Edit 1 - My Tests
Below is my code, testing against a pretty inactive local dev environment. Feel free to test for youself
using (var service = TestBase.GetOrganizationServiceProxy())
{
var earlyWatch = new Stopwatch();
var lateWatch = new Stopwatch();
for (int i = 0; i < 100; i++)
{
earlyWatch.Start();
var e = new Contact() { FirstName = "Early", LastName = "BoundTest"
e.Id = service.Create(e);
earlyWatch.Stop();
lateWatch.Start();
var l = new Entity();
l.LogicalName = "contact";
l["firstname"] = "Late";
l["lastname"] = "BoundTest";
l.Id = service.Create(l);
lateWatch.Stop();
service.Delete(e);
service.Delete(l);
}
var earlyTime = earlyWatch.ElapsedMilliseconds;
var lateTime = lateWatch.ElapsedMilliseconds;
var percent = earlyWatch.ElapsedTicks / (double)lateWatch.ElapsedTicks;
}
My two test results (please note that running two test are not statistically significant to draw any sort of statistical conclusion, but I think they lend weight to it not really being that big of a performance decrease to justify some of the development gains) where ran against a local dev environment with very little other activity to disrupt the tests.
Number Creates | Early (MS) | Late (MS) | % diff (from ticks)
10 | 1242 | 1106 | 12.3%
100 | 8035 | 7960 | .1%
Now lets plug in the numbers and see the difference. 12% seems like a lot, but 12% of what? The actual difference was .136 seconds. Let's say you create 10 Contacts every minute... .136 x 60 min / hour x 24 hours / day = 195.84 s/day or about 3 seconds a day. Lets say you spend 3 developer hours attempting to figure out which is faster. In order for the program to be able to save that much time, it would take 60 days of 24/7 10 contacts / minute processing in order for the faster code to "pay back" it's 3 hours of decision making.
So the rule is, always pick the method that is more readable/maintainable than what is faster first. And if the performance isn't fast enough, then look at other possibilities. But 98 times out of 100, it really isn't going to affect performance in a way that is detectable by an end user.
Premature optimization is the root of all evil -- DonaldKnuth
Probably not. If you want to know for certain, I would suggest running some tests and profiling the results.
However these MSDN articles suggest late binding it faster.
Best Practices for Developing with Microsoft Dynamics CRM
Use Early-Bound Types
Use the Entity class when your code must work on entities and
attributes that are not known at the time the code is written. In
addition, if your custom code works with thousands of entity records,
use of the Entity class results in slightly better performance than
the early-bound entity types. However, this flexibility has a
disadvantage because you cannot verify entity and attribute names at
compile time. If your entities are already defined at code time and
slight performance degradation is acceptable, you should use the early-bound types that you can generate by using the CrmSvcUtil
tool. For more information, see Use the Early Bound Entity Classes in
Code.
Choose your Development Style for Managed Code for Microsoft Dynamics CRM
Entity Programming (Early Bound vs. Late Bound vs. Developer
Extensions)
Early Bound ... Serialization costs increase as the entities are
converted to late bound types during transmission over the network.
2 & 3. You don't have to take any special action with custom fields or entities. Svcutil will generate classes for both.
Use the Early Bound Entity Classes in Code
The class created by the code generation tool includes all the
entity’s attributes and relationships. By using the class in your
code, you can access these attributes and be type safe. A class with
attributes and relationships is created for all entities in your
organization. There is no difference between the generated types for
system and custom entities.
As a side note, I wouldn't get too hung up on it, they are both acceptable implementation approaches and in the majority of situations I doubt the performance impact will be significant enough to worry about. Personally I prefer late binding, but that's mostly because I don't like having to generate the classes.
Edit.
I performed some quick profiling on this by creating accounts in CRM, a set of 200 & 5000. It confirms the information provided by Microsoft, in both runs late binding was about 8.5 seconds quicker. Over very short runs the late binding is significantly faster - 90%. However early binding quickly picks up speed and by the time 5000 records are created late binding is only 2% faster.
Full details blogged here.

J Oliver EventStore V2.0 questions

I am embarking upon an implementation of a project using CQRS and intend to use the J Oliver EventStore V2.0 as my persistence engine for events.
1) In the documentation, ExampleUsage.cs uses 3 serializers in "BuildSerializer". I presume this is just to show the flexibility of the deserialization process?
2) In the "Restart after failure" case where some events were not dispatched I believe I need startup code that invokes GetUndispatchedCommits() and then dispatch them, correct?
3) Again, in "ExampleUseage.cs" it would be useful if "TakeSnapshot" added the third event to the eventstore and then "LoadFromSnapShotForward" not only retrieve the most recent snapshot but also retrieved events that were post snapshot to simulate the rebuild of an aggregate.
4) I'm failing to see the use of retaining older snapshots. Can you give a use case where they would be useful?
5) If I have a service that is handling receipt of commands and generation of events what is a suggested strategy for keeping track of the number of events since the last snapshot for a given aggregate. I certainly don't want to invoke "GetStreamsToSnapshot" too often.
6) In the SqlPersistence.SqlDialects namespace the sql statement name is "GetStreamsRequiringSnaphots" rather than "GetStreamsRequiringSnapShots"
1) There are a few "base" serializers--such as the Binary, JSON, and BSON serializers. The other two in the example--GZip/Compression and Encryption serializers are wrapping serializers and are only meant to modify what's already been serialized into a byte stream. For the example, I'm just showing flexibility. You don't have to encrypt if you don't want to. In fact, I've got stuff running production that uses simple JSON which makes debugging very easy because everything is text.
2) The SynchronousDispatcher and AsychronousDispatcher implementations are both configured to query and find any undispatched commits. You shouldn't have to do anything special.
3) Greg Young talked about how he used to "inline" his snapshots with the main event stream, but there were a number of optimistic concurrency and race conditions in high-performance systems that came up. He therefore decided to move them "out of band". I have followed this decision for many of the same reasons.
In addition snapshots are really a performance consideration when you have extrememly low SLAs. If you have a stream with a few thousand events on it and you don't have low SLAs, why not just take the minimal performance hit instead of adding additional complexity into your system. In other words, snapshots are "ancillary" concepts. They're in the EventStore API, but they're an optional concept that should be considered for certain use cases.
4) Let's suppose you had an aggregate with tens of millions of events and you wanted to run a "what if" scenario from before your most recent snapshot. It's a lot cheaper to go from another snapshot forward. The really nice thing about snapshots being a secondary concept is that if you wanted to drop older snapshots you could and it wouldn't affect your system at all.
5) There is a method in each implementation of IPersistStreams called GetStreamsRequiringSnapshots. You provide a threshold of 50, for example which finds all streams having 50 or more events since their last snapshot. This can (and probably should) be done asynchronously from your normal processing.
6) "Snapshots" is the correct casing for that word. Much like "website" used to be "Web site" but because of common usage it became "website".

How do I adapt my recommendation engine to cold starts?

I am curious what are the methods / approaches to overcome the "cold start" problem where when a new user or an item enters the system, due to lack of info about this new entity, making recommendation is a problem.
I can think of doing some prediction based recommendation (like gender, nationality and so on).
You can cold start a recommendation system.
There are two type of recommendation systems; collaborative filtering and content-based. Content based systems use meta data about the things you are recommending. The question is then what meta data is important? The second approach is collaborative filtering which doesn't care about the meta data, it just uses what people did or said about an item to make a recommendation. With collaborative filtering you don't have to worry about what terms in the meta data are important. In fact you don't need any meta data to make the recommendation. The problem with collaborative filtering is that you need data. Before you have enough data you can use content-based recommendations. You can provide recommendations that are based on both methods, and at the beginning have 100% content-based, then as you get more data start to mix in collaborative filtering based.
That is the method I have used in the past.
Another common technique is to treat the content-based portion as a simple search problem. You just put in meta data as the text or body of your document then index your documents. You can do this with Lucene & Solr without writing any code.
If you want to know how basic collaborative filtering works, check out Chapter 2 of "Programming Collective Intelligence" by Toby Segaran
Maybe there are times you just shouldn't make a recommendation? "Insufficient data" should qualify as one of those times.
I just don't see how prediction recommendations based on "gender, nationality and so on" will amount to more than stereotyping.
IIRC, places such as Amazon built up their databases for a while before rolling out recommendations. It's not the kind of thing you want to get wrong; there are lots of stories out there about inappropriate recommendations based on insufficient data.
Working on this problem myself, but this paper from microsoft on Boltzmann machines looks worthwhile: http://research.microsoft.com/pubs/81783/gunawardana09__unified_approac_build_hybrid_recom_system.pdf
This has been asked several times before (naturally, I cannot find those questions now :/, but the general conclusion was it's better to avoid such recommendations. In various parts of the worls same names belong to different sexes, and so on ...
Recommendations based on "similar users liked..." clearly must wait. You can give out coupons or other incentives to survey respondents if you are absolutely committed to doing predictions based on user similarity.
There are two other ways to cold-start a recommendation engine.
Build a model yourself.
Get your suppliers to fill in key information to a skeleton model. (Also may require $ incentives.)
Lots of potential pitfalls in all of these, which are too common sense to mention.
As you might expect, there is no free lunch here. But think about it this way: recommendation engines are not a business plan. They merely enhance the business plan.
There are three things needed to address the Cold-Start Problem:
The data must have been profiled such that you have many different features (with product data the term used for 'feature' is often 'classification facets'). If you don't properly profile data as it comes in the door, your recommendation engine will stay 'cold' as it has nothing with which to classify recommendations.
MOST IMPORTANT: You need a user-feedback loop with which users can review the recommendations the personalization engine's suggestions. For example, Yes/No button for 'Was This Suggestion Helpful?' should queue a review of participants in one training dataset (i.e. the 'Recommend' training dataset) to another training dataset (i.e. DO NOT Recommend training dataset).
The model used for (Recommend/DO NOT Recommend) suggestions should never be considered to be a one-size-fits-all recommendation. In addition to classifying the product or service to suggest to a customer, how the firm classifies each specific customer matters too. If functioning properly, one should expect that customers with different features will get different suggestions for (Recommend/DO NOT Recommend) in a given situation. That would the 'personalization' part of personalization engines.