What is the ideal method to sharing read only data between the master and slave threads? From my understanding there are two options:
Set shared data as global variable in main so that the slave threads can read them.
Pass shared variables to slave threads as parameters.
From my experiments, there is hardly any difference in performance even with big data set. In fact, 1) has slightly worse performance over 2). I know that for 2), kdb will serialize and serialize parameters. Does it do the same for 1)? That would explain the degradation of performance considering the global variable is bigger in size than thread specific parameter. Are there alternative methods to do this?
Secondly, as slave threads cannot modify global variables. I reckon the only way to share results with main thread is returning them back. Please comment if this is not the case.
EDIT: Performance is measured in terms of runtime before and after call to peach.
Passing values into a function via peach like this
{}[v;] peach vector
sounds nice, and works well, unless v is very large. Each thread gets a copy (even if v is a global).
So the answer depends on your use case. Do you have enough RAM? Can you afford the memory copy given the number of threads? If the answer is yes, then you can afford to do this without too much ill affect (remember allocation will affect time).
I prefer to use globals for this reason.
Related
They are not used for RTL but rather verification, correct? They would not be synthesizable.
Do they have better memory management features in turn optimizing program time? If I recall
correctly, System Verilog has an automatic garbage collector, so there is no need to deallocate memory.
The official IEEE documentation does a great job of explaining how they work. I am just wondering in what scenarios I would use one vs an array. One guess would be that they have associated methods that allow for easier data manipulation?
Thank you in advance for your knowledge and expertise.
A queue can be synthesisable if it has a bounded maximum size. Only a few synthesis tools support it, probably none of the FPGA synthesis tools.
The key advantage with a queue is in efficiency adding/removing one element from the array, especially when accessed at the head or tail of the queue. A dynamic array may require reallocation and copying the entire array when modifying its size. The penalty for a queue is the extra time it takes to access elements in the middle of the queue, and extra space compared with the same number of element of a dynamic array.
I hope that 2 answers this question.
I've got an application with a lot of stateless microservices, which passes their variable context one to another. I've got a case when I'm starting few chains of services with the same context in parallel and then waiting for them to finish. Each service can modify its variable context, but after all of chains is finished I have to merge their variable contexts and ensure there is no conflicts.
It's illustrated in the examples below:
It's possible to solve this problem by storing the whole history of variable modifications, but it's a huge data overhead which I'd like to avoid.
Another solution I see is to find some hashing function, which lets to calculate the hash of modification history by the existing hash and new data, and also lets to check if one history data is prefix of another history data by knowing their hashes only. But I'm unable to find such a function.
I'm looking for any applicable algorithm with has as less data overhead as possible.
What you need are Version clocks, an old idea that can be used to merge paralel data modifications and to detect conflicts.
It's possible to solve this problem by storing the whole history of variable modifications, but it's a huge data overhead which I'd like to avoid.
With vector clocks you don't keep the entire history, but a counter for each variable and node (so each variable has a vector of counters).
Storing the whole history of variable modifications doesn't sound too awful, actually. For example, you can put modification information onto a queue, then, have a service that will process that queue by batch of elements at a time and put the result into one single place.
This is a common approach, for example, in situations when there is huge parallel workload and you can't synchronize access to only one place with a lock.
Later you can even scale out workers that process the queue.
I have to deliver an application as a standalone Matlab executable to a client. The code include a series of calls to a function that internally creates several cell arrays.
My problem is that an out-of-memory error happens when the number of calls to this function increases in response to the increase in the user load. I guess this is low-level memory fragmentation as the workspace variables are independent from the number of loops.
As mentioned here, quitting and restarting Matlab is the only solution for this type of out-of-memory errors at the moment.
My question is that how I can implement such a mechanism in a standalone application to save data, quit and restart itself in the case of out-of-memory error (or when high likelihood of such an error is predicted somehow).
Is there any best practice available?
Thanks.
This is a bit of a tough one. Instead of looking to restart to clear things out, could you change the code to break the work in to chunks to make it more efficient? Fragmentation is mostly proportional to the peak cell-related memory usage and how much the size of data items varies, and less to the total usage over time. If you can break a large piece of work in to smaller pieces done in sequence, this can lower the "high water mark" of your fragmented memory usage. You can also save on memory usage by using "flyweight" data structures that share their backing data values, or sometimes converting to cell-based structures to reference objects or numeric codes. Can you share an example of your code and data structure with us?
In theory, you could get a clean slate by saving your workspace and relevant state out to a mat file and having the executable launch another instance of itself with an option to reload that state and proceed, and then having the original executable exit. But that's going to be pretty ugly in terms of user experience and your ability to debug it.
Another option would be to offload the high-fragmentation code in to another worker process which could be killed and restarted, while the main executable process survives. If you have the Parallel Computation Toolbox, which can now be compiled in to standalone Matlab executables, this would be pretty straightforward: open a worker pool of one or two workers, and run the fraggy code inside them using synchronous calls, periodically killing the workers and bringing up new ones. The workers are independent processes which start out with non-fragmented memory spaces. If you don't have PCT, you could roll your own by compiling your application as two separate apps - the driver app and worker app - and have the main app spin up a worker and control it via IPC, passing your data back and forth as MAT files or bytestreams. That's not going to be a lot of fun to code, though.
Perhaps you could also push some of the fraggy code down in to the Java layer, which handles cell-like data structures more gracefully.
Changing the code to be less fraggy in the first place is probably the simpler and easier approach, and results in a less complicated application design. In my experience it's often possible. If you share some code and data structure details, maybe we can help.
Another option is to periodically check for memory fragmentation with a function like chkmem.
You could integrate this function to be called silently from you code each couple of iterations, or use a timer object to have it called every X minutes...
The idea is to use thse undocumented functions feature memstats and feature dumpmem to get the largest free memory blocks available in addition to the largest variables currently allocated. Using that you could make a guess if there is a sign of memory fragmentation.
When detected, you would warn the user and instruct them you how to save their current session (export to MAT-file), restart the app, and restore the session upon restart.
Context:
I want to store some temporary results in some temporary tables. These tables may be reused in several queries that may occur close in time, but at some point the evolutionary algorithm I'm using may not need some old tables any more and keep generating new tables. There will be several queries, possibly concurrently, using those tables. Only one user doing all those queries. I don't know if that clarifies everything about sessions and so on, I'm still uncertain about how that works.
Objective:
What I would like to do is to create temporary tables (if they don't exist already), store them on memory as far as that is possible and if at some point there is not enough memory, delete those that would be committed to the HDD (I guess those will be the least recently used).
Examples:
The client will be doing queries for EMAs with different parameters and an aggregation of them with different coefficients, each individual may vary in terms of the coefficients used and so the parameters for the EMAs may repeat as they are still in the gene pool, and may not be needed after a while. There will be similar queries with more parameters and the genetic algorithm will find the right values for the parameters.
Questions:
Is that what "on commit drop" means? I've seen descriptions about
sessions and transactions but I don't really understand those
concepts. Sorry if the question is stupid.
If it is not, do you know about any simple way to get Postgres to do
this?
Workaround:
In the worst case I should be able to make a guesstimation about how many tables I can keep on memory and try to implement the LRU by myself, but it's never going to be as good as what Postgres could do.
Thank you very much.
This is a complicated topic and probably one to discuss in some depth. I think it is worth both explaining why PostgreSQL doesn't support this and also what you can do instead with recent versions to approach what you are trying to do.
PostgreSQL has a pretty good approach to caching diverse data sets across multiple users. In general you don't want to allow a programmer to specify that a temporary table must be kept in memory if it becomes very large. Temporary tables however are managed quite differently from normal tables in that they are:
Buffered by the individual back-end, not the shared buffers
Locally visible only, and
Unlogged.
What this means is that typically you aren't generating a lot of disk I/O for temporary tables. The tables do not normally flush WAL segments, and they are managed by the local back-end so they don't affect shared buffer usage. This means that only occasionally is data going to be written to disk and only when necessary to free memory for other (usually more frequent) tasks. You certainly aren't forcing disk writes and only need disk reads when something else has used up memory.
The end result is that you don't really need to worry about this. PostgreSQL already tries, to a certain extent, to do what you are asking it to do, and temporary tables have much lower disk I/O requirements than standard tables do. It does not force the tables to stay in memory though and if they become large enough, the pages may expire into the OS disk cache, and eventually on to disk. This is an important feature because it ensures that performance gracefully degrades when many people create many large temporary tables.
We used Drools as part of a solution to act as a sort of filter in a very intense processing application, maybe running up to 100 rules on 500,000 + working memory objects.
turns out that it is extremely slow.
anybody else have any experience using Drools in a batch type processing application?
Kind of depends on your rules - 500K objects is reasonable given enough memory (it has to populate a RETE network in memory, so memory usage is a multiple of 500K objects - ie space for objects + space for network structure, indexes etc) - its possible you are paging to disk which would be really slow.
Of course, if you have rules that match combinations of the same type of fact, that can cause an explosion of combinations to try, which even if you have 1 rule will be really really slow.
If you had any more information on the analysis you are doing that would probably help with possible solutions.
I've used a Drools with a stateful working memory containing over 1M facts. With some tuning of both your rules and the underlying JVM, performance can be quite good after a few minutes for initial start-up. Let me know if you want more details.
I haven't worked with the latest version of Drools (last time I used it was about a year ago), but back then our high-load benchmarks proved it to be utterly slow. A huge disappointment after having based much of our architecture on it.
At least something good I remember about drools is that their dev team was available on IRC and very helpful, you might give them a try, they're the experts after all: irc.codehaus.org #drools
I'm just learning drools myself, so maybe I'm missing something, but why is the whole batch of five hundred thousand objects added to working memory at once? The only reason I can think of is that there are rules that kick in only when two or more items in the batch are related.
If that isn't the case, then perhaps you could use a stateless session and assert one object at a time. I assume rules will run 500k times faster in that case.
Even if it is the case, do all your rules need access to all 500k objects? Could you speed things up by applying per-item rules one at a time, and then in a second phase of processing apply batch level rules using a different rulebase and working memory? This would not change the volume of data, but the RETE network would be smaller because the simple rules would have been removed.
An alternative approach would be to try and identify the related groups of objects and assert the objects in groups during the second phase, further reducing the volume of data in working memory as well as splitting up the RETE network.
Drools is not really designed to be run on a huge number of objects. It's optimized for running complex rules on a few objects.
The working memory initialization for each additional object is too slow and the caching strategies are designed to work per working memory object.
Use a stateless session and add the objects one at a time ?
I had problems with OutOfMemory errors after parsing a few thousand objects. Setting a different default optimizer solved the problem.
OptimizerFactory.setDefaultOptimizer(OptimizerFactory.SAFE_REFLECTIVE);
We were looking at drools as well, but for us the number of objects is low so this isn't an issue. I do remember reading that there are alternate versions of the same algorithm that take memory usage more into account, and are optimized for speed while still being based on the same algorithm. Not sure if any of them have made it into a real usable library though.
this optimizer can also be set by using parameter
-Dmvel2.disable.jit=true