I'm testing this Go code on my VirtualBoxed Ubuntu 11.4
package main
import ("fmt";"time";"big")
var c chan *big.Int
func sum( start,stop,step int64) {
bigStop := big.NewInt(stop)
bigStep := big.NewInt(step)
bigSum := big.NewInt(0)
for i := big.NewInt(start);i.Cmp(bigStop)<0 ;i.Add(i,bigStep){
bigSum.Add(bigSum,i)
}
c<-bigSum
}
func main() {
s := big.NewInt( 0 )
n := time.Nanoseconds()
step := int64(4)
c = make( chan *big.Int , int(step))
stop := int64(100000000)
for j:=int64(0);j<step;j++{
go sum(j,stop,step)
}
for j:=int64(0);j<step;j++{
s.Add(s,<-c)
}
n = time.Nanoseconds() - n
fmt.Println(s,float64(n)/1000000000.)
}
Ubuntu has access to all my 4 cores. I checked this with simultaneous run of several executables and System Monitor.
But when I'm trying to run this code, it's using only one core and is not gaining any profit of parallel processing.
What I'm doing wrong?
You probably need to review the Concurrency section of the Go FAQ, specifically these two questions, and work out which (if not both) apply to your case:
Why doesn't my multi-goroutine program
use multiple CPUs?
You must set the GOMAXPROCS shell environment
variable or use the similarly-named function
of the runtime package to allow the run-time
support to utilize more than one OS thread.
Programs that perform parallel computation
should benefit from an increase in GOMAXPROCS.
However, be aware that concurrency is not parallelism.
Why does using GOMAXPROCS > 1
sometimes make my program slower?
It depends on the nature of your
program. Programs that contain several
goroutines that spend a lot of time
communicating on channels will
experience performance degradation
when using multiple OS threads. This
is because of the significant
context-switching penalty involved in
sending data between threads.
Go's goroutine scheduler is not as
good as it needs to be. In future, it
should recognize such cases and
optimize its use of OS threads. For
now, GOMAXPROCS should be set on a
per-application basis.
For more detail on this topic see the
talk entitled Concurrency is not Parallelism.
Related
I'm creating a server side app in Swift 3. I've chosen libevent for implementing networking code because it's cross-platform and doesn't suffer from C10k problem. Libevent implements it's own event loop, but I want to keep CFRunLoop and GCD (DispatchQueue.main.after etc) functional as well, so I need to glue them somehow.
This is what I've came up with:
var terminated = false
DispatchQueue.main.after(when: DispatchTime.now() + 3) {
print("Dispatch works!")
terminated = true
}
while !terminated {
switch event_base_loop(eventBase, EVLOOP_NONBLOCK) { // libevent
case 1:
break // No events were processed
case 0:
print("DEBUG: Libevent processed one or more events")
default: // -1
print("Unhandled error in network backend")
exit(1)
}
RunLoop.current().run(mode: RunLoopMode.defaultRunLoopMode,
before: Date(timeIntervalSinceNow: 0.01))
}
This works, but introduces a latency of 0.01 sec. While RunLoop is sleeping, libevent won't be able to process events. Lowering this timeout increases CPU usage significantly when the app is idle.
I was also considering using only libevent, but third party libs in the project can use dispatch_async internally, so this can be problematic.
Running libevent's loop in a different thread makes synchronization more complex, is this the only way of solving this latency issue?
LINUX UPDATE. The above code does not work on Linux (2016-07-25-a Swift snapshot), RunLoop.current().run exists with an error. Below is a working Linux version reimplemented with a timer and dispatch_main. It suffers from the same latency issue:
let queue = dispatch_get_main_queue()
let timer = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, queue)
let interval = 0.01
let block: () -> () = {
guard !terminated else {
print("Quitting")
exit(0)
}
switch server.loop() {
case 1: break // Just idling
case 0: break //print("Libevent: processed event(s)")
default: // -1
print("Unhandled error in network backend")
exit(1)
}
}
block()
let fireTime = dispatch_time(DISPATCH_TIME_NOW, Int64(interval * Double(NSEC_PER_SEC)))
dispatch_source_set_timer(timer, fireTime, UInt64(interval * Double(NSEC_PER_SEC)), UInt64(NSEC_PER_SEC) / 10)
dispatch_source_set_event_handler(timer, block)
dispatch_resume(timer)
dispatch_main()
A quick search of the Open Source Swift Foundation libraries on GitHub reveals that the support in CFRunLoop is (perhaps obviously) implemented differently on different platforms. This means, in essence, that RunLoop and libevent, with respect to cross-platform-ness, are just different ways to achieve the same thing. I can see the thinking behind the thought that libevent is probably better suited to server implementations, since CFRunLoop didn't grow up with that specific goal, but as far as being cross-platform goes, they're both barking up the same tree.
That said, the underlying synchronization primitives used by RunLoop and libevent are inherently private implementation details and, perhaps more importantly, different between platforms. From the source, it looks like RunLoop uses epoll on Linux, as does libevent, but on macOS/iOS/etc, RunLoop is going to use Mach ports as its fundamental primitive, but libevent looks like it's going to use kqueue. You might, with enough effort, be able to make a hybrid RunLoopSource that ties to a libevent source for a given platform, but this would likely be very fragile, and generally ill-advised, for a couple of reasons: Firstly, it would be based on private implementation details of RunLoop that are not part of the public API, and therefore subject to change at any time without notice. Second, assuming you didn't go through and do this for every platform supported by both Swift and libevent, you would have broken the cross-platform-ness of it, which was one of your stated reasons for going with libevent in the first place.
One additional option you might not have considered would be to use GCD by itself, without RunLoops. Look at the docs for dispatch_main. In a server application, there's (typically) nothing special about a "main thread," so dispatching to the "main queue", should be good enough (if needed at all). You can use dispatch "sources" to manage your connections, etc. I can't personally speak to how dispatch sources scale up to the C10K/C100K/etc. level, but they've seemed pretty lightweight and low-overhead in my experience. I also suspect that using GCD like this would likely be the most idiomatic way to write a server application in Swift. I've written up a small example of a GCD-based TCP echo server as part of another answer here.
If you were bound and determined to use both RunLoop and libevent in the same application, it would, as you guessed, be best to give libevent it's own separate thread, but I don't think it's as complex as you might think. You should be able to dispatch_async from libevent callbacks freely, and similarly marshal replies from GCD managed threads to libevent using libevent's multi-threading mechanisms fairly easily (i.e. either by running with locking on, or by marshaling your calls into libevent as events themselves.) Similarly, third party libraries using GCD should not be an issue even if you chose to use libevent's loop structure. GCD manages its own thread pools and would have no way of stepping on libevent's main loop, etc.
You might also consider architecting your application such that it didn't matter what concurrency and connection handling library you used. Then you could swap out libevent, GCD, CFStreams, etc. (or mix and match) depending on what worked best for a given situation or deployment. Choosing a concurrency approach is important, but ideally you wouldn't couple yourself to it so tightly that you couldn't switch if circumstances called for it.
When you have such an architecture, I'm generally a fan of the approach of using the highest level abstraction that gets the job done, and only driving down to lower level abstractions when specific circumstances require it. In this case, that would probably mean using CFStreams and RunLoops to start, and switching out to "bare" GCD or libevent later, if you hit a wall and also determined (through empirical measurement) that it was the transport layer and not the application layer that was the limiting factor. Very few non-trivial applications actually get to the C10K problem in the transport layer; things tend to have to scale "out" at the application layer first, at least for apps more complicated than basic message passing.
I am starting with Expert Advisors on MetaTrader Terminal software and I have many algorithms to use with it. These algorithms were developed in MATLAB using its powerfull built in functions ( e.g. svd, pinv, fft ).
To test my algorithms I have some alternatives:
Write all the algorithms in MQL5.
Write the algorithms in C++ and than make a DLL to call by MQL5.
Write the algorithms in Python to embed in C and than make a DLL.
Convert the MATLAB source code to C and than make a DLL.
About the problems:
Impracticable because MQL5 does not have built in functions so I will have to implement one by one by hand.
I still did not try this, but I think it will take a long time to implement the algorithms ( I wrote some algorithms in C but took a good time and the result wasn't fast like MATLAB ).
I am getting a lot of errors when compiling to a DLL but if I compile to an executable there is no error ( this would be a good alternative since to convert MATLAB to python is quite simple and fast to do ).
I am trying this now, but I think there is so much work to do.
I researched about other similar pieces of software, like MetaTrader Terminal but I didn't found a good one.
I would like to know, if there is a simplest ( and fast ) way to embed other language in some way to MQL5 or some alternative to my issue.
Thanks.
Yes, there is alternative ... 5 ) Go Distributed :
having a similar motivation for using non-MQL4 code for fast & complex mathematics in external quantitative models for FX-trading, I have started to use both { MATLAB | python | ... } and MetaTrader Terminal environments in an interconnected form of a heterogeneous distributed processing system.
MQL4 part is responsible for:
anAsyncFxMarketEventFLOW processing
aZmqInteractionFRAMEWORK setup and participation in message-patterns handling
anFxTradeManagementPOLICY processing
anFxTradeDetectorPolicyREQUESTOR sending analysis RQST-s to remote AI/ML-predictor
anFxTradeEntryPolicyEXECUTOR processing upon remote node(s) indication(s)
{ MATLAB | python | ... } part is responsible for:
aZmqInteractionFRAMEWORK setup and participation in message-patterns handling
anFxTradeDetectorPolicyPROCESSOR receiving & processing analysis RQST-s to from remote { MQL4 | ... } -requestor
anFxTradeEntryPolicyREQUESTOR sending trade entry requests to remote { MQL4 | other-platform | ... }-market-interfacing-node(s)
Why to start thinking in a Distributed way?
The core advantage is in re-using the strengths of MATLAB and other COTS AI/ML-packages, without any need to reverse engineer the still creeping MQL4 interfacing options ( yes, in the last few years, DLL-interfaces had several dirty hits from newer updates ( strings ceased to be strings and started to become a struct (!!!) etc. -- many man*years of pain with a code-base under maintenance, so there is some un-forgettable experience what ought be avoided ... ).
The next advantage is to become able to add failure-resilience. A distributed system can work in ( 1 + N ) protected shading.
The next advantage is to become able to increase performance. A distributed system can provide a pool of processors - be it in a { SEQ | PAR }-mode of operations ( a pipeline-process or a parallel-form process execution ).
MATLAB node just joins:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% MATLAB script to setup
zeromq-matlab
clear all;
if ~ispc
s1 = zmq( 'subscribe', 'ipc', 'MATLAB' ); %% using IPC transport on <localhost>
else
disp( '0MQ IPC not supported on Windows.' )
disp( 'Setup TCP transport class instead' )
disp( 'Setting up TCP') %% using TCP transport on <localhost>
s1 = zmq( 'subscribe', 'tcp', 'localhost', 5555 );
end
recv_data1 = []; %% setup RECV buffer
This said, one can preserve strengths on each side and avoid any form of duplications of already implemented native, high-performance tuned, libraries, while the distributed mode of operations also adds some brand new potential benefits for Expert Advisor modus operandi.
one may add a remote keyboard interface to an EA automation and use some custom-specific commands ( CLI )
a fast, non-blocking, distributed remote logging
GPU / GPU-grid computing being used from inside MetaTrader Terminal
may like to check other posts on extending MetaTrader Terminal programming models
A Distributed System, on top of a Communication Framework:
MATLAB has already available port of ZeroMQ Communication Framework, the same that MetaTrader Terminal has, thanks to Austin CONRAD's wrapper ( though the MQH is interfacing to a ver 2.1.11 DLL, the services needed work like a charm ), so you are straight ready to use it on each side, so these types of nodes are ready to join their respective roles in any form one can design into a truly heterogeneous distributed system.
My recent R&D uses several instances of python-side processes to operate AI/ML-predictor, r/KBD, r/RealTimeANALYSER and a centralised r/LOG services, that are actively used, over many PUSH/PULL + XREQ/XREP + PUB/SUB Scalable Formal Communication Patterns, from several instances of MetaTrader Terminal-s by their respective MQL4-code.
MATLAB functions could be re-used in the same way.
I've written some code that has an RTC component in it. It's a bit difficult to do proper emulation of the code because the clock speed is set to 50MHz so to see any 'real time' events take place would take forever. I did try to do simulation for 2 seconds in modelsim but it ended up crashing.
What would be a better way to do it if I don't have an evaluation board to burn and test using scope?
If you could provide a little more specific example of exactly what you're trying to test and what is chewing up your simulation cycles that would be helpful.
In general, if you have a lot of code that you need to test in simulation, it's helpful if you can create testbenches of the sub-modules and test them first. Often, if you simulate at the top (chip) level and try to stimulate sub-modules that are buried deep in the hierarchy of a design, it takes many clock ticks just to get data into and out of the sub-module. If you simulate the sub-module directly you have direct access to the modules I/O and can test the things you want to test in that module in fewer cycles than if you try to get to it from the top level.
If you are trying to test logic that has very deep fifos that you are trying to fill or a specific count of a large counter you're trying to hit, you can either add logic to your code to help create those conditions in fewer cycles (like a load instruction on the counter) or you can force the values of internal signals of your design from the testbench itself.
These are just a couple of general ideas. Again, if you provide more detail about what it is you're simulating there are probably people on this forum that can provide help that is more specific to your problem.
As already mentioned by Ciano, if you provided more information about your design we would be able to give more accurate answer. However, there are several tips that hardware designers should follow, specially for complex system simulation. Some of them (that I mostly use) are listed below:
Hierarchical simulation (as Ciano, already posted): instead of simulating the entire system, try to simulate smaller set of modules.
Selective configuration: most systems require some initialization processes such as reset initialization time, external chips register initialization, etc... Usually for simulation purposes a few of them are not require and you may use a global constant to jump these stages when simulating, like:
constant SIMULATION_ENABLE : STD_LOGIC := '1';
...;
-- in reset condition:
if SIMULATION_ENABLE = '1' then
currentState <= state_executeSystem; -- jump the initialization procedures
else
currentState <= state_initializeSystem;
end if;
Be careful, do not modify your code directly (hard coded). As the system increases, it becomes impossible to remember which parts of it you modified to simulate. Use constants instead, as the above example, to configure modules to simulation profile.
Scaled time/size constants: instead of using (everytime) the real values for time and sizes (such as time event, memory sizes, register file size, etc) use scaled values whenever possible. For example, if you are building a RTC that generates an interrupt to the main system every 60 seconds - scale your constants (if possible) to generate interrupts to about (6ms, 60us). Of course, the scale choice depends on your system. In my designs, I use two global configuration files. One of them I use for simulation and the other for synthesis. Most constant values are scaled down to enable lower simulation time.
Increase the abstraction: for bigger modules it might be useful to create a simplified and more abstract module, acting as a model of your module. For example, if you have a processor that has this RTC (you mentioned) as a peripheral, you may create a simplified module of this RTC. Pretending that you only need the its interrupt you may create a simplified model such as:
constant INTERRUPT_EVENTS array(1 to 2) of time := (
32 ns,
100 ms
);
process
for i in 1 to INTERRUPT_EVENTS'length loop
rtcInterrupt <= '0';
wait for INTERRUPT_EVENTS(i);
rtcInterrupt <= '1';
wait for clk = '1' and clk'event
end for
wait;
end process;
I have checked java.nio.file.Files.copy but that blocks a thread until the copy is done. Are there any libraries that allow one to copy a file in a non-blocking way? I need to perform many of these operations simultaneously and cannot afford to have so many threads blocked.
While I could write something myself using non-blocking streams, I would rather use something tried and tested that would guarantee a correct copy every time (or detect if something went wrong).
Check this: Iterate over lines in a file in parallel (Scala)?
val chunkSize = 128 * 1024
val iterator = Source.fromFile(path).getLines.grouped(chunkSize)
iterator.foreach { lines =>
lines.par.foreach { line => process(line) }
}
Reading (copying) files by chunks in parallel. In this case "par" is used.
So it quite non-blocking in terms / scope of processors (cores).
But you may follow same idea of chunks, for example using Akka/Future/Promises to be even in wider scopes.
You may customize you chunk-size deepening on your performance characteristic, level of system load, etc..
One more link that explains possible way to do read / write data from (property) file in parallel using Akka Actors. This is not quite that you might be want, but it may give an idea.
Idea - you may build your own not-blocking way of reading / copying files.
--
And about your statement "While I could write something myself using non-blocking streams":
I would remind that each OS / File System (FS) may have its own vision about what and where to block. Like Windows blocks a file (write-block at leat) if one thread writes to it. On Linux is is configurable. So if you want to stick to something stable, I would suggest to think it out and go with your own wrapper (over FS) solution based on events, chunks, states.
I have used the Process class, issuing an operating system command to copy the file. Of course, one has to check under which OS the application is running, and issue the appropriate command, but this allows for fast and asynchronous copies.
As Marius rightly mentions in the comments, Scala Process blocks, so I run it wrapped in a Future.
Java 8 Process introduces a function isAlive(). A non-blocking alternative would be to use Java 8 processes and use the scheduler to poll at regular intervals to see if the process has finished. However, I did no need to go to this extent.
Have you checked out the async stuff in scala-io?
http://jesseeichar.github.io/scala-io-doc/0.4.2/index.html#!/core/async%20read%20write
In my application I've written all of my Rx code to use Scheduler.Default.
I wanted to know if there's a difference between specifying Scheduler.Default and not specifying a scheduler at all?
What is the strategy employed by System.Reactive.Concurrency.DefaultScheduler?
Rx uses an appropriate strategy dependent on the platform specific PlatformServices that are loaded - hence you can have a different approach in different cases. The OOB implementation looks at whether Threads are available on your platform, and if so uses Threads and the platform Timer implementation to schedule items, otherwise it uses Tasks. The later case arises in Windows 8 Apps, for example.
You can find a good video about how platform services are implemented from the creator here: http://channel9.msdn.com/Shows/Going+Deep/Bart-De-Smet-Rx-20-RTM-and-RTW
Look here for information about how the built-in operators behave when you do and don't specify a scheduler: http://msdn.microsoft.com/en-us/library/hh242963(v=vs.103).aspx
Yes there is a difference between specifying Scheduler.Default and not specifying a scheduler. Using Scheduler.Default will introduce asynchronous and possibly concurrent behavior, while not supplying a scheduler leaves it up to the discretion of the operator. Some operators will choose to execute synchronously while others will execute asynchronously, while others will choose to jump threads.
It is probably a bad idea (for performance and possibly even correctness since too much concurrency might lead you into a deadlock situation) to supply Scheduler.Default to every Rx operator. If you do not have a specific scheduling requirement, then do not supply a scheduler and let the operator pick what it needs.
For example,
this will complete synchronously:
int result = 0;
Observable.Return(42).Subscribe(v => result = v);
result == 42;
while this will complete asynchronously (and likely on another thread):
int result = 0;
Observable.Return(42, Scheduler.Default).Subscribe(v => result = v);
result == 0;
// some time later
result == 42;