I have an agent-based simulation (using Java-based Repast) that generates a time series in its output for my different treatments. I am measuring performance through time, and at each time tick the performance is the mean of 30 runs (30 samples). In all of the treatments the performance starts from near 0 and ends in 100%, but with different speed. I was wondering if there is any stochastic model or probabilistic way to to compare the speed or growth of these time series. I want to find out which one "significantly" grows faster.
Thanks.
Related
There is an option to set the "Number of intervals" during the simulation in Dymola, but I am not sure how to choose the intervals in a real model.
So I did some tests, Here are my conclusions and questions:
The number of intervals shouldn't be too small, cause when it is too small, the simulation result could deviate from the real value.
So I set the number of intervals above 500 always, it seems working fine right now, but I am not sure if I should increase it.
When using a larger number of intervals, the simulation time could be longer, so maybe I could decrease it.
In short, I am looking for a principle to determine how to choose the number of intervals.
The intervals are just the ones for the plot window not for the simulation.
If you chose 500 you will get 500 points plotted with linear interpolation. The actual stepsize for simulation is chosen by the DASSL-Integerator to ensure stability. The tolerance option can influence that behaviour. If you choose another solver you can even set fixed number of iterator steps, but i would strongly recommend leaving that decision to the solvers.
In tensorflow, I used to execute cnn learning for fixed number of epochs and save checkpoints in between after specified number of epochs interval. For evaluating the model, the checkpoints are restored and perform prediction on the validation dataset.
I want to automate the learning process, instead of using fixed epochs. Please explain how the loss value over mini batches can be utilised for determining the stopping point? Also please help me towards implementing learning rate decay in tensorflow. Which is better constant decay or exponential and how to determine the decay factor?
First for the number of iterations you can exit the training if your loss stopped improving on the batch i.e. if the difference between two loss values AVERAGED accross batches (to reduce batch fluctuations) is less than a determined threshold.
But you probably realized that the threshold is an hyperparameter too !
In fact there are quite a few attempts to completely automate ML but no matter what you do you still end up with some hyperparameters.
Secondly for the decay factor it is used when you feel the loss has stopped improving and think that you are in a local minima and oscillating in and out of the well without actually going in (this metaphore only works when you have 2 dimensions but I find it usefull still).
Almost every time it is done in the litterature it looks very hand-made: like you train for 200 epochs you see that it reached a plateau so you decrease your lr with a step function (argument staircase=True in TF) and then again.
What is commonly used is to divide the learning rate by 10 (exponential decay) but like before it is very arbitrary !
For details on how to implement learning rate decay in TF you can see dga's answer in this SO question.
It is pretty straightforward !
What can help with the schedule and the values you use is cross-validation but oftentimes you can simply look at your loss and do it by hands.
There is no silver bullet in deep learning it is just trials and errors.
I have a Simulink model with master clock value of 4410 Hz. I know for a fact that computation time of some algorithms (e.g. cubic spline interpolation on a 4410 sample frame being accumulated in real-time) is much longer than the master time period (i.e. computation time of spline is cca 0.7 seconds). I would expect Simulink to output frame elements AFTER initial 1 second + propagation time delay (like in hardware languages, e.g. VHDL), but it actually starts outputting the elements of the frame just after one seconds (which is the length of frame, 4410/4410 seconds). This wouldn't be a problem if my output values weren't unexpected/wrong.
How does Simulink build the simulation in this case? It would appear that it stops the simulation for larger computation times, then continues it afterwards.
A simulink simulation assumes infinite computation capacity, it does not simulate computation times. It does not stop the simulation, it does not use a real clock at all. While simulink is a bit more complicated with the different solvers, you can take a look at discrete event simulation which should give a simple example of isolating the simulation clock form your real clock.
I am using Q-Learning algorithm on a simulation. this simulation has limited iterations (600 to 700). the learning process is activated for several runs of this simulation (100 run).
I am new to reinforcement learning, and i have an issue here about how to use exploration/exploitation on such kind of simulation (I am using e-greedy exploration).
I am using a decreasing exploration and I am wondering if I should use the decreasing exploration on the whole simulation runs, or decrease it for each simulation run (initiate epsilon to 0.9 for each simulation run and then decrease it).
Thank you
You won’t need such a high initiation of the epsilon. It might be better to initialize the q-values as very high, so that unknown q-values are always picked above q-values that has been explored at least once.
Considering your state space, it doesn’t matter whether you decrease it after a whole run or an individual run, but individually sounds like a better option.
How fast you decrease it will also depend on the circumstances of the world and how fast the agent learns. I’m trying to make my alpha and epsilon correlate to the error, but it’s tricky to do that.
I have questions about real application performance running on a cluster vs cluster peak performance.
Let's say one HPC cluster report that it has peak performance of 1 Petaflops. How is this calculated?
To me, it seems that there are two measuring matrixes. One is the performance calculated based on the hardware. The other one is from running HPL? Is my understanding correct?
When I am reading one real application running on the system at full scale, the developer mentions that it could achieve 10% of the peak performance. How is this measured and why it can't achieve peak performance?
Thanks
Peak performance is what the system is theoretically able to deliver. It is the product of the total number of CPU cores, the core clock frequency, and the number of FLOPs one core makes per clock tick. That performance can never be reached in practice because no real application consists of 100% fully vectorised tight loops that only operate on data held in the L1 data cache. In many cases data doesn't even fit in the last-level cache and the memory interface is usually not fast enough to deliver data at the same rate at which the CPU is able to process it. One ubiquitous example from HPC is the multiplication of a sparse matrix with a vector. It is so memory intensive (i.e. many loads and stores per arithmetic operation) that on many platforms it only achieves a fraction of the peak performance.
Things get even worse when multiple nodes are networked together on a massive scale as data transfers could introduce huge additional delays. Performance in those cases is determined mainly by the ratio of local data processing and data transfer. HPL is a particularly good in that aspect - it does a lot of vectorised local processing and does not move much data across the CPUs/nodes. That's not the case with many real-world parallel programs and also the reason why many are questioning the applicability of HPL in assessing cluster performance nowadays. Alternative benchmarks are already emerging, for example the HPCG benchmark (from the people who brought you HPL).
The theoretical (peak) value is based on the capability of each individual core in the cluster, which depends on clock frequency, number of floating point units, parallel instruction issuing capacity, vector register sizes, etc. which are design characteristics of the core. The flops/s count for each core in the cluster is then aggregated to get the cluster flops/s count.
For a car the equivalent theoretical performance would be the maximum speed it can reach given the specification of its engine.
For a program to reach the theoretical count, it has to perform specific operations in a specific order so that the instruction-level parallelism is maximum and all floating-point units are working constantly without delay due to synchronization or memory access, etc. (See this SO question for more insights)
For a car, it is equivalent to measuring top speed on a straight line with no wind.
But of course, chances that such a program computes something of interest are small. So benchmarks like HPL use actual problems in linear algebra, with a highly optimized and tuned implementation, but which is still imperfect due to IO operations and the fact that the order of operations is not optimal.
For a car, it could be compared to measuring the top average speed on a race track with straight lines, curves, etc.
If the program requires a lot of network, or disk communications, which are operations that require a lot of clock cycle, then the CPU has often to stay idle waiting for data before it can perform arithmetic operations, effectively wasting away a lot of computing power. Then, the actual performance is estimated by dividing the number of floating points operations (addition and multiplications) the program is performing by the time it takes to perform them.
For a car, this would correspond to measuring the top average speed in town with red lights, etc. by calculating the length of the trip divided by the time needed to accomplish it.