in karpathy's well written post on policy gradients (http://karpathy.github.io/2016/05/31/rl/), he mentions a "block of differentiable compute".
what exactly is a "block of differentiable compute"?
Related
In variational autoencoder, objective function has two terms, the one which makes input and output x same, and the other one regularizer, q(z) and p(z) to be close by KL divergence.
What I doN't understand is why we can assume that p(z)~Normal Gaussian with 0 mean and 1 variance?
Why not say..variance less than 1? so that more informationn is condensed with narrower gaussians in hidden layer?
Thank you
Provided the network is sufficiently powerful to synthesize complex functions, the shape of the prior should be - in theory - largely uninfluent. In the specific case of the variance of the Gaussian you take as prior, the network may easily adapt to a different variance by scaling the relevant statistics of the posterior distributions Q(z|X), and suitably rescaling sampling in the next layer of the network. The resulting network would have precisely the same behaviour (and loss) of the previous one. So, the variance of the prior Gaussian has just the role of fixing the unit of measure for the latent space. The topic is discussed in the excellent tutorial on Variational Autoencoders by Doersh (Section 2.4.3); you might also be interested to have a look at my blog.
I'm computing multiple integrals using MATLAB.
I'm using the integral function to compute the integral but I was wondering is it faster to use trapz instead of using integral?
I know that trapz introduces a bit of error in the computation, but despite that, with is the best function to compute integrals in MATLAB?
Short and sweet:
Use trapz for discrete data or for selected functional data if you don't care about (potentially extremely) low accuracy of the integral value
Use integral for integrands that have a functional form, adjusting tolerances as needed for speed.
As mentioned by the MATLAB documentation, trapz is intended "to perform numerical integrations on discrete data sets" and leverages the trapezoidal rule for the integrations. The error between the true integral and the trapz approximation is almost entirely dependent on the input x vector (sometimes called the abscissa in integration parlance) with no automatic adaptability. The good part is that if the underlying function is "nice" (i.e., continuous, smooth, no sharp peaks or excessive oscillations, etc.), trapz will likely be the fastest function to approximate the integral since it
Doesn't have to call a function for values (they're input)
Doesn't automatically adapt (which takes time and can be complex to
implement).
However, for general integrals, trapz may also be the most inaccurate and may require a denser x vector to calculate a low-error value.
For discrete data, this is a short-coming that must be lived with, but if the integrand has a functional form, integral and its family is highly recommended.
The black-box numeric integrators in MATLAB have evolved over the years, and MathWorks co-founder Clever Moler has a nice blog post going over some of the evolutions. The post discusses the quad, quadl, and quadgk functions and how quadgk became the core for integral and its ilk. The basic breakdown of the three functions is
quad uses a three-point and five-point Simpson's Rule
quadl uses a four-seven-thirteen point1 Lobatto-Kronrod2 rule
quadgk a uses seven-fifteen point Gauss-Kronrod2 rule
to acquire both an approximation of the integral and an error approximation for adaptive quadrature. The summary of the history lesson and test problems is that quadgk was written with vectorization incorporated3, uses a higher-order rule which excludes end-points, and gives extremely accurate answers faster than its competitors. As a result, quadgk is the core of the new and highly-recommended integral family.
1 Adaptive quadrature usually lists the number of points used to form its approximation of the value and the error. Typically, there are two numbers that indicate the number of points to form the low-order and high-order approximations. quadl is interesting in that it uses a four-point Gauss-Lobatto rule and seven-point and thirteen-point Kronrod extensions for its error handling.
2 Gaussian Quadrature, which is an integration technique that chooses it abscissa to exactly integrate a family of polynomials over a given interval instead of prescribing them as in Newton-Cotes, has a lot of names associated with it to indicate a lot of "stuff" that's going on without being explicit about it (which can be very annoying to newcomers). "Gauss" refers to the aforementioned method of choosing abscissa and associated weights for the integration. "Lobatto" indicates an extension to Gauss-Legendre integration methods that incorporates end-points (others may not like my link between these two, but I find the parallels pleasing). "Kronrod" indicates an extension to any particular Gauss rule that creates a high-order rule using a given set of abscissa and adding to it; this creates a "nesting" (the low-order points are part of the high-order point set) that results in fewer function evaluations overall.
3 Since vectorization is written into integral, integrands or limits that are vector-valed must use the 'ArrayValued' flag to tell the program to make functional evaluations differently so as not to create a size-mismatch error. It might be possible to program around this to a certain extent, but the MathWorks decided not to.
OK, so I'm doing RMS prop or SGD to get a neural network to learn it's parameters. But, after a while, both training and validation errors appear to have stagnated (outside of random fluctuations: I'm using dropout)..
So, I decided, to try to use conjugate gradient to refine the values. I still obviously don't want it to overfit, so I was keeping the dropout... But, of course, this makes the optimization function be noisy. So, I guess my question is: Does Conjugate Gradient (or L-BFGS or etc.) require noiseless functions? Or can they work in the presence of noise?
Thanks!
Gradient-based optimization algorithms are highly sensitive to noise. This is because the derivative calculation gets affected due to the discontinuities caused by the noise across the function domain.
To optimize noisy objective functions it is better to use heuristic algorithms: Genetic Algorithms, Simulated Annealing, Ant Colony, Particle Swarm... They Are not based on gradients and therefore they do not present the same weakness.
You can read more about these algorithms in the book:
Duc Pham, D. Karaboga, Intelligent Optimisation Techniques. London, United Kingdom: Springer-Verlag London, 2000.
If you are interested in Simulated Annealing, you can also read:
Peter Rossmanith, “Simulated Annealing” in Algorithms Unplugged, Vöcking, B., Alt, H., Dietzfelbinger, M., Reischuk, R., Scheideler, C., Vollmer, H., Wagner, D., Ed. Berlin, Germany: Springer-Verlag Berlin Heidelberg, 2011, ch. 41, pp. 393-400.
I have been trying to implement a navigation system for a robot that uses an Inertial Measurement Unit (IMU) and camera observations of known landmarks in order to localise itself in its environment. I have chosen the indirect-feedback Kalman Filter (a.k.a. Error-State Kalman Filter, ESKF) to do this. I have also had some success with an Extended KF.
I have read many texts and the two I am using to implement the ESKF are "Quaternion kinematics for the error-state KF" and "A Kalman Filter-based Algorithm for IMU-Camera Calibration" (pay-walled paper, google-able).
I am using the first text because it better describes the structure of the ESKF, and the second because it includes details about the vision measurement model. In my question I will be using the terminology from the first text: 'nominal state', 'error state' and 'true state'; which refer to the IMU integrator, Kalman Filter, and the composition of the two (nominal minus errors).
The diagram below shows the structure of my ESKF implemented in Matlab/Simulink; in case you are not familiar with Simulink I will briefly explain the diagram. The green section is the Nominal State integrator, the blue section is the ESKF, and the red section is the sum of the nominal and error states. The 'RT' blocks are 'Rate Transitions' which can be ignored.
My first question: Is this structure correct?
My second question: How are the error-state equations for the measurement models derived?
In my case I have tried using the measurement model of the second text, but it did not work.
Kind Regards,
Your block diagram combines two indirect methods for bringing IMU data into a KF:
You have an external IMU integrator (in green, labelled "INS", sometimes called the mechanization, and described by you as the "nominal state", but I've also seen it called the "reference state"). This method freely integrates the IMU externally to the KF and is usually chosen so you can do this integration at a different (much higher) rate than the KF predict/update step (the indirect form). Historically I think this was popular because the KF is generally the computationally expensive part.
You have also fed your IMU into the KF block as u, which I am assuming is the "command" input to the KF. This is an alternative to the external integrator. In a direct KF you would treat your IMU data as measurements. In order to do that, the IMU would have to model (position, velocity, and) acceleration and (orientation and) angular velocity: Otherwise there is no possible H such that Hx can produce estimated IMU output terms). If you instead feed your IMU measurements in as a command, your predict step can simply act as an integrator, so you only have to model as far as velocity and orientation.
You should pick only one of those options. I think the second one is easier to understand, but it is closer to a direct Kalman filter, and requires you to predict/update for every IMU sample, rather than at the (I assume) slower camera framerate.
Regarding measurement equations for version (1), in any KF you can only predict things you can know from your state. The KF state in this case is a vector of error terms, and thus you can only predict things like "position error". As a result you need to pre-condition your measurements in z to be position errors. So make your measurement the difference between your "estimated true state" and your position from "noisy camera observations". This exact idea may be represented by the xHat input to the indirect KF. I don't know anything about the MATLAB/Simulink stuff going on there.
Regarding real-world considerations for the summing block (in red) I refer you to another answer about indirect Kalman filters.
Q1) Your SIMULINK model looks to be appropriate. Let me shed some light on quaternion mechanization based KF's which I've worked on for navigation applications.
Since Kalman Filter is an elegant mathematical technique which borrows from the science of stochastics and measurement, it can help you reduce the noise from the system without the need for elaborately modeling the noise.
All KF systems start with some preliminary understanding of the model that you want to make free of noise. The measurements are fed back to evolve the states better (the measurement equation Y = CX). In your case, the states that you are talking about are errors in quartenions which would be the 4 values, dq1, dq2, dq3, dq4.
KF working well in your application would accurately determine the attitude/orientation of the device by controlling the error around the quaternion. The quaternions are spatial orientation of any body, understood using a scalar and a vector, more specifically an angle and an axis.
The error equations that you are talking about are covariances which contribute to Kalman Gain. The covariances denote spread around the mean and they are useful in understanding how the central/ average behavior of the system is changing with time. Low covariances denote less deviation from the mean behavior for any system. As KF cycles run the covariances keep getting smaller.
The Kalman Gain is finally used to compensate for the error between the estimates of the measurements and the actual measurements that are coming in from the camera.
Again, this elegant technique first ensures that the error in the quaternion values converge around zero.
Q2) EKF is a great technique to use as long as you have a non-linear measurement construction technique. Be very careful in using EKF if their are too many transformations in your system, i.e don't try to reconstruct measurements using transformation on your states, this seriously affects the model sanctity and since noise covariances would not undergo similar transformations, there would be a chance of hitting singularity as soon as matrices are non-invertible.
You could look at constant gain KF schemes, which would save you from covariance propagation and save substantial computation effort and time. These techniques are quite new and look very promising. They actively absorb P(error covariance), Q(model noise covariance) and R(measurement noise covariance) and work well with EKF schemes.
I am facing a problem translating a model with ODE to a stochastic model.
The original model contains two expressions:
1) k6f2*PKB_S473P^n6/(km6^n6+PKB_S473P^n6))*AS160
2) k9f1*S6K*mTORC1a^n9/(km9^n9+mTORC1a^n9)
that are described with Hill kinetics.
I was wondering how to tranlate the previous two expressions into a mass action law, to be able to simulate them with a stochastic simulator.
Thanks all in advance!
Transforming Hill kinetics into mass action kinetics cannot be done in a satisfying way, because for large concentrations, the former lead to bounded rates, while the latter does not. Mass action law will not show the saturation effect that can be seen with Hill kinetics. At best, you can approximate the Hill function locally with mass action, but then your simulation should be done around a stable steady state. Alternatively, if you introduce intermediate species and reactions with appropriate coefficients, you can obtain Michaelis-Mentens kinetics (Hill kinetics with unit Hill coefficient). In principle, you could even regain cooperativity using more complex intermediate species, but this is probably rather involved.