How to compute deterministic policy gradients in DDPG? - matlab

I am writing a MATLAB script that uses Deep Determininstic Policy Gradient to control an Active Suspension System (Dynamic System), but I am stuck on updating the actor network. All of the examples and articles I read, use tensorflow libraries like tf.gradients(). However, I need to know exactly how to apply chain rule to compute the deterministic policy gradient shown in the image to implement it in my MATLAB code.

Related

REINFORCE algorithm for a continuous action space

I have recently started exploring and playing around with reinforcement learning, and have managed to wrap my head around discrete action spaces, and have working implementations of a few environments in OpenAI Gym using Q-learning and Expected SARSA. However, I am running into some trouble understanding the handling of continuous action spaces.
From what I have understood so far, I have constructed a neural network that outputs the mean of a Gaussian distribution, with the standard deviation being fixed for now. Then using the output from the neural network I sample an action from the Gaussian distribution and perform this action in the environment. For each step in an episode I save the starting state, action and reward. Once the episode is over I am supposed to train the network, but this is were I am struggling.
From what I understand, the loss of policy network is calculated by the log-probability of the chosen action multiplied by the discounted reward of that action. For discrete actions this seems straightforward enough, have a softmax layer as your final layer and define a custom loss function that defines the loss as the logarithm of the softmax output layer multiplied by the target value which we set to be the discounted reward.
But how do you do this for a continuous action? The neural network outputs the mean, not the probability of an action or even the action itself, so how do I define a loss function to pass to keras to perform the learning step in TensorFlow for the continuous case?
I have read through a variety of articles on policy optimization, and while the article might mention the continuous case, all of the associated code always focuses on the discrete action space case for policy optimization, which is starting to become fairly disheartening. Can someone help me understand how to implement the continuous case in TensorFlow 2.0?

Output range for continuous control policy network

I tried to implement the simple vanilla policy gradient (REINFORCE) in a continuous control problem by adapting this pytorch implementation to the continuous case and I stumbled upon the following issue.
Usually, when the action space is discrete, the output of the policy network is bounded in (0,1)^n by the softmax function which gives the probability that the agent would pick a certain action given the state (input to the network). However, when the action space is continuous, for example if we have K action such that each action ak has lower and upper bounds lk anduk, I haven't found a way (empirical or theoretical) to limit the output of the network (which is usually the means and the standard deviations of the action probability distribution given the state) using lk and uk.
From the few trials I made, without constraining the output of the policy network, it was very hard, if not impossible, to learn a good policy, but i might be doing something wrong since i am new to reinforcement learning.
My intuition suggests me to limit the means and the standard deviations output of the policy network using, for example, a sigmoid and then scaling them with the absolute difference between lk and uk. I'm not quite sure how to do it properly though, considering also that the sampled action could exceed whatever bound you impose on the distribution parameters when using, for example, a gaussian distribution.
Am I missing something? Are there established ways to limit the output of the policy network for continuous action spaces or there's no need to do that at all?
I am not sure this is the right place for this question, if not I will be glad if you point to me a better place.

Episodic Semi-gradient Sarsa with Neural Network

While trying to implement the Episodic Semi-gradient Sarsa with a Neural Network as the approximator I wondered how I choose the optimal action based on the currently learned weights of the network. If the action space is discrete I can just calculate the estimated value of the different actions in the current state and choose the one which gives the maximimum. But this seems to be not the best way of solving the problem. Furthermore, it does not work if the action space can be continous (like the acceleration of a self-driving car for example).
So, basicly I am wondering how to solve the 10th line Choose A' as a function of q(S', , w) in this pseudo-code of Sutton:
How are these problems typically solved? Can one recommend a good example of this algorithm using Keras?
Edit: Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
I wondered how I choose the optimal action based on the currently learned weights of the network
You have three basic choices:
Run the network multiple times, once for each possible value of A' to go with the S' value that you are considering. Take the maximum value as the predicted optimum action (with probability of 1-ε, otherwise choose randomly for ε-greedy policy typically used in SARSA)
Design the network to estimate all action values at once - i.e. to have |A(s)| outputs (perhaps padded to cover "impossible" actions that you need to filter out). This will alter the gradient calculations slightly, there should be zero gradient applied to last layer inactive outputs (i.e. anything not matching the A of (S,A)). Again, just take the maximum valid output as the estimated optimum action. This can be more efficient than running the network multiple times. This is also the approach used by the recent DQN Atari games playing bot, and AlphaGo's policy networks.
Use a policy-gradient method, which works by using samples to estimate gradient that would improve a policy estimator. You can see chapter 13 of Sutton and Barto's second edition of Reinforcement Learning: An Introduction for more details. Policy-gradient methods become attractive for when there are large numbers of possible actions and can cope with continuous action spaces (by making estimates of the distribution function for optimal policy - e.g. choosing mean and standard deviation of a normal distribution, which you can sample from to take your action). You can also combine policy-gradient with a state-value approach in actor-critic methods, which can be more efficient learners than pure policy-gradient approaches.
Note that if your action space is continuous, you don't have to use a policy-gradient method, you could just quantise the action. Also, in some cases, even when actions are in theory continuous, you may find the optimal policy involves only using extreme values (the classic mountain car example falls into this category, the only useful actions are maximum acceleration and maximum backwards acceleration)
Do I need to modify the pseudo-code when using a network as the approximator? So, that I simply minimize the MSE of the prediction of the network and the reward R for example?
No. There is no separate loss function in the pseudocode, such as the MSE you would see used in supervised learning. The error term (often called the TD error) is given by the part in square brackets, and achieves a similar effect. Literally the term ∇q(S,A,w) (sorry for missing hat, no LaTex on SO) means the gradient of the estimator itself - not the gradient of any loss function.

Neural network for approximation function for board game

I am trying to make a neural network for approximation of some unkown function (for my neural network course). The problem is that this function has very many variables but many of them are not important (for example in [f(x,y,z) = x+y] z is not important). How could I design (and learn) network for this kind of problem?
To be more specific the function is an evaluation function for some board game with unkown rules and I need to somehow learn this rules by experience of the agent. After each move the score is given to the agent so actually it needs to find how to get max score.
I tried to pass the neighborhood of the agent to the network but there are too many variables which are not important for the score and agent is finding very local solutions.
If you have a sufficient amount of data, your ANN should be able to ignore the noisy inputs. You also may want to try other learning approaches like scaled conjugate gradient or simple heuristics like momentum or early stopping so your ANN isn't over learning the training data.
If you think there may be multiple, local solutions, and you think you can get enough training data, then you could try a "mixture of experts" approach. If you go with a mixture of experts, you should use ANNs that are too "small" to solve the entire problem to force it to use multiple experts.
So, you are given a set of states and actions and your target values are the score after the action is applied to the state? If this problem gets any hairier, it will sound like a reinforcement learning problem.
Does this game have discrete actions? Does it have a discrete state space? If so, maybe a decision tree would be worth trying?

Feasibility of Machine Learning techniques for Network Intrusion Detection

Is there a machine learning concept (algorithm or multi-classifier system) that can detect the variance of network attacks(or try to).
One of the biggest problems for signature based intrusion detection systems is the inability to detect new or variant attacks.
Reading up, anomaly detection seems to still be a statistical based en-devour it refers to detecting patterns in a given data set which isn't the same as detecting variation in packet payloads. Anomaly based NIDS monitors network traffic and compares it against an established baseline of a normal traffic profile. The baseline characterizes what is "normal" for the network - such as the normal bandwidth usage, the common protocols used, correct combinations of ports numbers and devices etc
Say some one uses Virus A to propagate through a network then some one writes a rule to stop Virus A but another person writes a "variation" of Virus A called Virus B purely for the purposes of evading that initial rule but still using most if not all of the same tactics/code. Is there not a way to detect variance?
If there is whats the umbrella term it would come under, as ive been under the illusion that anomaly detection was it.
Could machine learning be used for pattern recognition(rather than pattern matching) at the packet payload level?
i think your intution to look at machine learning techniques is correct, or will turn out to be correct (One of the biggest problems for signature based intrusion detection systems is the inability to detect new or variant attacks.) The superior performance of ML techiques is in general due to the ability of these algorithms to generalize (a multiplicity of soft constraints rather than a few hard constraints). and to adapt (updates based on new training instances to frustrate simple countermeasures)--two attributes that i would imagine are crucial for identifying network attacks.
The theoretical promise aside, there are practical difficulties with applying ML techniques to problems like the one recited in the OP. By far the most significant is the difficultly in gathering data to train the classifier. In particular, reliably labeling data points as "intrusion" is probably not easy; likewise, my guess is that these instances are sparsely distributed in the raw data."
I suppose it's this limitation that has led to the increased interest (as evidenced at least by the published literature) in applying unsupervised ML techniques to problems like network intrusion detection.
Unsupervised techniques differ from supervised techniques in that the data is fed to the algorithms without a response variable (i.e., without the class labels). In these cases you are relying on the algorithm to discern structure in the data--i.e., some inherent ordering in the data into reasonably stable groups or clusters (possibly what you the OP had in mind by "variance." So with an unsupervised technique, there is no need to explicitly show the algorithm instances of each class, nor is it necessary to establish baseline measurements, etc.
The most frequently used unsupervised ML technique applied to problems of this type is probably the Kohonen Map (also sometimes called self-organizing map or SOM.)
i use Kohonen Maps frequently, but so far not for this purpose. There are however, numerous published reports of their successful application in your domain of interest, e.g.,
Dynamic Intrusion Detection Using Self-Organizing Maps
Multiple Self-Organizing Maps for Intrusion Detection
I know MATLAB has at least one available implementation of Kohonen Map--the SOM Toolbox. The homepage for this Toolbox also contains a brief introduction to Kohonen Maps.