should I use float or classes as output for the final layer in my neural network? - neural-network

I am working on a deep learning problem where I am trying to predict time-to-failure on laboratory earthquake data from an observed seismic time series. The target is a single integer number (the time until the next earthquake) ranging, say, from 1 to 10.
I could design the last layer to return a single float and use, say, mean-square-error(MSE), as a loss to make that returned float close to the desired integer. Or, I could think of each integer possibility as a "class" and use a cross-entropy(CE) loss to optimize.
Are there any arguments in favour of either of these options?
Also, what if the target is a float number ranging from 1 to 10? I could also turn this into a class/CE problem.
So far, I have tried the CE option (which works at some level) and am thinking of trying the mse option but wanted to step back and think before proceeding. Such thoughts would include reasoning as to why one approach might outperform the other.
I am working with pytorch version 1.0.1 and Python 3.7.
Thanks for any guidances.

I decided to just implement a float head with an L1Loss in Pytorch and I created a simple but effective synthetic data set to test the implementation. The data set created images into which a number of small squares were randomly drawn. The training label was simply the number of squares divided by 10, a float number with one decimal digit.
The net trained very quickly and to a high degree of precision - the test samples were correct to the one decimal digit.
As to the original question, the runs I made definitely favoured the float over class.
My take on this is that the implementation in classes had a basic imprecision in the assignment of the classes and, perhaps more importantly, the class implementation has no concept of a "metric". That is, the float implementation, even if it misses the exact match, will try to generate an output label "close" to the input label while the class implementation has no concept of "close".
One warning with Pytorch. If you are fitting to one float, be sure to encase it in a length 1 vector in the data generator. Pytorch cannot handle a "naked" float number (even though it does become a vector when batches are done). But it doesn't complain. This cost me a bunch of time.

Related

Activation function to get day of week

I'm writing a program to predict when will something happens. I don't know which activation function to get output in day of week (1-7).
I tried sigmoid function but i need to input the predicted day and it output probability of it, I don't want it to be this way.
I expect the activation function returning 0 to infinite, is ReLU the best activation function for this task?
EDIT:
also, what if i wanted output more than 7 days, for example, x will hapen in 9th day from today, or 15th day from today, etc? I'm looking for dynamic ways to do this
What you are trying to do is solving a classification problem with a regression approach. That's at least unconventional.
You can use any activation function you want and define your output as you want. E.g. linear, relu with output range from 1 to 7 or something between -1(or 0) and 1 like tanh or sigmoid and map the output (-1 -> 1; -0.3 -> 2; ...).
The problem for you will be that you get a floatingpoint number as a result. So your model not only has to learn how to classify correctly but also how to predict the (allmost) exact number you want in your output neuron. That makes the problem more complicated than it has to be. With a model like that it also will be likley that for some outlier datapoints you might get unexpected return values like 0, -1 or 8. What do you do then?
To sum it up: Listen to #venkata krishnan, use softmax and seven output neurons and map this result to a number between 1 and 7 outside the neural network if you have to.
EDIT
What comes to my mind after reading the comments again would be a mix of what you want and what you should do.
You could try to make the second last layer a 7 neuron softmax layer and map those output to a single neuron in the last layer.
Niether did i ever try that nor have i ever read about something like that so i can't tell you if thats a good idea, likely not, but you might consider it worth a try.
I want to add onto the point of #venkata krishnan, which raises a valid point in your problem setting. You will find an answer to your original question further down, but I strongly suggeste you read the following comment first.
Generally, you want to discern between categorical, ordinal and interval variables. I have given a relatively lengthy explanation in a different answer on Stackoverflow, it might be helpful to understand this concept in more detail.
In your scenario, you mostly want to have an understanding of "how wrong" you are. Of course, it is perfectly reasonable to assume what you are doing and interpret it as a interval variable, and therefore have an assumed ordering (and a distance) between different values.
What is problematic, though, is the fact that you are assuming a continuous space on a discrete variable. E.g., it does not make any sense to interpret the output of 4.3, since you can only tell between 4 (Friday, assuming you start numbering your days at 0), or 5 (Saturday). Any value in between would have to be rounded, which is perfectly fine - until you want to perform backpropagation on this loss.
It is problematic, because you are essentially introducing a non-convex and non-continous function, no matter how you "round" your values. Again, to exemplify this, you could assume to round to the nearest number; then, at the value of 4.5, you would see a sudden increase in the loss, which is non-differentialbe, and will therefore put a hard time on your optimizer, potentially limiting convergence of your system.
If, instead, you utilize several output neurons, as suggested by #venkata krishnan, you might lose the information of distance (how many days you are off) on paper, but you can of course still interpret your loss in any way you like. This would certainly be the better option for a discrete-valued variable.
To answer your original question: I personally would make sure that your loss function is bounded both in the upper and lower level, as you could otherwise have undefined/inconsistent loss values, that might lead to subpar optimization. One way to do this is to re-scale a Sigmoid function (the co-domain of sigmoid(R) is [0,1]. Eventually, you can then just multiply your output by 6, to get a value range that is [0,6], and could (after rounding) cover all the values you want.
As far I know, there is no such thing like an activation function which will yield 0 to infinite. You can apply 7 output nodes with a "Softmax" activation function which will return the probability. There is another solution which may work. You can you 3 output nodes with "Binary" activation function which will return either 0 or 1. That means you can have 8 different outputs with only 3 nodes which are 000, 001, 010, 011, 100, 101, 110 and 111. You can use 7 of them. 

Simulink data types

I'm reading an IMU on the arduino board with a s-function block in simulink by double or single data types though I just need 2 decimals precision as ("xyz.ab").I want to improve the performance with changing data types and wonder that;
is there a way to decrease the precision to 2 decimals in s-function block or by adding/using any other conversion blocks/codes in the simulink aside from using fixed-point tool?
For true fixed point transfer, fixed-point toolbox is the most general answer, as stated in Phil's comment.
However, to avoid toolbox use, you could also devise your own fix-point integer format and add a block that takes a floating point input and convert it into an integer format (and vice versa on the output).
E.g. If you know the range is 327.68 < var < 327.67 you could just define your float as an int16 divided by 10. In a matlab function block you would then just say
y=int16(u*100.0);
to convert the input to the S-function.
On the output it would be a reversal
y=double(u)/100.0;
(Eml/matlab function code can be avoided by using multiply, divide and convert blocks.)
However, be mindful of the bits available and that the scaling (*,/) operations is done on the floating point rather than the integer.
2^(nrOfBits-1)-1 shows you what range you can represent including signeage. For unsigned types uint8/16/32 the range is 2^(nrOfBits)-1. Then you use the scaling to fit the representable bit into your used floating point range. The scaled range divided by 2^nrOfBits will tell you what the resolution will be (how large are the steps).
You will need to scale the variables correspondingly on the Arduino side as well when you go to an integer interface of this type. (I'm assuming you have access to that code - if not it'd be hard to use any other interface than what is already provided)
Note that the intXX(doubleVar*scale) will always truncate the values to integer. If you need proper rounding you should also include the round function, e.g.:
int16(round(doubleVar*scale));
You don't need to use a base 10 scale, any scaling and offsets can be used, but it's easier to make out numbers manually if you keep to base 10 (i.e. 0.1 10.0 100.0 1000.0 etc.).
As a final note, if the Arduino code interface is floating point (single/double) and can't be changed to integer type; you will not get any speedup from rounding decimals since the full floating point is what will be is transferred anyway. Even if you do manage to reduce the data a bit using integers I suspect this might not give a huge speedup unless you transfer large amounts of data. The interface code will have a comparatively large overhead anyway.
Good luck with your project!

Compute e^x for float values in System Verilog?

I am building a neural network running on an FPGA, and the last piece of the puzzle is running a sigmoid function in hardware. This is either:
1/(1 + e^-x)
or
(atan(x) + 1) / 2
Unfortunately, x here is a float value (a real value in SystemVerilog).
Are there any tips on how to implement either of these functions in SystemVerilog?
This is really confusing to me since both of these functions are complex, and I don't even know where to begin implementing them due to the added complexity of being float values.
One simpler way for this is to create a memory/array for this function. However that option can be highly inefficient.
x should be the input address for the memory and the value at that location can be the output of the function.
Suppose value of your function is as follow. (This is just an example)
x = 0 => f(0) = 1
x = 1 => f(0) = 2
x = 2 => f(0) = 3
x = 3 => f(0) = 4
So you can create an array for this, which stored the output values.
int a[4] = `{1, 2, 3, 4};
I just finished this by Vivado HLS, which allows you to write circuits in C.
Here is my C code.
#include math.h
void exp(float a[10],b[10])
{
int i;
for(i=0;i<10;i++)
{
b[i] = exp(a[i]);
}
}
But there is a question that it is impossible to create a unsized matrix. Maybe there is another way that I don't know.
As you seem to realize, type real is not synthesizable. you need to operate on the type integer mantissa and type integer exponent separately and combine them when you are done, having tracked the sign. Once you take care of (e^-x), the rest should be straight-forward.
try this page for a quick explanation: https://www.geeksforgeeks.org/floating-point-representation-digital-logic/
and search on "floating point digital design" for more explanations/examples.
Do you really need a floating number for this? Is fixed point sufficient?
Considering (atan(x) + 1) / 2, quite likely the only useful values of x are those where the exponent is fairly small. (if the exponent is large, your answer is pi/2).
atan of a fixed point number can be calculated in HW fairly easily; there are cordic methods (see https://zipcpu.com/dsp/2017/08/30/cordic.html) and direct methods; see for example https://dspguru.com/dsp/tricks/fixed-point-atan2-with-self-normalization/
FPGA design flows in which hardware (FPGA) is targeted generally do not support floating point numbers in the FPGA fabric. Fixed point with limited precision is more commonly used.
A limited precision fixed point approach:
Use Matlab to create an array of samples for your math function such that the largest value is +/- .99999. For 8 bit precision (actually 7 with sign bit), multiply those numbers by 128, round at the decimal point and drop the fractional part. Write those numbers to a text file in 2s complement hex format. In SystemVerilog you can implement a ROM using that text file. Use $readmemh() to read these numbers into a memory style variable (one that has both a packed and unpacked dimension). Link to a tutorial:
https://projectf.io/posts/initialize-memory-in-verilog/.
Now you have a ROM with limited precision samples of your function
Section 21.4 Loading memory array data from a file in the SystemVerilog specification provides the definition for $readmh(). Here is that doc:
https://ieeexplore.ieee.org/document/8299595
If you need floating point one possibility is to use a processor soft core with a floating point unit implemented in FPGA fabric, and run software on that core. The core interface to the rest of the FPGA fabric over a physical bus such as axi4 steaming. See:
https://www.xilinx.com/products/design-tools/microblaze.html to get started.
It is a very different workflow than ordinary FPGA design and uses different tools. C or C++ compiler with math libraries (tan, exp, div, etc) would be used along with the processor core.
Another possibility for fixed point is an FPGA with a hard core processor. Xilinx Zynq is one of them. This is a complex and powerful approach. A free free book provides knowledge on how to use Zynq
http://www.zynqbook.com/.
This workflow is even more complex that soft core approach because the Zynq is a more complex platform (hard processor & FPGA integrated on one chip).
Its pretty hard to implement non-linear functions like that in hardware, and on top of that floating point arithmetic is even more costly. Its definitely better(and recommended) to work with fixed point arithmetic as mentioned in answers before. The number of precision bits in fixed point arithmetic will depend on your result accuracy and the error tolerance.
For hardware implementations, any kind of non-linear function can be approximated as piecewise linear function, and use a ROM based implementation approach as described in previous answers. The number of sample points that you take from the non-linear function determines your accuracy. The more samples you store the better approximation of the function you get. Often in hardware , number of samples you can store can become restricted by the amount of fast/local memory available to you. In this case to optimize the memory resources, you can add a little extra compute resources and perform linear interpolation to calculate the needed values.

How to turn off denormal number support in MATLAB?

I am trying to turn off denormal number support in matlab, so that basically any two computations that would result in a denormal number would instead just result in zero (DAZ, FTZ)
I've researched several sites include the one below, but I haven't found anything about doing this.
http://blogs.mathworks.com/cleve/2014/07/21/floating-point-denormals-insignificant-but-controversial-2/
I've never heard of such an option in Matlab. It would likely require deep manipulation of a lot of the floating-point math, effectively requiring a new datatype to be supported if this were to be an easily toggle-able option in Matlab. You could write your own mex C code to do this (more here and here) for an individual function.
And of course you can get something like this with one line of Matlab – here's an example:
a = [1e-300 1e-310 1e-310];
b = [1e-301 1e-311 1e-310];
x = a-b;
x(abs(x(:)) < realmin(class(x))) = 0;
where realmin is the smallest normalized floating-point number. However, the floating point math is still performed using the extended denormal/subnormal values in a. It's just the output that's clipped to zero.
Unless you're doing this for fun an experimentation, or possibly running code on an embedded platform, I'd really recommend against disabling denormals as a form of optimization. Instead, focus on why your values are so small and how you might rescale your problem to avoid the issue entirely.

Mahout: Why is using setProbes() having this affect?

I'm using mahout 0.7 to do some classification. I have an encoder for a continuous variable
ContinuousValueEncoder durationPlanEncoder = new ContinuousValueEncoder("duration_plan");
The feature associated with this encoder is a number of days and can range from about 6 to 16.
I'm using an OnlineLogisticRegression model and I use the encoder to train it:
durationPlanEncoder.addToVector(null, <duration_plan double val>, trainDataVector);
For simplicity (since i'm trying to understand this whole classification thing while also learning Mahout), i am using 2 variables: 1) a categorical variable with 6 categories -- one of which ("dev") always predicts the =1 category; and 2) this "duration_plan" variable.
What i expect to find is that, when i give the classifier test data that consists of the category "dev" and a "duration_plan" value, the accuracy of the classifier will increase as the "duration_plan" value i give it gets closer to its average value across the training data. This is not what i'm seeing, however. Instead, the accuracy of the classifier improves as the value of "duration_plan" goes to 0.0. However -- there are no training vectors with duration_plan=0.0!! Why would this be the case?
Then i modified my durationPlanEncoder as follows:
durationPlanEncoder.setProbes(2);
and the accuracy improved. It got even better when i made the number of probes 20, then 200. Why? What is setProbes() doing and is this an anomaly or is this actually how i should be doing it?
The final part of my question is to mention that, even after setting setProbes(20), changing the value of "duration_plan" in the test data has no effect on the accuracy of the classifier -- which I don't think is how it should be. If i give a value for duration_plan that doesn't even exist in any of the training data and thus is never correlated with the =1 class, i would expect the classifier to classify the test sample as =0. Right? Which makes me think i must be coding something just plain wrong. Any suggestions are appreciated.
Mahout documentation is woefully sparse.
thanks.