Predicting patterns in number sequences - neural-network

My problem is as follows. As inputs I have sequences of whole numbers, around 200-500 per sequence. Each number in a sequence is marked as good or bad. The first number in each sequence is always good, but whether or not subsequent numbers are still considered good is determined by which numbers came before it. There's a mathematical function which governs how the numbers affect those that come after it but the specifics of this function are unknown. All we know for sure is it starts off accepting every number and then gradually starts rejecting numbers until finally every number is considered bad. Out of every sequence only around 50 numbers will ever be accepted before this happens.
It is possible that the validity of a number is not only determined by which numbers came before it, but also by whether these numbers were themselves considered good or bad.
For example: (good numbers in bold)
4 17 8 47 52 18 13 88 92 55 8 66 76 85 36 ...
92 13 28 12 36 73 82 14 18 10 11 21 33 98 1 ...
Attempting to determine the logic behind the system through guesswork seems like an impossible task. So my question is, can a neural network be trained to predict if a number will be good or bad? If so, approximately how many sequences would be required to train it? (assuming sequences of 200-500 numbers that are 32 bit integers)

Since your data is sequential and there is dependency between numbers, it should be possible to train a recurrent neural network. The recurrent weights take care of the relationship between numbers.
As a general rule of thumb, the more uncorrelated input sequences you have, the better it is. This survey article can help you get started with RNN: https://arxiv.org/abs/1801.01078

This is definitely possible. #salehinejad gives a good answer, but you might want to look for specific RNN's, such as the LSTM!
It's very good for sequence prediction. You just feed the network numbers one by one (sequentially).

Related

Is it possible to rank the features based on their importance using autoencoder?

I am using Autoencoder for the first time. I have come to know that it
reduces the dimensionality
of the input data set. I am not sure what does that actually mean. Does it select some specific features from the input features? Is it possible to rank the features using autoencoder?
My data looks like as below:
age height weight working_hour rest_hour Diabetic
54 152 72 8 4 0
62 159 76 7 3 0
85 157 79 7 4 1
24 153 75 8 4 0
50 153 79 8 4 1
81 154 80 7 3 1
The features are age, height, wieght, working_hour and rest_hour. Target column is Diabetic. Here I have 5 features and I want to use less features. That is why I want to implement autoencoder to select the best features for the prediction.
Generally it is not possible with a vanilla autoencoder (AE). An AE performs a non-linear mapping to a hidden dimension and back to the original. However, you have no chance of interpreting this mapping. You could use contrained AEs, but i would not recommend it, when you work for the first time with AEs.
However, you just want a reduction of the input dimension. What you can do is to train an embedding. You train the AE with the desired number of nodes in the bottleneck and use the output of the encoder as input for your other algorithm.
You can split the AE into to two functions: encoder (E) and decoder (D). Your forward propagation is then D(E(x)), when x is your input. After you finished training the AE (with a reasonable reconstruction error!), you predict only E(x) and feed it tour your other algorithm.
Another way would be a PCA, which is basically a linear AE. You can define a maximum number of hidden dimensions and evaluate their stake on the reconstruction error. Furthermore, it is much easier to implement and you do not need knowledge of tensorflow or pytorch.

CRC16 (ModBus) - computing algorithm

I am using the ModBus RTU, and I'm trying to figure out how to calculate the CRC16.
I don't need a code example. I am simply curious about the mechanism.
I have learned that a basic CRC is a polynomial division of the data word, which is padded with zeros, depending on the length of the polynomial.
The following test example is supposed to check if my basic understanding is correct:
data word: 0100 1011
polynomial: 1001 (x3+1)
padded by 3 bits because of highest exponent x3
calculation: 0100 1011 000 / 1001 -> remainder: 011
Calculation.
01001011000
1001
0000011000
1001
01010
1001
0011
Edit1: So far verified by Mark Adler in previous comments/answers.
Searching for an answer I have seen a lot of different approaches with reversing, dependence on little or big endian, etc., which alter the outcome from the given 011.
Modbus RTU CRC16
Of course I would love to understand how different versions of CRCs work, but my main interest is to simply understand what mechanism is applied here. So far I know:
x16+x15+x2+1 is the polynomial: 0x18005 or 0b11000000000000101
initial value is 0xFFFF
example message in hex: 01 10 C0 03 00 01
CRC16 of above message in hex: C9CD
I did calculate this manually like the example above, but I'd rather not write this down in binary in this question. I presume my transformation into binary is correct. What I don't know is how to incorporate the initial value -- is it used to pad the data word with it instead of zeros? Or do I need to reverse the answer? Something else?
1st attempt: Padding by 16 bits with zeros.
Calculated remainder in binary would be 1111 1111 1001 1011 which is FF9B in hex and incorrect for CrC16/Modbus, but correct for CRC16/Bypass
2nd attempt: Padding by 16 bits with ones, due to initial value.
Calculated remainder in binary would be 0000 0000 0110 0100 which is 0064 in hex and incorrect.
It would be great if someone could explain, or clarify my assumptions. I honestly did spent many hours searching for an answer, but every explanation is based on code examples in C/C++ or others, which I don't understand. Thanks in advance.
EDIT1: According to this site, "1st attempt" points to another CRC16-method with same polynomial but a different initial value (0x0000), which tells me, the calculation should be correct.
How do I incorporate the initial value?
EDIT2: Mark Adlers Answer does the trick. However, now that I can compute CRC16/Modbus there are some questions left for clearification. Not needed but appreciated.
A) The order of computation would be: ... ?
1st applying RefIn for complete input (including padded bits)
2nd xor InitValue with (in CRC16) for the first 16 bits
3rd applying RefOut for complete output/remainder (remainder maximum 16 bits in CRC16)
B) Referring to RefIn and RefOut: Is it always reflecting 8 bits for input and all bits for output nonetheless I use CRC8 or CRC16 or CRC32?
C) What do the 3rd (check) and 8th (XorOut) column in the website I am referring to mean? The latter seems rather easy, I am guessing its apllied by computing the value xor after RefOut just like the InitValue?
Let's take this a step at a time. You now know how to correctly calculate CRC-16/BUYPASS, so we'll start from there.
Let's take a look CRC-16/CCITT-FALSE. That one has an initial value that is not zero, but still has RefIn and RefOut as false, like CRC-16/BUYPASS. To compute the CRC-16/CCITT-FALSE on your data, you exclusive-or the first 16 bits of your data with the Init value of 0xffff. That gives fe ef C0 03 00 01. Now do what you know on that, but with the polynomial 0x11021. You will get what is in the table, 0xb53f.
Now you know how to apply Init. The next step is dealing with RefIn and RefOut being true. We'll use CRC-16/ARC as an example. RefIn means that we reflect the bits in each byte of input. RefOut means that we reflect the bits of the remainder. The input message is then: 80 08 03 c0 00 80. Dividing by the polynomial 0x18005 we get 0xb34b. Now we reflect all of those bits (not in each byte, but all 16 bits), and we get 0xd2cd. That is what you see as the result in the table.
We now have what we need to compute CRC-16/MODBUS, which has both a non-zero Init value (0xffff) and RefIn and RefOut as true. We start with the message with the bits in each byte reflected and the first 16 bits inverted. That is 7f f7 03 c0 00 80. Divide by 0x18005 and you get the remainder 0xb393. Reflect those bits and we get 0xc9cd, the expected result.
The exclusive-or of Init is applied after the reflection, which you can verify using CRC-16/RIELLO in that table.
Answers for added questions:
A) RefIn has nothing to do with the padded bits. You reflect the input bytes. However in a real calculation, you reflect the polynomial instead, which takes care of both reflections.
B) Yes.
C) Yes, XorOut is the what you exclusive-or the final result with. Check is the CRC of the nine bytes "123456789" in ASCII.

Octave/Matlab: Arranging Space-Time Data in a Matrix

This is a question about coding common practice, not a specific error or other malfunctions.
I have a matrix of values of a variable that changes in space and time. What is the common practice, to use different columns for time or space values?
If there is a definite common practice, in the first place
Update: Here is an example of the data in tabular form. The time vector is much longer than the space vector.
t y(x1) y(x2)
1 100 50
2 100 50
3 100 50
4 99 49
5 99 49
6 99 49
7 98 49
8 98 48
9 98 48
10 97 48
It depends on your goal and ultimately doesn't matter that much. This is more the question of your convenience.
If you do care about the performance, there is a slight difference. Your code achieves maximum cache efficiency when it traverses monotonically increasing memory locations. In Matlab data stored column-wise, therefore processing data column-wise results in maximum cache efficiency. Thus, if you frequently access all the data at certain time layers, store space in columns. If you frequently access all the data at certain spatial points, store time in columns.

What is the shortest human-readable hash without collision?

I have a total number of W workers with long worker IDs. They work in groups, with a maximum of M members in each group.
To generate a unique group name for each worker combination, concantating the IDs is not feasible. I am think of doing a MD5() on the flattened sorted worker id list. I am not sure how many digits should I keep for it to be memorable to humans while safe from collision.
Will log( (26+10), W^M ) be enough ? How many redundent chars should I keep ? I there any other specialized hash function that works better for this scenario ?
The total number of combinations of 500 objects taken by up to 10 would be approximately 2.5091E+20, which would fit in 68 bits (about 13 characters in base36), but I don't see an easy algorithm to assign each combination a number. An easier algorithm would be like this: if you assign each person a 9-bit number (0 to 511) and concatenate up to 10 numbers, you would get 90 bits. To encode those in base36, you would need 18 characters.
If you want to use a hash that with just 6 characters in base36 (about 31 bits), the probability of a collision depends on the total number of groups used during the lifetime of the application. If we assume that each day there are 10 new groups (that were not encountered before) and that the application will be used for 10 years, we would get 36500 groups. Using the calculator provided by Nick Barnes shows that there is a 27% chance of a collision in this case. You can adjust the assumptions to your particular situation and then change the hash length to fit your desired maximum chance of a collision.

An instance of online data clustering

I need to derive clusters of integers from an input array of integers in such a way that the variation within the clusters is minimized. (The integers or data values in the array are corresponding to the gas usage of 16 cars running between cities. At the end I will derive 4 clusters from the 16 cars into based on the clusters of the data values.)
Constraints: always the number of elements is 16, no. of clusters is 4 and the size of
the cluster is 4.
One simple way I am planning to do is that I will sort the input array and then divide them into 4 groups as shown below. I think that I can also use k-means clustering.
However, the place where I stuck was as follows: The data in the array change over time. Basically I need to monitor the array for every 1 second and regroup/recluster them so that the variation within the cluster is minimized. Moreover, I need to satisfy the above constraint. For this, one idea I am getting is to select two groups based on their means and variations and move data values between the groups to minimize variation within the group. However, I am not getting any idea of how to select the data values to move between the groups and also how to select those groups. I cannot apply sorting on the array in every second because I cannot afford NlogN for every second. It would be great if you guide me to produce a simple solution.
sorted `input array: (12 14 16 16 18 19 20 21 24 26 27 29 29 30 31 32)`
cluster-1: (12 14 16 16)
cluster-2: (18 19 20 21)
cluster-3: (24 26 27 29)
cluster-4: (29 30 31 32)
Let me first point out that sorting a small number of objects is very fast. In particular when they have been sorted before, an "evil" bubble sort or insertion sort usually is linear. Consider in how many places the order may have changed! All of the classic complexity discussion doesn't really apply when the data fits into the CPUs first level caches.
Did you know that most QuickSort implementations fall back to insertion sort for small arrays? Because it does a fairly good job for small arrays and has little overhead.
All the complexity discussions are only for really large data sets. They are in fact proven only for inifinitely sized data. Before you reach infinity, a simple algorithm of higher complexity order may still perform better. And for n < 10, quadratic insertion sort often outperforms O(n log n) sorting.
k-means however won't help you much.
Your data is one-dimensional. Do not bother to even look at multidimensional methods, they will perform worse than proper one-dimensional methods (which can exploit that the data can be ordered)
If you want guaranteed runtime, k-means with possibly many iterations is quite uncontrolled.
You can't easily add constraints such as the 4-cars rule into k-means
I believe the solution to your task (because of the data being 1 dimensional and the constraints you added) is:
Sort the integers
Divide the sorted list into k even-sized groups