Octave/Matlab: Arranging Space-Time Data in a Matrix - matlab

This is a question about coding common practice, not a specific error or other malfunctions.
I have a matrix of values of a variable that changes in space and time. What is the common practice, to use different columns for time or space values?
If there is a definite common practice, in the first place
Update: Here is an example of the data in tabular form. The time vector is much longer than the space vector.
t y(x1) y(x2)
1 100 50
2 100 50
3 100 50
4 99 49
5 99 49
6 99 49
7 98 49
8 98 48
9 98 48
10 97 48

It depends on your goal and ultimately doesn't matter that much. This is more the question of your convenience.
If you do care about the performance, there is a slight difference. Your code achieves maximum cache efficiency when it traverses monotonically increasing memory locations. In Matlab data stored column-wise, therefore processing data column-wise results in maximum cache efficiency. Thus, if you frequently access all the data at certain time layers, store space in columns. If you frequently access all the data at certain spatial points, store time in columns.

Related

Is it possible to rank the features based on their importance using autoencoder?

I am using Autoencoder for the first time. I have come to know that it
reduces the dimensionality
of the input data set. I am not sure what does that actually mean. Does it select some specific features from the input features? Is it possible to rank the features using autoencoder?
My data looks like as below:
age height weight working_hour rest_hour Diabetic
54 152 72 8 4 0
62 159 76 7 3 0
85 157 79 7 4 1
24 153 75 8 4 0
50 153 79 8 4 1
81 154 80 7 3 1
The features are age, height, wieght, working_hour and rest_hour. Target column is Diabetic. Here I have 5 features and I want to use less features. That is why I want to implement autoencoder to select the best features for the prediction.
Generally it is not possible with a vanilla autoencoder (AE). An AE performs a non-linear mapping to a hidden dimension and back to the original. However, you have no chance of interpreting this mapping. You could use contrained AEs, but i would not recommend it, when you work for the first time with AEs.
However, you just want a reduction of the input dimension. What you can do is to train an embedding. You train the AE with the desired number of nodes in the bottleneck and use the output of the encoder as input for your other algorithm.
You can split the AE into to two functions: encoder (E) and decoder (D). Your forward propagation is then D(E(x)), when x is your input. After you finished training the AE (with a reasonable reconstruction error!), you predict only E(x) and feed it tour your other algorithm.
Another way would be a PCA, which is basically a linear AE. You can define a maximum number of hidden dimensions and evaluate their stake on the reconstruction error. Furthermore, it is much easier to implement and you do not need knowledge of tensorflow or pytorch.

What is the shortest human-readable hash without collision?

I have a total number of W workers with long worker IDs. They work in groups, with a maximum of M members in each group.
To generate a unique group name for each worker combination, concantating the IDs is not feasible. I am think of doing a MD5() on the flattened sorted worker id list. I am not sure how many digits should I keep for it to be memorable to humans while safe from collision.
Will log( (26+10), W^M ) be enough ? How many redundent chars should I keep ? I there any other specialized hash function that works better for this scenario ?
The total number of combinations of 500 objects taken by up to 10 would be approximately 2.5091E+20, which would fit in 68 bits (about 13 characters in base36), but I don't see an easy algorithm to assign each combination a number. An easier algorithm would be like this: if you assign each person a 9-bit number (0 to 511) and concatenate up to 10 numbers, you would get 90 bits. To encode those in base36, you would need 18 characters.
If you want to use a hash that with just 6 characters in base36 (about 31 bits), the probability of a collision depends on the total number of groups used during the lifetime of the application. If we assume that each day there are 10 new groups (that were not encountered before) and that the application will be used for 10 years, we would get 36500 groups. Using the calculator provided by Nick Barnes shows that there is a 27% chance of a collision in this case. You can adjust the assumptions to your particular situation and then change the hash length to fit your desired maximum chance of a collision.

Predicting patterns in number sequences

My problem is as follows. As inputs I have sequences of whole numbers, around 200-500 per sequence. Each number in a sequence is marked as good or bad. The first number in each sequence is always good, but whether or not subsequent numbers are still considered good is determined by which numbers came before it. There's a mathematical function which governs how the numbers affect those that come after it but the specifics of this function are unknown. All we know for sure is it starts off accepting every number and then gradually starts rejecting numbers until finally every number is considered bad. Out of every sequence only around 50 numbers will ever be accepted before this happens.
It is possible that the validity of a number is not only determined by which numbers came before it, but also by whether these numbers were themselves considered good or bad.
For example: (good numbers in bold)
4 17 8 47 52 18 13 88 92 55 8 66 76 85 36 ...
92 13 28 12 36 73 82 14 18 10 11 21 33 98 1 ...
Attempting to determine the logic behind the system through guesswork seems like an impossible task. So my question is, can a neural network be trained to predict if a number will be good or bad? If so, approximately how many sequences would be required to train it? (assuming sequences of 200-500 numbers that are 32 bit integers)
Since your data is sequential and there is dependency between numbers, it should be possible to train a recurrent neural network. The recurrent weights take care of the relationship between numbers.
As a general rule of thumb, the more uncorrelated input sequences you have, the better it is. This survey article can help you get started with RNN: https://arxiv.org/abs/1801.01078
This is definitely possible. #salehinejad gives a good answer, but you might want to look for specific RNN's, such as the LSTM!
It's very good for sequence prediction. You just feed the network numbers one by one (sequentially).

An instance of online data clustering

I need to derive clusters of integers from an input array of integers in such a way that the variation within the clusters is minimized. (The integers or data values in the array are corresponding to the gas usage of 16 cars running between cities. At the end I will derive 4 clusters from the 16 cars into based on the clusters of the data values.)
Constraints: always the number of elements is 16, no. of clusters is 4 and the size of
the cluster is 4.
One simple way I am planning to do is that I will sort the input array and then divide them into 4 groups as shown below. I think that I can also use k-means clustering.
However, the place where I stuck was as follows: The data in the array change over time. Basically I need to monitor the array for every 1 second and regroup/recluster them so that the variation within the cluster is minimized. Moreover, I need to satisfy the above constraint. For this, one idea I am getting is to select two groups based on their means and variations and move data values between the groups to minimize variation within the group. However, I am not getting any idea of how to select the data values to move between the groups and also how to select those groups. I cannot apply sorting on the array in every second because I cannot afford NlogN for every second. It would be great if you guide me to produce a simple solution.
sorted `input array: (12 14 16 16 18 19 20 21 24 26 27 29 29 30 31 32)`
cluster-1: (12 14 16 16)
cluster-2: (18 19 20 21)
cluster-3: (24 26 27 29)
cluster-4: (29 30 31 32)
Let me first point out that sorting a small number of objects is very fast. In particular when they have been sorted before, an "evil" bubble sort or insertion sort usually is linear. Consider in how many places the order may have changed! All of the classic complexity discussion doesn't really apply when the data fits into the CPUs first level caches.
Did you know that most QuickSort implementations fall back to insertion sort for small arrays? Because it does a fairly good job for small arrays and has little overhead.
All the complexity discussions are only for really large data sets. They are in fact proven only for inifinitely sized data. Before you reach infinity, a simple algorithm of higher complexity order may still perform better. And for n < 10, quadratic insertion sort often outperforms O(n log n) sorting.
k-means however won't help you much.
Your data is one-dimensional. Do not bother to even look at multidimensional methods, they will perform worse than proper one-dimensional methods (which can exploit that the data can be ordered)
If you want guaranteed runtime, k-means with possibly many iterations is quite uncontrolled.
You can't easily add constraints such as the 4-cars rule into k-means
I believe the solution to your task (because of the data being 1 dimensional and the constraints you added) is:
Sort the integers
Divide the sorted list into k even-sized groups

Clustering on a large dataset

I'm trying to cluster a large (Gigabyte) dataset. In order to cluster, you need distance of every point to every other point, so you end up with a N^2 sized distance matrix, which in case of my dataset would be on the order of exabytes. Pdist in Matlab blows up instantly of course ;)
Is there a way to cluster subsets of the large data first, and then maybe do some merging of similar clusters?
I don't know if this helps any, but the data are fixed length binary strings, so I'm calculating their distances using Hamming distance (Distance=string1 XOR string2).
A simplified version of the nice method from
Tabei et al., Single versus Multiple Sorting in All Pairs Similarity Search,
say for pairs with Hammingdist 1:
sort all the bit strings on the first 32 bits
look at blocks of strings where the first 32 bits are all the same;
these blocks will be relatively small
pdist each of these blocks for Hammingdist( left 32 ) 0 + Hammingdist( the rest ) <= 1.
This misses the fraction of e.g. 32/128 of the nearby pairs which have
Hammingdist( left 32 ) 1 + Hammingdist( the rest ) 0.
If you really want these, repeat the above with "first 32" -> "last 32".
The method can be extended.
Take for example Hammingdist <= 2 on 4 32-bit words; the mismatches must split like one of
2000 0200 0020 0002 1100 1010 1001 0110 0101 0011,
so 2 of the words must be 0, sort the same.
(Btw, sketchsort-0.0.7.tar is 99 % src/boost/, build/, .svn/ .)
How about sorting them first? Maybe something like a modified merge sort? You could start with chunks of the dataset which will fit in memory to perform a normal sort.
Once you have the sorted data, clustering could be done iteratively. Maybe keep a rolling centroid of N-1 points and compare against the Nth point being read in. Then depending on your cluster distance threshold, you could pool it into the current cluster or start a new one.
The EM-tree and K-tree algorithms in the LMW-tree project can cluster problems this big and larger. Our most recent result is clustering 733 million web pages into 600,000 clusters. There is also a streaming variant of the EM-tree where the dataset is streamed from disk for each iteration.
Additionally, these algorithms can cluster bit strings directly where all cluster representatives and data points are bit strings, and the similarity measure that is used is Hamming distance. This minimizes the Hamming distance within each cluster found.