Designing a clustering process using RapidMiner - classification

I haven't had much experience with machine learning or clustering, so I'm at a bit of a loss as to how to approach this problem. My data of interest consists of 4 columns, one of which is just an id. The other 3 contain numerical data, values >= 0. The clustering I need is actually quite straightforward, and I could do it by hand, but it will get less clear later on so I want to start out with the right sort of process. I need 6 clusters, which depend on the 3 columns (call them A, B and C) as follows:
A B C Cluster
---- ---- -------- -------
0 0 0 0
0 0 >0 1
0 >0 <=B 2
0 >0 >B 3
>0 any <=(A+B) 4
>0 any >(A+B) 5
At this stage, these clusters will give an insight to the data to inform further analysis.
Since I'm quite new to this, I haven't yet learned enough about the various algorithms which do clustering, so I don't really know where to start. Could anyone suggest an appropriate model to use, or a few that I can research.

This does not look like clustering to me.
Instead, I figure you want a simple decision tree classification.
It should already be available in Rapidminer.

You could use the "Generate Attributes" operator.
This creates new attributes from existing ones.
It would be relatively tiresome to create all the rules but they would be something like
cluster : if (((A==0)&&(B==0)&&(C==0)),1,0)

Related

Neural Network Architecture for Binary Sequential Data

I want to develop a neural network to generate samples of sequential binary data. For e.g. I give my network a stream of binary data: 1 0 0 0 0 1 1 1 0 0 1 (11 digits). I want my generator to be able to output some sort of similar structure to my data. Given the previous example, I want something along the lines of 0 0 1 1 1 0 1 0 0 0 0 (11 digits). From my input -> output there is similar structure in the data.
My current approach is using a GAN with with LSTM to decipher patterns. This doesn't work too well.
Obviously I would like to generate far longer streams of data but similar concept. Does anyone have any suggestions on what type of model to use. I know this is a really unconventional optimization problem but I feel like this is a necessary step in breaking down my problem.
Lastly, it might help to think about the problem like this. If I were to create a simulator to model some environment my binary string could represent the days of rain vs. no rain. Evidently, I want to generate some sort of data that is believable and matches similar patterns to the actual data.
EDIT:
I am also open to any ideas on just modeling in general like maybe using markov chains, etc.

Find "complemented" bit vectors clusters

I have a huge list of bit vectors (BV) that I want to group in clusters.
The idea behind this clusters is to be able to choose later BVs from each cluster and combine them for generate a BV with (almost) all-ones (which must be maximized).
For example, imagine the 1 means an app is Up and 0 is down in node X in a specific moment in time. We want to find the min list of nodes for having the app Up:
App BV for node X in cluster 1: 1 0 0 1 0 0
App BV for node Y in cluster 2: 0 1 1 0 1 0
Combined BV for App (X+Y): 1 1 1 1 1 0
I have been checking the different cluster algorithms but I did found one that takes into account this "complemental" behavior because in this case each column of the BV is not referred to a feature (only means up or down in an specific timeframe).
Regarding other algorithms like k-means or hierarchical clustering, I do not have clear if I can include in the clustering algorithm this consideration for the later grouping.
Finally, I am using the hamming distance to determine the intra-cluster and the inter-cluster distances given that it seems to be the most appropiated metric for binary data but results show me that clusters are not closely grouped and separated among them so I wonder if I am applying the most suitable group/approximation method or even if I should filter the input data previously grouping.
Any clue or idea regarding grouping/clustering method or filtering data is welcomed.
This does not at all sound like a clustering problem.
None of these algorithms will help you.
Instead, I would rather call this a match making algorithm. But I'd assume it is at least NP-hard (it resembles set cover) to find the true optimum, so you'll need to come up with a fast approximation. Best something specific to your use case.
Also you haven't specified (you wrote + but that likely isn't what you want) how to combine two 1s. Is it xor or or? Nor if it is possible to combine more than two, and what the cost is when doing so. A strategy would be to find the nearest neighbor of the inverse bitvector for each and always combine the best pair.

compare a network with multi value of a variable

I have a network and simulate it in netlogo.In my network i have n nodes with a random data from [0.1,2,...,19].
at the beginning one random node became sink and 3 random nodes start to send its data to sink.i declare a variable named gamma.after nodes send their data to sink,sink decide to whether store that data in its memory space or not base on gamma.after 0.5s this process repeat.at each time some nodes are sink and want some data.this is the way i distribute data in my network.
after all i have to change gamma from 0 to 1 to determine best value for that. and each time run my code to plot count of something.i mean:first run my code with gamma=1 and after run it again with gamma=0.98 and ...
if Entropy <= gamma
[
do something
]
If i press the setup button each time i change gamma my network setup change and i can not compare the same network with another gamma.
How can i compare my network with multi value of gammas??
I mean is that possible to save all my process and run it exaclly the same again?
You can use random-seed to always create the same network and then use a new seed (created and set with random-seed new-seed) to generate the random numbers and ask order etc for your processing. The tool BehaviorSpace will allow you to do many runs with different values of gamma.
Using this approach will guarantee you the same network. However, just because a particular value of gamma is best for one network, does not make it the best for other networks. So you could create multiple networks with different seeds and have NetLogo select each network (as #David suggests) or you could simply allow NetLogo to create the different networks and run many simulations so that you have a more robust answer that works over an 'average' network.
It is possible if you design some tests first, when you put random data each time you press setup the previous graph is not the same as the new one, thus you'll need to load the same data everytime you want to test.
An idea:
Make text files with the node data and the value of gamma. For 4 nodes you'd have something like:
dat1.txt
1 3 2 9
1
dat2.txt
1 3 2 9
0.98
dat3.txt
1 3 2 9
0.96
And so on...
You can genereate this files with a procedure and an specific seed (see random-numbers), this means that if you want to generate 30 tests (30 sets of 4 nodes in the above example), you'll need 30 different seeds.

Solving approach for a series

I am having a great trouble on finding the solution of this series.
index 1 2 3 4 5
number 0 1 5 15 35
here say first index is an exception but what is the solution for that series to pick an index & get the number. Please add your Explanation of the solving approach.
I would also like to have some extra example for solving approach of other this kind of series.
The approaches to solve a general series matching problem vary a lot, depending on the information you have about the series. You can start with reading up on time series.
For this series you can easily google it and find out they're related to the binomial coefficients like n!/(n-4)!/4! . Taking into account i, it will be something like (i+3)!/4!/(i-1)!

Neural networks - Finding o/p from two distinct i/p patterns

I have two distinct (unknown relationship) types of input patterns and I need to design a neural network where I would get an output based on both these patterns. However, I am unsure of how to design such a network.
I am a newbie in NN but I am trying to read as much as I can. In my problem as far as I can understand there are two input matrices of order say 6*1 and an o/p matrix of order 6*1. So how should I start with this? Is it ok to use backpropogation and a single hidden layer?
e.g.->
Input 1 Input 2 Output
0.59 1 0.7
0.70 1 0.4
0.75 1 0.5
0.83 0 0.6
0.91 0 0.8
0.94 0 0.9
How do I decide the order of the weight matrix and the transfer function?
Please help. Any link pertaining to this will also do. Thanks.
The simplest thing to try is to concatenate the 2 input vectors. This way you'll have 1 input vector of length 12, and this becomes a "text-book" learning problem from R^{12} to R^{6}.
The downside of this, is that you lose the information about each 6 inputs coming from a different source, but by your description it doesn't sound like you know much about these sources. Anyways, if you have any special knowledge of the 2 sources, you can use some pre-processing (like subtracting the mean, or dividing by the standard deviation) on each of the sources, to make them more similar, but most learning algorithms should also work OK without it.
As for which algorithm to try, I think the cannonical order is: linear machines (perceptron), then SVM, then multi-layer-networks (trained with backprop). The reason is, the more powerful the machine you use, the better chances you have to fit the train set, but less chances to fit the "true" pattern (overfitting).