Clustering data with different approaches - cluster-analysis

i have the following type of data:
*.edge file has the connections between ids of different users:
1 23
4 67
...
*.feat contains properties of the ids. Here the first column (column 0) are the userids. The other ones are representing features named in another file. For example userid 1 does not have the feature of column 1 (0), but userid 4 does (1):
1: 0 0 1 0 1 1 0 1 1
4: 1 0 1 1 1 0 1 1 1
...
Now i want to cluster the data and want to use different algorithms like k-means, DBSCAN, hierarchical clustering and so on. But as i read, there are several problems with multidimensional data?

There are problems with very high-dimensional data, but 10 is not high. You have other problems: k-means needs coordinates to compute means, not a graph with edges. Also, the values should be continuous, not binary. You need to study these methods in more detail. If you say "But as I read ...", then try to give a reference.

Related

Generate a list of unique random numbers where the number should not be same as index of number in list using kdb

I need to write a function that generates a list of unique random numbers where the number should not be the same as the index of the list.
Valid output - 1 3 2 0
Invalid output - 0 3 2 1 /- Number 0 and index 0 are same
Invalid output - 1 0 2 3 /- Number 2 and 3 matches with index
I can think of using deal function(?) but the numbers in the list match with the index
-5?5
Can I do something with the seed so that it generates numbers different from the index of the number?
Any other solution without roll/deal function(?) also will be of great help.
EDIT:
I came up with below solution - (Different/optimized approaches are welcome)
{$[max(til x)=o:{neg[x]?x}x;.z.s[x];o]}
There are probably a few ways to do go about this. I came up with this:
q){$[any (r:neg[x]?x)=til x;.z.s x;r]}5
3 2 4 0 1
q){$[any (r:neg[x]?x)=til x;.z.s x;r]}5
2 0 4 1 3
Generates your random list then if any of the indexes match it will generate another list until no indexes match.

Extending Rabin-Karp algorithm to hash a 2D matrix

I'm trying to solve a problem here, it asks to find the size of the biggest common subsquare between two matrices.
e.g.
Matrix #1
3 3
1 2 0
1 2 1
1 2 3
Matrix #2
3 3
0 1 2
1 1 2
3 1 2
Answer: 2
Biggest common subsquare is:
1 2
1 2
I know that Rabin-Karp algorithm can be extended to work on a 2D matrix, but I can't understand how exactly can we do that, I tried to understand the author's code in the editorial, but its too complicated, I also did some search for a good explanation, but I couldn't find a clear one.
Can anyone simply explain how can I use Rabin-Karp algorithm to hash a matrix, I know I will hash rows and columns, but I can't see how to mix their hashes together to come up with a hashed matrix, and how the rolling hash function will be handled in this case ?

Choosing which variables to normalize while applying logistic regression

Suppose a dataset comprises independent variables that are continuous and binary variables. Usually the label/outcome column is converted to a one hot vector, whereas continuous variables can be normalized. But what needs to be applied for binary variables.
AGE RACE GENDER NEURO EMOT
15.95346 0 0 3 1
14.57084 1 1 0 0
15.8193 1 0 0 0
15.59754 0 1 0 0
How does this apply for logistic regression and neural networks?
If the range of continuous value is small, encode it into a binary form and use each bit of that binary form as a predictor.
For example, number 2 = 10 in binary.
Therefore
predictor_bit_0 = 0
predictor_bit_1 = 1
Try and see if it works. Just to warn you, this method is very subjective and may or may not yield good results for your data. I'll keep you posted if I find a better solution

Create a new variable in Tableau

I am new to Tableau and trying to get myself oriented to this system. I am an R user and typically work with wide data formats, so getting things wrangled into the proper long format has been tricky. Here is my current problem.
Assume I have a data file that is structured as such
ID Disorder Value
1 A 0
1 B 1
1 C 0
2 A 1
2 B 1
2 C 1
3 A 0
3 B 0
3 C 0
What I would like to do is to combine the variables, such that the presence of a set of disorders are used for summary variables. For example, how could I go about achieving something like this as my output? The sum is the number of people with the disorder, and the percentage is the number of people with the disorder divided by the total number of people.
Disorders Sum Percentage
A 1 33.3
B 2 66.6
C 1 33.3
AB 2 66.6
BC 2 66.6
AC 1 33.3
ABC 2 66.6
The approach to this would really be dependent on how flexible it has to be. Ultimately a wide data source with your Disorder making columns would make this easier. You will still need to blend the results on a data scaffold that has the combinations of codes you are wanting to get this to work in Tableau. If this needs to scale, you'll want to do the transformation work using custom SQL or another ETL tool like Alteryx. I posted a solution to this question for you over on the Tableau forum where I can upload files: http://community.tableausoftware.com/message/316168

matlab-how to separate one row one column of a large binary number into many columns, one per entry

So I am trying to send a jpeg through a binary symmetric channel (bsc) matlab has a function for this but it requires the information be in one row and all separated entries.
For example:
this is what i have- 01010101 (1x1) and this is what i need 0 1 0 1 0 1 0 1 (1x8)
I've tried reshape and a lot of other things but no dice. any help would be appreciated.
~kelly