Read a big data file with headlines into a matrix - matlab

I have a file that looks like this (with real data and much bigger):
A B C D E F G H I
1 105.28 1 22 84 2 10.55 21 2
2 357.01 0 32 34 1 11.43 28 1
3 150.23 3 78 22 0 12.02 11 0
4 357.01 0 32 34 1 11.43 28 1
5 357.01 0 32 34 1 11.43 28 1
6 357.01 0 32 34 1 11.43 28 1
...
17000 357.01 0 32 34 1 11.43 28 1
I want to import all the numerical value into a matrix, skipping the headlines. For that purpose I use this code:
Filename = 'test.txt';
A = dlmread(Filename,' ',1,0); %Imports the whole data into a matrix
The problem with this is just that A is a 17 000 * 1 vector instead of a matrix with several columns. If I manual edit the data file, remove the headlines and just run this it works:
A = dlmread(Filename); %Imports the whole data into a matrix
But I would prefer not to do this since the headlines are used later on in the code. Any advice how to get this work?
edit: solved by using
' '
instead of just
' '

Use the import tool.
Make sure you choose the data.
Generate script.

Related

KDB/Q How to implement moving rank efficiently?

I am trying to implement a moving rank function, taking parameters of n, the number of items, and m, the column name. Here is how I implement it:
mwindow: k){[y;x]$[y>0;x#(!#x)+\:!y;x#(!#x)+\:(!-y)+y+1]};
mrank: {[n;x] sum each x > prev mwindow[neg n;x]};
But this seems to take quite some time if n is moderately large, say 100.
I figure it is because it has to calculate from scratch, unlike msum, which keeps a running variable and only calculate the difference between the newly added and the dropped.
There's a number of general sliding window functions here that you can use to generate rolling lists on which to apply your rank: https://code.kx.com/q/kb/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
Those approaches seem to fill the lists out with zeros/nulls however which I think won't really suit your use of rank. Here's another possible approach which might be more suitable to rank (though I haven't tested this for performance on the large scale):
q)mwin:{x each (),/:{neg[x]sublist y,z}[y]\[z]}
q)update r:mwin[rank;4;c] from ([]c:10?100)
c r
----------
84 ,0
25 1 0
31 2 0 1
0 3 1 2 0
51 1 2 0 3
29 2 0 3 1
25 0 3 2 1
73 2 1 0 3
0 2 1 3 0
6 2 3 0 1
q)update r:last each mwin[rank;4;c] from ([]c:10?100)
c r
----
38 0
72 1
13 0
77 3
64 1
9 0
37 1
79 3
97 3
63 1
q)

Plot selected rows with the average and standard deviation (GNUPlot)

I have a csv file with experiment results that goes like this:
64 4 8 1 1 2 1 ttt 62391 4055430 333 0.0001 10 161 108 288 0
64 4 8 1 1 2 1 ttt 60966 3962810 322 0.0001 10 164 112 295 0
64 4 8 1 1 2 1 ttt 61530 3999475 325 0.0001 10 162 112 291 0
64 4 8 1 1 2 1 ttt 61430 4054428 332 0.0001 10 158 110 286 0
64 4 8 1 1 2 1 ttt 63891 4152938 339 0.0001 9 149 109 274 0
64 4 32 1 1 2 1 ttt 63699 4204182 345 0.0001 4 43 179 240 0
64 4 32 1 1 2 1 ttt 63326 4116218 336 0.0001 4 45 183 248 0
64 4 32 1 1 2 1 ttt 62654 4135211 340 0.0001 4 48 178 248 0
64 4 32 1 1 2 1 ttt 63192 4107506 339 0.0001 4 49 175 245 0
64 4 32 1 1 2 1 ttt 62707 4138666 345 0.0001 4 46 179 245 0
64 4 64 1 1 2 1 ttt 60968 3962929 323 0.0001 4 46 191 256 0
64 4 64 1 1 2 1 ttt 58765 3819787 305 0.0001 4 50 196 267 0
64 4 64 1 1 2 1 ttt 58946 3831499 308 0.0001 5 52 187 260 0
64 4 64 1 1 2 1 ttt 60646 3942047 321 0.0001 4 47 187 254 0
64 4 64 1 1 2 1 ttt 59723 3882044 311 0.0001 4 46 201 269 0
64 8 8 1 1 2 1 ttt 63414 4185382 382 0.0001 33 517 109 643 0
64 8 8 1 1 2 1 ttt 62429 4057899 372 0.0001 33 538 110 667 0
64 8 8 1 1 2 1 ttt 60622 3940452 384 0.0001 33 556 115 689 0
64 8 8 1 1 2 1 ttt 64433 4188192 369 0.0001 33 519 110 644 0
My goal is to be able to plot various combinations (choose which, in different charts) of the columns before the "ttt", with the average and standard deviation of the columns (choose which) after "ttt" (by grouping them by the before "ttt" columns).
Is this possible in GNUPlot and if yes how? If not, do you have any alternate suggestions regarding my problem?
Here is a completely revised and more general version.
Since you want to filter by 3 columns you need to have 3 properties to distinguish the data in the plot. This would be for example color, x-position and pointtype. What the script basically does:
Generates random data for testing (take your file instead)
$Data looks like this:
8 64 57773 0
4 32 64721 2
8 32 56757 1
4 16 56226 2
8 8 56055 1
8 64 59874 0
8 32 58733 0
4 16 55525 2
8 32 58869 0
8 64 64470 0
4 32 60930 1
8 64 57073 2
...
the variables ColX, ColC, ColP, and ColS define which columns are taken for x-position, color, pointtype and statistics.
find unique values of ColX, ColC, ColP, (check help smooth frequency) and put them to datablocks $ColX, $ColC, and $ColP.
put the unique values to arrays ArrX, ArrC, ArrP
loop all possible combinations and do statistics on ColS and put it to $Data2. Add 3 columns at the beginning for color, x-position and pointtype.
$Data2 looks like this:
1 1 1 0 8 4 61639.4 2788.4
1 1 2 0 8 8 59282.1 2740.2
1 2 1 0 16 4 59372.3 2808.6
1 2 2 0 16 8 60502.3 2825.0
1 3 1 0 32 4 59850.7 2603.8
1 3 2 0 32 8 60617.7 1979.8
1 4 1 0 64 4 60399.4 3273.6
1 4 2 0 64 8 59930.7 2919.8
2 1 1 1 8 4 59172.6 2288.2
2 1 2 1 8 8 58992.2 2888.0
2 2 1 1 16 4 59350.1 2364.6
2 2 2 1 16 8 61034.0 2368.5
2 3 1 1 32 4 59920.8 2867.6
2 3 2 1 32 8 59711.9 3464.2
2 4 1 1 64 4 60936.7 3439.7
2 4 2 1 64 8 61078.7 2349.3
3 1 1 2 8 4 58976.0 2376.3
3 1 2 2 8 8 61731.5 1635.7
3 2 1 2 16 4 58276.0 2101.7
3 2 2 2 16 8 58594.5 3358.5
3 3 1 2 32 4 60471.5 3737.6
3 3 2 2 32 8 59909.1 2024.0
3 4 1 2 64 4 62044.2 1446.7
3 4 2 2 64 8 60454.0 3215.1
Finally, plot the data. I couldn't figure out how plotting style with yerror works properly together with variable pointtypes. So, instead I split it into two plot commands with vectors and with points. The third one keyentry is just to get an empty line in the legend and the forth one is to get the pointtype into the legend.
I hope you can figure out all the other details and adapt it to your data.
Code:
### grouped statistics on filtered (unsorted) data
reset session
set colorsequence classic
# generate some random test data
rand1(n) = 2**(int(rand(0)*2)+2) # values 4,8
rand2(n) = 2**(int(rand(0)*4)+3) # values 8,16,32,64
rand3(n) = int(rand(0)*10000)+55000 # values 55000 to 65000
rand4(n) = int(rand(0)*3) # values 0,1,2
set print $Data
do for [i=1:200] {
print sprintf("% 3d% 4d% 7d% 3d", rand1(0), rand2(0), rand3(0), rand4(0))
}
set print
print $Data # (just for test purpose)
ColX = 2 # column for x
ColC = 4 # column for color
ColP = 1 # column for pointtype
ColS = 3 # column for statistics
# get unique values of the columns
set table $ColX
plot $Data u (column(ColX)) smooth freq
unset table
set table $ColC
plot $Data u (column(ColC)) smooth freq
unset table
set table $ColP
plot $Data u (column(ColP)) smooth freq
unset table
# put unique values into arrays
set table $Dummy
array ArrX[|$ColX|-6] # gnuplot creates 6 extra lines
array ArrC[|$ColC|-6]
array ArrP[|$ColP|-6]
plot $ColX u (ArrX[$0+1]=$1)
plot $ColC u (ArrC[$0+1]=$1)
plot $ColP u (ArrP[$0+1]=$1)
unset table
print ArrX, ArrC, ArrP # just for test purpose
# define filter function
Filter(c,x,p) = ArrX[x]==column(ColX) && ArrC[c]==column(ColC) && \
ArrP[p]==column(ColP) ? column(ColS) : NaN
# loop all values and do statistics, write data into $Data2
set print $Data2
do for [c=1:|ArrC|] {
do for [x=1:|ArrX|] {
do for [p=1:|ArrP|] {
undef var STATS*
stats $Data u (Filter(c,x,p)) nooutput
if (exists('STATS_mean') && exists('STATS_stddev')) {
print sprintf("% 3d% 3d% 3d% 3d% 3d% 3d% 9.1f % 7.1f", c, x, p, ArrC[c], ArrX[x], ArrP[p], STATS_mean, STATS_stddev)
}
}
}
print ""; print ""
}
set print
# print $Data2 # just for testing purpose
set xlabel sprintf("Column %d", ColX)
set ylabel sprintf("Column %d", ColS)
set xrange[0.5:|ArrX|+1]
set xtics () # remove all xtics
do for [x=1:|ArrX|] { set xtics add (sprintf("%d",ArrX[x]) x)} # set xtics "manually"
# function for x position and offsets,
# actually not dependent on 'n' but to shorten plot command
# columns in $Data2: 1=color, 2=x, 3=pointtype
width = 0.5 # float number!
step = width/(|ArrC|-1)
PosX(n) = column(2) - width/2.0 + step*(column(1)-1) + (column(3)-1)*step*0.3
plot \
for [c=1:|ArrC|] $Data2 u (PosX(0)):($7-$8):(0):(2*$8) index c-1 w vectors \
heads size 0.04,90 lw 2 lc c ti sprintf("%g",ArrC[c]),\
for [c=1:|ArrC|] '' u (PosX(0)):7:($3*2+4):(c) index c-1 w p ps 1.5 pt var lc var not, \
keyentry w p ps 0 ti "\n", \
for [p=1:|ArrP|] '' u (0):(NaN) w p pt p*2+4 ps 1.5 lc rgb "black" ti sprintf("%g",ArrP[p])
### end of code
Result:
I do not think gnuplot can produce exactly what you are asking for in a single plot command. I will show you two alternatives in the hope that one or both is a useful starting point.
Alternative 1: standard boxplot
spacing = 1.0
width = 0.25
unset key
set xlabel "Column 3"
set ylabel "Column 9"
plot 'data' using (spacing):9:(width):3 with boxplot lw 2
This collects points based on the value in column 3 and for each such value it produces a boxplot. This is a widely used method of showing the distribution of point values in a category, but it is a quartile analysis not a display of mean + standard deviation.
Alternative 2: calculate mean and standard deviation for categories known in advance
The boxplot analysis has the advantage that you do not need to know in advance what values may be present in column 3. Gnuplot can calculate mean and standard deviation based on a column 3 value, but you need to specify in advance what that value is. Here is a set of commands tailored to the specific example file you provided. It calculates, but does not plot, the requested categorical mean and standard deviation. You can use these numbers to construct a plot, but that will require additional commands. You could, for example, save the values for each category in a new file, or array, or datablock and then go back and plot these together.
col3entry = "8 32 64"
do for [i in col3entry] {
stats "data" using ($3 == real(i) ? $9 : NaN) name "Condition".i nooutput
print i, ": ", value("Condition".i."_mean"), value("Condition".i."_stddev")
}
output:
8: 62345.1111111111 1259.34784220021
32: 63115.6 392.552977316438
64: 59809.6 881.583711283279

Matlab replace consecutive zero value with others value

I have this matrix:
A = [92 92 92 91 91 91 146 146 146 0
0 0 112 112 112 127 127 127 35 35
16 16 121 121 121 55 55 55 148 148
0 0 0 96 96 0 0 0 0 0
0 16 16 16 140 140 140 0 0 0]
How can I replace consecutive zero value with shuffled consecutive value from matrix B?
B = [3 3 3 5 5 6 6 2 2 2 7 7 7]
The required result is some matrix like this:
A = [92 92 92 91 91 91 146 146 146 0
6 6 112 112 112 127 127 127 35 35
16 16 121 121 121 55 55 55 148 148
7 7 7 96 96 5 5 3 3 3
0 16 16 16 140 140 140 2 2 2]
You simply can do it like this:
[M,N]=size(A);
for i=1:M
for j=1:N
if A(i,j)==0
A(i,j)=B(i+j);
end
end
end
If I understand it correctly from what you've described, your solution is going to need the following steps:
Loop over the rows of your matrix, e.g. for row = 1:size(A, 1)
Loop over the elements of each row, identify where each run of zeroes starts and store the indices and the length of the run. For example you might end up with a matrix like: consecutiveZeroes = [ 2 1 2 ; 4 1 3 ; 4 6 5 ; 5 8 3 ] indicating that you have a run at (2, 1) of length 2, a run at (4, 1) of length 3, a run at (4, 6) of length 5, and a run at (5, 8) of length 3.
Now loop over the elements of B counting up how many elements there are of each value. For example you might store this as replacementValues = [ 3 3 ; 2 5 ; 2 6 ; 3 2 ; 3 7 ] meaning three 3's, two 5's, two 6's etc.
Now take a row from your consecutiveZeroes matrix and randomly choose a row of replacementValues that specifies the same number of elements, replace the zeroes in A with the values from replacementValues, and delete the row from replacementValues to show that you've used it.
If there isn't a row in replacementValues that describes a long enough run of values to replace one of your runs of zeroes, find a combination of two or more rows from replacementValues that will work.
You can't do this with just a single pass through the matrix, because presumably you could have a matrix A like [ 15 7 0 0 0 0 0 0 3 ; 2 0 0 0 5 0 0 0 9 ] and a vector B like [ 2 2 2 3 3 3 7 7 5 5 5 5 ], where you can only achieve what you want if you use the four 5's and two 7's and not the three 2's and three 3's to substitute for the run of six zeroes, because you have to leave the 2's and 3's for the two runs of three zeroes in the next row. The easiest approach if efficiency is not critical would probably be to run the algorithm multiple times trying different random combinations until you get one that works - but you'll need to decide how many times to try before giving up in case the input data actually has no solution.
If you get stuck on any of these steps I suggest asking a new, more specific question.

matlab replace value in matrix

I am trying to replace the first and second column of an edgelist matrix (edgenumber x 3) by specific node numbers as such:
5 1 1
1 38 1
2 1 1
28 17 1
18 1 1
25 1 1
that the node numbers (connection from node 5 to node 1) are replaced by the corresponding values from a vector. The edgelist is generated from an unweighted 40x40 adjacency matrix.
The vector degree_list of size 40x1 contains the "real"node numbers of this edgelist which i want to add to a larger 321x321 adjacency matrix. (if there's an easier way to do that then by concatenating the edgelists, that would also be great).
degree_list=[183,150,151,39,184,185,152,...];
So of the above edgelist I would like to replace all 1s in coll 1 and 2 by 183, all 2s by 150, etc.
Then I need to keep this new edgelist which I the add to the larger edgelist, transform it back to an adjacency matrix and have my new correct bigger adjM.
I have tried to find a solution here and on other websites but not been successful. Thank you very much for your help,
Chris
Code
a1 = [
5 1 1
1 38 1
2 1 1
28 17 1
18 1 1
25 1 1]
degree_list=[183,150,151,39,184,185,152];
col12 = a1(:,[1 2])
col12_uniq = unique(col12)
degree_list = [degree_list numel(degree_list)+1:max(col12_uniq)]
uniq_dim3 = bsxfun(#eq,col12,permute(repmat(col12_uniq,1,2),[3 2 1]))
match_dim3 = bsxfun(#times,uniq_dim3,permute(degree_list(col12_uniq),[3 1 2]))
a1_out = [sum(match_dim3,3) a1(:,3)]
Output
a1 =
5 1 1
1 38 1
2 1 1
28 17 1
18 1 1
25 1 1
a1_out =
184 183 1
183 38 1
150 183 1
28 17 1
18 183 1
25 183 1

Comparing content of two columns in MATLAB

I need to compare the content of two tables, more exactly two columns (one column per table), in MATLAB to see for each element of the first column, if there is an equal element in the second column.
Should I use a for loop or is there an existing MATLAB function that does this?
If the order is important, you do element-wise comparison, after which you use all
%# create two arrays
A = [1,2;1,3;2,5;3,3];
B = [2,2;1,3;1,5;3,3];
%# compare the second column of A and B, and check if the comparison is `true` for all elements
all(A(:,2)==B(:,2))
ans =
1
If the order is unimportant and all elements are unique, use ismember
all(ismember(A(:,1),B(:,1))
ans =
1
If the order is unimportant, and there are repetitions, use sort
all(sort(A(:,1))==sort(B(:,2)))
ans =
0
did you know you could do this:
>> a = [1:5];
>> b = [5:-1:1];
>> a == b
ans =
0 0 1 0 0
so you could compare 2 columns in matlab by using the == operator on the whole column. And you could use the result from that as a index specifier to get the equal values. Like this:
>> a(a == b)
ans =
3
This mean, select all the elements out of a for which a == b is true.
For example you could also select all the elements out of a which are larger than 3:
>> a(a > 3)
ans =
4 5
Using this knowledge I would say you could solve your problem.
For arithmetic values, both solutions mentioned will work. For strings or cell arrays of strings, use strcmp/strcmpi.
From the help file:
TF = strcmp(C1, C2) compares each element of C1 to the same element in C2, where C1 and C2 are equal-size cell arrays of strings. Input C1 or C2 can also be a character array with the right number of rows. The function returns TF, a logical array that is the same size as C1 and C2, and contains logical 1 (true) for those elements of C1 and C2 that are a match, and logical 0 (false) for those elements that are not.
An example (also from the help file):
Example 2
Create 3 cell arrays of strings:
A = {'MATLAB','SIMULINK';'Toolboxes','The MathWorks'};
B = {'Handle Graphics','Real Time Workshop';'Toolboxes','The MathWorks'};
C = {'handle graphics','Signal Processing';' Toolboxes', 'The MATHWORKS'};
Compare cell arrays A and B with sensitivity to case:
strcmp(A, B)
ans =
0 0
1 1
Compare cell arrays B and C without sensitivity to case. Note that 'Toolboxes' doesn't match because of the leading space characters in C{2,1} that do not appear in B{2,1}:
strcmpi(B, C)
ans =
1 0
0 1
To get a single return value rather than an array of logical values, use the all function as explained by Jonas.
You can use for loop (code below) to compare the content of the column 1 and column % 2 in the same table:
clc
d=[ 19 24 16 12 35 0
16 16 18 0 23 18
16 10 7 10 13 24
19 8 30 0 12 26
16 12 4 1 13 12
24 0 31 0 40 0
12 11 10 6 20 0
16 11 6 2 25 9
17 9 21 0 17 8
20 0 7 10 22 0
13 16 12 18 17 13
17 23 17 0 23 20
25 0 10 3 17 15
14 4 4 17 12 10
19 24 21 5 35 0
15 20 5 0 10 31
13 8 0 16 40 0
18 27 26 1 19 14
12 0 2 0 12 4
20 0 6 2 15 21
20 0 26 0 18 26
12 11 1 13 19 15
14 0 20 0 9 16
14 15 6 12 40 0
20 0 8 10 18 12
10 11 14 0 13 11
5 0 22 0 8 12 ];
x1=d(:,1);
y1=d(:,2);
for i=1:27
if x1(i)>y1(i);
z(i)=x1(i);
else
z(i)=y1(i);
end
end
z'