Can we discard a numerical variable based on the T test when our target variable is a categorical? - data-cleaning

I have a numeric variable within my data.
sample(d$timedelta, 20)
[1] 601 561 44 162 554 443 604 68 140 446 178 506 348 402 401 700 127 717 669 68
My target is a binary variable (Popularity = 1/0)
I want to drop the variable if there is no statistically significant difference between $timedelta among the two groups
pop.1.time = d$timedelta[d$Popularity==1]
pop.0.time = d$timedelta[d$Popularity==0]
t.test(pop.1.time,pop.0.time, var.equal = F, paired = F)
Can I drop Timedelta altogether if the above test shows that there is no difference among the two groups?
Is that a valid approach? Or am I misinterpreting the meaning of a T-test?

Related

Importing CSV into Matlab

Chemical composition of a certain material
Hi,
I am trying to import the below mentioned data in CSV format in matlab, which is [1000x10] in dimensions.
HCL;H2SO4;CH4; SULPHUR;CHLORINE;S2O3;SO2;NH3;CO2;O2
144 2 3 141 140 6 7 137 136 10 11 133
13 131 130 16 17 127 126 20 21 123 122 24
25 119 118 28 29 115 114 32 33 111 110 36
108 38 39 105 104 42 43 101 100 46 47 97
96 50 51 93 92 54 55 89 88 58 59 85
61 83 82 64 65 79 78 68 69 75 74 72
73 71 70 76 77 67 66 80 81 63 62 84
60 86 87 57 56 90 91 53 52 94 95 49
48 98 99 45 44 102 103 41 40 106 107 37
109 35 34 112 113 31 30 116 117 27 26 120
121 23 22 124 125 19 18 128 129 15 14 132
12 134 135 9 8 138 139 5 4 142 143 1
I am able to import this data through my code
fid = fopen(uigetfile('.csv'),'rt');
FileName = fopen(fid);
headers = fgets(fid); %get first line
headers = textscan(headers,'%s','delimiter',';'); %read first line
format = repmat('%f',1,size(headers{1,1},1)); %count columns n makeformat string
data = textscan(fid,format,'delimiter',';'); %read rest of the file
data = [data{:}];
I am getting data in matrix form in variable data [1000x10] and name of all the components like HCL, H2SO4 in a cell array named headers{1x1}.
Now I have two questions like the built in import feature in matlab you have flexibility to import data as separate column vectors, numeric matrix,cell array and table format. Is it possible to do as such through code, like i get column vectors with their name HCL with [1000x1] and H2sO4 with [1000x1] in my workspace after import and so on all the column vectors with their names with [1000x1]dimensions.
if yes then help me please...?
If above mentioned is not possible then i can do alternatively that now I have names of column vectors in headers cell array, how I can extract those name and use those names as column vector names through code and I can assign data from data matrix [1000x10] to each column vector with their corresponding names.
like if i say
x = headers {1*1}{1*1}; i will get x = "HCL"
x = genvarname(x); I will get x= x0x22HCL0x2 BUT
I want that x get replaced with HCL.and then I assign
HCL = data(:,1) and same like this other variables H2SO4,SULPHUR, CHLORINE.
You can say i try to implement the import feature of column vector through my code.
Kindly help me to solve this issue. thanks
Have you tried the built-in readtable function?
You can access each column of the table by using the named column header.
If you'd like, you can use the two data types to create a table in MatLab. I'm not terribly familiar with its use, but it seems to be well documented. I'm sure someone else can expand upon this.
Edit:
After re-reading your question, I think this is closer to what you are after.
n=10;
what='HCL';%change this to any of the strings you interested in
numstr = repmat('%f',1,n);
hdrstr = repmat('%s',1,n);
headers = textscan(headers,hdrstr,'delimiter',';');
headers = headers(1,:)
data = cell2mat(textscan(fid,numstr,'delimiter',';'));
datout = data(:,strcmp(headers,what));%datout will be 1000x1 HCL data
Depending on what you want to do, you can loop through these appropriately
I know this is not what you asked for, but I would convert to a struct:
x=cell2struct(num2cell(data),headers,2)
reason is simple, selecting for example the third row with individual variables is not possible. With a struct simply use x(3)
If at some point you need the vectors you originally asked for and you can't use the strcut, use [x.HCL]

read/load parts of the irregular file by Matlab

I would like to partly load a PTX file by matlab (please see the following example)
I need to read and write the first two row (2 numbers) into 2 variables say a and b. And read and write the data from 5th row to the end into a matrix
Thanks for your help
114
221
1 0 0
1 0 0 0
-5.566405 -7.161944 -1.144557 0.197208 24 29 35
-5.560656 -7.154540 -1.137673 0.222400 29 32 39
-5.559846 -7.153491 -1.131895 0.254002 37 40 49
-5.560894 -7.154833 -1.126452 0.305013 51 54 63
-5.560084 -7.153783 -1.120633 0.290013 72 76 88
-5.561128 -7.155119 -1.115189 0.243214 105 113 134
-5.563203 -7.157782 -1.109926 0.227604 130 143 177
-5.569191 -7.165479 -1.105504 0.201602 121 140 173
-7.833616 -10.078705 -1.546952 0.130007 94 112 134
Look at the tdfread function in order to get the data into Matlab. It should be something like datafile = tdfread(filename, '\t'). Once you have that, index into the variable returned from that function like
a = datafile(1, 1);
b = datafile(2, 1);
data = datafile(5:end, :);

MATLAB accessing conditional values and performing operation in single column

Just started MATLAB 2 days ago and I can't figure out a non-loop method (since I read they were slow/inefficient and MATLAB has better alternatives) to perform a simple task.
I have a matrix of 5 columns and 270 rows. What I want to do is:
if the value of an element in column 5 of matrix goodM is below 90, I want to take that element and and subtract it from 90.
So far I tried:
test = goodM(:,5) <= 90;
goodM(test) = 999;
It changes all goodM values within column 1 not 5 into 999, in addition this method doesn't allow me to perform operations on the elements below 90 in column 5. Any elegant solution to doing this?
edit:: goodM(:,5)(test) = 999; doesn't seem to work either so I have no idea to specify the target column.
I am assuming you are looking to operate on elements that have values below 90 as your text in the question reads, rather than 'below or equal to' as represented by '<=' as used in your code. So try this -
ind = find(goodM(:,5) < 90) %// Find indices in column 5 that have values less than 90
goodM(ind,5) = 90 - goodM(ind,5) %// Operate on those elements using indices obtained from previous step
Try this code:
b=90-a(a(:,5)<90,5);
For example:
a =
265 104 479 13 176
26 110 447 208 144
379 163 179 366 464
301 48 274 391 26
429 374 174 184 297
495 375 312 373 82
465 272 399 447 420
205 170 373 122 84
1 417 63 65 252
271 277 412 113 500
then,
b=90-a(a(:,5)<90,5);
b =
64
8
6

Add a constant value to a vector’s elements

I would like to add a constant value of 360 to a vector of values after the maximum value is reached. That is, if H=[12 26 67 92 167 178 112 98 76 85], how do I write a matlab code so that 180 is added to all values after 178? The answer should be H=[12 26 67 92 167 178 292 278 256 265].
This should work on earlier Matlab versions as well:
H=[12 26 67 92 167 178 112 98 76 85]
[n, n] = max(H);
H(n+1:end) = H(n+1:end) + 180
Try following:
n=find(H==max(H));
H(n+1:end)=H(n+1:end)+180;
Since desired vector values are in increasing order, idea here is to find the index of maximum value and increment all the subsequent elements with 180.
EDIT
Better approach for finding max index, as suggested by #LeonidBeschastny
[~,n]=max(H);

Matlab - Remove bad data from vector of values

I have a vector, stdclock, which holds values that follow this pattern:
stdclock=[13 25 38 50 63 75 88 100 113 125 138 150 163 175 188 200 213 2517 2529 2542 2554 2567 2579 2592 2604 2617 2629 2642 2654 2667 2679 2692 2704 2717]
This data is generated through an encoding of 17 values that come 12 or 13 numbers apart (e.g. 25-13=12, 38-25 = 13, etc). You'll see that the first 17 values follow this pattern. Each group of 17 values encode an object, which we'll call an 'item' and are independent of the subsequent 17 values. Then, between value 17 and 18, there's a much larger difference than 12 or 13, but it could be any number higher than, say, 15. This difference represents a separation qualitative separation in the data such that the first 17 values encode one item, the next 17 values encode another item, etc etc. The difference between the 17th and 18th value will never be as small as 12 or 13. Therefore, I can check for any values >= 15, and be sure that I can separate my data in this way. Alternatively, I can reshape the vector as a 17xlength(stdclock)/17 matrix.
So far so good. The problem is that this vector is generated through hardware which can sometimes have errors such that one or more values is simply dropped and not recorded. I want to figure out an algorithm that will detect that values are missing from an 'item' and then remove all remaining values from that item.
I can't quite wrap my head around how to do this in a way that will work for all patterns of errors (e.g. if an item can have missing numbers anywhere, in any pattern, and neighboring items may also have missing numbers anywhere in any pattern, or nowhere).
Any help would be appreciated. An example of a 'corrupted' item would be like this
stdclock=[13 25 38 50 63 75 88 100 113 125 138 150 163 175 188 200 213 2529 2542 2554 2567 2579 2592 2604 2642 2654 2679 2692 2704]
where this stdclock is the same as the one on top, but I went through in the second item and randomly removed numbers, including the first and last numbers.
If you can assume that the difference between consecutive groups is always larger than some threshold, you can use the approach below: identify consecutive groups, and throw out all groups of a length less than 17. It turns out that the threshold for a new group can be set as low as 15, since a missing data point will split a group of 17 into two shorter groups, which will then both be removed.
stdclock=[13 25 38 50 63 75 88 100 113 125 138 150 163 175 188 200 213 2529 2542 2554 2567 2579 2592 2604 2642 2654 2679 2692 2704];
%# a difference of more than groupDelta indicates a new (pseudo-)group
groupDelta = 15;
groupJump = [1 diff(stdclock) > groupDelta];
%# number the groups
groupNumber = cumsum(groupJump);
%# count, for each group, the numbers.
groupCounts = hist(groupNumber,1:groupNumber(end));
%# if a group contains fewer than 17 entries, throw it out
badGroup = find(groupCounts < 17);
stdclock(ismember(groupNumber,badGroup)) = [];
stdclock =
13 25 38 50 63 75 88 100 113 125 138 150 163 175 188 200 213