How to exclude Character and have just numeric in R data - character

I should make a regression analysis about the relationship between income and authoritarianism. I have a database with 6k samples and in some of them I have characters instead of numerics. How can I quickly exclude characters instead of numerics?

You have some range in your case and you cannot use directly. you should choose some categorical variables for those ranges. For example, replace the ranges < 3000 by 1, 3000<ranges<4999 by 2 and so on. Also, I found that your dataset has some missing values and for them, you have two options:
1) you can remove those rows
2) you can replace them by the mean of non-missing values.

Related

Look up an account then average associated values excluding zeros

On one sheet, I have account code and in the cell next to it, I need to look up the account code on the next sheet to average the cost excluding those cells that are zero in col. b from the average calculation.
The answer for London should be: £496.33 but having tried various sumifs / countifs I cannot get it to work.
You probably need COUNTIFS which -- similar to the SUMIFS you are already using -- allows to define multiple critera and ranges.
So, if the column R contains the values, you want to build the average upon, and the column H in the respective row must equal $B$28 to be included in the sum, the respective COUNTIFS looks as follows
=SUMIFS('ESL Info'!$R:$R,'ESL Info'!H:H,$B$28)/COUNTIFS('ESL Info'!$H:$H,$B$28, 'ESL Info'!$R:$R, "<>0")
ie additionally to the value in the H-column to equal B28 it also requires the value R-column (ie the actual number you are summing up) to be different from 0
You could also add the same criteria 'ESL Info'!$R:$R, "<>0" to your SUMIFS, but that isn't necessary, because a 0 doesn't provide anything to you sum, thus it doesn't matter if it's included in the sum or not ...
And depending on the Excel version you are using, you may even have the AVERAGEIFS function available, which does exactly what you want
=AVERAGEIFS('ESL Info'!$R:$R,'ESL Info'!$H:$H;$B$28,'ESL Info'!$R:$R,"<>0")

When to use a cell, matrix, or table in Matlab

I am fairly new to matlab and I am trying to figure out when it is best to use cells, tables, or matrixes to store sets of data and then work with the data.
What I want is to store data that has multiple lines that include strings and numbers and then want to work with the numbers.
For example a line would look like
'string 1' , time, number1, number 2
. I know a matrix works best if al elements are numbers, but when I use a cell I keep having to convert the numbers or strings to a matrix in order to work with them. I am running matlab 2012 so maybe that is a part of the problem. Any help is appreciated. Thanks!
Use a matrix when :
the tabular data has a uniform type (all are floating points like double, or integers like int32);
& either the amount of data is small, or is big and has static (predefined) size;
& you care about the speed of accessing data, or you need matrix operations performed on data, or some function requires the data organized as such.
Use a cell array when:
the tabular data has heterogeneous type (mixed element types, "jagged" arrays etc.);
| there's a lot of data and has dynamic size;
| you need only indexing the data numerically (no algebraic operations);
| a function requires the data as such.
Same argument for structs, only the indexing is by name, not by number.
Not sure about tables, I don't think is offered by the language itself; might be an UDT that I don't know of...
Later edit
These three types may be combined, in the sense that cell arrays and structs may have matrices and cell arrays and structs as elements (because thy're heterogeneous containers). In your case, you might have 2 approaches, depending on how you need to access the data:
if you access the data mostly by row, then an array of N structs (one struct per row) with 4 fields (one field per column) would be the most effective in terms of performance;
if you access the data mostly by column, then a single struct with 4 fields (one field per column) would do; first field would be a cell array of strings for the first column, second field would be a cell array of strings or a 1D matrix of doubles depending on how you want to store you dates, the rest of the fields are 1D matrices of doubles.
Concerning tables: I always used matrices or cell arrays until I
had to do database related things such as joining datasets by a unique key; the only way I found to do this in was by using tables. It takes a while to get used to them and it's a bit annoying that some functions that work on cell arrays don't work on tables vice versa. MATLAB could have done a better job explaining when to use one or the other because it's not super clear from the documentation.
The situation that you describe, seems to be as follows:
You have several columns. Entire columns consist of 1 datatype each, and all columns have an equal number of rows.
This seems to match exactly with the recommended situation for using a [table][1]
T = table(var1,...,varN) creates a table from the input variables,
var1,...,varN . Variables can be of different sizes and data types,
but all variables must have the same number of rows.
Actually I don't have much experience with tables, but if you can't figure it out you can always switch to using 1 cell array for the first column, and a matrix for all others (in your example).

How to implement data I have to svmtrain() function in MATLAB?

I have to write a script using MATLAB which will classify my data.
My data consists of 1051 web pages (rows) and 11000+ words (columns). I am holding the word occurences in the matrix for each page. The first 230 rows are about computer science course (to be labeled with +1) and remaining 821 are not (to be labeled with -1). I am going to label few part of these rows (say 30 rows) by myself. Then SVM will label the remaining unlabeled rows.
I have found that I could solve my problem using MATLAB's svmtrain() and svmclassify() methods. First I need to create SVMStruct.
SVMStruct = svmtrain(Training,Group)
Then I need to use
Group = svmclassify(SVMStruct,Sample)
But the point that I do not know what Training and Group are. For Group Mathworks says:
Grouping variable, which can be a categorical, numeric, or logical
vector, a cell vector of strings, or a character matrix with each row
representing a class label. Each element of Group specifies the group
of the corresponding row of Training. Group should divide Training
into two groups. Group has the same number of elements as there are
rows in Training. svmtrain treats each NaN, empty string, or
'undefined' in Group as a missing value, and ignores the corresponding
row of Training.
And for Training it is said that:
Matrix of training data, where each row corresponds to an observation
or replicate, and each column corresponds to a feature or variable.
svmtrain treats NaNs or empty strings in Training as missing values
and ignores the corresponding rows of Group.
I want to know how I can adopt my data to Training and Group? I need (at least) a little code sample.
EDIT
What I did not understand is that in order to have SVMStruct I have to run
SVMStruct = svmtrain(Training, Group);
and in order to have Group I have to run
Group = svmclassify(SVMStruct,Sample);
Also I still did not get what Sample should be like?
I am confused.
Training would be a matrix with 1051 rows (the webpages/training instances) and 11000 columns (the features/words). I'm assuming you want to test for the existence of each word on a webpage? In this case you could make the entry of the matrix a 1 if the word exists for a given webpage and a 0 if not.
You could initialize the matrix with Training = zeros(1051,11000); but filling the entries would be up to you, presumably done with some other code you've written.
Group is a 1-D column vector with one entry for every training instance (webpage) than tells you which of two classes the webpage belongs to. In your case you would make the first 230 entries a "+1" for computer science and the remaining 821 entries a "-1" for not.
Group = zeros(1051,1); % gives you a matrix of zeros with 1051 rows and 1 column
Group(1:230) = 1; % set first 230 entries to +1
Group(231:end) = -1; % set the rest to -1

Perl - determining the intersection of several numeric ranges

I would like to be able to load long list of positive integer ranges and create a new "summary" range list that is the union of the intersections of each pairs of ranges. And, I want to do this in Perl. For example:
Sample ranges: (1..30) (45..90) (15..34) (92..100)
Intersection of ranges: (15..30)
The only way I could think of was using a bunch of nested if statements to determine the starting point of sample A, sample B, sample C, etc. and figure out the overlap this way, but it's not possible to do that with hundreds of sample, each containing numerous ranges.
Any suggestions are appreciated!
The first thing you should do when you need to do some thing is take a look at CPAN to see what tools are available of if someone has solved your problem for you already.
Set::IntSpan and Set::IntRange are on the first page of results for "set" on CPAN.
What you want is the union of the intersection of each pair of ranges, so the algorithm is as follows:
Create an empty result set.
Create a set for each range.
For each set in the list,
For each later set in the list,
Find the intersection of those two sets.
Find the union of the result set and this intersection. This is the new result set.
Enumerate the elements of the result set.
I don't have code to share, but I would expand each range into hash, or use a Set module, and then use intersection operations on the sets.

Extract a specific row from a combination matrix

Suppose I have 121 elements and want to get all combinations of 4 elements taken at a time, i.e. 121c4.
Since combnk(1:121, 4) takes a lot of time, I want to go for 2% of that combination by providing:
z = 1:50:length(121c4(:, 1))
For example: 1st row, 5th row, 100th row and so on, up to 121c4, picking only those rows from a 121c4 matrix without generating the complete combination (it's consuming too much for large numbers like 625c4).
If you haven't defined an ordering on the combinations, why not just use
randi(121,p,4)
where p is the number of combinations you want in your set ? With this approach you may, or may not, want to replace duplicates.
If you have defined an ordering on the combinations, tell us what it is.