Imported .sav file (SPSS) with variable coded 0 and 1 - import

When I import a .sav file (SPSS) into R, where a variable has been coded with 0 and 1, it reads in one variable as NA. I've encountered this with several .sav files that I imported into R. You would think it would read the 0 variable as NA, but it reverses the variables from the way they are described in an accompanying report. For example, for the variable sex, males (n=99)(coded 0) and females (n=110)(coded 1) in the report are reversed so that males (n=99) appears as a variable named Males, but females (n=110) are NA. In another dataset, the main grouping variables are A and B. Group A shows 0 and Group B shows 94, and NAs are 96 (all of group A) (the same reversal is occurring but I can't demonstrate that here). An explanation would be interesting, but my immediate problem is how to get the NAs into the other variable (or a new variable, for further analysis). I've created a small select of 2 columns and slice of 10 rows. I'm not sure how to create small dataframe for you here. This is the group variable with one other variable (MH0Q1):
group MH0Q1
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> 1
6 B <NA>
7 B <NA>
8 B 2
8 B <NA>
9 B <NA>
10 B <NA>
11 B 2
The attributes of MH0Q1:
$ MH0Q1 : Named num 0 1 2
.. ..- attr(*, "names")= chr "0" "1" "2"
Thanks

You haven't said how you are reading this sav file in R. It might be that the problematic variable is categorical in the sav file and is being converted to an R factor differently than you expect.

Related

Correlation matrix from categorical and non-categorical variables (Matlab)

In Matlab, I have a dataset in a table of the form:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
UR F 12 U FT TEA MOTHER 1 11
GB M 22 R FT SER FATHER 5 15
GB M 12 R FT OTH FATHER 3 12
GB M 11 R PT POL FATHER 2 10
Where some variables are binary, some are categorical, some numerical. Would it be possible to extract from it a correlation matrix, with the correlation coefficients between the variables? I tried using both corrcoef and corrplot from the econometrics toolbox, but I come across errors such as 'observed data must be convertible to type double'.
Anyone would have a take on how this can be done? Thank you.
As said above, you first need to transform your categorical and binary variables to numerical values.
So if your data is in a table (T) do something like:
T.SCHOOL = categorical(T.SCHOOL);
A worked example can be found in the Matlab help here, where they use the patients dataset, which seems to be similar to your data.
You could then transform your categorical columns to double:
T.SCHOOL = double(T.SCHOOL);
Be careful with double though, as it transforms categorical variables to arbitrary numbers, see the matlab forum.
Also note, that you are introducing order into your categorical variables, if you simply transform them to numbers. So if you for example transform JOB 'TEA', 'SER', 'OTH' to 1, 2, 3, etc. you are making the variable ordinal. 'TEA' is then < 'OTH'.
If you want to avoid that you can re-code the categorical columns into 'binary' dummy variables:
dummy_schools = dummyvar(T.SCHOOL);
Which returns a matrix of size nrows x unique(T.SCHOOL).
And then there is the whole discussion, whether it is useful to calculate correlations of categorical variables. Like here.
I hope this helps :)
I think you need to make all the data numeric, i.e change/code the non-numerical columns to for example:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
1 1 12 1 1 1 1 1 11
2 2 22 2 1 2 2 5 15
2 2 12 2 1 3 2 3 12
2 2 11 2 2 4 2 2 10
and then do the correlation.

tm to tidytext conversion

I am trying to learn tidytext. I can follow the examples on tidytext website so long as I use the packages (janeaustenr, eg). However, most of my data are text files in a corpus. I can reproduce the tm to tidytext conversion example for sentiment analysis (ap_sentiments) on the tidytext website. I am having trouble, however, understanding how the tidytext data are structured. For example, the austen novels are stored by "book" in the austenr package. For my tm data, however, what is the equivalent for calling the vector for book? Here is the specific example for my data:
'cname <- file.path(".", "greencomments" , "all")
I can then use tidytext successfully after running the tm preprocessing:
practice <- tidy(tdm)
practice
partysentiments <- practice %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
partysentiments
# A tibble: 170 x 4
term document count sentiment
<chr> <chr> <dbl> <chr>
1 benefit 1 1.00 positive
2 best 1 2.00 positive
3 better 1 7.00 positive
4 cheaper 1 1.00 positive
5 clean 1 24.0 positive
7 clear 1 1.00 positive
8 concern 1 2.00 negative
9 cure 1 1.00 positive
10 destroy 1 3.00 negative
But, I can't reproduce the simple ggplots of word frequencies in tidytext. Since my data/corpus are not arranged with a column for "book" in the dataframe, the code (and therefore much of the tidytext functionality) won't work.
Here is an example of the issue. This works fine:
practice %>%
count(term, sort = TRUE)
# A tibble: 989 x 2
term n
<chr> <int>
1 activ 3
2 air 3
3 altern 3
but, what how to I arrange the tm corpus to match the structure of the books in the austenr package? Is "document" the equivalent of "book"? I have text files in folders for the corpus. I have tried replacing this in the code, and it doesn't work. Maybe I need to rename this? Apologies in advance - I am not a programmer.

Why is my variable unaffected by certain built-in functions if it stores data read from a file?

I have a file called 01.in, in the same folder I'm running q. It contains one line, which has a string of digits in it. For instance, let's assume it contains the following string: 1122.
I read the data from this file, transformed it into a list of integers digits, and stored it in a variable a using the following line:
a:("i"$read0 `:01.in)-"i"$"0"
Now if I try to use some dyadic built-in functions such as xprev or rotate, the q interpreter outputs either nothing, or the original list. For example:
q)a
1 1 2 2
q)-1 xprev a
q)0 xprev a
1 1 2 2
q)1 xprev a
q)-1 rotate a
1 1 2 2
q)0 rotate a
1 1 2 2
q)1 rotate a
1 1 2 2
Those same functions work if I use them on the list 1 1 2 2 directly. I'm trying to understand why what I'm doing isn't working as I expected. Just a heads up: I'm very new to q, so I apologize if this is something obvious that I'm missing.
With the way you are reading the file, you are creating a nested list:
q)type a
0h
q)0N!a;
,1 1 2 2i
Here I use 0N! to show the raw structure, the , indicates this is a nested list. Instead, try reading it something like this:
q)a:"I"$'first read0`:01.in
q)a
1 1 2 2i
q)-1 xprev a
1 2 2 0Ni

Create a new variable in Tableau

I am new to Tableau and trying to get myself oriented to this system. I am an R user and typically work with wide data formats, so getting things wrangled into the proper long format has been tricky. Here is my current problem.
Assume I have a data file that is structured as such
ID Disorder Value
1 A 0
1 B 1
1 C 0
2 A 1
2 B 1
2 C 1
3 A 0
3 B 0
3 C 0
What I would like to do is to combine the variables, such that the presence of a set of disorders are used for summary variables. For example, how could I go about achieving something like this as my output? The sum is the number of people with the disorder, and the percentage is the number of people with the disorder divided by the total number of people.
Disorders Sum Percentage
A 1 33.3
B 2 66.6
C 1 33.3
AB 2 66.6
BC 2 66.6
AC 1 33.3
ABC 2 66.6
The approach to this would really be dependent on how flexible it has to be. Ultimately a wide data source with your Disorder making columns would make this easier. You will still need to blend the results on a data scaffold that has the combinations of codes you are wanting to get this to work in Tableau. If this needs to scale, you'll want to do the transformation work using custom SQL or another ETL tool like Alteryx. I posted a solution to this question for you over on the Tableau forum where I can upload files: http://community.tableausoftware.com/message/316168

Storing and accessing data for future analysis of intra and inter-operator variability

I'm am about to start a short project which will involve a reasonable amount of data which I would like to store in a sensible manner - preferably a postgressql database.
I will give a quick outline of the task. I will be processing and analysing data for a series of images each of which will have a unique ID. For each image, myself and other operators will complete some simple image processing tasks including adjusting angles and placing regions with the end result being numerous quantitative parameters - eg mean, variance etc. We expect there will be intra and inter-operator variability in these measures which is what I would like to analyse.
My initial plan was to store the data in the following way
ID Operator Attempt Date Result1 Result2 Reconstruction Method Iterations
1 AB 1 01/01/13 x x FBP
1 AB 2 01/01/13 x x FBP
1 CD 1 01/01/13 x x FBP
1 CD 2 01/01/13 x x FBP
2 AB 1 01/01/13 x x FBP
2 AB 2 01/01/13 x x FBP
2 CD 1 01/01/13 x x FBP
2 CD 2 01/01/13 x x FBP
1 AB 1 11/01/13 x x FBP
1 AB 2 01/01/13 x x MLEM
Now what I would like to compare (using correlation and Bland Altman plots) are the difference in results for the same operator processing the same image (the images must have the same ID, Date, Reconstruction technique) for all operators. i.e for all identical image and operator how do attempt 1 and 2 differ. I want to do the same analysis for interoperator variability i.e how does AB compare to CD for ID 1 for all images reconstructed with FBP or EF to AB for all images reconstructed with MLEM. Images with the same unique ID but acquired on different dates or reconstruction techniques should not be compared as they will contain difference outwith operator variability.
I have various R scripts to do the analysis but what I am uncertain of is how to access my data and arrange it in a sensible format to carry out the analysis or if my planned storage method is optimum for doing so. In the past I have used perl to access the database and pull out the numbers but I have recently discovered Rpostgressql which may be more suitable.
I guess my question is, for such a database how can I pick out:
(a) all unique images (ID, acquired on same date with same reconstruction method) and compare the difference in all Result1 for operator AB (CD etc) for attempt 1 and 2
(b) the same thing comparing all Result1 attempt 1s between AB and CD, CD and EF etc
Here is an example of the output I would like for (a)
ID Operator Date Result1 (Attempt 1) Result1(Attempt 2)
1 AB 01/01/13 10 12
2 AB 01/01/13 22 21
3 AB 03/01/13 15 17
4 AB 04/01/13 27 25
5 AB 06/01/13 14 12
1 AB 11/01/13 3 6
I would then analyse the last 2 columns
An example output for (b) comparing AB and CD
ID Date Result1 (Op: AB, Att: 1) Result1(Op: CD: Att 1)
1 01/01/13 10 12
2 01/01/13 22 21
3 05/01/13 12 14
1 11/01/13 19 24
These are just a rough idea!
(a) all unique images (ID, acquired on same date with same
reconstruction method) and compare the difference in all Result1 for
operator AB (CD etc) for attempt 1 and 2
For (a) you can use SQL statements that make use of the arguments, DISTINCT & SORT BY.
For example
SELECT DISTINCT Images FROM YourTable SORT BY DATE(Date), "Reconstruction Method"
(b) the same thing comparing all Result1 attempt 1s between AB and CD,
CD and EF etc
For (b) you can use SQL statements that make use of the argument, WHERE.
For example
SELECT * From YourTable WHERE Operator=AB