tm to tidytext conversion - tm

I am trying to learn tidytext. I can follow the examples on tidytext website so long as I use the packages (janeaustenr, eg). However, most of my data are text files in a corpus. I can reproduce the tm to tidytext conversion example for sentiment analysis (ap_sentiments) on the tidytext website. I am having trouble, however, understanding how the tidytext data are structured. For example, the austen novels are stored by "book" in the austenr package. For my tm data, however, what is the equivalent for calling the vector for book? Here is the specific example for my data:
'cname <- file.path(".", "greencomments" , "all")
I can then use tidytext successfully after running the tm preprocessing:
practice <- tidy(tdm)
practice
partysentiments <- practice %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
partysentiments
# A tibble: 170 x 4
term document count sentiment
<chr> <chr> <dbl> <chr>
1 benefit 1 1.00 positive
2 best 1 2.00 positive
3 better 1 7.00 positive
4 cheaper 1 1.00 positive
5 clean 1 24.0 positive
7 clear 1 1.00 positive
8 concern 1 2.00 negative
9 cure 1 1.00 positive
10 destroy 1 3.00 negative
But, I can't reproduce the simple ggplots of word frequencies in tidytext. Since my data/corpus are not arranged with a column for "book" in the dataframe, the code (and therefore much of the tidytext functionality) won't work.
Here is an example of the issue. This works fine:
practice %>%
count(term, sort = TRUE)
# A tibble: 989 x 2
term n
<chr> <int>
1 activ 3
2 air 3
3 altern 3
but, what how to I arrange the tm corpus to match the structure of the books in the austenr package? Is "document" the equivalent of "book"? I have text files in folders for the corpus. I have tried replacing this in the code, and it doesn't work. Maybe I need to rename this? Apologies in advance - I am not a programmer.

Related

Correlation matrix from categorical and non-categorical variables (Matlab)

In Matlab, I have a dataset in a table of the form:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
UR F 12 U FT TEA MOTHER 1 11
GB M 22 R FT SER FATHER 5 15
GB M 12 R FT OTH FATHER 3 12
GB M 11 R PT POL FATHER 2 10
Where some variables are binary, some are categorical, some numerical. Would it be possible to extract from it a correlation matrix, with the correlation coefficients between the variables? I tried using both corrcoef and corrplot from the econometrics toolbox, but I come across errors such as 'observed data must be convertible to type double'.
Anyone would have a take on how this can be done? Thank you.
As said above, you first need to transform your categorical and binary variables to numerical values.
So if your data is in a table (T) do something like:
T.SCHOOL = categorical(T.SCHOOL);
A worked example can be found in the Matlab help here, where they use the patients dataset, which seems to be similar to your data.
You could then transform your categorical columns to double:
T.SCHOOL = double(T.SCHOOL);
Be careful with double though, as it transforms categorical variables to arbitrary numbers, see the matlab forum.
Also note, that you are introducing order into your categorical variables, if you simply transform them to numbers. So if you for example transform JOB 'TEA', 'SER', 'OTH' to 1, 2, 3, etc. you are making the variable ordinal. 'TEA' is then < 'OTH'.
If you want to avoid that you can re-code the categorical columns into 'binary' dummy variables:
dummy_schools = dummyvar(T.SCHOOL);
Which returns a matrix of size nrows x unique(T.SCHOOL).
And then there is the whole discussion, whether it is useful to calculate correlations of categorical variables. Like here.
I hope this helps :)
I think you need to make all the data numeric, i.e change/code the non-numerical columns to for example:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
1 1 12 1 1 1 1 1 11
2 2 22 2 1 2 2 5 15
2 2 12 2 1 3 2 3 12
2 2 11 2 2 4 2 2 10
and then do the correlation.

Calculating group means with own group excluded in MATLAB

To be generic the issue is: I need to create group means that exclude own group observations before calculating the mean.
As an example: let's say I have firms, products and product characteristics. Each firm (f=1,...,F) produces several products (i=1,...,I). I would like to create a group mean for a certain characteristic of the product i of firm f, using all products of all firms, excluding firm f product observations.
So I could have a dataset like this:
firm prod width
1 1 30
1 2 10
1 3 20
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
To reproduce the table:
firm=[1,1,1,2,2,2,3,3]
prod=[1,2,3,1,2,4,2,4]
hp=[30,10,20,25,15,40,10,35]
x=[firm' prod' hp']
Then I want to estimate a mean which will use values of all products of all other firms, that is excluding all firm 1 products. In this case, my grouping is at the firm level. (This mean is to be used as an instrumental variable for the width of all products in firm 1.)
So, the mean that I should find is: (25+15+40+10+35)/5=25
Then repeat the process for other firms.
firm prod width mean_desired
1 1 30 25
1 2 10 25
1 3 20 25
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
I guess my biggest difficulty is to exclude the own firm values.
This question is related to this page here: Calculating group mean/medians in MATLAB where group ID is in a separate column. But here, we do not exclude the own group.
p.s.: just out of curiosity if anyone works in economics, I am actually trying to construct Hausman or BLP instruments.
Here's a way that avoids loops, but may be memory-expensive. Let x denote your three-column data matrix.
m = bsxfun(#ne, x(:,1).', unique(x(:,1))); % or m = ~sparse(x(:,1), 1:size(x,1), true);
result = m*x(:,3);
result = result./sum(m,2);
This creates a zero-one matrix m such that each row of m multiplied by the width column of x (second line of code) gives the sum of other groups. m is built by comparing each entry in the firm column of x with the unique values of that column (first line). Then, dividing by the respective count of other groups (third line) gives the desired result.
If you need the results repeated as per the original firm column, use result(x(:,1))

Create a new variable in Tableau

I am new to Tableau and trying to get myself oriented to this system. I am an R user and typically work with wide data formats, so getting things wrangled into the proper long format has been tricky. Here is my current problem.
Assume I have a data file that is structured as such
ID Disorder Value
1 A 0
1 B 1
1 C 0
2 A 1
2 B 1
2 C 1
3 A 0
3 B 0
3 C 0
What I would like to do is to combine the variables, such that the presence of a set of disorders are used for summary variables. For example, how could I go about achieving something like this as my output? The sum is the number of people with the disorder, and the percentage is the number of people with the disorder divided by the total number of people.
Disorders Sum Percentage
A 1 33.3
B 2 66.6
C 1 33.3
AB 2 66.6
BC 2 66.6
AC 1 33.3
ABC 2 66.6
The approach to this would really be dependent on how flexible it has to be. Ultimately a wide data source with your Disorder making columns would make this easier. You will still need to blend the results on a data scaffold that has the combinations of codes you are wanting to get this to work in Tableau. If this needs to scale, you'll want to do the transformation work using custom SQL or another ETL tool like Alteryx. I posted a solution to this question for you over on the Tableau forum where I can upload files: http://community.tableausoftware.com/message/316168

SPSS table merge with extra variable present in both datasets: which value is kept?

I have what I hope is a simple SPSS question. If I conduct a table merge using the syntax below, but both "bigfile" and "smallfile" have values for some variable [say, ChildID], will 'mergefile' have the values for ChildID from smallfile or from bigfile?
Match files files=bigfile
/table=smallfile
/by JoinID.
dataset name mergefile.
execute.
Thank you very much.
-Dan
From the fine manual:
The order in which files are specified determines the order of
variables in the new active dataset. In addition, if the same variable
name occurs in more than one input file, the variable is taken from
the file specified first.
So this should indicate that ChildID should be the values that were in bigfile, for your particular example. Lets demonstrate this to make sure.
data list free /JoinID ChildID X.
begin data
1 1 4
1 1 5
1 1 6
1 1 7
2 2 8
2 2 9
3 3 2
3 3 1
end data.
dataset name bigfile.
data list free /JoinID ChildID Y.
begin data
1 5 4
2 5 8
3 5 2
end data.
dataset name smallfile.
match files file = 'bigfile'
/table = 'smallfile'
/by JoinID.
dataset name mergefile.
list ALL.
Which produces the output.
JoinID ChildID X Y
1.00 1.00 4.00 4.00
1.00 1.00 5.00 4.00
1.00 1.00 6.00 4.00
1.00 1.00 7.00 4.00
2.00 2.00 8.00 8.00
2.00 2.00 9.00 8.00
3.00 3.00 2.00 2.00
3.00 3.00 1.00 2.00
You may also be interested in the rename sub-command (as well as drop and keep) for match files (to prevent over-writing or to specify which file you want the final variables to come from). My workflow typically drops the cases from one of the files, as files won't merge if they are strings of different length.
An example of using the rename and drop sub-commands is below (using the same example data from above). This will allow you to keep the values from the subsequent files if that is what you prefer.
match files file = 'bigfile'
/rename = (ChildId = Old)
/table = 'smallfile'
/by JoinID
/drop Old.
dataset name mergefile2.
list ALL.

Storing and accessing data for future analysis of intra and inter-operator variability

I'm am about to start a short project which will involve a reasonable amount of data which I would like to store in a sensible manner - preferably a postgressql database.
I will give a quick outline of the task. I will be processing and analysing data for a series of images each of which will have a unique ID. For each image, myself and other operators will complete some simple image processing tasks including adjusting angles and placing regions with the end result being numerous quantitative parameters - eg mean, variance etc. We expect there will be intra and inter-operator variability in these measures which is what I would like to analyse.
My initial plan was to store the data in the following way
ID Operator Attempt Date Result1 Result2 Reconstruction Method Iterations
1 AB 1 01/01/13 x x FBP
1 AB 2 01/01/13 x x FBP
1 CD 1 01/01/13 x x FBP
1 CD 2 01/01/13 x x FBP
2 AB 1 01/01/13 x x FBP
2 AB 2 01/01/13 x x FBP
2 CD 1 01/01/13 x x FBP
2 CD 2 01/01/13 x x FBP
1 AB 1 11/01/13 x x FBP
1 AB 2 01/01/13 x x MLEM
Now what I would like to compare (using correlation and Bland Altman plots) are the difference in results for the same operator processing the same image (the images must have the same ID, Date, Reconstruction technique) for all operators. i.e for all identical image and operator how do attempt 1 and 2 differ. I want to do the same analysis for interoperator variability i.e how does AB compare to CD for ID 1 for all images reconstructed with FBP or EF to AB for all images reconstructed with MLEM. Images with the same unique ID but acquired on different dates or reconstruction techniques should not be compared as they will contain difference outwith operator variability.
I have various R scripts to do the analysis but what I am uncertain of is how to access my data and arrange it in a sensible format to carry out the analysis or if my planned storage method is optimum for doing so. In the past I have used perl to access the database and pull out the numbers but I have recently discovered Rpostgressql which may be more suitable.
I guess my question is, for such a database how can I pick out:
(a) all unique images (ID, acquired on same date with same reconstruction method) and compare the difference in all Result1 for operator AB (CD etc) for attempt 1 and 2
(b) the same thing comparing all Result1 attempt 1s between AB and CD, CD and EF etc
Here is an example of the output I would like for (a)
ID Operator Date Result1 (Attempt 1) Result1(Attempt 2)
1 AB 01/01/13 10 12
2 AB 01/01/13 22 21
3 AB 03/01/13 15 17
4 AB 04/01/13 27 25
5 AB 06/01/13 14 12
1 AB 11/01/13 3 6
I would then analyse the last 2 columns
An example output for (b) comparing AB and CD
ID Date Result1 (Op: AB, Att: 1) Result1(Op: CD: Att 1)
1 01/01/13 10 12
2 01/01/13 22 21
3 05/01/13 12 14
1 11/01/13 19 24
These are just a rough idea!
(a) all unique images (ID, acquired on same date with same
reconstruction method) and compare the difference in all Result1 for
operator AB (CD etc) for attempt 1 and 2
For (a) you can use SQL statements that make use of the arguments, DISTINCT & SORT BY.
For example
SELECT DISTINCT Images FROM YourTable SORT BY DATE(Date), "Reconstruction Method"
(b) the same thing comparing all Result1 attempt 1s between AB and CD,
CD and EF etc
For (b) you can use SQL statements that make use of the argument, WHERE.
For example
SELECT * From YourTable WHERE Operator=AB