SPSS table merge with extra variable present in both datasets: which value is kept? - merge

I have what I hope is a simple SPSS question. If I conduct a table merge using the syntax below, but both "bigfile" and "smallfile" have values for some variable [say, ChildID], will 'mergefile' have the values for ChildID from smallfile or from bigfile?
Match files files=bigfile
/table=smallfile
/by JoinID.
dataset name mergefile.
execute.
Thank you very much.
-Dan

From the fine manual:
The order in which files are specified determines the order of
variables in the new active dataset. In addition, if the same variable
name occurs in more than one input file, the variable is taken from
the file specified first.
So this should indicate that ChildID should be the values that were in bigfile, for your particular example. Lets demonstrate this to make sure.
data list free /JoinID ChildID X.
begin data
1 1 4
1 1 5
1 1 6
1 1 7
2 2 8
2 2 9
3 3 2
3 3 1
end data.
dataset name bigfile.
data list free /JoinID ChildID Y.
begin data
1 5 4
2 5 8
3 5 2
end data.
dataset name smallfile.
match files file = 'bigfile'
/table = 'smallfile'
/by JoinID.
dataset name mergefile.
list ALL.
Which produces the output.
JoinID ChildID X Y
1.00 1.00 4.00 4.00
1.00 1.00 5.00 4.00
1.00 1.00 6.00 4.00
1.00 1.00 7.00 4.00
2.00 2.00 8.00 8.00
2.00 2.00 9.00 8.00
3.00 3.00 2.00 2.00
3.00 3.00 1.00 2.00
You may also be interested in the rename sub-command (as well as drop and keep) for match files (to prevent over-writing or to specify which file you want the final variables to come from). My workflow typically drops the cases from one of the files, as files won't merge if they are strings of different length.
An example of using the rename and drop sub-commands is below (using the same example data from above). This will allow you to keep the values from the subsequent files if that is what you prefer.
match files file = 'bigfile'
/rename = (ChildId = Old)
/table = 'smallfile'
/by JoinID
/drop Old.
dataset name mergefile2.
list ALL.

Related

Running total using two columns

Given a table with data like:
A
B
Qty.
Running Total
5
5
5
10
5
15
I can create the running total using the formula =SUM($A$2:A2) and then drag down to get the running total after each quantity (here Qty.)
What may I do for calculating running total using two columns which may or may not be consecutive as shown below:
A
B
C
D
Qty. 1
Other
Qty. 2
RT
2
blah
2
4
2
phew
2
8
3
xyz
2
13
Place in cell D2 the formula =SUM(A2,C2,D1). Do not pay attention to the fact that the function will refer to a non-numeric cell D1 - the SUM() function will not break, unlike ordinary addition =A2+C2+D1. Now, just stretch the formula down.

tm to tidytext conversion

I am trying to learn tidytext. I can follow the examples on tidytext website so long as I use the packages (janeaustenr, eg). However, most of my data are text files in a corpus. I can reproduce the tm to tidytext conversion example for sentiment analysis (ap_sentiments) on the tidytext website. I am having trouble, however, understanding how the tidytext data are structured. For example, the austen novels are stored by "book" in the austenr package. For my tm data, however, what is the equivalent for calling the vector for book? Here is the specific example for my data:
'cname <- file.path(".", "greencomments" , "all")
I can then use tidytext successfully after running the tm preprocessing:
practice <- tidy(tdm)
practice
partysentiments <- practice %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
partysentiments
# A tibble: 170 x 4
term document count sentiment
<chr> <chr> <dbl> <chr>
1 benefit 1 1.00 positive
2 best 1 2.00 positive
3 better 1 7.00 positive
4 cheaper 1 1.00 positive
5 clean 1 24.0 positive
7 clear 1 1.00 positive
8 concern 1 2.00 negative
9 cure 1 1.00 positive
10 destroy 1 3.00 negative
But, I can't reproduce the simple ggplots of word frequencies in tidytext. Since my data/corpus are not arranged with a column for "book" in the dataframe, the code (and therefore much of the tidytext functionality) won't work.
Here is an example of the issue. This works fine:
practice %>%
count(term, sort = TRUE)
# A tibble: 989 x 2
term n
<chr> <int>
1 activ 3
2 air 3
3 altern 3
but, what how to I arrange the tm corpus to match the structure of the books in the austenr package? Is "document" the equivalent of "book"? I have text files in folders for the corpus. I have tried replacing this in the code, and it doesn't work. Maybe I need to rename this? Apologies in advance - I am not a programmer.

Imported .sav file (SPSS) with variable coded 0 and 1

When I import a .sav file (SPSS) into R, where a variable has been coded with 0 and 1, it reads in one variable as NA. I've encountered this with several .sav files that I imported into R. You would think it would read the 0 variable as NA, but it reverses the variables from the way they are described in an accompanying report. For example, for the variable sex, males (n=99)(coded 0) and females (n=110)(coded 1) in the report are reversed so that males (n=99) appears as a variable named Males, but females (n=110) are NA. In another dataset, the main grouping variables are A and B. Group A shows 0 and Group B shows 94, and NAs are 96 (all of group A) (the same reversal is occurring but I can't demonstrate that here). An explanation would be interesting, but my immediate problem is how to get the NAs into the other variable (or a new variable, for further analysis). I've created a small select of 2 columns and slice of 10 rows. I'm not sure how to create small dataframe for you here. This is the group variable with one other variable (MH0Q1):
group MH0Q1
2 <NA> <NA>
3 <NA> <NA>
4 <NA> <NA>
5 <NA> 1
6 B <NA>
7 B <NA>
8 B 2
8 B <NA>
9 B <NA>
10 B <NA>
11 B 2
The attributes of MH0Q1:
$ MH0Q1 : Named num 0 1 2
.. ..- attr(*, "names")= chr "0" "1" "2"
Thanks
You haven't said how you are reading this sav file in R. It might be that the problematic variable is categorical in the sav file and is being converted to an R factor differently than you expect.

How do I identify the correct clustering algorithm for the available data?

I have the sample data of flight routes, number of searches for that route, gross profit for the route, number of transactions for the route. I want to bucket flight routes which shows similar characteristics based on above mentioned variables. What are the steps to fix on the particular clustering algorithm?
Below is sample data which I would like to cluster.
Route Clicks Impressions CPC Share of Voice Gross-Profit Number of Transactions Conversions
AAE-ALG 2 25 0.22 $4.00 2 1
AAE-CGK 5 40 0.21 $6.00 1 1
AAE-FCO 1 25 0.25 $13.00 4 1
AAE-IST 8 58 0.30 $18.00 3 2
AAE-MOW 22 100 0.11 $1.00 6 5
AAE-ORN 11 70 0.21 $22.00 3 2
AAE-ORY 8 40 0.18 $3.00 4 4
For me it seems an N dimension clustering problem where N is the number of features, N = 7 (Route, Clicks, Impressions, CPC, Share of Voice, Gross-Profit, Number of Transactions, Conversions).
I think if you preprocess the feature values to be able to interpret distance on them you can apply K-means for clustering your data.
E.g. Route can be represented by the distance* of the airports: dA than you can find diff between 2 distances* that will be the distance between them: d = ABS(dA - dA')
Don't forget to scale your features.

Create a new variable in Tableau

I am new to Tableau and trying to get myself oriented to this system. I am an R user and typically work with wide data formats, so getting things wrangled into the proper long format has been tricky. Here is my current problem.
Assume I have a data file that is structured as such
ID Disorder Value
1 A 0
1 B 1
1 C 0
2 A 1
2 B 1
2 C 1
3 A 0
3 B 0
3 C 0
What I would like to do is to combine the variables, such that the presence of a set of disorders are used for summary variables. For example, how could I go about achieving something like this as my output? The sum is the number of people with the disorder, and the percentage is the number of people with the disorder divided by the total number of people.
Disorders Sum Percentage
A 1 33.3
B 2 66.6
C 1 33.3
AB 2 66.6
BC 2 66.6
AC 1 33.3
ABC 2 66.6
The approach to this would really be dependent on how flexible it has to be. Ultimately a wide data source with your Disorder making columns would make this easier. You will still need to blend the results on a data scaffold that has the combinations of codes you are wanting to get this to work in Tableau. If this needs to scale, you'll want to do the transformation work using custom SQL or another ETL tool like Alteryx. I posted a solution to this question for you over on the Tableau forum where I can upload files: http://community.tableausoftware.com/message/316168