Create a new variable in Tableau - tableau-api

I am new to Tableau and trying to get myself oriented to this system. I am an R user and typically work with wide data formats, so getting things wrangled into the proper long format has been tricky. Here is my current problem.
Assume I have a data file that is structured as such
ID Disorder Value
1 A 0
1 B 1
1 C 0
2 A 1
2 B 1
2 C 1
3 A 0
3 B 0
3 C 0
What I would like to do is to combine the variables, such that the presence of a set of disorders are used for summary variables. For example, how could I go about achieving something like this as my output? The sum is the number of people with the disorder, and the percentage is the number of people with the disorder divided by the total number of people.
Disorders Sum Percentage
A 1 33.3
B 2 66.6
C 1 33.3
AB 2 66.6
BC 2 66.6
AC 1 33.3
ABC 2 66.6

The approach to this would really be dependent on how flexible it has to be. Ultimately a wide data source with your Disorder making columns would make this easier. You will still need to blend the results on a data scaffold that has the combinations of codes you are wanting to get this to work in Tableau. If this needs to scale, you'll want to do the transformation work using custom SQL or another ETL tool like Alteryx. I posted a solution to this question for you over on the Tableau forum where I can upload files: http://community.tableausoftware.com/message/316168

Related

tm to tidytext conversion

I am trying to learn tidytext. I can follow the examples on tidytext website so long as I use the packages (janeaustenr, eg). However, most of my data are text files in a corpus. I can reproduce the tm to tidytext conversion example for sentiment analysis (ap_sentiments) on the tidytext website. I am having trouble, however, understanding how the tidytext data are structured. For example, the austen novels are stored by "book" in the austenr package. For my tm data, however, what is the equivalent for calling the vector for book? Here is the specific example for my data:
'cname <- file.path(".", "greencomments" , "all")
I can then use tidytext successfully after running the tm preprocessing:
practice <- tidy(tdm)
practice
partysentiments <- practice %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
partysentiments
# A tibble: 170 x 4
term document count sentiment
<chr> <chr> <dbl> <chr>
1 benefit 1 1.00 positive
2 best 1 2.00 positive
3 better 1 7.00 positive
4 cheaper 1 1.00 positive
5 clean 1 24.0 positive
7 clear 1 1.00 positive
8 concern 1 2.00 negative
9 cure 1 1.00 positive
10 destroy 1 3.00 negative
But, I can't reproduce the simple ggplots of word frequencies in tidytext. Since my data/corpus are not arranged with a column for "book" in the dataframe, the code (and therefore much of the tidytext functionality) won't work.
Here is an example of the issue. This works fine:
practice %>%
count(term, sort = TRUE)
# A tibble: 989 x 2
term n
<chr> <int>
1 activ 3
2 air 3
3 altern 3
but, what how to I arrange the tm corpus to match the structure of the books in the austenr package? Is "document" the equivalent of "book"? I have text files in folders for the corpus. I have tried replacing this in the code, and it doesn't work. Maybe I need to rename this? Apologies in advance - I am not a programmer.

Clustering data with different approaches

i have the following type of data:
*.edge file has the connections between ids of different users:
1 23
4 67
...
*.feat contains properties of the ids. Here the first column (column 0) are the userids. The other ones are representing features named in another file. For example userid 1 does not have the feature of column 1 (0), but userid 4 does (1):
1: 0 0 1 0 1 1 0 1 1
4: 1 0 1 1 1 0 1 1 1
...
Now i want to cluster the data and want to use different algorithms like k-means, DBSCAN, hierarchical clustering and so on. But as i read, there are several problems with multidimensional data?
There are problems with very high-dimensional data, but 10 is not high. You have other problems: k-means needs coordinates to compute means, not a graph with edges. Also, the values should be continuous, not binary. You need to study these methods in more detail. If you say "But as I read ...", then try to give a reference.

Calculating group means with own group excluded in MATLAB

To be generic the issue is: I need to create group means that exclude own group observations before calculating the mean.
As an example: let's say I have firms, products and product characteristics. Each firm (f=1,...,F) produces several products (i=1,...,I). I would like to create a group mean for a certain characteristic of the product i of firm f, using all products of all firms, excluding firm f product observations.
So I could have a dataset like this:
firm prod width
1 1 30
1 2 10
1 3 20
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
To reproduce the table:
firm=[1,1,1,2,2,2,3,3]
prod=[1,2,3,1,2,4,2,4]
hp=[30,10,20,25,15,40,10,35]
x=[firm' prod' hp']
Then I want to estimate a mean which will use values of all products of all other firms, that is excluding all firm 1 products. In this case, my grouping is at the firm level. (This mean is to be used as an instrumental variable for the width of all products in firm 1.)
So, the mean that I should find is: (25+15+40+10+35)/5=25
Then repeat the process for other firms.
firm prod width mean_desired
1 1 30 25
1 2 10 25
1 3 20 25
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
I guess my biggest difficulty is to exclude the own firm values.
This question is related to this page here: Calculating group mean/medians in MATLAB where group ID is in a separate column. But here, we do not exclude the own group.
p.s.: just out of curiosity if anyone works in economics, I am actually trying to construct Hausman or BLP instruments.
Here's a way that avoids loops, but may be memory-expensive. Let x denote your three-column data matrix.
m = bsxfun(#ne, x(:,1).', unique(x(:,1))); % or m = ~sparse(x(:,1), 1:size(x,1), true);
result = m*x(:,3);
result = result./sum(m,2);
This creates a zero-one matrix m such that each row of m multiplied by the width column of x (second line of code) gives the sum of other groups. m is built by comparing each entry in the firm column of x with the unique values of that column (first line). Then, dividing by the respective count of other groups (third line) gives the desired result.
If you need the results repeated as per the original firm column, use result(x(:,1))

Comparing, matching and combining columns of data

I need some help matching data and combining it. I currently have four columns of data in an Excel sheet, similar to the following:
Column: 1 2 3 4
U 3 A 0
W 6 B 0
R 1 C 0
T 9 D 0
... ... ... ...
Column two is a data value that corresponds to the letter in column one. What I need to do is compare column 3 with column 1 and whenever it matches copy the corresponding value from column 2 to column 4.
You might ask why don't I do this manually ? I have a spreadsheet with around 100,000 rows so this really isn't an option!
I do have access to MATLAB and have the information imported, if this would be more easily completed within that environment, please let me know.
As mentioned by #bla:
a formula similar to =IF(A1=C1,B1,0)
should serve (Excel).

How are the columns and rows counted in pascal function in Functional Programming Principles in Scala at coursera?

I'm learning Scala while going through the Coursera course Functional Programming Principles in Scala.
The first exercise says:
1
1 1
1 2 1
1 3 3 1
1 4 6 4 1
The numbers at the edge of the triangle are all 1, and each number
inside the triangle is the sum of the two numbers above it. Write a
function that computes the elements of Pascal’s triangle by means of a
recursive process.
Do this exercise by implementing the pascal function in Main.scala,
which takes a column c and a row r, counting from 0 and returns the
number at that spot in the triangle. For example, pascal(0,2)=1,
pascal(1,2)=2 and pascal(1,3)=3.
At the start, I understand, as he refers to the 'numbers' we are all familiar with, but then he goes on to use the term "elements." What does he mean by this? What does he want me to compute?
I assumed that he got bored with the word "number" and thought, after defining the names of the numbers in the triangle as 'numbers' he just wanted to use something new, thus "element," but no matter how I count I cannot get the references to work.
I cannot even really understand the term 'column' seeing as the numbers are not vertically above each other.
Can you please explain how he gets pascal(1,3) == 3?
You're thinking about columns a bit wrong. By "xth column," he means the "xth entry in a given row.
So, if you are looking at the function pascal(c,r), you would want to figure out what the cth number is in the rth row.
So, for example:
pascal(1,2) corresponds to the second entry in the 3rd row
1
1 1
1 *2* 1
pascal(1,3) wants you to look at the second entry in the 4th row.
1
1 1
1 2 1
1 *3* 3 1
Just count from the left. (0,2) is the leftmost number in the row
1 2 1
so (1,3) would be the second number in
1 3 3 1
You can simply make the triangle "rectangle", and everything will become apparent:
cols-> 0 1 2 3 4
row-0 1
row-1 1 1
row-2 1 2 1
row-3 1 3 3 1
row-4 1 4 6 4 1
And you were right in that the triangle's "elements" are made of numbers, though there's a subtle difference, but insubstantial in this case.
P.S. I would personally advice to prefer the course forum for such questions:
It will avoid controversial issues on the honor code.
Your course fellows will have a quicker understanding of the problem at hand
They will have access to material which is not available to those not undertaking the course
It will help to build up a sense of membership amongst the course students, and give you all a chance to create new, possibly fruitful, relashionships
What you're asking is against the Coursera Honor Code: https://www.coursera.org/maestro/auth/normal/tos.php#honorcode
http://www.aiqus.com/questions/41299/coursera-cheating-scala-course
I loved solving this exercise.
My thought process was the following:
Understanding that the problem is a literal description of the binomial coefficient. https://en.wikipedia.org/wiki/Binomial_coefficient
Understanding that the ask is a literal plug into the fomula (!row) / ((!col) * !((row - c))) and the formula is right there in the wiki page
Now the only thing that is missing now is implementing a tail recursive function of factorial
Bonus. if you use the extension method as such
extension (int: Int) {
def ! = factorialTailRec(int)
}
// you get to write
(r.!) / ((c.!) * ((r - c).!))
You get to write almost the identical mathematical formula. And at that moment I realised the similarities between doing maths and programming. And I cried a little with the beauty of it.