how to copy/move a column from one dataset to another dataset by matching variable - merge

I have a Dataset1 which has 25 columns, with a single column for individual ID. I have Dataset2 with 15 columns and also a column for individual ID. I am trying to move or copy one of the columns from Dataset1 and put it into Dataset2 sorted by individual, but the same individuals are not necessarily in both datasets. Is there an easy way to do this? I've tried playing around with dplyr but I'm really new to R and haven't had any luck. I don't want to completely merge the datasets, I just want to add one column of data to the second dataset but without losing information on individual ID.
Thanks!

an example
library(nycflights13) # for data example
head(flights)
head(airlines)
left_join(airlines, flights %>% select(carrier, flight))
your use case
library(dplyr)
left_join(Dataset1, Dataset2 %>% select(individual ID, name_col_to_copy) , by = "individual ID")

Related

How to filter one source by clicking and filtering a bar chart from another source in Tableau?

I used an Apriori algorithm to view the frequent relationships in the dataset and I want to do a dashboard to better visualize this data but I don't know how to do this filter.
This is the bar chart that I created to show the support (amount of times something happend) and the confidence (probability of B happening given A) of these associations:
Apriori Chart
Next to it on the dashboard, I'll have a table with the full dataset used in this Apriori analysis where I have more information such as ID, Income, Hours Worked, etc:
Table from different data source
How can I create this relationship? The two data sources don't have a column in common that I can use for that.
I would need some way to:
Split the values in the antecedents columns by comma and filter only those columns with value equal to 1 in the other dataset
**Dataset A**
'Age Range <=30, Joblevel 1, Maritalstatus Single'
->
'Age Range <=30'
'Joblevel 1'
'Maritalstatus Single'
**Dataset B**
'Age Range <=30' == 1
'Joblevel 1' == 1
'Maritalstatus Single' == 1
Clicking this would filter the table next to it
Is there any way I can do this in Tableau?
You can download the tbwx i used in this example here https://community.tableau.com/servlet/JiveServlet/download/1083124-384949/Apriori.twbx
Thanks in advance for the help!
I am not able to check your twbx on the machine I'm using but I think you should be able to do this. The fields in the 2 data sources need to match so manipulate the data sources the make this happen.
For data source 1 there's a function SPLIT which will mean you are able to split the comma separated string to 3 fields.
Putting those 3 fields to the Detail shelf of your bar chart (or even Rows and hiding the header) will mean you can use them in an action filter.
Your second data source is a cross tab - post pivot. You should be able to pivot this data source. Highlight the measures and pivot them. This will give you the field Pivot Field Names and Pivot Field Values.
You only want to keep those with a value of 1 so create a calculated field
[Lookup1]: IF [Pivot Field Values] = 1 THEN [Pivot Field Names] END
Duplicate this field twice so you have Lookup1, Lookup2 and Lookup 3.
Then you should be able to action filter the table.
In the action filter set it up so SplitField1 = Lookup1, SplitField2 = Lookup2, etc.
Fingers crossed this works, I haven't been able to test so I am pulling it out of my head.

DAX: Distinct and then aggregate twice

I'm trying to create a Measure in Power BI using DAX that achieves the below.
The data set has four columns, Name, Month, Country and Value. I have duplicates so first I need to dedupe across all four columns, then group by Month and sum up the value. And then, I need to average across the Month to arrive at a single value. How would I achieve this in DAX?
I figured it out. Reply by #OscarLar was very close but nested SUMMARIZE causes problems because it cannot aggregate values calculated dynamically within the query itself (https://www.sqlbi.com/articles/nested-grouping-using-groupby-vs-summarize/).
I kept the inner SUMMARIZE from #OscarLar's answer changed the outer SUMMARIZE with a GROUPBY. Here's the code that worked.
AVERAGEX(GROUPBY(SUMMARIZE(Data, Data[Name], Data[Month], Data[Country], Data[Value]), Data[Month], "Month_Value", sumx(CURRENTGROUP(), Data[Value])), [Month_Value])
Not sure I completeley understood the question since you didn't provide example data or some DAX code you've already tried. Please do so next time.
I'm assuming parts of this can not (for reasons) be done using power query so that you have to use DAX. Then I think this will do what you described.
Create a temporary data table called Data_reduced in which duplicate rows have been removed.
Data_reduced =
SUMMARIZE(
'Data';
[Name];
[Month];
[Country];
[Value]
)
Then create the averaging measure like this
AveragePerMonth =
AVERAGEX(
SUMMARIZE(
'Data_reduced';
'Data_reduced'[Month];
"Sum_month"; SUM('Data_reduced'[Value])
);
[Sum_month]
)
Where Data is the name of the table.

How clean data in two columns of a csv file, with many rows as documents

I need to clean textual data from two columns (say c(8,9)) of a cvs file with around 800 rows, where each row represent a different document, containing comments from interviewees on a company. I need to apply the stopwords, but I don't know how to use tm functions on those rows of these two columns.
library(tm)
setwd("C:\\temp")
netf <- read.csv2("netflix.csv")
corp <-tm_map(netf, c(8,9), removeWords, stopwords("english"))
Error in UseMethod("tm_map",x):
no applicable method for "tm_map" applied to an object of class "data.frame"
I would like to get the two columns with their 800 rows cleaned by stopwords

Counting the number each element in a comma seperated column in Tableau

I am new to using Tableau.I want to count the number of times each genre appears in the data set.
In the data set(image attached), I have several genres for one show. I want to count the number of each genre in the data set and display it in Tableau
If you have access to database, then take the dump of data in a excel.
Split the data by , and then create a individual column for every word in the genre column.
Now take the excel as source to tableau, In tableau pivot the splitted columns of Genre.
Go to sheet in tableau, Place the pivot field values in rows and count of pivot field values as measures.
You should be able to see the desired result.
This can be done by an alternative method if you know the distinct list of all genre.
what you need to do is to create a separate calculative field for each Genre using
if contains(Genre,'action') then 1 else 0 end
and then use the Sum of these field as the count of Series per genre.
I know this is a hideous task but, it can be done if you do not have any other option.

Need help building complex multi-table queries

This question is something that a lot of people learning bioinformatics and new to DNA data analysis are struggling with:
Lets say I have 20 tables with the same column headings. Each table represents a patient sample and each row represents a locus (site) which has mutated in that sample. Each site is uniquely identified by two columns together - chromosome number and base number (eg. 1 and 43535, 1 and 33456, 1 and 3454353). There are several columns which give different characteristics of each mutation including a column called Gene which gives the gene at that site.. Multiple sites can be mutated in a gene - meaning the Gene column can have the same value multiple times in one table.
I want to query all these tables at the same time by lets say Gene. I input a value from the Gene column and I want as output the names of all the tables (samples) in which the gene name is present in the Gene column and also the entire line(s) (preferably) for each sample so that I can compare the characteristics of the mutation in that gene across multiple samples on one output page.
I also want to input a number say 4 and want as output a list of genes which have mutated in at least 4 of 20 patients (list of genes whose names appear in the Gene column in atleast 4 of 20 tables).
What is the "easiest way" to do this? What is the "best way" assuming I want to make more flexible queries, besides these two?
I am a MD, do not have any particular software expertise but I am willing to put in the necessary time to build this query system. A few lines of code won't put me off..
Eg data:
Func Gene ExonicFunc Chr Start End Ref Obs
exonic ACTRT2 nonsynonymous SNV 1 2939346 2939346 G A
exonic EIF4G3 nonsynonymous SNV 1 21226201 21226201 G A
exonic CSMD2 nonsynonymous SNV 1 34123714 34123714 C T
This is just a third of the columns. Multiple columns were removed to fit the page size here...
Thank you.
Create a view that union's all the tables together. You should probably add additional information about which table ti comes from:
create view allpatients as
select 'a' as whichtable, t.*
from tableA t
union all
select 'b' as whichtable, t.*
from tableB t
...
You might find that it is easier to "instantiate" the view by creating a table with all patients. Just have a stored procedure that recreates the table by combining the 20 tables.
Alternatively, you could find that you have large individual tables (millions of rows). In this case, you would want to treat each of the original tables as a partition.
If what you have is a bunch of Excel files, you can import them all into the same table, with a distinct column for patient id. There is no need to create 20 different tables for this -- in fact, it would be a bad idea.
Once you do, go to Access' query design, SQL view and use these queries:
To create a query that returns all fields for the input gene name:
select *
from gene_data
where gene = [GeneName]
To create a query that returns gene names that are mutated in more than 4 samples:
select gene
from
(select gene, sample_id
from gene_data
group by gene, sample_id) g
group by gene
having count(sample_id) > 4
After this, change to design view -- you'll see how to create similar queries using the GUI.