How clean data in two columns of a csv file, with many rows as documents - tm

I need to clean textual data from two columns (say c(8,9)) of a cvs file with around 800 rows, where each row represent a different document, containing comments from interviewees on a company. I need to apply the stopwords, but I don't know how to use tm functions on those rows of these two columns.
library(tm)
setwd("C:\\temp")
netf <- read.csv2("netflix.csv")
corp <-tm_map(netf, c(8,9), removeWords, stopwords("english"))
Error in UseMethod("tm_map",x):
no applicable method for "tm_map" applied to an object of class "data.frame"
I would like to get the two columns with their 800 rows cleaned by stopwords

Related

PowerBI Splitting a delimited column into duplicate rows

I am working with a customer survey in MS Forms, there are 2 questions that have multiple selections. What I want to know is, how do I split these columns into new rows while duplicating the data in the other columns?
I know how to split the column by delimiter, I'm just struggling to figure out the correct approach to split the columns and duplicate the rows.
Here are the two columns, crossed out for sensitive info, there are about 10+ additional columns with data that I would like to be duplicated with each split.
Follow these bellow steps-
Input:
Step one:
Step two:
Output:

Create sample value for failure records spark

I have a scenario where my dataframe has 3 columns a,b and c. I need to validate if the length of all the columns is equal to 100. Based on validation I am creating status column like a_status,b_status,c_status with values 5 (Success) and 10 (Failure). In Failure scenarios I need to update count and create new columns a_sample,b_sample,c_sample with some 5 failure sample values separated by ",". For creating samples column I tried like this
df= df.select(df.columns.toList.map(col(_)) :::
df.columns.toList.map( x => (lit(getSample(df.select(x, x + "_status").filter(x + "_status=10" ).select(x).take(5))).alias(x + "_sample")) ).toList: _* )
getSample method will just get array of rows and concatenate as a string. This works fine for limited columns and data size. However if the number of columns > 200 and data is > 1 million rows it creates huge performance impact. Is there any alternate approach for same.
While the details of your problem statement are unclear, you can break up the task into two parts:
Transform data into a format where you identify several different types of rows you need to sample.
Collect sample by row type.
The industry jargon for "row type" is stratum/strata and the way to do (2), without collecting data to the driver, which you don't want to do when the data is large, is via stratified sampling, which Spark implements via df.stat.sampleBy(). As a statistical function, it doesn't work with exact row numbers but fractions. If you absolutely must get a sample with an exact number of rows there are two strategies:
Oversample by fraction and then filter unneeded rows, e.g., using the row_number() window function followed by a filter 'row_num < n.
Build a custom user-defined aggregate function (UDAF), firstN(col, n). This will be much faster but a lot more work. See https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
An additional challenge for your use case is that you want this done per column. This is not a good fit with Spark's transformations such as grouping or sampleBy, which operate on rows. The simple approach is to make several passes through the data, one column at a time. If you absolutely must do this in a single pass through the data, you'll need to build a much more custom UDAF or Aggregator, e.g., the equivalent of takeFirstNFromAWhereBHasValueC(n, colA, colB, c).

Tallying unknown words across columns in Tableau (or from comma separated column)

I have an issue that I have been trying to solve for the better part of a week now. I have a large database (in Google sheets) representing casestudies. I have some columns with multiple categories listed (in this example 'species', 'genera', and 'morphologies'), and I want to be able to tally how many times each category occurs in the data set.
I use Tableau to visalise the data, and the final output will be a large publc tableau. I know I can do a "find" based on the specific string, but I'd like the dataset to be dynamic and be able to handle new data being added without having to update calculated fields? Is there a way of finding uniqe terms (either from a single column of comma separated values, or from multiple columns), and tallying them?
Things I have tried so far:
1 - A pivot table in Tableau. Works well, but messes with all the other data, since it repeats lines.
2 - A pivot table on its own data source in Tableau. Also works well, and avoids the problem of messing with the other data. However, now each figure is disconnected from the others so I can't do a large dashboard where everything is filtered by each other (ie filtering species and genera by country at the same time).
3 - An SQL query() in google sheets, which finds all unique terms and queries them, which can then be plotted in Tableau. Also works well, but similar problem of the data being disconnected from all the other terms in the dataset.
Any ideas of a field calculation that will find, list and tally unique terms in a single comma separated column (or across multiple columns), without changing the data structure?
I have placed a sample data set here (google sheets), which is a smaller version of what I'm actually working on. In it I have marked comma separated columns in grey, and they're followed by a bunch of columns with the values split into columns. I only need to analyse either of those (ie either a calculation to separate comma separate values or from multiple columns).
I've also added a sample Tableau workbook here.

how to copy/move a column from one dataset to another dataset by matching variable

I have a Dataset1 which has 25 columns, with a single column for individual ID. I have Dataset2 with 15 columns and also a column for individual ID. I am trying to move or copy one of the columns from Dataset1 and put it into Dataset2 sorted by individual, but the same individuals are not necessarily in both datasets. Is there an easy way to do this? I've tried playing around with dplyr but I'm really new to R and haven't had any luck. I don't want to completely merge the datasets, I just want to add one column of data to the second dataset but without losing information on individual ID.
Thanks!
an example
library(nycflights13) # for data example
head(flights)
head(airlines)
left_join(airlines, flights %>% select(carrier, flight))
your use case
library(dplyr)
left_join(Dataset1, Dataset2 %>% select(individual ID, name_col_to_copy) , by = "individual ID")

Need help building complex multi-table queries

This question is something that a lot of people learning bioinformatics and new to DNA data analysis are struggling with:
Lets say I have 20 tables with the same column headings. Each table represents a patient sample and each row represents a locus (site) which has mutated in that sample. Each site is uniquely identified by two columns together - chromosome number and base number (eg. 1 and 43535, 1 and 33456, 1 and 3454353). There are several columns which give different characteristics of each mutation including a column called Gene which gives the gene at that site.. Multiple sites can be mutated in a gene - meaning the Gene column can have the same value multiple times in one table.
I want to query all these tables at the same time by lets say Gene. I input a value from the Gene column and I want as output the names of all the tables (samples) in which the gene name is present in the Gene column and also the entire line(s) (preferably) for each sample so that I can compare the characteristics of the mutation in that gene across multiple samples on one output page.
I also want to input a number say 4 and want as output a list of genes which have mutated in at least 4 of 20 patients (list of genes whose names appear in the Gene column in atleast 4 of 20 tables).
What is the "easiest way" to do this? What is the "best way" assuming I want to make more flexible queries, besides these two?
I am a MD, do not have any particular software expertise but I am willing to put in the necessary time to build this query system. A few lines of code won't put me off..
Eg data:
Func Gene ExonicFunc Chr Start End Ref Obs
exonic ACTRT2 nonsynonymous SNV 1 2939346 2939346 G A
exonic EIF4G3 nonsynonymous SNV 1 21226201 21226201 G A
exonic CSMD2 nonsynonymous SNV 1 34123714 34123714 C T
This is just a third of the columns. Multiple columns were removed to fit the page size here...
Thank you.
Create a view that union's all the tables together. You should probably add additional information about which table ti comes from:
create view allpatients as
select 'a' as whichtable, t.*
from tableA t
union all
select 'b' as whichtable, t.*
from tableB t
...
You might find that it is easier to "instantiate" the view by creating a table with all patients. Just have a stored procedure that recreates the table by combining the 20 tables.
Alternatively, you could find that you have large individual tables (millions of rows). In this case, you would want to treat each of the original tables as a partition.
If what you have is a bunch of Excel files, you can import them all into the same table, with a distinct column for patient id. There is no need to create 20 different tables for this -- in fact, it would be a bad idea.
Once you do, go to Access' query design, SQL view and use these queries:
To create a query that returns all fields for the input gene name:
select *
from gene_data
where gene = [GeneName]
To create a query that returns gene names that are mutated in more than 4 samples:
select gene
from
(select gene, sample_id
from gene_data
group by gene, sample_id) g
group by gene
having count(sample_id) > 4
After this, change to design view -- you'll see how to create similar queries using the GUI.