loading multiple non-CSV tables to R, and perform a function on each file. - average

First day on R. I may be expecting too much from it but here is what I'm looking for:
I have multiple files (140 tables), and each table has two columns (V1=values & V2=frequencies). I use the following code to get the Avg from each table:
I was wondering if it's possible to do this once instead of 140 times!
i.e: to load all files and get an exported file that shows Avg of each table in front of the original name of the file.
-I use read.table to load files as read.CSV doesn't work well for some reason.
I'll appreciate any input!
Sum(V1*V2)/Sum(V2)

Related

Tableau counting multiple blanks

I have a dataset containing roughly 50 columns. I want to check the data inputs and show how many blanks/if any are in each column.
I originally started using calculated fields for columns and using sum(if isnull([column] then 1 else 0 end)). This works but doesn’t seem very efficient if I want to do it for multiple columns.
Tried doing a pivot table as well however I need the data in it’s original form for other analysis I would like to do.
Would appreciate any help.
Thanks
Fiona

Is is possible limit the number of rows in the output of a Dataprep flow?

I'm using Dataprep on GCP to wrangle a large file with a billion rows. I would like to limit the number of rows in the output of the flow, as I am prototyping a Machine Learning model.
Let's say I would like to keep one million rows out of the original billion. Is this possible to do this with Dataprep? I have reviewed the documentation of sampling, but that only applies to the input of the Transformer tool and not the outcome of the process.
You can do this, but it does take a bit of extra work in your Recipe--set up a formula in a new column using something like RANDBETWEEN to give you a random integer output between 1 and 1,000 (in this million-to-billion case). From there, you can filter rows based on whatever random integer between 1 and 1,000 as what you'll keep, and then your output will only have your randomized subset. Just have your last part of the recipe remove this temporary column.
So indeed there are 2 approaches to this.
As Courtney Grimes said, you can use one of the 2 functions that create random-number out of a range.
randbetween :
rand :
These methods can be used to slice an "even" portion of your data. As suggested, a randbetween(1,1000) , then pick 1<x<1000 to filter, because it's 1\1000 of data (million out of a billion).
Alternatively, if you just want to have million records in your output, but either
Don't want to rely on the knowledge of the size of the entire table
just want the first million rows, agnostic to how many rows there are -
You can just use 2 of these 3 row filtering methods: (top rows\ range)
P.S
By understanding the $sourcerownumber metadata parameter (can read in-product documentation), you can filter\keep a portion of the data (as per the first scenario) in 1 step (AKA without creating an additional column.
BTW, an easy way of "discovery" of how-to's in Trifacta would be to just type what you're looking for in the "search-transtormation" pane (accessed via ctrl-k). By searching "filter", you'll get most of the relevant options for your problem.
Cheers!

Compare two sql data using mysqlworkbench

I have 3 tables on mysqlworkbench, 1 table need to combine with 2 data ch(17million row) and cl(9million row) suppose to be one table, other table name alldoc.(121k)
Basically i need to combine ch and cl as one table, and compare with alldoc data. Technically they are suppose to be same but people made mistake that why i need to compare. 100 column
enter image description here
enter image description here
I plan to write query till i hit to 100 because i have 100 columns in all data. Just rows sizes are different.
Thank you from advance. I know complicated but i really need to compare these two data writing query

Tallying unknown words across columns in Tableau (or from comma separated column)

I have an issue that I have been trying to solve for the better part of a week now. I have a large database (in Google sheets) representing casestudies. I have some columns with multiple categories listed (in this example 'species', 'genera', and 'morphologies'), and I want to be able to tally how many times each category occurs in the data set.
I use Tableau to visalise the data, and the final output will be a large publc tableau. I know I can do a "find" based on the specific string, but I'd like the dataset to be dynamic and be able to handle new data being added without having to update calculated fields? Is there a way of finding uniqe terms (either from a single column of comma separated values, or from multiple columns), and tallying them?
Things I have tried so far:
1 - A pivot table in Tableau. Works well, but messes with all the other data, since it repeats lines.
2 - A pivot table on its own data source in Tableau. Also works well, and avoids the problem of messing with the other data. However, now each figure is disconnected from the others so I can't do a large dashboard where everything is filtered by each other (ie filtering species and genera by country at the same time).
3 - An SQL query() in google sheets, which finds all unique terms and queries them, which can then be plotted in Tableau. Also works well, but similar problem of the data being disconnected from all the other terms in the dataset.
Any ideas of a field calculation that will find, list and tally unique terms in a single comma separated column (or across multiple columns), without changing the data structure?
I have placed a sample data set here (google sheets), which is a smaller version of what I'm actually working on. In it I have marked comma separated columns in grey, and they're followed by a bunch of columns with the values split into columns. I only need to analyse either of those (ie either a calculation to separate comma separate values or from multiple columns).
I've also added a sample Tableau workbook here.

How to save a matrix whose every element is a cell containing a 17 column MATLAB table?

Lets say I have data from a certain activity over 500 days. The duration of activity varies over those 500 days. And every day's activity is 17 columns long.
Everyday activity looks like this:
I created a (500 X 1) mat file of zeros called 'activity_database.mat' and then I tried to do this in MATLAB:
clear
load 'activity_database.mat'
for v=1:500
////////////////////////////
DO SOMETHING TO GET A TABLE
///////////////////////////
activity_data{v}=merged_table;
save('activity_database.mat','activity_data')
end
Now, after running the code. When I try to load the activity_database.mat , I receive the following error:
Error using load
Unable to read MAT-file C:\Users\jackryan\activity_database.mat. File might be corrupt.
What am I doing wrong, here? Also, the database is 50000 elements actually, so I am expecting out of space error too (about 30 GB). Is there a way to store all this data in reasonable space complexity bounds?
Instead of cumulating the entire data in a single file, you could actually save a file per day, in a specified order. Something like:
first_date = datenum(2012, 12, 20);
db_folder = '//somewhere/over/the/rainbow/';
for v=1:500
%// DO SOMETHING TO GET A TABLE
mat_name = sprintf('activity_day_%s.mat', datestr(first_date+v-1,'yyyymmdd'));
save(fullfile(db_folder,mat_name), 'merged_table');
end;
You should not have problems about over-sized .mat files, and you can load selectively the data depending on days.