Trailing rows in datastore with multiple csv files - matlab

Matlab 2015b
I have several large (100-300MB) csv files, I want to merge to one and filter out some of the columns. They are shaped like this:
timestamp | variable1 | ... | variable200
01.01.16 00:00:00 | 1.59 | ... | 0.5
01.01.16 00:00:01 | ...
.
.
For this task I am using a datastore class including all the csv files:
ds = datastore('file*.csv');
When I read all of the entries and try to write them back to a csv file using writetable, I get an error, that the input has to be a cell array.
When looking at the cell array read from the datastore in debug mode, I noticed, that there are several rows containing only a timestamp, which are not in the original files. These columns are between the last row of a file and the first rows of the following one. The timestamps of this rows are the logical continuation of the last timestamp (as you would get them using excel).
Is this a bug or intended behaviour?
Can I avoid reading this rows in the first place or do I have to filter them out afterwards?
Thanks in advance.

As it seems nobody else had this problem, I will share how I dealt with it in the end:
toDelete = strcmp(data.(2), '');
data(toDelete, :) = [];
I took the second column of the table and checked for an empty string. Afterwards I filled all faulty rows with an empty array via logical indexing. (As shown in the Matlab Documentation)
Sadly I found no method to prevent loading the faulty data, but in the end the amount of data was not to big to do this processing step in memory.

Related

How to convert a string to a table with `textscan`?

I'm using matlab to read in COVID-19 data provided by Johns Hopkins as a .csv-file using urlread, but I'm not sure how to use textscan in the next step in order to convert the string into a table. The first two columns of the .csv-file are strings specifying the region, followed by a large number of columns containing the registered number of infections by date.
Currently, I just save the string returned by urlread locally and open this file with importdata afterwards, but surely there should be a more elegant solution.
You have mixed-up two things: Either you want to read from the downloaded csv-file using ´textscan´ (and ´fopen´,´fclose´ of course), or you want to use ´urlread´ (or rather ´webread´ as MATLAB recommends not to use ´urlread´ anymore). I go with the latter, since I have never done this myself^^
So, first we read in the data and split it into rows
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv";
% read raw data as single character array
web = webread(url);
% split the array into a cell array representing each row of the table
row = strsplit(web,'\n');
Then we allocate a table (pre-allocation is good for MATLAB as it stores variables on consecutive addresses in the RAM, so tell MATLAB beforehand how much space you need):
len = length(row);
% get the CSV-header as information about the number of columns
Head = strsplit(row{1},',');
% allocate table
S = strings(len,2);
N = NaN(len,length(Head)-2);
T = [table(strings(len,1),strings(len,1),'VariableNames',Head(1:2)),...
repmat(table(NaN(len,1)),1,length(Head)-2)];
% rename columns of table
T.Properties.VariableNames = Head;
Note that I did a little trick to allocate so many reparate columns of ´NaN´s by repeating a single table. However, concatnating this table with the table of strings is difficult as both contain the column-names var1 and var2. That is why I renamed the column of the first table right away.
Now we can actually fill the table (which is a bit nasty due to the person who found it nice to write ´Korea, South´ into a comma-separated file)
for i = 2:len
% split this row into columns
col = strsplit(row{i},',');
% quick conversion
num = str2double(col);
% keep strings where the result is NaN
lg = isnan(num);
str = cellfun(#string,col(lg));
T{i,1} = str(1);
T{i,2} = strjoin(str(2:end));% this is a nasty workaround necessary due to "Korea, South"
T{i,3:end} = num(~lg);
end
This should also work for the days that are about to come. Let me know what you actually gonna do with the data

Extract specific values from cells from a CSV

I have to combine a lot of files , mostly CSV, already have code to combine however I need first to trim the desired csv files so I can get the data that I want. Each CSV has first 2 columns of 8 rows which contain data that I want. and then just below those there is a row that generates 8 columns. I am only having issue grabbing data from the first 8 rows of the 2 columns.
Example of the csv first 3 rows:
Target Name: MIAW
Target OS: Windows
Last Updated On: June 27 2019 07:35:11
This is the data that I want, the first 3 rows are like this, with 2 columns. My idea is to store the 3 values of the 2nd column each into a variable and then use it with the rest of my code.
As I only have a problem extracting the data, and since the way the CSV are formated there is no header at all, it is hard to come up with an easy way to read the 2nd column data. Below is an example, this of course will be used to process several files so it will be a foreach, but I want to come up first with the simple code for 1 file so I can adapt it myself to a foreach.
$a = MIAW-Results-20190627T203644Z.csv
Write-Host "$a[1].col2"
This would work if and only if I had a header called col2, I could name it with the first value on the 2nd column but the issue is that that value will change for CSV file. So the code I tried would not work for example if I were to import several files using:
$InFiles = Get-ChildItem -Path $PSScriptRoot\ *.csv -File |
Where-Object Name -like *results*
Each CSV will have a different value for the first value on the 2nd column.
Is there an easier way to just grab the 3 rows of the second column that I need? I need to grab each one and store each in a different variable.

loading multiple non-CSV tables to R, and perform a function on each file.

First day on R. I may be expecting too much from it but here is what I'm looking for:
I have multiple files (140 tables), and each table has two columns (V1=values & V2=frequencies). I use the following code to get the Avg from each table:
I was wondering if it's possible to do this once instead of 140 times!
i.e: to load all files and get an exported file that shows Avg of each table in front of the original name of the file.
-I use read.table to load files as read.CSV doesn't work well for some reason.
I'll appreciate any input!
Sum(V1*V2)/Sum(V2)

How to save a matrix whose every element is a cell containing a 17 column MATLAB table?

Lets say I have data from a certain activity over 500 days. The duration of activity varies over those 500 days. And every day's activity is 17 columns long.
Everyday activity looks like this:
I created a (500 X 1) mat file of zeros called 'activity_database.mat' and then I tried to do this in MATLAB:
clear
load 'activity_database.mat'
for v=1:500
////////////////////////////
DO SOMETHING TO GET A TABLE
///////////////////////////
activity_data{v}=merged_table;
save('activity_database.mat','activity_data')
end
Now, after running the code. When I try to load the activity_database.mat , I receive the following error:
Error using load
Unable to read MAT-file C:\Users\jackryan\activity_database.mat. File might be corrupt.
What am I doing wrong, here? Also, the database is 50000 elements actually, so I am expecting out of space error too (about 30 GB). Is there a way to store all this data in reasonable space complexity bounds?
Instead of cumulating the entire data in a single file, you could actually save a file per day, in a specified order. Something like:
first_date = datenum(2012, 12, 20);
db_folder = '//somewhere/over/the/rainbow/';
for v=1:500
%// DO SOMETHING TO GET A TABLE
mat_name = sprintf('activity_day_%s.mat', datestr(first_date+v-1,'yyyymmdd'));
save(fullfile(db_folder,mat_name), 'merged_table');
end;
You should not have problems about over-sized .mat files, and you can load selectively the data depending on days.

Read csv file excluding first column and first line

I have a csv file containing 8 lines and 1777 columns.
I need to read all the contents in matlab, excluding the first line and first column. First line and first column contain strings and matlab can't parse them.
Do you have any idea?
data = csvread(filepath);
The code above reads all the contents
As suggested, csvread with a range will read in the numeric data. If you would like to read in the strings as well (which are presumably column headers), you can use readtable:
t = readtable(filepath);
This will create a table with the column headers in your file as variable names of the columns of the table. This way you can keep the strings associated with the data, if need be.