I'm using matlab to read in COVID-19 data provided by Johns Hopkins as a .csv-file using urlread, but I'm not sure how to use textscan in the next step in order to convert the string into a table. The first two columns of the .csv-file are strings specifying the region, followed by a large number of columns containing the registered number of infections by date.
Currently, I just save the string returned by urlread locally and open this file with importdata afterwards, but surely there should be a more elegant solution.
You have mixed-up two things: Either you want to read from the downloaded csv-file using ´textscan´ (and ´fopen´,´fclose´ of course), or you want to use ´urlread´ (or rather ´webread´ as MATLAB recommends not to use ´urlread´ anymore). I go with the latter, since I have never done this myself^^
So, first we read in the data and split it into rows
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv";
% read raw data as single character array
web = webread(url);
% split the array into a cell array representing each row of the table
row = strsplit(web,'\n');
Then we allocate a table (pre-allocation is good for MATLAB as it stores variables on consecutive addresses in the RAM, so tell MATLAB beforehand how much space you need):
len = length(row);
% get the CSV-header as information about the number of columns
Head = strsplit(row{1},',');
% allocate table
S = strings(len,2);
N = NaN(len,length(Head)-2);
T = [table(strings(len,1),strings(len,1),'VariableNames',Head(1:2)),...
repmat(table(NaN(len,1)),1,length(Head)-2)];
% rename columns of table
T.Properties.VariableNames = Head;
Note that I did a little trick to allocate so many reparate columns of ´NaN´s by repeating a single table. However, concatnating this table with the table of strings is difficult as both contain the column-names var1 and var2. That is why I renamed the column of the first table right away.
Now we can actually fill the table (which is a bit nasty due to the person who found it nice to write ´Korea, South´ into a comma-separated file)
for i = 2:len
% split this row into columns
col = strsplit(row{i},',');
% quick conversion
num = str2double(col);
% keep strings where the result is NaN
lg = isnan(num);
str = cellfun(#string,col(lg));
T{i,1} = str(1);
T{i,2} = strjoin(str(2:end));% this is a nasty workaround necessary due to "Korea, South"
T{i,3:end} = num(~lg);
end
This should also work for the days that are about to come. Let me know what you actually gonna do with the data
Related
I'm trying to write a python wrapper which is designed to read multiple records from csv(or mysql) and update values of a predefined range of a sheet, the range is consisted of value cells and formula cells, my purpose is to update value cells only and keep formula cells in the range unchanged.
In order to do this, I first tried setting cell values one by one, using if to skip formula cells, but it was to slow because there were more than 10 thousand cells, then I tried setDataArray which was fast enough but formulas were overrode by values, then I created an array and set values and formulas into the array and used setFormulaArray to put the values and formulas into the range, the function was what I needed, by it took more than one minutes to finish
I know the setFormulaArray will update the formulas but I don't need this to happen, however as there is no option in the API to skip formulas I can only use the same formula to update the original formula in a certain cell.
Is there any solution to improve the performance when using setFormulaArry, or is there any solution to update only value cells and skip formula cells in a range?
bellow is my code
import uno
import time
#to open the libreoffice calc file
local_ctx = uno.getComponentContext()
smgr_local = local_ctx.ServiceManager
resolver = smgr_local.createInstanceWithContext("com.sun.star.bridge.UnoUrlResolver", local_ctx)
url = "uno:socket,host=localhost,port=2083,tcpNoDalay=1;urp;StarOffice.ComponentContext"
uno_ctx = resolver.resolve(url)
uno_smgr = uno_ctx.ServiceManager
desktop = uno_smgr.createInstanceWithContext("com.sun.star.frame.Desktop", uno_ctx )
PropertyValue = uno.getClass('com.sun.star.beans.PropertyValue')
inProps = PropertyValue( "Hidden" , 0 , True, 0 )
document = desktop.loadComponentFromURL("file:///D:/salse.ods", "_blank", 0, inProps )
# get the sheet and read original data and formula from the sheet
sheets = document.getSheets()
xs = sheets["salse"]
cursor=xs.createCursor()
cursor.gotoStartOfUsedArea(False)
cursor.gotoEndOfUsedArea(True)
cra=cursor.getRangeAddress()
rng=xs.getCellRangeByPosition(cra.StartColumn,cra.StartRow,cra.EndColumn,cra.EndRow)
ft=rng.getFormulaArray()
#some code here to change values in ft....
# bellow took more than one minutes
rng.setFormulaArray(ft)
it is only a test of setFormulaArray performance
update1:
I still can't find the ideal solution,however, my solution is to create an identical sheet for each of the sheets in the workbook and add ,in the original sheets, references to the newly created sheets on the precisely same cells, for example:
example
sheetA is the original sheet
sheetA-1 is newly created one
then modify data cells in sheetA to reference to sheetA-1
then setDataArray to update the entire range in sheetA-1, data will automatically change in corresponding cells in sheetA
I want to convert the second column of table T using datenum.
The elements of this column are '09:30:31.848', '15:35:31.325', etc. When I use datenum('09:30:31.848','HH:MM:SS.FFF') everything works, but when I want to apply datenum to the whole column it doesn't work. I tried this command datenum(T(:,2),'HH:MM:SS.FFF') and I receive this error message:
"The input to DATENUM was not an array of character vectors"
Here a snapshot of T
Thank you
You are not calling the data from the table, but rather a slice of the table (so its stays a table). Refer to the data in the table using T.colName:
times_string = ['09:30:31.848'; '15:35:31.325'];
T = table(times_string)
times_num = datenum(T.times_string, 'HH:MM:SS.FFF')
Alternatively, you can slice the table using curly braces to extract the data (if you want to use the column number instead of name):
times_num = datenum(T{:,2}, 'HH:MM:SS.FFF')
I have some corrupted rows in my large CSV file where some data values get shifted due to missing line breaks. This results in values appearing in the wrong column header. For eg. if three columns exists in my table, , , , after corruption, I start to see values like , , .
Is there a way for me to drop all rows where for e.g. I see a non-int in a row that I know should, in fact, be an Int?
What you can do is loop through the lines, and when the lines.split(",").count() doesn't equal what you want, you can filter it out. Something like this:
import scala.io.Source
val n = 5 //or how many columns you require
Source.fromFile(input_file).getLines().toSeq.map(_.split(",")).filter(_.count == n)
This should do what you want :)
Matlab 2015b
I have several large (100-300MB) csv files, I want to merge to one and filter out some of the columns. They are shaped like this:
timestamp | variable1 | ... | variable200
01.01.16 00:00:00 | 1.59 | ... | 0.5
01.01.16 00:00:01 | ...
.
.
For this task I am using a datastore class including all the csv files:
ds = datastore('file*.csv');
When I read all of the entries and try to write them back to a csv file using writetable, I get an error, that the input has to be a cell array.
When looking at the cell array read from the datastore in debug mode, I noticed, that there are several rows containing only a timestamp, which are not in the original files. These columns are between the last row of a file and the first rows of the following one. The timestamps of this rows are the logical continuation of the last timestamp (as you would get them using excel).
Is this a bug or intended behaviour?
Can I avoid reading this rows in the first place or do I have to filter them out afterwards?
Thanks in advance.
As it seems nobody else had this problem, I will share how I dealt with it in the end:
toDelete = strcmp(data.(2), '');
data(toDelete, :) = [];
I took the second column of the table and checked for an empty string. Afterwards I filled all faulty rows with an empty array via logical indexing. (As shown in the Matlab Documentation)
Sadly I found no method to prevent loading the faulty data, but in the end the amount of data was not to big to do this processing step in memory.
I have a .csv file and I can't read it on Octave. On R I just use the command below and everything is read alright:
myData <- read.csv("myData.csv", stringsAsFactors = FALSE)
However, when I go to Octave it doesn't do it properly with the below command:
myData = csvread('myData.csv',1,0);
When I open the file with Notepad, the data looks something like the below. Note there isn't a comma separating the last column name (i.e. Column3) from the first value (i.e. Value1) and the same thing happens with the last value of the first row (i.e. Value3) and the first value of the second row (i.e Value4)
Column1,Column2,Column3Value1,Value2,Value3Value4,Value5,Value6
The Column1 is meant for date values (with format yyyy-mm-dd hh:mm:ss), I don't know if that has anything to do with the problem.
Alex's answers already explains why csvread does not work for your case. That function only reads numeric data and returns an array. Since your fields are all strings, you need something that reads a csv file into a cell array.
That function is named csv2cell and is part of the io package.
As a separate note, if you plan to make operation with those dates, you may want to convert those dates as strings, into serial date numbers. This will allow you to put your dates in a numeric array which will allow for faster operations and reduced memory usage. Also, the financial package has many functions to deal with dates.
csvread only reads numeric data, so a date does not qualify unfortunately.
In Octave you might want to check out the dataframe package. In Matlab you would do readtable.
Otherwise there are also more primitive functions you can use like textscan.