Import csv file with values within quotation marks

Import csv file with values within quotation marks - matlab

I want to import a csv file (data.csv) which is structured as follows
Area,MTU,Biomass MW,Lignite MW
CTA|DE(50Hertz),"01.01.2015 00:00 - 01.01.2015 00:15 (CET)","1227","6485"
CTA|DE(50Hertz),"01.01.2015 00:15 - 01.01.2015 00:30 (CET)","","6421"
CTA|DE(50Hertz),"01.01.2015 00:30 - 01.01.2015 00:45 (CET)","1230","n/e"
CTA|DE(50Hertz),"01.01.2015 00:45 - 01.01.2015 01:00 (CET)","N/A","6299"
I've tried to use textscan which works pretty well, but the process stops once the empty quotation marks are reached. The argument 'EmptyValue',0 does not work.
file = fopen('data.csv');
data = textscan(file, '%q %q "%f" "%f"', 'Delimiter', ',',...
'headerlines', 1,'TreatAsEmpty', {'N/A', 'n/e'}, 'EmptyValue',0);
fclose(file);
Any idea, on how to import the whole file.

textscan(file,'%q%q%q%q%[^\n\r]','Delimiter',',','headerlines',1);
worked just fine for me. You get values like: "01.01.2015 00:00 - 01.01.2015 00:15 (CET)" But those are trivial to write a separate parser for. Don't try to do it all in one step. That will cause you much pain and suffering. Break it up into simple single steps.
Also, I highly recommend right clicking your file in the "Current Folder" window in matlab, then selecting "Import Data" This makes importing CSV (or tab separated, or fixed width data files) trivial.

Related

why are csvs copied from QPAD and csvs saved from q process so different in terms of size?

I am trying to save a csv generated from a table.
If I 'Export all as CSV' from QPAD the file is 22MB.
If I do `:path.csv 0: csv 0: table the file is 496MB.
The file contains same data.
I do have some columns which are list of dates, list of symbols which cause some issues when parsing to csv.
To get over that I use this {`$$[1=count x;string first x;`$" "sv string x]}
i.e. one of the cols is called allDates and looks like this:
someOtherCol
allDates
stackedSymCol
val1
, 2001.01.01
, `sym 1
val2
2001.01.01 2001.01.02
`sym 2`sym 3
Where is this massive difference in size coming from and how can I reduce the the size.
If I remove these 3 columns which are lists of lists, the file goes down significantly.
Doing an ungroup is not an option.
I think the important question here is why is QPAD capable to handle columns which are lists of lists of type 'D' 'S' etc and how I can achieve that without casting those columns to a space delimited string. This is what is causing my saved csv to be so massive.
ie. I can do an 'Export all to csv' from QPAD on this and it is 21MB :
but if I want to save it programatically, I need to change those allDates and DESK_NAME column and it goes up to 500MB
UPDATE: Thanks everyone. I did not know that QPAD is truncating data like that on exports. That is worrying.

These csvs will not be identical. qPad truncates nested lists(including strings). The csv exported directly from kdb will be complete.
Eg.
([]a:3#enlist til 1000;b:3#enlist til 1000)
The qPad csv export of this looks like this at the end: 30j, 31j ....

Based on the update to your answer it seems you are exporting the data shown in the screenshot which would not be the same as the data you are transforming to save to csv directly from q.
Based on the screenshot it is likely the csv files are not identical for at least 3 reasons:
QPad is truncating the nested dates at a certain length
QPad adds enlist to nested lists of length 1
QPad adds/keeps backticks before symbols
Example data comparison
Here is a minimal example that should highlight this:
q)example:{n:1 2 20;([]someOtherCol:3?10;allDates:n?\:.z.d;stackedSymCol:n?\:`3)}[]
q)example
someOtherCol allDates
stackedSymCol
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 ,2006.01.13
,`hfg
1 2008.04.06 2008.01.11
`nha`plc
4 2009.06.12 2016.01.24 2021.02.02 2018.09.02 2011.06.19 2022.09.26 2008.10.29 2010.03.11 2022.07.30 2012.09.06 2021.11.27 2017.11.24 2007.09.10 2012.11.27 2020.03.10 2003.07.02 2007.11.29 2010.07.18 2001.10.23 2000.11.07 `ifd`jgp`eln`kkb`ahm`cal`eni`idj`mod`omb`dkc`ogf`eaj`mbf`kdd`hip`gkg`eef`edi`jak
I have used 'Export All to CSV' to save to C:/q/qpad.csv.
I couldn't get your "razing" function to work as-is so I modified it slightly and used that to convert nested lists to strings and saved the file to csv.
q)f:{`$$[1=count x;string first x;" "sv string x]}
q)`:C:/q/q.csv 0: csv 0: update f'[allDates], f'[stackedSymCol] from example
Reading from both files and comparing the contents results in mismatched contents.
q)a:read0`:C:/q/q.csv
q)b:read0`:C:/q/qpad.csv
q)a~b
0b
Side note
Since kdb+ V4.0 2020.03.17 it is possible to save nested vectors to csv using .h.cd to prepare the text. The variable .h.d is used as the delimiter for sublist items.
q).h.d:" ";
q).h.cd example
"someOtherCol,allDates,stackedSymCol"
"8,2013.09.10,pii"
"6,2007.08.09 2012.12.30,hbg blg"
"8,2011.04.04 2020.08.21 2006.02.12 2005.01.15 2016.05.31 2015.01.03 2021.12.09 2022.03.26 2013.10.15 2001.10.29 2011.02.17 2010.03.28 2005.11.14 2003.08.16 2002.04.20 2004.08.07 2014.09.19 2000.05.24 2018.06.19 2017.08.14,cim pgm gha chp dio gfc beh mbo cfe kec jbn bjh eni obf agb dce gnk jif pci ppc"
q)`:somefile.csv 0: .h.cd example
CSV saved from q
Contents of the csv saved from q and the character count are shown in the example:
q)read0`:C:/q/q.csv
"someOtherCol,allDates,stackedSymCol"
"8,2013.09.10,pii"
"6,2007.08.09 2012.12.30,hbg blg"
"8,2011.04.04 2020.08.21 2006.02.12 2005.01.15 2016.05.31 2015.01.03 2021.12.09 2022.03.26 2013.10.15 2001.10.29 2011.02.17 2010.03.28 2005.11.14 2003.08.16 2002.04.20 2004.08.07 2014.09.19 2000.05.24 2018.06.19 2017.08.14,cim pgm gha chp dio gfc beh mbo cfe kec jbn bjh eni obf agb dce gnk jif pci ppc"
q)count raze read0`:C:/q/q.csv
383
CSV saved from QPad
Similarly the contents of the csv saved from QPad and the character count:
q)read0`:C:/q/qpad.csv
"someOtherCol,allDates,stackedSymCol"
"1,enlist 2006.01.13,enlist `hfg"
"1,2008.04.06 2008.01.11,`nha`plc"
"4,2009.06.12 2016.01.24 2021.02.02 2018.09.02 2011.06.19 2022.09.26 2008.10.29 2010.03.11 2022.07.30 2012.09.06 2021.11.27 2017.11.24 2007.09.10 2012.11.27 ...,`ifd`jgp`eln`kkb`ahm`cal`eni`idj`mod`omb`dkc`ogf`eaj`mbf`kdd`hip`gkg`eef`edi`jak"
q)count raze read0`:C:/q/qpad.csv
338
Conclusion
We can see from these examples the points outlined above. The dates are truncated at a certain length, enlist is added to nested lists of length 1, and backticks are kept before symbols.
The truncated dates could be the reason why the file you have exported from QPad is so much smaller. Based on your comments above the files are not identical, so this may be the reason.
TL;DR - Both files are created differently and that's why they differ.

How to import dates correctly from this .csv file into Matlab?

I have a .csv file with the first column containing dates, a snippet of which looks like the following:
date,values
03/11/2020,1
03/12/2020,2
3/14/20,3
3/15/20,4
3/16/20,5
04/01/2020,6
I would like to import this data into Matlab (I think the best way would probably be using the readtable() function, see here). My goal is to bring the dates into Matlab as a datetime array. As you can see above, the problem is that the dates in the original .csv file are not consistently formatted. Some of them are in the format mm/dd/yyyy and some of them are mm/dd/yy.
Simply calling data = readtable('myfile.csv') on the .csv file results in the following, which is not correct:
'03/11/2020' 1
'03/12/2020' 2
'03/14/0020' 3
'03/15/0020' 4
'03/16/0020' 5
'04/01/2020' 6
Does anyone know a way to automatically account for this type of data in the import?
Thank you!
My version: Matlab R2017a
EDIT ---------------------------------------
Following the suggestion of Max, I have tried specifiying some of the input options for the read command using the following:
T = readtable('example.csv',...
'Format','%{dd/MM/yyyy}D %d',...
'Delimiter', ',',...
'HeaderLines', 0,...
'ReadVariableNames', true)
which results in:
date values
__________ ______
03/11/2020 1
03/12/2020 2
NaT 3
NaT 4
NaT 5
04/01/2020 6
and you can see that this is not working either.

If you are sure all the dates involved do not go back more than 100 years, you can easily apply the pivot method which was in use in the last century (before th 2K bug warned the world of the danger of the method).
They used to code dates in 2 digits only, knowing that 87 actually meant 1987. A user (or a computer) would add the missing years automatically.
In your case, you can read the full table, parse the dates, then it is easy to detect which dates are inconsistent. Identify them, correct them, and you are good to go.
With your example:
a = readtable(tfile) ; % read the file
dates = datetime(a.date) ; % extract first column and convert to [datetime]
idx2change = dates.Year < 2000 ; % Find which dates where on short format
dates.Year(idx2change) = dates.Year(idx2change) + 2000 ; % Correct truncated years
a.date = dates % reinject corrected [datetime] array into the table
yields:
a =
date values
___________ ______
11-Mar-2020 1
12-Mar-2020 2
14-Mar-2020 3
15-Mar-2020 4
16-Mar-2020 5
01-Apr-2020 6

Instead of specifying the format explicitly (as I also suggested before), one should use the delimiterImportoptions and in the case of a csv-file, use the delimitedTextImportOptions
opts = delimitedTextImportOptions('NumVariables',2,...% how many variables per row?
'VariableNamesLine',1,... % is there a header? If yes, in which line are the variable names?
'DataLines',2,... % in which line does the actual data starts?
'VariableTypes',{'datetime','double'})% as what data types should the variables be read
readtable('myfile.csv',opts)
because the neat little feature recognizes the format of the datetime automatically, as it knows that it must be a datetime-object =)

How do I read the rest of a line of a text file in MATLAB with TEXTSCAN?

I am trying to read a text file with data according to a specific format. I am using and textscan together with a string containing the format to read the whole data set in one code line. I've found how to read the whole line with fgetl, but I would like to use as few code lines as possible. So I want to avoid own for loops. textscan seems great for that.
As an example I'll include a part of my code which reads five strings representing a modified dataset, its heritage (name of old dataset), the date and time of the modification and lastly any comment.
fileID = fopen(filePath,'r+');
readContentFormat = '%s = %s | %s %s | %s';
content = textscan(fileID, readContentFormat, 'CollectOutput,1);
This works for the time being if the comment doesn't have any delimiters (like a white space) in it. However, I would like to be able to write comments at the end of the line.
Is there a way to use textscan and let it know that I want to read the rest of a line as one string/character array (including any white spaces)? I am hoping for something to put in my variable readContentFormat, instead of that last %s. Or is there another method which does not involve looping through each row in the file?
Also, even though my data is very limited I am keen to know any pros or cons with different methods regarding computational efficiency or stability. If you know something you think is worth sharing, please do so.

One way that is satisfactory to me (but please share any other methods anyway!) is to set the delimiters to characters other than white space, and trim away any leading or trailing white spaces with strtrim. This seemed to work well, but I have no idea how demanding the computations are.
Example:
The text file 'testFile.txt' in the current folder has the following lines
File |Heritage |Date and time |Comment
file1.mat | oldFile1.mat | 2018-03-01 14:26:00 | -
file2.mat | oldFile2.mat | 2018-03-01 13:26:00 | -
file3.mat | oldFile3.mat | 2018-03-01 12:26:00 | Time for lunch!
The following code will read the data and put it into a cell array without leading or trailing white spaces, with few lines of code. Neat!
function contentArray = myfun()
fileID = fopen(testFile.txt,'r');
content = textscan(fileID, '%s%s%s%s','Delimiter', {'|'},'CollectOutput', 1);
contentArray = strtrim(content{1}(2:4,:));
end
The output:
tmpArr =
3×4 cell array
'file1.mat' 'oldFile1.mat' '2018-03-01 14:26:00' '-'
'file2.mat' 'oldFile2.mat' '2018-03-01 13:26:00' '-'
'file3.mat' 'oldFile3.mat' '2018-03-01 12:26:00' 'Time for lunch!'

What is neuroph GUI import file format?

Just starting with Neuroph NN GUI. Trying to create a dataset by importing a .csv file. What's the file format supposed to be?
I have 3 inputs and 1 output so I assumed the format of the import file would be ..
1,2,3,4
6,7,8,9
But I get error 9, or 4 or 10 depending on what combination I try of newlines, commas etc.
Any help out there ?
many thanks,
john.

That's because you aren't counting with the output column. The lastest columns are for the output.
So, for example, if you have 10 inputs and 1 output, your file will need to have 11 columns.
I came here, because the Neurophy can't import CSVs with title line. Example of a data file that works for me:
1.0,1.0,1.0
1.0,2.0,2.0
1.0,3.0,3.0
1.0,4.0,4.0
1.0,5.0,5.0
1.0,6.0,6.0
1.0,7.0,7.0
1.0,8.0,8.0
1.0,9.0,9.0
1.0,10.0,10.0
2.0,1.0,2.0
2.0,2.0,4.0
2.0,3.0,6.0
2.0,4.0,8.0
2.0,5.0,10.0
2.0,6.0,12.0
2.0,7.0,14.0
2.0,8.0,16.0
2.0,9.0,18.0
2.0,10.0,20.0

Handling very long log txt file with different line length with MATLAB

I'm trying to import data to matlab from txt file which is length is about 1e6 lines.
the text is as follows :
[04 05 11 12] jiffies=100
[04 06 15 09] jiffies=3455
.
.
.
[00 02 07 07] jiffies=111200
I've managed to extract the first two numbers (which I need) without using a loop;
now I want to read only the number after the "jiffies=#" , If I am trying to use the same method
textscan(fid,'%s','delimiter', 'jiffies=')
but it's not working , any method without using loops ?

You can skip values by using a star *.
The complete formatting string for all data in your file is
'[%d %d %d %d] jiffies=%d'
To skip all the numbers in the front, simply put a star between % and d.
C = textscan(fid,'[%*d %*d %*d %*d] jiffies=%d');
which returns
C{1}
ans =
100
3455
111200

If you want to import all of the data at once, you can open this file in the Import Tool:
uiimport(filename);
This will correctly remove the brackets and 'jiffies=' text, and can return you a numeric array of all the numbers. You can also generate code from the Import Tool (click on the Import Selection dropdown/Generate script or function). This may prove to handle any errors in the file better than using textscan directly.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Import csv file with values within quotation marks - matlab

Related

why are csvs copied from QPAD and csvs saved from q process so different in terms of size?

How to import dates correctly from this .csv file into Matlab?

How do I read the rest of a line of a text file in MATLAB with TEXTSCAN?

What is neuroph GUI import file format?

Handling very long log txt file with different line length with MATLAB

Categories

Resources