Reading large csv files with strings containing commas as one field - matlab

I have a large .csv file (~26000 rows). I want to be able to read it into matlab. Another problem is that it contains a collection of strings delimited by commas in one of the fields.
I'm having trouble reading it. I tried stuff like tdfread, which won't work here. Any tricks with textscan i should be aware about?
Is there any other way?

I'm not sure what is generating your CSV file but that is your problem.
The point of a CSV file, is that the file itself designates separation of fields. If the text of the CSV contains commas, then nothing you can do will help you. How would ANY program know when the text in a single field contains commas, or when that comma is a field delimiter?
Proper CSV would have a text qualifier. Some generators/readers gives you the option to use one. The standard text qualifier is a " (quote). Its changeable, though, because your text may contain those, too.
Again, its all about generating proper CSV content.

There's a chance that xlsread won't give you the answer you expect -- do the strings always appear in the same columns, for example? I think (as everyone else seems to :-) that it would be more robust to just use
fid = fopen('yourfile.csv');
and then either textscan
t = textscan(fid, '%s', delimiter', sprintf('\n'));
t = t{1};
or just fgetl (the example in the help is perfect).
After that you can do some line-by-line processing -- using textscan again on the text content of each line, for example, is a nice, quick way to get a cell-array that will allow fast analysis of each line.

You have a problem because you're reading it in as a .csv, and you have commas within your data. You can get it in Excel and manipulate the date, possibly extract the unwanted commas with Excel formulas. I work with .csv files for DB imports quite a bit. I imagine matLab has similar rules, which is - no commas in your data.
Can you tell us more about your data? Are there commas throughout, our just one column? Maybe you can read it in as tab delimited?

Are you using a Unix system? The reason I am asking is that you could use a command-line function such as sed and regular expressions to clean those data files before you pass them into Matlab. Here is a link that explains how to do exactly what you are looking for.

Since, as others have observed, your file is CSV with commas inside what you think of as a single field, it's going to be hard to persuade Matlab that that really is only one field. I think your best strategy is going to be to read one line at a time, into a string acting as a buffer, and to translate it, field-by-field, into the variables or other data structures that you want. Since Matlab has in-built regular expression capabilities this shouldn't be too hard.
And, as others have already suggested, posting a sample of your data would help us to help you.

One easy solution is:
path='C:\folder1\folder2\';
data = 'data.csv';
data = dataset('xlsfile',sprintf('%s\%s', path,data));
Of course you could also do the following:
[data,path] = uigetfile('C:\folder1\folder2\*.csv');
data = dataset('xlsfile',sprintf('%s\%s', path,data));
now you will have loaded the data as dataset. An easy way to get a column 1 for example is
double(data(1))

Related

Writing to Text Files as a Struct vs.Cell Array

So I am trying to write data to a .txt file and I have encountered some difficulties. My understanding is that if I want to write struct data to a text file I need to first convert it to a table using struct2table and then use writetable to write it to the file. When I do this, however, the data is comma delimited in the text file. I really like the table format as it appears in MATLAB, but I can't find a way to make it appear that way in the text file (I assume that Excel reads in data that is comma delimited, which is why the data is formatted that way). Now, if I format the data as a cell array and then write that to the .txt file, then it looks great; however, the struct format is nice in that I can access data points easier and plot the data. So, I am a little lost on what route I should take to solve this problem. One idea I had was to format the data as a struct array and then when I want to write it to a .txt convert it to a cell array. The other idea I had was to somehow manipulate the data format when I use writetable. I can, for instance, change the delimiter to a tab and that looks great, except that the data does not line up (e.g. if I have Frequency and Power as column headers, the numbers below them in the table are not aligned with those headers). This, of course, is trivial if using `fprintf'; but I can't use that on a struct array (from what I understand).
I hope this is comprehensive enough, but if there is anything else I can provide please let me know and thank you in advance.

How to read data from txt between 2 lines with the same symbol?

I have a lot of .txt files (<1000 lines each). The data format is the following (the picture): there are some lines in the beginning that I don't need, then the line with '', then the lines with data that I need to extract from the file, then again a line with '' and some comments that I don't need.
Is there any way to do that? I have a lot of such files. The matter is that in every file the number of lines before the first '' is different. So, is there any way to read the data in between of two ''? I tried all the functions but I am a beginner and just cannot come up with the right idea...
This is quite simple with regular expressions:
usefulData = regexp(fileread('abg06.txt'), '(?<=\*).*?(?=\*)', 'match','once');

How can I get around MATLAB's specifications of csvread?

I am trying to create a program that takes in multiple csv files. However, they include both strings and numbers.
I have csv files that looks something like this:
"Project","Task","Value Type", "Value"
"105", "06.05.02", "cost", "3434"
"105", "06.05.02", "obligation", "3434"
"106", "06.05.02", "cost", "500"
"106", "06.05.02", "obligation", "500"
The number of columns is fixed (there are actually 23, I only listed four for readability), but each csv has a different number of rows. If I save it as an xls file, it works perfectly. However, this takes too long if there is a lot of files and the end user doesn't want to deal with that.
Similar questions suggested textread, but the first row would be
textread('filename.csv', '%s%s%s%s', 'delimiter', ',');
while the rest of the file is
textread('filename.csv', '%f%s%s%f', 'delimiter', ',');
In comparison to the simplicity of having the numbers, strings, and raw data in corresponding arrays using xlsread, having 23 different arrays seems a bit complicated.
What would be the best solution here?
The files are large, but not large enough that I am worried about efficiency.
Is there a way to change the extension of the files from .csv to .xls from within my program? (I looked this up as well, but couldn't find anything that worked) I would really like to use xlsread, but if this isn't possible, is there a way to have textread save the first row of a csv with certain variable types(%s%s%s%s) and then save the rest of the rows with a different variable type (%f%s%s%f)?
xlsread does read csv files, so there should be no need to convert. Just read directly (tested on 2013b with a small file of mixed numeric and string data):
[num, text, alldata] = xlsread('test.csv')
Note: this apparently only works on Windows machines. If just changing the extension makes xlsread work for you, you can rename with movefile:
oldfile = somefile.csv;
ext = 'xls';
[~, name, ~ ] = fileparts(oldfile);
newfile = [name,ext];
movefile(oldfile,newfile);
If you have many files, this would go in a loop and oldfile would be taken from the output of something like a dir or ls command giving you all the .csv files.
Incidentally, while you might see it mentioned in older questions, textread is now not recommended, use textscan instead for cases where you need more complexity/control over the input. It can be very powerful but for this case is probably like cracking a nut with a sledgehammer.
If you don't need the headers, for example, you can take the whole file in one line with:
C = textread('filename.csv', '%f%s%s%f', 'Delimiter', ',','HeaderLines',1);

Can I use importdata to return only a part of a text file?

I am using:
importdata(fileName,'',headerLength)
To get data from a text file which is carriage return line feed delimited. The problem I have is that the files are relatively large and there are several thousand of them, which makes the data loading slow. I only want a small part of the file so I would like to know if I can use importdata to realise this?
Something like this:
importdata(fileName,'',headerLength:dataEnd);
This does not work and I can't find any support for doing something like this in the importdata documentation.
Does anyone know of a more suitable function?
If you know the lines (the row number) in each file you wish to load in,
You can use a slower, more traditional way of reading in your data. The readline.m allows you to do this:
http://uk.mathworks.com/matlabcentral/fileexchange/20026-readline-m-v3-0--jun--2009-
This allows you to read whichever line you want from your data block, but it is much slower than your normal csvread/textscan, but could be considered overall faster if you know which lines you are looking for.

Load txt-file with two different delimiters into struct

I'm having trouble with loading .txt file in Matlab. The main problem is having not equal rows. I'll attach the file so you can more clearly see what I'm truing to say. First, the file has information about each node in graph. One row has information like this:
1|1|EL_1_BaDfG|4,41|5,1|6,99|8,76|9,27|13,88|14,19|15,91|19,4|21,48...
it means:
id|type|name|connected_to, weight|connected_to, weight| and so on..
I was trying to use fscanf function, but it only reads whole line as one string. How I suppose to divide it into struct with information that I need?
Best regards,
Dejan
Here, you can see file that I'm trying to load
An alternative to Stewie answer is to use:
fgetl to read each line
Then use
strread (or textscan) to split the string
Firstly using the | delimiter - then on the sub section(s) containing , do it a second time.