How to parse CSV file with empty values in Octave? - matlab

I have the following CSV data that I am trying to parse in Octave. Note that the values in the last column are empty:
102,19700101,,0.485,
111,19700101,,0.48,
I have defined my line format as:
lineFormat = [repmat('%s',1,1), ...
repmat('%f',1,1), ...
repmat('%q',1,1), ...
repmat('%f',1,1), ...
repmat('%q',1,1)];
How can I read this in with textscan? When I try:
C = textscan(fid, lineFormat, 'Delimiter', ',')
I incorrectly get the following (notice that the second line from the CSV is shifted):
C =
{
[1,1] =
{
[1,1] = 102
[2,1] = 19700101
}
[1,2] =
1.9700e+07
NaN
[1,3] =
{
[1,1] =
[2,1] = 0.48
}
[1,4] =
0.48500
110.00000
[1,5] =
{
[1,1] = 111
[2,1] = 19700101
}
}
I've also tried with 'MultipleDelimsAsOne' but the last column value is still omitted. How do I read my CSV data in properly with textscan? This code works as expected in MATLAB, but not in Octave.
Running Octave 4.2.2 on Ubuntu 16.04.

For your example, setting the EndOfLine parameter helped for me (Windows 10, Octave 5.1.0):
C = textscan(fid, lineFormat, 'Delimiter', ',', 'EndOfLine', '\n')
The output seems correct:
C =
{
[1,1] =
{
[1,1] = 102
[2,1] = 111
}
[1,2] =
19700101
19700101
[1,3] =
{
[1,1] =
[2,1] =
}
[1,4] =
0.48500
0.48000
[1,5] =
{
[1,1] =
[2,1] =
}
}
Now I wanted to test your %q columns and expanded your example:
102,19700101,,0.485,
111,19700101,,0.48,
111,19700101,,0.48,"test"
111,19700101,"test",0.48,
Unfortunately, the above solution doesn't work properly here:
C =
{
[1,1] =
{
[1,1] = 102
[2,1] = 111
[3,1] = 111
[4,1] =
}
[1,2] =
19700101
19700101
19700101
111
[1,3] =
{
[1,1] =
[2,1] =
[3,1] =
[4,1] = 19700101
}
[1,4] =
0.48500
0.48000
0.48000
[1,5] =
{
[1,1] =
[2,1] =
[3,1] = test
}
}
But, when switching from %q to %s in lineformat, it works as expected:
C =
{
[1,1] =
{
[1,1] = 102
[2,1] = 111
[3,1] = 111
[4,1] = 111
}
[1,2] =
19700101
19700101
19700101
19700101
[1,3] =
{
[1,1] =
[2,1] =
[3,1] =
[4,1] = "test"
}
[1,4] =
0.48500
0.48000
0.48000
0.48000
[1,5] =
{
[1,1] =
[2,1] =
[3,1] = "test"
[4,1] =
}
}
I have no explanation for that; a bug maybe? If you can live with removing the double quotes yourself afterwards, this (still) might a solution for you.
Hope that helps!

It appears this is a bug in Octave: https://savannah.gnu.org/bugs/index.php?57612
I got around this by adding an extra comma to the end of my CSV files whose lines ended in a comma. Since Octave ignores the final comma, adding a second comma causes Octave to not ignore the second-to-last one:
102,19700101,,0.485,,
111,19700101,,0.48,,
Here's a shell one-liner to fix all the CSV files in a directory:
find ${1:-.} -type f -name *.csv -exec sed -i -e 's/,$/,,/g' {} \;
This is not a great solution, just a work-around for the existing bug.

Related

Adding a datapoint to datastruct in matlab

I am trying to add a datapoint to an existing data struct. I have created the following data struct.
ourdata.animal= {'wolf', 'dog', 'cat'}
ourdata.height = [110 51 32]
ourdata.weight = [55 22 10]
say I want to add another one to the data struct with name 'fish' height 3 and weight 1, how do I go about this?
You can simply attach it to the end of the structure:
ourdata.animal{end+1} = 'fish'
ourdata.height(end+1) = 3
ourdata.weight(end+1) = 1
If you want to work with multiple structures, you can write a little function to combine the values of fields in multiple structs. Here's one, using fieldnames() to discover what fields exist:
function out = slapItOn(aStruct, anotherStruct)
% Slap more data on to the end of fields of a struct
out = aStruct;
for fld = string(fieldnames(aStruct))'
out.(fld) = [aStruct.(fld) anotherStruct.(fld)];
end
end
Works like this:
>> ourdata
ourdata =
struct with fields:
animal: {'wolf' 'dog' 'cat'}
height: [110 51 32]
weight: [55 22 10]
>> newdata = slapItOn(ourdata, struct('animal',{{'bobcat'}}, 'height',420, 'weight',69))
newdata =
struct with fields:
animal: {'wolf' 'dog' 'cat' 'bobcat'}
height: [110 51 32 420]
weight: [55 22 10 69]
>>
BTW, I'd suggest that you use string arrays instead of cellstrs for storing your string data. They're better in pretty much every way (except performance). Get them with double quotes:
>> strs = ["wolf" "dog" "cat"]
strs =
1×3 string array
"wolf" "dog" "cat"
>>
Also, consider using a table array instead of a struct array for tabular-looking data like this. Tables are nice!
>> animal = ["wolf" "dog" "cat"]';
>> height = [110 51 32]';
>> weight = [55 22 10]';
>> t = table(animal, height, weight)
t =
3×3 table
animal height weight
______ ______ ______
"wolf" 110 55
"dog" 51 22
"cat" 32 10
>>

Random selection of a member's location in a nested cell of cells: Matlab

I have a nested cell of cells like the one below:
CellArray={1,1,1,{1,1,1,{1,1,{1,{1 1 1 1 1 1 1 1}, 1,1},1,1},1,1,1},1,1,1,{1,1,1,1}};
I need to randomly pick a location in CellArray. All members' locations of CellArray must have same chances to be chosen in the random selection process. Thanks.
You can capture the output of the celldisp function. Then use regex to extrcat indices:
s=evalc('celldisp(CellArray,'''')');
m = regexp(s, '\{[^\=]*\}', 'match');
Thanks to #excaza that suggested a clearer use of regexp
Result:
m =
{
[1,1] = {1}
[1,2] = {2}
[1,3] = {3}
[1,4] = {4}{1}
[1,5] = {4}{2}
[1,6] = {4}{3}
[1,7] = {4}{4}{1}
[1,8] = {4}{4}{2}
[1,9] = {4}{4}{3}{1}
[1,10] = {4}{4}{3}{2}{1}
[1,11] = {4}{4}{3}{2}{2}
[1,12] = {4}{4}{3}{2}{3}
[1,13] = {4}{4}{3}{2}{4}
[1,14] = {4}{4}{3}{2}{5}
[1,15] = {4}{4}{3}{2}{6}
[1,16] = {4}{4}{3}{2}{7}
[1,17] = {4}{4}{3}{2}{8}
[1,18] = {4}{4}{3}{3}
[1,19] = {4}{4}{3}{4}
[1,20] = {4}{4}{4}
[1,21] = {4}{4}{5}
[1,22] = {4}{5}
[1,23] = {4}{6}
[1,24] = {4}{7}
[1,25] = {5}
[1,26] = {6}
[1,27] = {7}
[1,28] = {8}{1}
[1,29] = {8}{2}
[1,30] = {8}{3}
[1,31] = {8}{4}
}
Use randi to select an index:
m{randi(numel(m))}

Error using fprintf

My code looks as so:
PosHotspot = dataset('file', 'PositiveHotspotpos.txt', 'Delimiter', '\t');
a = 2;
exon_end = PosHotspot.total_exon;
exonposition = PosHotspot.ExonPos;
Isoformnumber = PosHotspot.Isoform;
fileID = fopen('PosHotspot_results.txt', 'w')
for j = 1:660
exon = exonposition(j:j);
Isoform = Isoformnumber(j:j);
b = exon_end(j:j) - 1;
rng(0, 'twister');
r=randi([a b],1,1000);
less = sum(exon>r);
greater = sum(exon<r);
equal = sum(exon==r);
fprintf(fileID, '%s %4f %4f\n',Isoform,less,greater)
end
fclose(fileID)
However, I keep getting this error:
Error using fprintf
Function is not defined for 'cell' inputs.
Error in PositiveHotspotttest (line 24)
fprintf(fileID, '%s %4f %4f\n',Isofrom,less,greater)
I'm certain that it has to do with writing my information from Isoforms to the file.
Here's an example of what my file looks like:
chrom Gene Isoform exon_start ExonPos total_exon exonpos_exontotal
chr20 ADA NM_000022 43255096 4 13 0.307692307692
chr9 ALDOB NM_000035 104187734 7 10 0.7
chr5 ARSB NM_000046 78077674 7 9 0.777777777778
chr5 ARSB NM_000046 78135178 6 9 0.666666666667
chr5 ARSB NM_000046 78181406 5 9 0.555555555556
I want to output the Isoforms to my new file as well as the greater than and less than values. Is there a way to do this?
It's probably pretty simple, but again I'm new to matlab
Change:
Isoform = Isoformnumber(j:j);
to the more natural:
Isoform = Isoformnumber{j};
Like this you'll retrieve the content of the cell no. j, instead of the whole cell.

read input file Matlab

I have a problem while reading an input file in Matlab. It appears that all the rows have one parameter input except for the last one which is a vector:
INPUT FILE
--------------
Field1: number
Field2: text (one word)
Field3: number
Field4: number
Field5: number
Field6: number
Field7: vector
The code I have implemented looks like:
fid = fopen('input.inp','r');
A = textscan(fid,'%s %f','Delimiter','\t','headerLines',2);
data = cat(2,A{:});
I would like some help to deal with the fact that I have some text/number cases and also to deal with the vector form the last row. Thanks
Is this what you are looking for...?
I think you have to use %s %s as a format to text scan and not a float because a vector cannot be converted to a float for example.
I changed this A = textscan(fid,'%s %s','Delimiter','\t'); to include %s %s.
Also, I think you want to concatenate along the first dimension rather than the second.
I think you actually want to create a key/value pair of the input file rather than just reading each row into a cell but you don't state that.
INPUT FILE
--------------
Field1: 1
Field2: two
Field3: 3
Field4: 4
Field5: 5
Field6: 6
Field7: [7 8 9]
fid = fopen('D:\tmp\t.txt','r');
A = textscan(fid,'%s %s','Delimiter','\t','headerLines',2);
cat(1,A{:})
ans =
{
[1,1] = Field1: 1
[2,1] = Field3: 3
[3,1] = Field5: 5
[4,1] = Field7: [7 8 9]
[5,1] = Field2: two
[6,1] = Field4: 4
[7,1] = Field6: 6
}
If you want to create a key/value pair, then you can split them into key and value with a loop that you can use with the Container class if needed. You have to filter your strings a bit (e.g remove colons etc) but you get the gist.
keySet = {};
valueSet = {};
for (n=1:size(A2,1))
s = A2{n};
ind = strfind(s,' ');
keySet{n} = s(1:ind(1));
valueSet{n} =s(ind(1):end);
end
The output is
keySet =
{
[1,1] = Field1:
[1,2] = Field3:
[1,3] = Field5:
[1,4] = Field7:
[1,5] = Field2:
[1,6] = Field4:
[1,7] = Field6:
}
valueSet =
{
[1,1] = 1
[1,2] = 3
[1,3] = 5
[1,4] = [7 8 9]
[1,5] = two
[1,6] = 4
[1,7] = 6
}
From the Container class doc:
mapObj = containers.Map(keySet,valueSet)

Matlab - preprocess CSV file

I have a CSV file in a format similar to the following one:
title1
index columnA1 columnA2 columnA3
1 2 3 6
2 23 23 1
3 2 3 45
4 2 2 101
title2
index columnB1 columnB2 columnB3
1 23 53 6
2 22 13 1
3 5 4 43
4 8 6 102
I want to build a function readCustomCSV which receives a CSV file in the bellow illustrated format and a row index i and returns an output file with (for let's say i = 3) the following content:
title1
index columnA1 columnA2 columnA3
3 2 3 45
title2
index columnB1 columnB2 columnB3
3 5 4 43
Do you know how to use the csvread function in order to obtain this type of functionality?
It confuses me that there are 2 types sections. I was thinking at using the whole thing as a string and then split it into 2 .csv files and then read the corresponding line line.
try using this function :
I assumed that all tables have equal number of columns/rows. The code can definitely be shortened / improved / extended ;)
function multi_table_csvread (row_index)
filename_INPUT = 'multi_table.csv' ;
filename_OUTPUT = 'selected_row.csv' ;
fIN = fopen(filename_INPUT,'r');
nextLine = fgetl(fIN);
tableIndex = 0;
tableLine = 0;
csvTable = [];
% start reading the csv file, line by line
while nextLine ~= -1
lineStr = strtrim(strsplit(nextLine,',')) ;
% remove empty cells
lineStr(cellfun('isempty',lineStr)) = [] ;
tableLine = tableLine + 1 ;
% if 1 element start new table
if numel(lineStr) == 1
tableIndex = tableIndex + 1;
tableLine = 1;
csvTable{tableIndex,tableLine} = lineStr ;
else
lineStr = add_comas(lineStr) ;
csvTable{tableIndex,tableLine} = lineStr ;
end
nextLine = fgetl(fIN);
end
fclose(fIN);
fOUT = fopen(filename_OUTPUT,'w');
if row_index > size(csvTable,2) -2
error('The row index exceeds the maximum number of rows!')
end
for k = 1 : size(csvTable,1)
title = csvTable{k,1};
columnHeaders = csvTable{k,2};
selected_row = csvTable{k,row_index+2};
fprintf(fOUT,'%s\n',title{:});
fprintf(fOUT,'%s',columnHeaders{:});
fprintf(fOUT,'\n');
fprintf(fOUT,'%s',selected_row{:});
fprintf(fOUT,'\n');
end
fclose(fOUT);
function line_with_comas = add_comas(this_line)
for ii = 1 : length(this_line)-1
this_line{ii} = strcat(this_line{ii},',') ;
end
line_with_comas = this_line ;