LowEndian to BigEndian - Matlab - matlab

I read the bytes in this way:
filename = 'random_path';
f=fopen(filename,'rb');%f=fopen(filename,'rb');
if f<3, error('Impossivel abrir %s',filename); end
samples= fread(f,202*4096,'int16')';
This file was wrote in LowEndian. Now I want to pass it to BigEndian. I try this, without success.
read= fopen('big_endian','wb');
fwrite(read,int16(swapbytes(samples)),'int16');
fclose(read)

You should try:
%read the data
fid = fopen('random_path','r','ieee-le'); %ieee-le = Low endian
data = fread(fid,inf,'int16');
%write the data
fid = fopen('random_path_le','w','ieee-be'); %ieee-be = Big endian
fwrite(fid,data,'int16');

Related

Writing lines to .csv file fails with command fprintf

I want to acces a .csv file, look for empty data blocks and store all of the lines that have no empty data blocks.
This is my code:
filename = 'C:\Users\try.csv';
file1 = fopen(filename); %Acces file with empty data blocks
filename2 = 'C:\Users\try_corrected.csv';
file2 = fopen(filename2); %Acces destination file
tline = fgets(file1); %Read the first line of file1
while ischar(tline)
detected = false;
[r,s] = size(tline); %Determine the lengt of the textline for the for-loop
for(i=1: 1: s)
if( (tline(i)==',' && tline(i+1) ==',') || tline(1)==',' || tline(s-2)==',' )
detected = true %Line should not be printed in destination file
break;
end
end
if detected == false
fprintf(file2,'%s\n',tline);
end
tline = fgets(file1);
end
type 'C:\Users\try_corrected.csv'
fclose(file2);
textdata = textscan(file1, '%s %f %f %f %s %f %f %f %s %f %s','HeaderLines', 1,'Delimiter',',');
fclose(file1);
If I do the "type" command, I should see all the printed strings which is not the case.
Am I using fprintf wrong? I know there is a command called csvwrite but I thought this could work too?
First, when you open your destination file you need to open it for writing. Without a second parameter fopen will open for read access. If your destination file did not exist it would return you a -1 file handle.
Use instead:
fopen(filename2,'w')
Here is a simplified version of your code including that amendment:
filename = 'c:\try.csv';
fid_in = fopen(filename,'r'); %Access file with empty data blocks
filename2 = 'C:\try_corrected.csv';
fid_out = fopen(filename2,'w'); %Access destination file
while (~feof(fid_in))
str=fgets(fid_in);
if (~doublecommas(str))
fprintf(fid_out,'%s',str);
end
end
fclose(fid_in);
fclose(fid_out);
type(filename2)
This uses a different method to detect double presence of commas in the CSV line:
function flag=doublecommas(str)
flag=false; % flag = result returned,
% true if empty CSV fields present in line
if (ischar(str) && length(str)>0)
for cindex=1:length(str)-1
if strcmp(str(cindex:(cindex+1)),',,')
flag=true; break;
end
end
end
return;

fopen with correct file format and path

How do I get to read my file with increment .htm file with correct file format and path?
path:DATA\WEBPAGE_SOURCE\train75_phish_data\1.htm
file:1.htm,2.htm,3.htm....etc
Inside 1.htm,2.htm,3.htm....etc are the soucre code of webpage
I do try with the following example, but got the error when i=21.
data2=fopen(strcat('DATA\WEBPAGE_SOURCE\train75_phish_data\',int2str(i),'.htm'),'r')
I have refer to this, still cannot work, any ideas?
http://www.mathworks.com/help/matlab/ref/fopen.html
Here is my code:
data = importdata('DATA/URL/trainURL')
domain_URL = regexp(data,'\w*://[^/]*','match','once')
[sizeData b] = size(domain_URL);
for i = 1:150
A7_data = domain_URL{i};
data2=fopen(strcat('DATA\WEBPAGE_SOURCE\train75_phish_data\',int2str(i),'.htm'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
img_only = regexp(CharData, '<img.*?>', 'match');
feature7_data=(cellfun(#(n) isempty(n), strfind(img_only, A7_data)))
B7(i)=sum(feature7_data)
end
feature7(B7>=10)=1;
feature7(B7<10&B7>5)=0;
feature7(B7<=5)=-1;
feature7'
Here is my output:
data = importdata('DATA/URL/trainURL') is a list of URL being saved inside
I could not loop the results for i=20, it will come to error when iteration=21, I want to loop until 150, it cnt read the 'data2' for 'i=21'
I think you need to handle possible exceptions that can come in a more principled way. Try this:
data = importdata('DATA/URL/trainURL')
domain_URL = regexp(data,'\w*://[^/]*','match','once')
[sizeData b] = size(domain_URL);
for i = 1:150
A7_data = domain_URL{i};
filename = fullfile('DATA\WEBPAGE_SOURCE\train75_phish_data\',strcat(int2str(i),'.htm'));
if (exist(filename,'file')),
disp(sprintf('file %s exists, processing it',filename));
data2=fopen(filename,'r');
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
img_only = regexp(CharData, '<img.*?>', 'match');
feature7_data=(cellfun(#(n) isempty(n), strfind(img_only, A7_data)))
B7(i)=sum(feature7_data)
else,
disp(sprintf('file %s does not exist, skipping it!',filename));
end
end
feature7(B7>=10)=1;
feature7(B7<10&B7>5)=0;
feature7(B7<=5)=-1;
feature7'
after the line that does the fread.

Excessively large overhead in MATLAB .mat file

I am parsing a large text file full of data and then saving it to disk as a *.mat file so that I can easily load in only parts of it (see here for more information on reading in the files, and here for the data). To do so, I read in one line at a time, parse the line, and then append it to the file. The problem is that the file itself is >3 orders of magnitude larger than the data contained therein!
Here is a stripped down version of my code:
database = which('01_hit12.par');
[directory,filename,~] = fileparts(database);
matObj = matfile(fullfile(directory,[filename '.mat']),'Writable',true);
fidr = fopen(database);
hitranTemp = fgetl(fidr);
k = 1;
while ischar(hitranTemp)
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
hitran = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2u%1c%7f%7f','delimiter','','whitespace','');
matObj.moleculeNumber(1,k) = uint8(hitran{1});
matObj.isotopeologueNumber(1,k) = uint8(hitran{2});
matObj.vacuumWavenumber(1,k) = hitran{3};
matObj.lineIntensity(1,k) = hitran{4};
matObj.airWidth(1,k) = single(hitran{6});
matObj.selfWidth(1,k) = single(hitran{7});
matObj.lowStateE(1,k) = single(hitran{8});
matObj.tempDependWidth(1,k) = single(hitran{9});
matObj.pressureShift(1,k) = single(hitran{10});
if rem(k,1e4) == 0;
display(sprintf('line %u (%2.2f)',k,100*k/K));
end
hitranTemp = fgetl(fidr);
k = k + 1;
end
fclose(fidr);
I stopped the code after 13,813 of the 224,515 lines had been parsed because it had been taking a very long time and the file size was getting huge, but the last printout indicated that I had only just cleared 10k lines. I cleared the memory, and then ran:
S = whos('-file','01_hit12.mat');
fileBytes = sum([S.bytes]);
T = dir(which('01_hit12.mat'));
diskBytes = T.bytes;
disp([fileBytes diskBytes diskBytes/fileBytes])
and get the output:
524894 896189009 1707.37141022759
What is taking up the extra 895,664,115 bytes? I know the help page says there should be a little extra overhead, but I feel that nearly a Gb of descriptive header is a bit excessive!
New information:
I tried pre-allocating the file, thinking that perhaps MATLAB was doing the same thing it does when a matrix is embiggened in a loop and reallocating a chunk of disk space for the entire matrix on each write, and that isn't it. Filling the file with zeros of the appropriate data types results in a file that my short check script returns:
8531570 71467 0.00837677004349727
This makes more sense to me. Matlab is saving the file sparsely, so the disk file size is much smaller than the size of the full matrix in memory. Once it starts replacing values with real data, however, I get the same behavior as before and the file size starts skyrocketing beyond all reasonable bounds.
New new information:
Tried this on a subset of the data, 100 lines long. To stream to disk, the data has to be in v7.3 format, so I ran the subset through my script, loaded it into memory, and then resaved as v7.0 format. Here are the results:
v7.3: 3800 8752 2.30
v7.0: 3800 2561 0.67
No wonder the v7.3 format isn't the default. Does anyone know a way around this? Is this a bug or a feature?
This seems like a bug to me. A workaround is to write in chunks to pre-allocated arrays.
Start off by pre-allocating:
fid = fopen('01_hit12.par', 'r');
data = fread(fid, inf, 'uint8');
nlines = nnz(data == 10) + 1;
fclose(fid);
matObj.moleculeNumber = zeros(1,nlines,'uint8');
matObj.isotopeologueNumber = zeros(1,nlines,'uint8');
matObj.vacuumWavenumber = zeros(1,nlines,'double');
matObj.lineIntensity = zeros(1,nlines,'double');
matObj.airWidth = zeros(1,nlines,'single');
matObj.selfWidth = zeros(1,nlines,'single');
matObj.lowStateE = zeros(1,nlines,'single');
matObj.tempDependWidth = zeros(1,nlines,'single');
matObj.pressureShift = zeros(1,nlines,'single');
Then to write in chunks of 10000, I modified your code as follows:
... % your code plus pre-alloc first
bs = 10000;
while ischar(hitranTemp)
if abs(hitranTemp(1)) == 32;
hitranTemp(1) = '0';
end
for ii = 1:bs,
hitran{ii} = textscan(hitranTemp,'%2u%1u%12f%10f%10f%5f%5f%10f%4f%8f%15c%15c%15c%15c%6u%2u%2u%2u%2u%2u%2 u%1c%7f%7f','delimiter','','whitespace','');
hitranTemp = fgetl(fidr);
if hitranTemp==-1, bs=ii; break; end
end
% this part really ugly, sorry! trying to keep it compact...
matObj.moleculeNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(#(c)c{1},hitran),1:bs));
matObj.isotopeologueNumber(1,k:k+bs-1) = uint8(builtin('_paren',cellfun(#(c)c{2},hitran),1:bs));
matObj.vacuumWavenumber(1,k:k+bs-1) = builtin('_paren',cellfun(#(c)c{3},hitran),1:bs);
matObj.lineIntensity(1,k:k+bs-1) = builtin('_paren',cellfun(#(c)c{4},hitran),1:bs);
matObj.airWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{5},hitran),1:bs));
matObj.selfWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{6},hitran),1:bs));
matObj.lowStateE(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{7},hitran),1:bs));
matObj.tempDependWidth(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{8},hitran),1:bs));
matObj.pressureShift(1,k:k+bs-1) = single(builtin('_paren',cellfun(#(c)c{9},hitran),1:bs));
k = k + bs;
fprintf('.');
end
fclose(fidr);
The final size on disk is 21,393,408 bytes. The usage breaks down as,
>> S = whos('-file','01_hit12.mat');
>> fileBytes = sum([S.bytes]);
>> T = dir(which('01_hit12.mat'));
>> diskBytes = T.bytes; ratio = diskBytes/fileBytes;
>> fprintf('%10d whos\n%10d disk\n%10.6f\n',fileBytes,diskBytes,ratio)
8531608 whos
21389582 disk
2.507099
Still fairly inefficient, but not out of control.

Reading binary file into matlab

I have a data file that uses (char(1 byte), char[n](array of n chars), word(2 byte unsigned int), short(2 byte signed int), dword(4 byte unsigned int), long(4 byte signed int) and float(4 byte real)) and is supposedly in the following format. I am reading the data file into MATLAB with fopen, fread, etc. but the values I am getting are not what I expect.
Format:
char[8]
<-- outputs 8 ascii values that spell the correct string identifier
dword
<--version of the data files, msw-major version, lsw-minor version (have tried reading as 1 uint32 and 2 uint16's)
dword
dword
dword
dword
<--number of window displays in program
displayinfo[8]
<--contains display window params in the following structure: (not sure what data type this is)
dword
dword
dword
dword
dword
dword
dword
dword
dword
dword
dword
dword
dword
(end of display window params; some are specified as must be a number in [0,3] and they aren't coming out like that)
char[16]
word <-- supposed to be year data was collected (2013) but coming up as 0
Code:
fid = fopen('MIC1.001','rb');
fileIdentifier = fread(fid, 8,'char');
dataFileMajorVersion = fread(fid,1,'uint16');
dateFileMinorVersion = fread(fid,1,'uint16');
numModules = fread(fid,1,'uint32');
fread(fid,1,'uint32'); % value not used
numSwipesCollected = fread(fid,1,'uint32');
numWindowDisplays = fread(fid,1,'uint32');
% display info vars:
displayType = [];
moduleNumber = [];
channelNumber = [];
beginningBar = [];
endBar = [];
vertExpFactor = [];
voltageOffset =[];
isGridEnabled = [];
isEngineeringUnitEnabled = [];
colorOfDisplay = [];
multiChannelIndex = [];
numChannelsForMultiChannelDisp = [];
multiChannelDispStyle = [];
% or does it go through loop for all 8 whether or not there are 8 displays??
for i=1:numWindowDisplays
displayType = [fread(fid,1,'uint32'); displayType];
moduleNumber = [fread(fid,1,'uint32'); moduleNumber];
channelNumber = [fread(fid,1,'uint32'); channelNumber];
beginningBar = [fread(fid,1,'uint32'); beginningBar];
endBar = [fread(fid,1,'uint32'); endBar];
vertExpFactor = [fread(fid,1,'uint32'); vertExpFactor];
voltageOffset =[fread(fid,1,'uint32'); voltageOffset];
isGridEnabled = [fread(fid,1,'uint32'); isGridEnabled];
isEngineeringUnitEnabled = [fread(fid,1,'uint32'); isEngineeringUnitEnabled];
colorOfDisplay = [fread(fid,1,'uint32'); colorOfDisplay];
multiChannelIndex = [fread(fid,1,'uint32'); multiChannelIndex];
numChannelsForMultiChannelDisp = [fread(fid,1,'uint32'); numChannelsForMultiChannelDisp];
multiChannelDispStyle = [fread(fid,1,'uint32'); multiChannelDispStyle];
end
fread(fid,1,'uint32'); % value only used internally
fread(fid,16,'char'); % unused parameter for future use
yearOfDataCollection = fread(fid,1,'uint16');
I would recommend first, just read in all the data at once as a byte array. You'll be able to debug the problem much faster:
fid = fopen('MIC1.001','rb');
data = fread(fid);
fclose(fid);
% could look at it as all chars, just for debugging
char(A)'
The data is read in as a large array of bytes. Then you would go through and parse the bytes by casting them appropriately. You may want to try just testing your method first:
% create a binary file to follow the same format as the specified file
fid = fopen('test.dat','wb');
% Put in 8 character string for file ID
aa = 'myfile0';
fwrite(fid,aa);
% Null terminate it, (I guess)
fwrite(fid,0);
% write the 2 byte file major revision
aa = 1000;
fwrite(fid,aa,'uint16');
% write the 2 byte file minor revision
aa = 5000;
fwrite(fid,aa,'uint16');
% write the 4 byte number of modules
aa = 65536;
fwrite(fid,aa,'uint32');
fclose(fid);
% read the entire file in
fid = fopen('test.dat','rb');
A = fread(fid);
fclose(fid);
% Try to read the file id
disp(char(A(1:8))')
% Try to read the file major revision
majorByte1 = dec2hex(A(9));
majorByte2 = dec2hex(A(10));
% see if it needs byte swapped
tmp1 = hex2dec([majorByte1 majorByte2]);
tmp2 = hex2dec([majorByte2 majorByte1]);
fprintf(1,'File Major: %d ? = %d\nFile Major: %d ? = %d\n',1000,tmp1,1000,tmp2);
For output I get:
myfile0
File Major: 1000 ? = 3715
File Major: 1000 ? = 1000
So, for me, I'll need to byte swap the data, maybe you do too? :-)
EDIT
To do this using fread, from the Matlab docs:
A = fread(fileID, sizeA, precision, skip, machineformat) reads data
with the specified machineformat.
For your case:
A = fread(fid,2,'uint16',0,'b');
I'm assuming you're on a little endian machine, to swap it to little endian, just use a l instead of a b.
number of digit of dec2hex conversion should be given (0x0A0C != 0xAC):
majorByte1 = dec2hex(A(9), 2);
majorByte2 = dec2hex(A(10), 2);
% see if it needs byte swapped
tmp1 = hex2dec([majorByte1 majorByte2]);
tmp2 = hex2dec([majorByte2 majorByte1]);

how can I import multiple csv files with selected columns using textscan?

I have a large number of csv files to be processed. I only want the selected columns in each file and then load all the files from a certain folder and then output as one combined file. Here are my codes running with errors.... Could anyone help me to solve this problem?
data_directory = 'C:\Users\...\data';
numfiles = 17;
for n = 1:numfiles
filepath = [data_directory,'data_', num2str(n),'_output.csv'];
fid = fopen (filepath, 'rt');
wanted_columns= [2 3 4 5 10 11 12 13 14 15 16 17 35 36 41 42 44 45 59 61];
format = [];
columns = 109;
for i = 1 : columns;
if any (i == wanted_columns)
format = [format '%s'];
else
format = [format '%*s'];
end
end
data = textscan(fid, format, 'Delimiter',',','HeaderLines',1);
fclose(fid);
end
I think you should check whether the file is opened correctly. The error message seems to indicate that this is not the case. If it is not, check if the filepath is correct.
fid = fopen (filepath, 'rt');
if fid == -1
error('Failed to open file');
end
If the error is thrown here, you know that there was a problem with 'fopen'.
Ofcourse I don't know which files are on your computer, but I assume the '...' in the filename is not in your actual matlab file, only in your question on SO.
But could it be that you repeat the word 'data', while the actual filename only contains 'data' once? You code now will result in filenames like ''C:\Users\...\datadata_1_output.csv'. Maybe 'data' should be removed in data_directory or in filepath = ...?
Here is another way how you can setup the format string in a vectorized manner:
fcell = repmat({'%*s '},1,n_columns);
fcell(wanted_columns) = {'%s '};
formatstr = [fcell{:}];
Notice format is a build-in function in MATLAB, and it's better not to be used for variable name.