When I import my data (numerical matrix of NYSE stock data), the data isn't loaded properly:
the final part of my CSV data disp() displayed should be -
9.76, 10, 9.99, 9.94, 9.97,9.944,9.95,10,9.956,10.01
What I get when I call the disp(importDataResult) is -
0.0100 0.0099 0.0099 0.0100 etc..
Have you got any idea why when I import the data it is transformed completely? The below link contains my zipped CSV file so you can see the problem (I completely understand if you can't be bothered checking this out, but I'd be interested to know if the same problem applies to others' MATLAB / computers).
https://www.sendspace.com/file/slif0y
The code I'm using is:
function [ c ] = CreateCov_Test()
c = csvread('nyse_data_matrix_no_tags.csv');
disp(c);
end
Here is a screenshot of the issue:
https://s32.postimg.org/os74qfrlx/matlab_screen.png
Thank you very much!
Matlab is not transforming any data. The configuration of who Matlab is displaying variables is controlled with format, the default being format short.
An excerpt from the documentation:
format may be used to switch between different output display formats of all float variables as follows:
format SHORT Scaled fixed point format with 5 digits.
So what does Scaled fixed point format with 5 digits mean, well lets see
>> a = [0.1 10000 100]
>> disp(a)
1.0e+04 *
0.0000 1.0000 0.1000
Note the 1.0e+04 *, its a multiplier for all data in the matrix. When displaying a large matrix, this multiplier is often hidden (as in your case), which admittedly can be rather confusing.
Related
I'm trying to analyze a large text log file (11 GB). All of the data are numerical values, and a snippit of the data are listed below.
-0.0623 0.0524 -0.0658 -0.0015 0.0136 -0.0063 0.0259 -0.003
-0.0028 0.0403 0.0009 -0.0016 -0.0013 -0.0308 0.0511 0.0187
0.0894 0.0368 0*0243 0.0279 0.0314 -0.0212 0.0582 -0.0403 //<====row 3, weird ASCII char * is present
-0.0548 0.0132 0.0299 0.0215 0.0236 0.0215 0.003 -0.0641
-0.0615 0.0421 0.0009 0.0457 0.0018 -0.0259 0.041 0.031
-0.0793 0.01 //<====row 6, the data is misaligned here
0.0278 0.0053 -0.0261 0.0016 0.0233 0.0719
0.0143 0.0163 -0.0101 -0.0114 -0.0338 -0.0415 0.0143 0.129
-0.0748 -0.0432 0.0044 0.0064 -0.0508 0.0042 0.0237 0.0295
0.040 -0.0232 -0.0299 -0.0066 -0.0539 -0.0485 -0.0106 0.0225
Every set of data consists of 2048 rows, and each row has 8 columns.
Here comes the problem: when the data is transformed from binary files to text files using the logging software, a lot of the data are distorted. Take the data above as an example, row 3 column 3 there is a " * " present in the data. And in row 6, one row of data is broken into two rows, one row has 2 data and the other row has 6 data.
I am currently struggling reading this large text files using MATLAB. Since the file itself is so large, I can only use textscan to read the data.
for example:
C = textscan(fd,'%f%f%f%f%f%f%f%f',1,'Delimiter','\t ');
However, I cannot use '%f' as format since there contains several weird ASCII characters such as " * " or " ! " in the data. These distorted data cannot be treated as floating point numbers.
So I choose to use:
C = textscan(fd,'%s%s%s%s%s%s%s%s',1,'Delimiter','\t ');
and then I transfer those strings into doubles to be processed. However, this encounters the problem of broken lines. When it reaches row 6, it gives:
[-0.0793],[0.01],[],[],[],[],[],[];
[0.0278],[0.0053],[-0.0261],[0.0016],[0.0233],[0.0719],[0.0143],[0.0163];
while it is supposed to look like:
-0.0793 0.01 0.0278 0.0053 -0.0261 0.0016 0.0233 0.0719 ===> one set
0.0143 0.0163 -0.0101 -0.0114 -0.0338 -0.0415 0.0143 0.129 ===> another set
Then the data will be offset by one row and the columns are messed up.
Then I try to do:
C = textscan(fd,'%s',1,'Delimiter','\t ');
to read one element at one time. If this element is NaN, it will textscan the next one until it sees something other than NaN. Once it obtains 2048 non-empty elements, it will store those 2048 data into a matrix to be processed. After being processed, this matrix is cleared.
This method works well for the first 20% of the whole file....BUT,
since the file itself is 11GB which is very large, after reading about 20% of the file, MATLAB shows:
Error using ==> textscan
Out of memory. Type HELP MEMORY for your options.
(some people suggest using %f while doing textscan, but it won't work because there are some ASCII chars which are causing problem)
Any suggestions to deal with this file?
EDIT:
I have tried:
C = textscan(fd,'%s%s%s%s%s%s%s%s',2048,'Delimiter','\t ');
Although the result is incorrect due to the misalignment of data (like row 6), this code indeed does not cause the "Out of memory" problem. Out of memory problem only occurs when I try to use
C= textscan(fd,'%s',1,'Delimiter','\t ').
to read the data one entry by one entry. Anyone has any idea why this memory problem happens?
This might seem silly, but are you preallocating an array for this data? If the only issue (as it seems to be) with your last function is memory, perhaps
C = zeros(2048,8);
will alleviate your problem. Try inserting that line before you call textscan. I know that MATLAB often exhorts programmers to preallocate for speed; this is just a shot in the dark, but preallocating memory may fix your issue.
Edit: also see this MATLAB Central discussion of a similar issue. It may be that you will have to run the file in chunks, and then concatenate the arrays when each chunk is finished.
Try something like the code below. It preallocates space and reads numRow* numColumns from the textfile at a time. If you can initialize the bigData matrix then it shouldn't run out of memory ... I think.
Note: That I used 9 for #rows since your sample data had 9 complete rows you will want to use 2024 I presume. This also might need some end of file checks etc. and some error handling. Also any numbers w/ odd ascii text in the will turn into NaN.
Note 2: This still might not work or be very very slow. I had a similar problem reading large text files (10-20GB) that were slightly more complicated. I had to abandon reading them in Matlab. Instead I used Perl for an initial pass which output to binary. Then used Matlab to read the binary back into data. The 2 step approach ended up saving lots and lots of runtime. Link in case you are interested
function bigData = readData(fileName)
fid = fopen(fileName,'r');
numBlocks = 1; %Somehow determine # of blocks??? not sure if you know of a way to determine this
r = 9; %Replace 9 with your size 2048
c = 8;
bigData = zeros(r*numBlocks,8);
for k = 1:numBlocks
[dataBlock, rFlag] = readDataBlock(fid,r,c);
if rFlag
%Or some kind of error.
break
end
bigData((k-1)*r+1:k*r,:) = dataBlock;
end
fclose(fid);
function [dataBlock, rFlag]= readDataBlock(fid,r,c)
C= textscan(fid,'%s',r*c,'Delimiter','\t '); %replace 9*8 by the size of the data block.
dataBlock = [];
if numel(C{1}) == r*c
dataBlock = reshape(str2double(C{1}),9,8);
rFlag = false;
else
rFlag = true;
% ?? Throw an error or whatever is appropriate
end
While I don't really know how to solve your problems with the broken data, I can give some advice how to process big text data. Read it in batches of multiple lines and write the output directly to the hard drive. In your case the second might be unnecessary, if everything is working you could try to replace data with a variable.
The code was originally written for a different purpose, I deleted the parser for my problem and replaced it with parsed_data=buffer; %TODO;
outputfile='out1.mat';
inputfile='logfile1';
batchsize=1000; %process 1000 lines at once
data=matfile(outputfile,'writable',true); %Simply delete this line if you dant "data" to be a variable in your memory
h=dir(inputfile);
bytes_to_read=h.bytes;
data.out={};
input=fopen(inputfile);
buffer={};
while ftell(input)<bytes_to_read
buffer=[buffer,textscan(input,'%s',batchsize-numel(buffer))];
parsed_data=buffer; %TODO;
data.out(end+1,1)={parsed_data};
buffer={}; %In the default case your empty your buffer her.
%If a incomplete line read here partially, leave it in the buffer.
end
fclose(input);
faced with the problem, txt file have 1000 double values, but when I have loaded this file, I get 2000, other values are zero values.
Help me please, I just started learning Matlab.
clear all
file = 'C:\data.txt';
x = load(file);
x = sort(x(:));
x
x = 1.0e+010 *
0.0000
0.0000
0.0000
0.0000
...
Use the Matlab function textscan. This function goes line by line and parses data into C as a cell array. You can also control the number of lines it reads in based on the number N.
C = textscan(fileID,formatSpec,N)
The Matlab website has many examples on how to call this function.
Does anyone here use "MATLAB Tensor Toolbox Version 2.5 (Sandia National Laboratories)" with Matlab to operate tensors? I would like to know how to save the tensor/sptensor from it into a text file.
I found a function "export_data()" but it supports "tensor" type and doesn't work on "sptensor". I think it would be perfect if there is a way to write a "sptensor" into text files with the format like it shows via "disp(sptensor)" function. e.g.
(57,211,138) 0.0000
(57,211,141) 1.1063
(57,211,142) 1.1063
(57,211,143) 0.0000
(57,211,144) 0.6213
I think either I can save the whole sptensor in one time, or we can iterate it and save elements one by one. Do you have any idea to achieve this? How you store the (reconstructed) tensor data for further process?
I don't know this toolbox, but you can write it manually using the following snipet of code in a loop:
fid = fopen('filename.txt','a');
for n = 1:4
x = randn(4,1);
fprintf(fid,'(%f,%f,%f)%f',x(1),x(2),x(3),x(4));
fprintf(fid,'\n');
end
fclose(fid);
You can use the export_data(tensor,"filename.txt"). The export_data function exports the tensor data in simple ASCII text format. The tensor can be any tensor like sptensor, ktensor, ttensor etc.
I want to plot some data in Matlab, but I'm having problems with properly displaying time.
Time is in format:
HH:MM:SS.miliseconds
So for example: 11:16:41.835
I read my .txt file (tab delimited), and now I need to plot data. All data are the same length, and I reckon there is a problem with the time format. Any advice? plot(time, data1) doesn't work either.
Consider using matlab's built-in datavec, for example:
dvec = datevec('11:21:02.647', 'HH:MM:SS.FFF')
dvec = 1.0e+003 *
2.0100 0.0010 0.0010 0.0110 0.0210 0.0026
More info can be found here
For some reason, the hdf5write method in MATLAB is automatically converting my row vectors to column vectors when I re-read them:
>> hdf5write('/tmp/data.h5','/data',rand(1,10));
>> size(hdf5read('/tmp/data.h5','/data'))
ans =
10 1
However, for a row vector in the third dimension, it comes back just fine:
>> hdf5write('/tmp/data.h5','/data',rand(1,1,10));
>> size(hdf5read('/tmp/data.h5','/data'))
ans =
1 1 10
How can I get hdf5write to do the right thing for row vectors? They should be coming back as 1 x 10, not 10 x 1.
edit the problem is slightly more complicated because I am using c-based mex to actually read the data later, instead of hdf5read. Moreover, the problem really is in hdf5write, and this is visible in the hdf5 files themselves:
>> hdf5write('/tmp/data.h5','/data',randn(1,10));
>> ! h5ls /tmp/data.h5
data Dataset {10}
that is, the data is saved as a 1-dimensional array in the hdf5 file. For comparison, I try the same thing with an actual 2-d matrix (to show what it looks like), a 1-d column vector, a 1-d vector along the third dimension, and, for kicks, try the V71Dimensions trick which is in the help for both hdf5read and hdf5write:
>> hdf5write('/tmp/data.h5','/data',randn(10,1)); %1-d col vector
>> ! h5ls /tmp/data.h5
data Dataset {10}
>> hdf5write('/tmp/data.h5','/data',randn(1,1,10)); %1-d vector along 3rd dim; annoying
>> ! h5ls /tmp/data.h5
data Dataset {10, 1, 1}
>> hdf5write('/tmp/data.h5','/data',randn(2,5)); %2-d matrix. notice the reversal in dim order
>> ! h5ls /tmp/data.h5
data Dataset {5, 2}
>> hdf5write('/tmp/data.h5','/data',randn(1,10),'V71Dimensions',true); %1-d row; option does not help
>> ! h5ls /tmp/data.h5
data Dataset {10}
So, the problem does seem to be in hdf5write. The 'V71Dimensions' flag does not help: the resultant hdf5 file is still a Dataset {10} instead of a Dataset {10,1}.
It's the reading that's an issue. From the help
[...] = hdf5read(..., 'V71Dimensions',
BOOL) specifies whether to change the
majority of data sets read from the
file. If BOOL is true, hdf5read
permutes the first two dimensions of
the data set, as it did in previous
releases (MATLAB 7.1 [R14SP3] and
earlier). This behavior was intended
to account for the difference in how
HDF5 and MATLAB express array
dimensions. HDF5 describes data set
dimensions in row-major order; MATLAB
stores data in column-major order.
However, permuting these dimensions
may not correctly reflect the intent
of the data and may invalidate
metadata. When BOOL is false (the
default), the data dimensions
correctly reflect the data ordering as
it is written in the file — each
dimension in the output variable
matches the same dimension in the
file.
Thus:
hdf5write('/tmp/data.h5','/data',rand(1,10));
size(hdf5read('/tmp/data.h5','/data','V71Dimensions',true))
ans =
1 10
I'm affraid for this you will have to use the low-level HDF5 API of Matlab.
In Matlab, the low-level API is available using for instance H5.open(...), H5D.write(...) and so on. The names correspond exactly to those of the C library (see the HDF5 doc). However there is a slight difference in the arguments they take, but the matlab help function will tell you everything you need to know...
The good news is that the Matlab version of the API is still less verbose than the C version. For instance, you don't have to close manually the datatypes, dataspaces, etc. since Matlab closes them for you when the variables go out of scope.