Csvwrite with numbers larger than 7 digits - matlab

So, I have a file that's designed to parse through a rather large csv file to weed out a handful of data points. Three of the values (out of 400,000+) within the file is listed below:
Vehicle_ID Frame_ID Tot_Frames Epoch_ms Local_X
2 29 1707 1163033200 8.695
2 30 1707 1163033300 7.957
2 31 1707 1163033400 7.335
What I'm trying to do here is to take previously filtered data points like this and plug it into another csv file using csvwrite. However, csvread will only take in the Epoch_ms in double precision, storing the value as 1.1630e+09, which is sufficient for reading, as it does maintain the original value of the number for use in MATLAB operations.
However, during csvwrite, that precision is lost, and each data point is written as 1.1630e9.
How do I get csvwrite to handle the number with greater precision?

Use dlmwrite with a precision argument, such as %i. The default delimiter is a comma, just like a CSV file.
dlmwrite(filename, data, 'precision', '%i')

Related

converting large numerical value to power 10 format

Is there a way to convert large numerical value to *10 to the power format in sas?
Eg: 88888888383383838383 to 8.8*10^6
Thanks in advance.
You can use the format ew. where the w the number output characters. Using e8. will result in 8.9E+19. But beware that SAS uses floating point to store values internally, with a maximum of 8 bytes. Your example value would be rounded to 88,888,888,383,383,830,528 (no matter how it's formatted).

Simple compression algorithm in C++ interpretable by matlab

I'm generating ~1million text files containing arrays of doubles, tab delimited (these are simulations for research). Example output below. Each million text files I expect to be ~5 TB, which is unacceptable. So I need to compress.
However, all my data analysis will be done in matlab. And every matlab script will need to access all million of these text files. I can't decompress the whole million using C++, then run the matlab scripts, because I lack the HD space. So my question is, are there some very simple, easy to implement algorithms or other ways of reducing my text file sizes so that I can write the compression in C++ and read it in matlab?
example text file
0.0220874 0.00297818 0.000285954 1.70E-05 1.52E-07
0.0542912 0.00880725 0.000892849 6.94E-05 4.51E-06
0.0848582 0.0159799 0.00185915 0.000136578 7.16E-06
0.100415 0.0220033 0.00288016 0.000250445 1.38E-05
0.101889 0.0250725 0.00353148 0.000297856 2.34E-05
0.0942061 0.0256 0.00393893 0.000387219 3.01E-05
0.0812377 0.0238492 0.00392418 0.000418365 4.09E-05
0.0645259 0.0206528 0.00372185 0.000419891 3.23E-05
0.0487525 0.017065 0.00313825 0.00037539 3.68E-05
If it matters.. the complete text files represent joint probability mass functions, so they sum to 1. And I need lossless compression.
UPDATE Here is an IDIOTS guide to writing binary in C++ and reading it Matlab, with some very basic explanation along the way.
C++ code to write a small array to a binary file.
#include <iostream>
using namespace std;
int main()
{
float writefloat;
const int rows=2;
const int cols=3;
float JPDF[rows][cols];
JPDF[0][0]=.19493;
JPDF[0][1]=.111593;
JPDF[0][2]=.78135;
JPDF[1][0]=.33333;
JPDF[1][1]=.151535;
JPDF[1][2]=.591355;
JPDF is an array of type float that I save 6 values to. It's a 2x3 array.
FILE * out_file;
out_file = fopen ( "test.bin" , "wb" );
To be honest, I don't quite get what the first line is doing. It seems to be making a pointer of type FILE named out_file. The second line fopen says make a new file for writing (the 'w' of the second parameter), and make it a binary file (the 'b' of the wb).
fwrite(&rows,sizeof(int),1,out_file);
fwrite(&cols,sizeof(int),1,out_file);
Here I encode the size of my array (# rows, # cols). Note that we fwrite the reference to the variables rows and cols, not the variables themselves (& is by ref). The second parameter tells it how many bytes to write. Since rows and cols are both ints, I use sizeof(int). The '1' says do this 1 time. I think. And out_file is our pointer to the file we're writing to.
for (int i=0; i<3; i++)
{
for (int j=0; j<2; j++)
{
writefloat=JPDF[j][i];
fwrite (&writefloat , sizeof(float), 1, out_file);
}
}
fclose (out_file);
return 0;
}
Now I'll iterate through my array and write each value in bytes to my file. The indexing is a little backwards looking in that I'm iterating down each column rather than across a column in the inner loop. We'll see why in a sec. Again, I'm writing the reference to writefloat, which takes on the value of the current array element in each iteration. Since each array element is a float, I'm using sizeof(float) here instead of sizeof(int).
Just to be incredibly, stupidly clear, here's a diagram of how I think of the file we've just created.
[4 bytes: rows][4 bytes: cols][4 bytes: JPDF[0][0]][4 bytes: JPDF[1][0]] ...
[4 bytes: JPDF[1][2]]
..where each chunk of bytes is written in binary (0s and 1s).
To interpret in MATLAB:
FID=fopen('test.bin');
sizes=fread(FID,2,'int')
FID sort of works like a pointer here. Secretly, it probably is a pointer. Then we use fread which operates very similarly to C++ fread. FID is our 'pointer' to our file. The 'int' tells the function how many bytes each chunk contains. So sizes=fread(FID,2,'int') says 'open FID in binary, and read 2 chunks of size INT bytes, and return the 2 elements in vector form. Now, sizes(1)=rows, and sizes(2)=cols.
s=fread(FID,[sizes(1) sizes(2)],'float')
The next part wasn't completely clear to me originally, I thought I'd have to tell fread to skip the 'header' of my binary that contains row/col info. However, it secretly maintains a pointer to where you left off. So now I empty the rest of the binary file, using the fact that I know the dimensions of the array. Note, while the second parameter [M,N] is [rows,cols], fread reads in "column order", which is why we wrote the array data in column order.
The one * is that I think I can only use matlab code 'int' and 'float' if the architecture of the C++ program is concordant with matlab (e.g., both are 64-bit, or both are 32-bit). But I'm not sure about this.
The output is:
sizes =
2
3
s =
0.194930002093315 0.111593000590801 0.781350016593933
0.333330005407333 0.151535004377365 0.59135502576828
To do better than four bytes per number, you need to determine to what precision you need these numbers. Since they are probabilities, they are all in [0,1]. You should be able to specify a precision as a power of two, e.g. that you need to know each probability to within 2-n of the actual. Then you can simply multiply each probability by 2n, round to the nearest integer, and store just the n bits in that integer.
In the worst case, I can see that you are never showing more than six digits for each probability. You can therefore code them in 20 bits, assuming a constant fixed precision past the decimal point. Multiply each probability by 220 (1048576), round, and write out 20 bits to the file. Each probability will take 2.5 bytes. That is smaller than the four bytes for a float value.
And either way is way smaller than the average of 11.3 bytes per value in your example file.
You can get better compression even than that if you can exploit known patterns in your data. Assuming that there are any. I see that in your example, on each line the values go down by some factor at each step. If that is real and not just an artifact of the generation of the example, then you can successively use fewer bits for each sample. Also if the first sample is really always less than 1/8, then you can drop the top three bits off that one, since those bits would always be zero. If the second column is always less than 1/32, you can drop the first five bits off all of those. And so on. Assuming that the magnitudes in the example are maximums across all of the data sets (obviously not true, but just using that as an illustrative case), and assuming you need six decimal digits after the decimal point, I could code each row of six values in 50 bits, for an average of a little over one byte per probability.
And for one last smidgen of compression, since the values add to one, you don't have store the last value.
Matlab can read binary files. Why not save your files as binary instead of text?
Saving each number as a float would only require 4 bytes (if you're running 32 bit linux), you could use doubles but it appears that you aren't using the full double resolution. Under your current scheme each digit every number consumes a byte of space. All of your numbers are easily 4+ char longs, some as long as 10 chars. Implementing this change should cut down your file sizes by more than 50%.
Additionally you might consider using a more elegant data format like HDF5 (more here) that both supports compression and is supported by matlab
Update:
There are lots of examples of how to write binary files in C++, just google it.
Additionally to read in a binary file in Matlab simply use fread
The difference between representing a number as ascii vs binary is really simple. All files are written using binary, the difference is in how that information gets interpreted. Text files are generally read using ASCII, which provides a nice mapping between an 8bit word and characters. When you see a string like "255" what you have is a array of bytes where each byte encodes on character in the array. However when you are storing numbers its really wasteful to store each digit of using a different byte. A single byte can store values between 0-255. So why use three bytes to store the string "255" when I can use a single byte to store the value 255.
You can always go ahead and zip everything using a standard library like zlib. Afterwards you could use a custom dll written in C++ that unzips your data in chunks you can manage. So basically:
Data --> Zip --> Dll (Loaded by Matlab via LoadLibrary) --> Matlab

reading in text file and organising by lines using MATLAB

I want to read in a text file (using matlab) with data that is not in a convenient matlab matrix form. This is an example:
{926377200,926463600}
[(48, 13), (75, 147), (67, 13)]
{926463600,926550000}
[(67, 48)]
{926550000,926636400}
[]
{926636400,926722800}
[]
{926722800,926809200}
...
All I would like is a vector of all the numbers separated by commas. With them always being in pairs and the odd lines' numbers are of much greater magnitude each time, this can be differentiated by logic later.
I cannot figure out how to use textscan or the other methods. What makes this a bit tricky is that the matlab methods require a defined format for the strings separated by delimiters and here the even lines have non-restricted numbers of integer pairs.
You can do this with textscan. You just need to specify the {} etc as whitespace.
For example, if you put your sample data into the file tmp.txt (in the current directory) and run the following:
fid = fopen('tmp.txt','r');
if fid > 0
numbers = textscan(fid,'%f','whitespace','{,}[]() ');
fclose(fid);
numbers = numbers{:}
end
you should see
numbers =
926377200
926463600
48
13
75
147
67
13
926463600
926550000
67
48
926550000
926636400
926636400
926722800
926722800
926809200
Just iterate through each character. (use fscanf or fread or whatever). If the character is a number (use str2num) , store it as a number , if it is not a number, discard it and start storing a new number when you encounter the next number.

read and rewrite in matlab

How do I write a text file in the same format that it is read in MATLAB?
I looked and my question is almost the same as above question.
I want to read in a file which is 84641 x 175 long.
and i want a make a new .txt file with 84641 x 40 , deleteling rest of the columns.
I have 2 rewrite the dates n times. date is on first column in format 6/26/2010 and time on 2nd column in format ' 00:00:04'
when i use the code put in above question i keep getting the error
??? Error using ==> reshape
Product of known dimensions, 181,
not divisible into total number
of elements, 14812175.
Error in ==>
write at
data = reshape(data{1},N+6,[])';
when i comment this it has error in printf statements for date and data write.
Any ideas??
thanks
As the author of the accepted answer in the question you link to, I'll try to explain what I think is going wrong.
The code in my answer is designed to read data from a file which has a date XX/XX/XXXX in the first column, a time XX:XX:XX in the second column, and N additional columns of data.
You list the number of elements in data as 14812175, which is evenly divisible by 175. This implies that your input data file has 2 columns for the date and time, then 169 additional columns of data. This value of 169 is what you have to use for N. When the date and time columns are read from the input file they are broken up into 3 columns each in data (for a total of 6 columns), which when added to the 169 additional columns of data gives you 175.
After reshaping, the size of data should be 84641-by-175. The first 6 columns contain the date and time values. If you want to write the date, the time, and the first 40 columns of the additional data to a new file, you would only have to change one line of the code in my answer. This line:
fprintf(fid,', %.1f',data(i,7:end)); %# Output all columns of data
Should be changed to this:
fprintf(fid,', %.1f',data(i,7:46)); %# Output first 40 columns of data

How do you interchange the row and column of a matrix in MATLAB

I have an input data in Excel which has 2000 rows and 60 columns.
I want to read this data into MATLAB but I need to to interchange the rows and the column so that the matrix will be 60 rows and 2000 columns. How can I do this in MATLAB, because Excel only has 256 column which cannot hold 2000 columns.
You just need to transpose it: data = data'
To read in the data to MATLAB, start with the xlsread function. Then transpose it, as tzaman showed in his solution.
Your code might look like this:
[filename,path]=uigetfile();
data=xlsread([path,filename]);
data=data';
xlswrite([path,'myfile.xls'],data);
Which would save the transposed data as myfile.xls in the same directory as the original file.
EDIT: Excel 2003 is limited to 256 columns, which is why xlswrite is throwing an error. Have you tried using dlmwrite instead?