Lossless data compression for extremely large data - Planetary artificial intelligence - lossless-compression

I want to create an environment for artificial intelligence, with planetary size. It will simulate underground life on a very large world. According to Wikipedia, planet Earth has a surface area of 510,072,000 Km^2, I want to create a square of similar proportions, maybe bigger. I will store one meter on each bit, where 0 means dirt and 1 means wall of dirt.
Let's first calculate how to store a single line of this square. One line would be 510072000000m and each byte can store 8 meters, so one line would be 59.38GB and the entire world would be 3.44PB. And I would like to add at least water and lava to each square meter, that would multiply the results by 2.
I need to compress this information with lossless data compression algorithms. I first tried a very direct approach with 7zip and I tried it with a smaller world, where one line would be 6375B. In theory, the world should be 6375^2B = 38.76MB, but when I try it I get a file of 155MB, I do not know why this difference. But when I compress it with 7Zip, I get a file of 40.1MB. It is a huge difference, and with that ratio I would convert my 3.44PB world file into a 912.21GB file.
My first thought is, why am I having such a large file, when maths tell me it should be smaller? Maybe the problem is the code, maybe the problem is that I had errors on maths. The code is as follows: (C#)
// 510072000000m each line = 63759000000B
const long SIZE = 6375;
// Create the new, empty data file.
string fileName = tbFile.Text;
FileStream fs = new FileStream(fileName, FileMode.Create);
// Create the writer for data.
BinaryWriter w = new BinaryWriter(fs);
// Use random numbers to fill the data
Random random = new Random();
// Write data to the file.
for (int i = 0; i < SIZE; i++)
{
for (int j = 0; j < SIZE; j++)
{
w.Write(random.Next(0,256));
}
}
w.Close();
fs.Close();
And the maths are so basic that if I did something wrong I cannot see it.
Can you give me any advice? Just focus on data compression, artificial intelligence is not a problem because I have experience with evolutionary algorithms and the world does not need to be real time, it can take all the time it needs.
Thank you all for your time.

I don't know about C#, but it seems you are currently writing 4 bytes each time (6375 * 6375 * 4 Bytes in MB = 155 MB). So I guess the Write method currently writes a 32 bits integer.

#Scharron has correctly answered the specifics of your question, but I think there is a more fundamental issue:
It is theoretically impossible to compress random data significantly. Indeed most compression algorithms will increase the storage size when given random input data. Perhaps the specifics of your AI algorithm will introduce some patterns than can be compressed, but if you are starting with truly random input data, you will have to store those multiple PB.
The reason you were seeing significant compression is that, as #Scharron pointed out, you were writing 3 zero bytes for every byte of data, leading to much more easily compressible data.

Related

Performance issue with reading DICOM data into cell array

I need to read 4000 or more DICOM files. I have written the following code to read the files and store the data into a cell array so I can process them later. A single DICOM file contains 128 * 931 data. But once I execute the code it took more than 55 minutes to complete the iteration. Can someone point out to me the performance issue of the following code?
% read the file information form the disk to memory
readFile=dir('d:\images','*.dcm');
for i=1:4000
% Read the information form the dicom files in to arrays
data{i}=dicomread(readFile(i).name);
info{i}=dicominfo(readFile(i).name);
data_double{i}=double(data{1,i}); % convert 16 bit data into double
first_chip{i}=data_double{1,i}(1:129,1:129); % extracting first chip data into an array
end
You are reading 128*931*4000 pixels into memory (assuming 16-bit values, that's nearly 1 GB), converting that to doubles (4 GB) and extracting a region (129*129*4000*8 = 0.5 GB). You are keeping all three of these copies, which is a terrible amount of data! Try not keeping all that data around:
readFile = dir('d:\images','*.dcm');
first_chip = cell(size(readFile));
info = cell(size(readFile));
for ii = 1:numel(readFile)
info{ii} = dicominfo(readFile(ii).name);
data = dicomread(info{ii});
data = (1:129,1:129); % extracting first chip data
first_chip{ii} = double(data); % convert 16 bit data into double
end
Here, I have pre-allocated the first_chip and info arrays. If you don't do this, the arrays will be re-allocated every time you add an element, causing expensive copies. I have also extracted the ROI first, then converted to double, as suggested by Rahul in his answer. Finally, I am re-using the DICOM info structure to read the file. I don't know if this makes a big difference in speed, but it saves the dicomread function some effort.
But note that this process will still take a considerable amount of time. Reading DICOM files is complex, and takes time. I suggest you read them all in once, then save the first_chip and info cell arrays into a MAT-file, which will be a lot faster to read in at a later time.
You can run the profiler to check which part of the code is taking up most of the time! But as far as it looks to me is that its your iteration size & the time taken is very much genuine. You could try and use parallel computing ( parfor loop) if you have a multicore processor, that should decrease the runtime significantly depending upon the number of cores that you have.
One suggestion would be to exctract the 'first chip data' first and then convert it to double, as the conversion process takes a significant amount of time.

How to write "Big Data" to a text file using Matlab

I am getting some readings off an accelerometer connected to an Arduino which is in turn connected to MATLAB through serial communication. I would like to write the readings into a text file. A 10 second reading will write around 1000 entries that make the text file size around 1 kbyte.
I will be using the following code:
%%%%%// Communication %%%%%
arduino=serial('COM6','BaudRate',9600);
fopen(arduino);
fileID = fopen('Readings.txt','w');
%%%%%// Reading from Serial %%%%%
for i=1:Samples
scan = fscanf(arduino,'%f');
if isfloat(scan),
vib = [vib;scan];
fprintf(fileID,'%0.3f\r\n',scan);
end
end
Any suggestions on improving this code ? Will this have a time or Size limit? This code is to be run for 3 days.
Do not use text files, use binary files. 42718123229.123123 is 18 bytes in ASCII, 4 bytes in a binary file. Don't waste space unnecessarily. If your data is going to be used later in MATLAB, then I just suggest you save in .mat files
Do not use a single file! Choose a reasonable file size (e.g. 100Mb) and make sure that when you get to that many amount of data you switch to another file. You could do this by e.g. saving a file per hour. This way you minimize the possible errors that may happen if the software crashes 2 minutes before finishing.
Now knowing the real dimensions of your problem, writing a text file is totally fine, nothing special is required to process such small data. But there is a problem with your code. You are writing a variable vid which increases over time. That may cause bad performance because you are not using preallocation and it may consume a lot of memory. I strongly recommend not to keep this variable, and if you need the dater read it afterwards.
Another thing you should consider is verification of your data. What do you do when you receive less samples than you expect? Include timestamps! Be aware that these timestamps are not precise because you add them afterwards, but it allows you to identify if just some random samples are missing (may be interpolated afterwards) or some consecutive series of maybe 100 samples is missing.

Why is saving image as a jpeg file not changing the image when saving it n times

I'm using matlab 2015. I know that by saving a jpeg file n time data is lost, because jpeg compression is lossy, but nothing changes when running code below
Code:
N = 100;
imfinfo('temp.jpg')
for i = 1: N
image = uint8(imread('temp.jpg'));
imwrite(image, 'temp.jpg','jpeg', 'Quality', 90,'Mode','lossy');
figure(0)
imshow(image)
imfinfo('temp.jpg')
end
When you save a jpeg you get a certain loss of information. While being more complicated in real, imagine the last digits of a floating point being cut off to reduce the amount of information, thus the file size. When you read it again, it is already a image which contains less information which can be stored in a jpeg without further loss. To stick with the simplification, cutting off the last digits again won't cause further trouble because they are already zero.

Modifying Every Column of Every Frame of a Video

I would like to write a program that will take a video as input, create an output video file, and will (starting after a certain number of frames), begin writing modified frames to the output file frame by frame.
The modification will need to work on individual columns of pixels, one at a time.
Viewing this as a problem to be solved in Matlab, with each frame as a matrix... I cannot think of a way to make this computationally tractable.
I am hoping that someone might be able to offer suggestions on how I might begin to approach this problem.
Here are some details, in case it helps:
I'm interested in transforming a video in the following way:
Viewing a video as a sequence of (MxN) matrices, where each matrix is called a frame:
Take an input video and create new file for output video
For each column V in frame(i) of output video, replace this column by
column V in frame(i + V - N) of the input video.
For example: the new right-most column (column N) of frame(i) will contain column N of frame(i + N - N) = frame(i)... so that there is no replacement. The new 2nd to right-most column (column N-1) of frame(i) will contain column N-1 of [frame(i+N-1-N) = frame(i-1)].
In order to make this work (i.e. in order to not run out of previous frames), this column replacement will start on frame N of the video.
So... This is basically a variable delay running from left to right?
As you say, you do have two ways of going about this:
a) Use lots of memory
b) Use lots of file access
Your memory requirements increase as a cube power of the size of the video - the size of each frame increases, AND the number of previous frames you need to have open or reference increases. I.e. doubling frame size will require 4x memory per frame, and 2x number of frames open.
I think that Matlab's memory management will probably make this hard to do for e.g. a 1080p video, unless you have a pretty high-end workstation. Do you? A quick test-read of a 720p video gives 1.2MB per frame. 1080p would then be approx 5MB per frame, and you would need to have 1920 frames open: approx 10GB needed.
It will be more efficient to load frames individually, if you don't have enough memory - otherwise you will be using pagefiles and that'll be slower than loading frame-by-frame.
Your basic code reading each frame individually could be something like this:
VR=VideoReader('My_Input_Video_Filename.avi');
VW=VideoWriter('My_Output_Video_Filename.avi','MPEG-4');
NumInFrames=get(VR,'NumberOfFrames');
InWidth=get(VR,'Width');
InHeight=get(VR,'Height');
OutFrame=zeros(InHeight,InWidth,3,'uint8');
for (frame=InWidth+1:NumInFrames)
for (subindex=1:InWidth)
CData=read(VR,frame-subindex);
OutFrame(:,subindex,:)=CData(:,subindex,:);
end
writeVideo(VW,OutFrame);
end
This will probably be slow, and I haven't fully checked it works, but it does use a minimum amount of memory.
The best case for minimum file acess is probably using a ring buffer arrangement and the maximum amount of memory, which would look something like this:
VR=VideoReader('My_Input_Video_Filename.avi');
VW=VideoWriter('My_Output_Video_Filename.avi','MPEG-4');
NumInFrames=get(VR,'NumberOfFrames');
InWidth=get(VR,'Width');
InHeight=get(VR,'Height');
CDatas=read(VR,InWidth);
BufferIndex=1;
OutFrame=zeros(InHeight,InWidth,3,'uint8');
for (frame=InWidth+1:NumInFrames)
CDatas(:,:,:,BufferIndex)=read(VR,frame);
tempindices=circshift(1:InWidth,[1,-1*BufferIndex]);
for (subindex=1:InWidth)
OutFrame(:,subindex,:)=CDatas(:,subindex,:,tempindices(subindex));
end
writeVideo(VW,OutFrame);
BufferIndex=mod(BufferIndex+1,InWidth);
end
The buffer indexing code may need some tweaking there, but something along those lines would be a minimum file access, maximum memory use solution.
For a given PC with more or less memory, you can implement somewhere in between these two as a solution (i.e. reading somewhere between 1 and all frames per iteration) as a best-case.
Matlab will be quite slow for this kind of task, but it will be a good way of getting your algorithm right and working out indexing bugs and that kind of thing. Converting to a compiled language would give a good increase in speed - I converted a Matlab script to a C# program in a couple of hours, and gave a 10x increase in speed over an optimised script where the time taken was in the number of file reads.
Hope this helps, good luck!

Mixing sound files of different size

I want to mix audio files of different size into a one single .wav file without clipping any file.,i.e. The resulting file size should be equal to the largest sized file of all.
There is a sample through which we can mix files of same size
[(http://www.modejong.com/iOS/#ex4 )(Example 4)].
I modified the code to get the mixed file as a .wav file.
But I am not able to understand that how to modify this code for unequal sized files.
If someone can help me out with some code snippet,i'll be really thankful.
It should be as easy as sending all the files to the mixer simultaneously. When any single file gets to the end, just treat it as if the remainder is filled with zeroes. When all files get to the end, you are done.
Note that the example code says it returns an error if there would be clipping (the sum of the waves is greater than the max representable value.). This condition is more likely if you are mixing multiple inputs. The best way around it is to create some "headroom" in the input waves. You can do either do this in preprocessing, by ensuring that each wave's volume is no more than X% of maximum. (~80-90%, depending on number of inputs.). The other way is to do it dynamically in the mixer code by multiplying each sample by some value <1.0 as you add it to the mix.
If you are selecting the waves to mix at runtime and failure due to clipping is unacceptable, you will need to modify the sample code to pin the values at max/min instead of returning an error. Don't just let them overflow or you will get noisy artifacts.
(Clipping creates artifacts as well, but when you haven't created enough headroom before mixing, it is definitely preferrable to overflow. It is a more familiar-sounding type of distortion, similar to what you get when you overdrive your speakers. See this wikipedia article on clipping:
Clipping is preferable to the alternative in digital systems—wrapping—which occurs if the digital hardware is allowed to "overflow", ignoring the most significant bits of the magnitude, and sometimes even the sign of the sample value, resulting in gross distortion of the signal.
How I'd do it:
Much like the mix_buffers function that you linked to, but pass in 2 parameters for mixbufferNumSamples. Iterate over the whole of the longer of the two buffers. When the index has gone beyond the end of the shorter buffer, simply set the sample from that buffer to 0 for the rest of the function.
If you must avoid clipping and do it in real-time and you know nothing else about the two sounds, you must provide enough headroom. The simplest method is by halving each of the samples before mixing:
mixed = s1/2 + s2/2;
This ensures that the resultant mixed sample won't overflow an int16_t. It will have the side effect of making everything quieter though.
If you can run it offline, you can calculate a scale factor to apply to both waveforms which will keep the peaks when summed below the maximum allowed value.
Or you could mix them all at full volume to an int32_t buffer, keeping track of the largest (magnitude) mixed sample and then go back through the buffer multiplying each sample by a scale factor which will make that extreme sample just reach the +32767/-32768 limits.