[Q/KDB+]: wsfull when creating splayed table from csv using `.Q.fs`

[Q/KDB+]: wsfull when creating splayed table from csv using `.Q.fs` - kdb

I have a 9.6GB csv file from which I would like to create an on-disk splayed table.
When I run this code, my 32-bit q process (on Win 10, 16GB RAM machine) runs out of memory ('wsfull) and crashes after creating an incomplete 4.68GB splayed table (see the screenshot).
path:{` sv (hsym x 0), 1_x}
symh: {`$1_ string x}
colnames: `ric`open`high`low`close`volume`date
dir: `:F:
db: `db
tbl: `ohlcv
tbldisk: path dir,db,tbl
tblsplayed: path dir,db,tbl,`
dbsympath: symh path dir,db
csvpath: `:F:/prices.csv
.Q.fs[{ .[ tblsplayed; (); ,; .Q.en[path dir,db] flip colnames!("SFFFFID";",")0:x]}] csvpath
What exactly is going on in the memory and on the disk behind the scene when reading the csv file with .Q.fs and 0:? Is the csv read row by row or column by column?
I thought that only the 132kB chunks are held in the memory at any given time, hoping that .Q.fs is 'wsfull resistant.
Is the q process actually taking in the whole column (splay) into memory, one at a time, as it increments the chunks?
Considering that: (according to this source, among others):
on 32-bit systems the main memory OLTP portion of a database is
limited to about 1GB of raw data, i.e. 1/4 of the address space
that would nearly explain running out of memory. As shown on this screenshot taken right after the 'wsfull, couple of columns are near the 1GB limit.
Here is a run with memory profiling:
.Q.fs[{ 0N!.Q.w[]; .[ tblsplayed; (); ,; .Q.en[path dir,db] flip colnames!("SFFFFID";",")0:x]}] csvpath

I believe it's row by row when Q reads a csv. The reason your q session crashes is probably because you didn't clear memory during
.Q.fs[{ .[ tblsplayed; (); ,; .Q.en[path dir,db] flip colnames!("SFFFFID";",")0:x]}] csvpath
Try to add .Q.gc[]
.Q.fs[{ .Q.gc[]; .[ tblsplayed; (); ,; .Q.en[path dir,db] flip colnames!("SFFFFID";",")0:x]}] csvpath

Related

How can you find the size of a delta table quickly and accurately?

The microsoft documentation here:
https://learn.microsoft.com/en-us/azure/databricks/kb/sql/find-size-of-table#size-of-a-delta-table
suggests two methods:
Method 1:
import com.databricks.sql.transaction.tahoe._
val deltaLog = DeltaLog.forTable(spark, "dbfs:/<path-to-delta-table>")
val snapshot = deltaLog.snapshot // the current delta table snapshot
println(s"Total file size (bytes): ${deltaLog.snapshot.sizeInBytes}")`
Method 2:
spark.read.table("<non-delta-table-name>").queryExecution.analyzed.stats
For my table, they both return ~300 MB.
But then in storage explorer Folder statistics or in a recursive dbutils.fs.ls walk, I get ~900MB.
So those two methods that are much quicker than literally looking at every file underreport by 67%. This would be fine to use the slower methods except when I try to scale up to the entire container, it takes 55 hours to scan all 1 billion files and 2.6 PB.
So what is the best way to get the size of a table in ADLS Gen 2? Bonus points if it works for folders that are not tables as that's really the number I need. dbutils.fs.ls is single threaded and only works on the driver, so it's not even very parallelizable. It can be threaded but only within the driver.

deltaLog.snapshot returns just the current snapshot. You can have more files present in table's directory, those belong to historical versions that have been deleted/replaced from the current snapshot.
Also it returns 0 without complaints for non-delta paths. So I'm using this piece of code to get a database-level summary:
import com.databricks.sql.transaction.tahoe._
val databasePath = "dbfs:/<path-to-database>"
def size(path: String): Long =
dbutils.fs.ls(path).map { fi => if (fi.isDir) size(fi.path) else fi.size }.sum
val tables = dbutils.fs.ls(databasePath).par.map { fi =>
val totalSize = size(fi.path)
val snapshotSize = DeltaLog.forTable(spark, fi.path).snapshot.sizeInBytes
(fi.name, totalSize / 1024 / 1024 / 1024, snapshotSize / 1024 / 1024 / 1024)
}
display(tables.seq.sorted.toDF("name", "total_size_gb", "snapshot_size_gb"))
This does parallelize on driver only, still it's only file listing, so it's pretty fast. I admit I don't have a billion files, but well, if it's slow for you just use a bigger driver and tune the number of threads.

How to find the number of data mapped by mmap()?

if mmap() was used to read a file, how can I find the number of data mapped by mmap().
float *map = (float *)mmap(NULL, FILESIZE, PROT_READ, MAP_SHARED, fd, 0);

The mmap system call does not read data. It just maps the data in your virtual address space (by indirectly configuring your MMU), and that virtual address space is changed by a successful mmap. Later, your program will read that data (or not). In your example, your program might later read map[356] if mmap has succeeded (and you should test against its failure).
Read carefully the documentation of mmap(2). The second argument (in your code, FILESIZE) defines the size of the mapping (in bytes). You might check that it is a multiple of sizeof(float) and divide it by sizeof(float) to get the number of elements in map that are meaningful and obtained from the file. The size of the mapping is rounded up to a multiple of pages. The man page of mmap(2) says:
A file is mapped in multiples of the page size. For a file that is
not a multiple of the page size, the remaining memory is zeroed when
mapped, and writes to that region are not written out to the file.
Data is mapped in pages. A page is usually 4096 bytes. Read more about paging.
The page size is returned by getpagesize(2) or by sysconf(3) with _SC_PAGESIZE (which usually gives 4096).
Consider reading some book like Operating Systems: Three Easy Pieces (freely downloadable) to understand how virtual memory works and what is a memory mapped file.
On Linux, the /proc/ filesystem (see proc(5)) is very useful to understand the virtual address space of some process: try cat /proc/$$/maps in your terminal, and read more to understand its output. For a process of pid 1234, try also cat /proc/1234/maps
From inside your process, you could even read sequentially the /proc/self/maps pseudo-file to understand its virtual address space, like here.

Can't open matlab file

I have a ".mat" file supposedly containing a [30720000x4 double] matrix (values from accelerometers). When I try to open this file with "Import data" in Matlab I get the following error:
Error using load
Can't read file F:\vibration_exp_2\GR_UB50n\bearing1\GR_UB50n_1_2.mat.
Error using load
Unknown text on line number 1 of ASCII file
F:\vibration_exp_2\GR_UB50n\bearing1\GR_UB50n_1_2.mat
"MATLAB".
Error in uiimport/runImportdata (line 456)
datastruct = load('-ascii', fileAbsolutePath);
Error in uiimport/gatherFilePreviewData (line 424)
[datastruct, textDelimiter, headerLines]= runImportdata(fileAbsolutePath,
type);
Error in uiimport (line 240)
[ctorPreviewText, ctorHeaderLines, ctorDelim] = ...
The filesize is 921MB which is the same as my other files that do open. I also tried opening the file using python, but no success. Any suggestions? I use MATLAB R2013b .
More info:
How the file was create:
%% acquisition of vibration data
% input:
% sample rate in Hz (max. 51200 Hz, should be used as bearing
% faults are high-frequent)
% time in seconds, stating the duration of the measurement
% (e.g. 600 seconds = 10 minutes)
% filename for the file to be saved
%
% examples:
% data = DAQ(51200, 600, 'NF1_1.mat');
% data = DAQ(51200, 600, 'NF1_2.mat');
function data = DAQ(samplerate,time,filename)
s = daq.createSession('ni'); % Creates the DAQ session
%%% Add the channels as accelerometer channels (meaning IEPE is turned on)
s.addAnalogInputChannel('cDAQ1Mod1','ai0','Accelerometer');
s.addAnalogInputChannel('cDAQ1Mod1','ai1','Accelerometer');
s.addAnalogInputChannel('cDAQ1Mod1','ai2','Accelerometer');
s.addAnalogInputChannel('cDAQ1Mod1','ai3','Accelerometer');
%s.addAnalogInputChannel('cDAQ1Mod2','ai0','Accelerometer');
s.Rate = samplerate;
s.NumberOfScans = samplerate*time;
%%% Defining the Sensitivities in V/g
s.Channels(1).Sensitivity = 0.09478; %31965, top outer
s.Channels(2).Sensitivity = 0.09531; %31966, back outer
s.Channels(3).Sensitivity = 0.09275; %31964, top inner
s.Channels(4).Sensitivity = 0.09363; %31963, back inner
data = s.startForeground(); %Acquiring the data
save(filename, 'data');
More info:
When I open the file using a simple text editor I can see a lot of characters that do not make sense but also the first line:
MATLAB 5.0 MAT-FILE, Platform: PCWIN64, Created on: Thu Apr 30
16:29:07 2015
More info:
The file itself: https://www.dropbox.com/s/r7mavil79j47xa2/GR_UB50n_1_2.mat?dl=0
It is 921MB.
EDIT:
How can I recover my data?
I've tried this, but got memory errors.
I've also tried this, but it did not work.

I fear I can't add many good news to what you know already, but it hasn't been mentioned yet.
The reason the .mat-file can't be load is due to the data beeing corrupted. What makes it 'unrecoverable' is the way it is stored internally. The exact format is specified in the MAT-File Format Documentation. So I decided to manually construct a simple reader to specifically read your .mat file.
It makes sense, that the splitmat.m can't recover anything, as it will basicly split the data into chunks, one stored variable per chunk, however in this case there is only 1 variable stored and thus only one chunk, which happens to be the corrupted one.
In this case, the data is stored as a miCOMPRESSED, which is a normal matlab array compressed using gzip. (Which, as a side note, doesn't seem like a good fit for 'random' vibration data.) This might explain previous comments about the smaller file size then the full data, as the filesize matches exatly with the internally stored value.
I extracted the compressed archive and tried to uncompress it in a variety of ways. Basicly it is a '.gz' without the header, that can be appended manually. Unfortunatly there seems to be a corrupted block near the start of the dataset. I am by no means an expert on gzip, but as far as I know the dictionary (or decryption key) is stored dynamicly which makes all data useless from the point the block is corrupted. If you are really eager, there seems to be a way to recover data even behind the point where data is corrupted, but that method is massively timeconsuming. Also the only way to validate data of those sections is manual inspection, which in your case might proof very difficult.
Below is the code, that I used to extract the .gz-file, so if you want to give it a try, this might get you started. If you manage to decrypt the data, you can read it as described in the MAT-File Format, 13f.
corrupted_file_id = fopen('corrupt.mat','r');
%% some header data
% can be skipped replacing this block with
% fread(id,132);
%header of .mat file
header_text = char(fread(corrupted_file_id,116,'char')');
subsystem_data_offset = fread(corrupted_file_id,8,'uint8');
version = fread(corrupted_file_id,1,'int16');
endian_indicator = char(fread(corrupted_file_id,2,'int8')');
data_type = fread(corrupted_file_id,4,'uint8');
%data_type is 15, so it is a compressed matlab array
%% save te content
data_size = fread(corrupted_file_id,1,'uint32');
gz_file_id = fopen('compressed_array.gz','w');
% first write a valid gzip head
fwrite(gz_file_id,hex2dec('1f8b080000000000'),'uint64',0,'b');
% then write the data sequentialy
step = 1:1e3:data_size;% 1MB steps
for idx = step
fwrite(gz_file_id,fread(corrupted_file_id,1e3,'uint8'));
end
step = step(end):data_size;% 1B steps
for idx = step
fwrite(gz_file_id,fread(corrupted_file_id,1,'uint8'));
end
fclose(gz_file_id);
fclose(corrupted_file_id);

To answer literally to the question, my suggestion would be to make sure first that the file is okay. This tool on File Exchange apparently knows how to diagnose corrupted .MAT files starting with version V5 (R8):
http://www.mathworks.com/matlabcentral/fileexchange/6893-matcat-mat-file-corruption-analysis-tool

The file's size (indices going out of range) seems to be a problem. Octave, which should read .mat files, gives the error
memory exhausted or requested size too large for range of Octave's index type
To find out what is wrong you may need to write a test program outside MatLab, where you have more control over memory management. Examples are here, including instructions on how to build them on your own platform. These stand-alone programs may not have the same memory issues. The program matdgns.c is specifically made to check .mat files for errors.

How to tell PyCUDA to reuse the memory from an earlier kernel?

My program has two kernels and the second kernel should use the already uploaded input data and the results from the first kernel, so I can save the memory transfers. How would I archive this?
This is how I launch my kernels:
result = gpuarray.zeros(points, dtype=np.float32)
kernel(
driver.In(dataT),result,np.int32(points),
grid = (blocks,1),
block = (block_size, 1, 1),
)

In pycuda you won't transfer data to and from the device unless you explicitly request it.
For example, if you allocate memory and transfer some data to the GPU with:
result = float64(zeros( (height,width) )
result_device = gpuarray.to_gpu(result)
The variable result_device is a reference to the data in the GPU. You can pass result_device to any other kernel without incurring a memory transfer back to the CPU.
In this case a memory transfer will happen again when you call:
result = result_device.get()

How can I prevent GD from running out of memory?

I'm not sure if memory is the culprit here. I am trying to instantiate a GD image from data in memory (it previously came from a database). I try a call like this:
my $image = GD::Image->new($image_data);
$image comes back as undef. The POD for GD says that the constructor will return undef for cases of insufficient memory, so that's why I suspect memory.
The image data is in PNG format. The same thing happens if I call newFromPngData.
This works for very small images, like under 30K. However, slightly larger images, like ~70K will cause the problem. I wouldn't think that a 70K image should cause these problems, even after it is deflated.
This script is running under CGI through Apache 2.0, on OS 10.4, if that matters at all.
Are there any memory limitations imposed by Apache by default? Can they be increased?
Thanks for any insight!
EDIT: For clarification, the GD::Image object never gets created, so clearing out the $image_data from memory isn't really an option.

GD library eats many bytes per byte of image size. It's a well over a 10:1 ratio!
When a user uploads an image to our system, we start by checking the file size before loading it into a GD image. If it's over a threshold (1 Megabyte) we don't use it but instead report an error to the user.
If we really cared we could dump it to disk, use the command line "convert" tool to rescale it to a sane size, then load the output into the GD library and remove the temporary file.
convert -define jpeg:size=800x800 tmpfile.jpg -thumbnail '800x800' -
Will scale the image so it fits within an 800 x 800 square. It's longest edge is now 800px which should safely load. The above command will send the shrunk .jpg to STDOUT. The size= option should tell convert not to bother holding the huge image in memory, but just enough to scale to 800x800.

I've run into the same problem a few times.
One of my solutions was simply to increase the amount of memory available to my scripts. The other was to clear the buffer:
Original Script:
$src_img = imagecreatefromstring($userfile2);
imagecopyresampled($dst_img,$src_img,0,0,0,0,$thumb_width,$thumb_height,$origw,$origh);
Edited Script:
$src_img = imagecreatefromstring($userfile2);
imagecopyresampled($dst_img,$src_img,0,0,0,0,$thumb_width,$thumb_height,$origw,$origh);
imagedestroy($src_img);
By clearing out the memory of the first src_image, it freed up enough to handle more processing.