Loading a large binary file with kdb+ (Q) - kdb

I have a set of large files (around 2Gb).
When I attempt to load it (let's assume that correctly):
ctq_table:flip `QTIM`BID`OFR`QSEQ`BIDSIZE`QFRSIZ`OFRSIZ`MODE`EX`MMID!("ijjiiihcs";4 8 8 4 4 4 2 1 4) 1: `:/q/data/Q200405A.BIN
It gives back a wsfull error. Kdb+ as far as I know meant to be used for such tasks.
Is there a way to handle big files without running out of memory (like keeping on disk, even if it is slower)?

As Igor mentioned in the comments (and getting back on to the topic of the question) you can read the large binary file in chunks and write to disk one piece at a time. This will reduce your memory footprint at the expense of being slower due to the additional disk i/o operations.
In general, chunking can be trickier for bytestreams because you might end a chunk with an incomplete message (if your chunk point was arbitrary and messages were variable-width) however in your case you seem to have fixed-width messages so the chunk end-points are easier to calculate.
Either way I often find it useful to loop using over (/) and keep track of your last known (good) index and then start at that index when reading the next chunk. The general idea (untested) would be something like
file:`:/q/data/Q200405A.BIN;
chunkrows:10000; /number of rows to process in each chunk
columns:`QTIM`BID`OFR`QSEQ`QFRSIZ`OFRSIZ`MODE`EX`MMID;
types:"ijjiiihcs";
widths:4 8 8 4 4 4 2 1 4;
{
data:flip columns!(types;widths)1:(file;x;chunkrows*sum widths);
upsertToDisk[data]; /write a function to upsert to disk (partitioned or splayed)
x+chunkrows*sum widths /return the rolling index of the starting point for the next chunk
}/[hcount[file]>;0]
This will continue until the last good index reaches the end of the file. You can adjust the chunkrows size accordingly depending on your memory constraints.
Ultimately if you're trying to handle large-ish data with the free 32bit version then you're going to have headaches no matter what you do.

Related

Caching one big RDD or many small RDDs

I have a large RDD (R) which i cut it into 20 chunks (C_1, C_2, ..., C_20) such that:
If the time it takes to cache only depends on the size of the RDD (e.g. 10 second per MB) then caching the individual chunks is better.
However, i suspect there is some additional overhead i'm not aware of, like seek time in case of persisting to disk.
So, my questions are:
Are there any additional overheads when writing to memory?
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
EDIT: To give some more context, i'm currently running the application on my computer but at the end it will run on a cluster consisting of 10 nodes, each of which has 8 cores. However, since we only have access to the cluster for a small amount of time, i wanted to already experiment locally on my computer.
From my understanding, the application won't need a lot of shuffling as i can partition it rather nicely, such that each chunk runs on a single node.
However, i'm still thinking about the partitioning, so it is not yet 100% decided.
Spark performs the computations in memory. So there is no real extra overhead when you cache data to memory. Caching to memory essentially says, reuse these intermediate results. The only issue that you can run into is having too much data in memory and then it spills to disk. There you will incur disk read time costs. unpersist() will be needed for swapping things out of memory as you get finished with the various intermediate results, if you run into memory limitations.
When determining where to cache your data you need to look at the flow of your data. If you read in a file and then filter it 3 times and write out each one of those filters separately, without caching you will end up reading in that file 3 times.
val data = spark.read.parquet("file:///testdata/").limit(100)
data.select("col1").write.parquet("file:///test1/")
data.select("col2").write.parquet("file:///test2/")
data.select("col3").write.parquet("file:///test3/")
If you read in the file, cache it, then you filter 3 times and write out the results. You will read in the file once and then write out each result.
val data = spark.read.parquet("file:///testdata/").limit(100).cache()
data.select("col1").write.parquet("file:///test4/")
data.select("col2").write.parquet("file:///test5/")
data.select("col3").write.parquet("file:///test6/")
The general test that you can use as to what to cache is, "Am I performing multiple actions on the same RDD?" If yes, cache it. In your example if you break the large RDD into chunks and the large RDD isn't cached you will most likely be recalculating the large RDD every time that you perform an action on it. Then if you don't cache the chunks and you perform multiple actions on those then you will have to recalculate those chunks every time.
Is it better to cache (i.e. in memory) the large RDD (R) or the 20 individual chunks?
So to answer that, it all depends on what you are doing with each intermediate result. It looks like you will definitely want to properly repartition your large RDD according to the number of executors and then cache it. Then, if you perform more than one action on each one of the chunks that you create from the large RDD, you may want to cache those.

SnappyData Spark Scall java.sql.BatchUpdateException

So, I have around 35 GB of zip files, each one contains 15 csv files, I have created a scala script that processes each one of the zip files and each one of the csv files per each zip file.
The problem is that after some amount of files the script lunches this error
ERROR Executor: Exception in task 0.0 in stage 114.0 (TID 3145)
java.io.IOException: java.sql.BatchUpdateException: (Server=localhost/127.0.0.1[1528] Thread=pool-3-thread-63) XCL54.T : [0] insert of keys [7243901, 7243902,
And the string continues with all the keys (records) that were not inserted.
So what I have found is that apparently (I said apparently because of my lack of knowledge about scala and snappy and spark) the memory that is been used is full... my question... how do I increment the size of the memory used? or how do I empty the data that is in memory and save it in the disk?
Can I close the session started and that way free the memory?
I have had to restart the server, remove the files processed and then I can continue with the importation but after some other files... again... same exception
My csv files are big... the biggest one is around 1 GB but this exception happens not just with the big files but when accumulating multiple files... until some size is reached... so where do I change that memory use size?
I have 12GB RAM...
You can use RDD persistance and store to disk/memory or a combination : https://spark.apache.org/docs/2.1.0/programming-guide.html#rdd-persistence
Also, try adding a large number of partitions when reading the file(s): sc.textFile(path, 200000)
I think you are running out of available memory. The exception message is misleading. If you only have 12GB of memory on your machine, I wonder if your data would fit.
What I would do is first figure out how memory you need.
1. Copy conf/servers.template to conf/servers file
2) Change this file with something like this: localhost -heap-size=3g
-memory-size=6g //this essentially allocates 3g in your server for computations (spark, etc) and allocates 6g of off-heap memory for your data (column tables only).
3) start your cluster using snappy-start-all.sh
4) Load some subset of your data (I doubt you have enough memory)
5) Check the memory used in the SnappyData Pulse UI (localhost:5050)
if you think you have enough memory, load the full data.
Hopefully that works out.
BatchUpdateException tells me that you are creating Snappy tables and inserting data in them. Also, BatchUpdateException in most of the cases means low memory (exception message needs to be better). So, I believe you may be right about the memory. For freeing the memory, you will have to drop the tables that you created. For information about memory size and table sizing, you may want to read these docs:
http://snappydatainc.github.io/snappydata/best_practices/capacity_planning/#memory-management-heap-and-off-heap
http://snappydatainc.github.io/snappydata/best_practices/capacity_planning/#table-memory-requirements
Also if you have lot of data that can't fit in memory, you can overflow it to disk. See the following doc about the overflow configuration:
http://snappydatainc.github.io/snappydata/best_practices/design_schema/#overflow-configuration
Hope it helps.

Reading Multiple Files in Multiple Threads using C#, Slow !

I have an Intel Core 2 Duo CPU and i was reading 3 files from my C: drive and showing
some matching values from the files onto a EditBox on Screen.The whole process takes 2 minutes.Then I thought of processing each file in a separate thread and then the whole process is taking 2.30 minutes !!! i.e 30 seconds more than single threaded processing.
I was expecting the other way around !I can see both the Graphs in CPU usage history.Some one please explain to me what is going on ?
here is my code snippet.
foreach (FileInfo file in FileList)
{
Thread t = new Thread(new ParameterizedThreadStart(ProcessFileData));
t.Start(file.FullName);
}
where processFileData is the method that process the files.
Thanks!
The root of the problem is that the files are on the same drive and, unlike your dual core processor, your hard drive can only do one thing at a time.
If you read two files simultaneously, the disk heads will jump from one file to the other and back again. Given that your hard drive can read each file in roughly 40 seconds, it now has the additional overhead of moving its disk head between the three separate files many times during the read.
The fastest way to read multiple files from a single hard drive is to do it all in one thread and read them one after another. This way, the head only moves once per file read (at the very beginning) and not multiple times per read.
To optimize this process, you'll either need to change your logic (do you really need to read the whole contents of all three files?). Or purchase a faster hard drive/put the 3 files in three different hard drives and use threading/use a raid.
If you read from disk using multiple threads, then the disk heads will bounce around from one part of the disk to another as each thread reads from a different part of the drive. That can reduce throughput significantly, as you've seen.
For that reason, it's actually often a better idea to have all disk accesses go through a single thread, to help minimize disk seeks.
If your task is I/O bound and if it needs to run often, you might look at a tool like "contig" to make sure the layout of your files on disk is optimized / contiguous.
If you processing is mostly IO bound and CPU bound it make sense it take same time or even more.
How do you compare those files ? You should think what is the bottleneck of you application? IO output/input, CPU, memory ...
The multithreading is only interesting for CPU bound processing. i.e. complex calculation, comparison of data in memory, sorting etc ...
Since your process is IO bound, you should let the OS do your threading for you. Look at FileStream.BeginRead() for an example how to queue up your reads. Your EndRead() method can spin up your next request to read your next block of data pointing to itself to handle each subsequent completed block.
Also, with you creating additional threads, the OS has to manage more threads. And if a different CPU happens to get picked to handle the completed read, you've lost all of the CPU caching where your thread originated.
As you've found, you can't "speed up" an application just by adding threads.

What is the fastest way to read 10 GB file from the disk?

We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.
I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.
The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.
Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?
Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).
Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.
This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.
Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.
Perhaps you've already read this forum thread, but if not:
http://www.perlmonks.org/?node_id=512221
It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.
Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.
Best of luck.
I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.
PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.
Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).
Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.
Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.
Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.
I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.
hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.
If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.
If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).
Since you said platform and language doesn't matter...
If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.
Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.
Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).
I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.
Hope it helps
Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.

what is the suggested number of bytes each time for files too large to be memory mapped at one time?

I am opening files using memory map. The files are apparently too big (6GB on a 32-bit PC) to be mapped in one ago. So I am thinking of mapping part of it each time and adjusting the offsets in the next mapping.
Is there an optimal number of bytes for each mapping or is there a way to determine such a figure?
Thanks.
There is no optimal size. With a 32-bit process, there is only 4 GB of address space total, and usually only 2 GB is available for user mode processes. This 2 GB is then fragmented by code and data from the exe and DLL's, heap allocations, thread stacks, and so on. Given this, you will probably not find more than 1 GB of contigous space to map a file into memory.
The optimal number depends on your app, but I would be concerned mapping more than 512 MB into a 32-bit process. Even with limiting yourself to 512 MB, you might run into some issues depending on your application. Alternatively, if you can go 64-bit there should be no issues mapping multiple gigabytes of a file into memory - you address space is so large this shouldn't cause any issues.
You could use an API like VirtualQuery to find the largest contigous space - but then your actually forcing out of memory errors to occur as you are removing large amounts of address space.
EDIT: I just realized my answer is Windows specific, but you didn't which platform you are discussing. I presume other platforms have similar limiting factors for memory-mapped files.
Does the file need to be memory mapped?
I've edited 8gb video files on a 733Mhz PIII (not pleasant, but doable).