Most efficient memory type for kdb+ - kdb

I am currently configuring a server that will run a kdb+ tickerplant with several subscription processes. Is there an optimal physical memory type for realtime kdb data?

Checkout the type sizes at http://code.kx.com/q/ref/card/#datatypes
Answer depends on what you mean by "efficient" - by the far the largest hit you take in latency is memory allocation, so the less you have to allocate the better. That means smaller types.
But of course you have to weigh that up against your use cases.
For your realtime always make sure the tickerplant inserts the time column so that #s is maintained on the time column for efficient querying.
The tickerplant itself publishes on a timer - the longer the timer the less hit on cpu, but then the tp is collecting data for a while before publishing. Again, weigh up against use cases. BTW make sure your tickerplant is writing the log file to a fast local disk so as to decrease pub delay and iowait.
If you're operating high load from multiple sources, consider OS tweaks too like tcp quickack ( http://www.techrepublic.com/article/take-advantage-of-tcp-ip-options-to-optimize-data-transmission/). There's similar tweaks for memory allocation and disk i/o.

Related

Postgresql auto-vacuuming taking too long

I have db table which has around 5-6 Mn entries and it is taking around 20 minutes to perform vacuuming. Since, one field of this table is updated very frequently, thereare a lot of dead rows to deal with.
For an estimate, with our current user base it can have 2 Million dead tuples on daily basis. So, vacuuming of this table requires both:
Read IO: as the whole table is not present in shared memory.
Write IO: as there are a lot of entries to update.
What should be an ideal way to vacuum this table? Should I increase the autovacuum_cost_limit to allow more operations per autovacuum run? But as i can see, it will increase IOPS, which again might hinder the performance. Currently, I have autovacuum_scale_factor = 0.2. Should I decrease it? If I decrease it it will run more often, although write IO will decrease but it will lead to more number of time period with high read IO.
Also, as the user base will increase it will take more and more time as the size of table with increase and vacuum will have to read a lot from disk. So, what should I do?
One of the solution I have thought of:
Separate the highly updated column and make a separate table.
Tweaking the parameter to make it run more often to decrease write IO(as discussed above). How to handle more Read IO, as vacuum will now run more often?
Combine point 2 along with increasing RAM to reduce Read IO as well.
In general what is the approach that people takes, because I assume people must have very big table 10GB or more, that needs to be vacuumed.
Separating the column is a viable strategy but would be a last resort to me. PostgreSQL already has a high per-row overhead, and doing this would double it (which might also remove most of the benefit). Plus, it would make your queries uglier, harder to read, harder to maintain, easier to introduce bugs. Where splitting it would be most attractive is if index-only-scans on a set of columns not including this is are important to you, and splitting it out lets you keep the visibility map for those remaining columns in a better state.
Why do you care that it takes 20 minutes? Is that causing something bad to happen? At that rate, you could vacuum this table 72 times a day, which seems to be way more often than it actually needs to be vacuumed. In v12, the default value for autovacuum_vacuum_cost_delay was dropped 10 fold, to 2ms. This change in default was not driven by changes in the code in v12, but rather by the realization that the old default was just out of date with modern hardware in most cases. I would have no trouble pushing that change into v11 config; but I don't think doing so would address your main concern, either.
Do you actually have a problem with the amount of IO you are generating, or is it just conjecture? The IO done is mostly sequential, but how important that is would depend on your storage hardware. Do you see latency spikes while the vacuum is happening? Are you charged per IO and your bill is too high? High IO is not inherently a problem, it is only a problem if it causes a problem.
Currently, I have autovacuum_scale_factor = 0.2. Should I decrease it?
If I decrease it it will run more often, although write IO will
decrease but it will lead to more number of time period with high read
IO.
Running more often probably won't decrease your write IO by much if any. Every table/index page with at least one obsolete tuple needs to get written, during every vacuum. Writing one page just to remove one obsolete tuple will cause more writing than waiting until there are a lot of obsolete tuples that can all be removed by one write. You might be writing somewhat less per vacuum, but doing more vacuums will make up for that, and probably far more than make up for it.
There are two approaches:
Reduce autovacuum_vacuum_cost_delay for that table so that autovacuum becomes faster. It will still consume I/O, CPU and RAM.
Set the fillfactor for the table to a value less than 100 and make sure that the column you update frequently is not indexed. Then you could get HOT updates which don't require VACUUM.

Scala concurrency performance issues

I have a data mining app.
There is 1 Mining Actor which receives and processes a Json containing 1000 objects. I put this into a list and foreach, I log the data by sending it to 1 Logger Actor which logs data into many files.
Processing the list sequentially, my app uses 700MB and takes ~15 seconds of 20% cpu power to process (4 core cpu). When I parallelize the list, my app uses 2GB and ~ the same amount of time and cpu to process.
My questions are:
Since I parallelized the list and thus the computation, shouldn't the compute-time decrease?
I think having only one Logger Actor is a bottleneck in this case. The computation may be faster but the bottleneck hides the speed increase. So if I add more Loggers to the pool, the app time should decrease?
Why does the memory usage jump to 2GB? Does the JVM have to store the entire collection in memory to parallelize it? And after the computation is done, the JVM garbage collector should deal with it?
Without more details, any answer is a guess. However, even a guess might point you to the right direction.
Parallelized execution should decrease the running time but your problem might lie elsewhere. For some reason, your CPU is idling a lot even in the single-threaded mode. You do not specify whether you read the input from disk or the network or where you write your output to. You explicitly say that you write logs to a lot of files. Disk and network reading/writing might in your case take much longer than data processing. Most probably your process is idle due to this I/O waiting. You should not expect any speedups from parallelizing a job that spends 80% of its time waiting on I/O. I therefore also suspect that loggers are not the bottleneck here.
The memory usage might jump if your threads allocate a lot of memory each. In that case, the more threads you have the more memory will be required. I don't know what kind of collection you are parallelizing on, but most are stored in memory, completely. Yes, the garbage collector will free any resources that do not require you to explicitly free them, such as files.
How many threads for reading and writing to the hard disk?
The memory increases because I send messages faster than the Logger can write, so the Mailbox balloons in size until the Logger has processed the messages and the GC kicks in.
I solved this by writing state to a protocol buffer file. Before doing any writes, I compare with the protobuf file because reads are significantly cheaper than writes. My resource usage is now 10% for 2 seconds, and less than 400MB RAM.

Efficiently checking for a rare occurrence

I have to process many millions of data records. A data record has a record-type string at the beginning of a record. Processing is record-type-dependent but does not require to 'if'/'elsif' the type, just selecting an array-slice mask from a hash.
However, on the order of once-per-million I might encounter a record type that require a totally different kind of processing.
I hate to insert an 'if' testing for this record type that will return 'true' so rarely.
Any suggestions?
Thanks
Meir
The answer is: Don't worry about it.
The speed of your CPU is considerably higher than that of your disk IO, so an if test is just not going to make a lot of difference - even if you ignored e.g. branch prediction algorithms.
An SSD will do about 1500 IO operations per second, and to quote Borodin from the comments:
A reasonable average disk read speed is 100MB per second. Say your records are 100 bytes each, that means you can read 1 million records per second, or 1μs per record. A 2011 Intel Core i5 processor runs at 83,000 MIPS, and so can
execute 83,000 instructions in the time taken to read one record. It is pointless to avoid a few test and branch instructions amongst all that.
Basically this is true in any code - your IO to storage is almost always your limiting factor, because CPUs have followed Moore's law, but the actual rotational speed of a spinning disk hasn't really changed in 15+ years. SSDs are something of a revolutionary change, but they're still too expensive to use as bulk storage options (and even if that wasn't true, they're still going to be the bottleneck on a sustained data transfer/processing operation).

IPC: Ramdisk V.S. socket

I need to transfer huge amount of data between Java and C++ programs under Linux(CentOS). Performance is the first concern.
What will be the best choice? RAMDisk (/dev/shm/) or local socket?
A socket is fastest because the other end can start processing the data (on a separate cpu core) before you have finished sending data.
Say you're sending 100KB of data, the other end can begin processing as soon as it recieves a couple of kilobytes. And by the time all 100KB has been sent, it has probably finished processing 90KB or thereabouts, so it only has 10KB left.
While with a RAM disk, you have to write the entire 100KB before it can even start processing data. Making it about 10x faster to use a socket than a ram disk, assuming both ends need to do about the same amount of work.
Maybe it takes 1 millisecond to write 100KB to a RAM disk and then 1 millisecond to process it. With a socket it would take 1 millisecond to send the data but only 0.1 millisecond to finish processing after all the data has been sent.
The larger the amount of data being sent, the bigger the performance gain for sockets. 10 seconds to write all the data, and another 0.1 millisecond to fnish processing after all data has been sent.
However, a RAM disk is easier to work with. Sockets use streams of data, which is more time consuming in terms of writing the code and debugging/testing it.
Also, don't assume you need a ram disk. Depending on how the operating system has been configured writing 100MB to a spinning platter hard drive might simply write it to a RAM cache and then put it on the hard drive later on. You can read it from the temporary RAM cache immediately without waiting for the data to be written to the HDD. Always test before making performance assumptions. Do not assume a HDD is slower than RAM, because it might be optimised out for you silently.
The mac I'm typing this on, which is UNIX just like CentOS, currently has about 8GB of RAM dedicated to holding copies of files it guesses I'm going to read at some point in the near future. I didn't have to create a RAM disk manually, it just put them in RAM heuristically. CentOS does the same sort of thing, you have to test it to see how fast it actually is.
But sockets are definitely the fastest option, since you do not need to write all the data to start processing it.

What is the fastest way to read 10 GB file from the disk?

We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.
I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.
The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.
Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?
Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).
Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.
This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.
Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.
Perhaps you've already read this forum thread, but if not:
http://www.perlmonks.org/?node_id=512221
It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.
Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.
Best of luck.
I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.
PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.
Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).
Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.
Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.
Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.
I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.
hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.
If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.
If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).
Since you said platform and language doesn't matter...
If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.
Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.
Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).
I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.
Hope it helps
Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.