If data fits on a single machine does it make sense to use Spark? - scala

I have 20GB of data that requires processing, all of this data fits on my local machine. I'm planning on using Spark or Scala parallel colleections to implement some algorithms and matrix multiplication against this data.
Since the data fits on a single machine should I use Scala parallel collections ?
Is this true : The main bottleneck in parallel tasks is getting the data to the CPU for processing, so since all of the data is as close as can be to the CPU Spark will not give any significant performance improvement ?
Spark will have the overhead setting up parallel tasks even though it will be just running on one machine, so this overhead is redundant in this case ?

It's hard to provide some non-obvious instructions, like if you had your data and doesn't goes up to the 80% of memory and ..., then use local mode. Having said this, there are couple of points, which, in general, may make you use spark even if your data fits one's machine memory:
really intensive CPU processing, from the top of my head, it might be complicated parsing of texts
stability -- say you have many processing stages and you don't want to lose results, once your single machine goes down.
it's especially important in case you have recurrent calculations, not one-off queries (this way, time you spend on bringing spark to the table might pay-off)
streaming -- you get your data from somewhere in a stream manner, and, though it's snapshot fits single machine, you have to orchestrate it somehow
In your particular case
so since all of the data is as close as can be to the CPU Spark will
not give any significant performance improvement
Of course it's not, Spark is not a voodoo magic that somehow might get your data closer to the CPU, but it can help you scale among machines, thus CPUs (point #1)
Spark will have the overhead setting up parallel tasks even though it
will be just running on one machine, so this overhead is redundant in
this case ?
I may sound captain obvious, but
Take #2 and #3 into consideration, do you need them? If yes, go spark or something else
If no, implement your processing in a dumb way (parallel collections)
Profile and take a look. Are your processing is CPU bound? Can you speed up it, without lot of tweaks? If no, go spark.
There is also [cheeky] point 4) in the list of Why should I use Spark?. It's the hype -- Spark is a very sexy technology which is easy to "sell" to both your devs (it's the cutting edge of big data) and the company (your boss, in case you're building your own product, your customer in case you're building product for somebody else).

Related

A program resistent to power/hardware/OS failures

I need to write a program that performs a parallel search in a large space of possible states, with new areas being discovered (and their exploration started) in the process, and exploration of some areas being terminated early as intermediate results obtained elsewhere eliminate a possibility of discovering new useful results in them. The search is performed using multiple threads running in a heavy cooperation with each other to avoid recalculation of intermediate data.
A complex internal state (including call stacks of several threads and state synchronization primitives they use) has to be maintained and updated during the whole process, and there is no apparent way to split the computation into isolated chunks that can be executed sequentially, each saving and passing a small intermediate result to the next. Also, there is no way to split the computation into independent parallel threads not communicating with each other, without imposing a prohibitive overhead due to recalculation of large amount of intermediate data.
Because of the large search domain, the program would possibly run for months before producing a final result. Hence, there is a significant risk of power, hardware or OS failure during the program execution that can lead to a complete loss of all work that has been done to the moment. In such a case the program will need to restart all its computations from scratch.
I need a solution that can prevent a complete data loss in such cases. I thought of an execution engine/platform that continuously saves the current state of the process to a failure-resistant storage like a redundant disk array or database. But I understand that this approach can significantly slow down the process, even to a degree when there would be no benefit compared to an expected computation time including restarts due to possible failures.
In fact, I do not need an ideal solution that continuously saves the program state, and I can easily bear a loss of hours or maybe even days of work. A possible heavyweight solution that comes to my mind is to run the program inside a virtual machine, saving its snapshots from time to time, and restoring the machine after a possible host failure from a recent snapshot. This approach can also help to recover the program state after a random or preventable guest OS failure.
Is there a similar, but more lightweight solution limited to preserving a state of a single process? Or could you suggest any other approaches that can solve my problem?
You may want to look at using Erlang which allows large numbers of threads to run at relatively low cost. Because the thread cost is low, redundancy can be used to achieve increased reliability.
For the problem you present, a triple-redundancy scheme may be the way to go, where periodic checks for synchronization across the three (or more) systems would determine by vote who has failed.

StarCounter and CAP

I have been reading about a database named Starcounter. It makes a claim that it can handle loads that a "NoSql"-database only can handle without dropping consistency. As far as I understand the CAP-theorem, if you keep consistency, you lose availability or partition tolerance. So what trick makes StarCounter work?
I can imagine that StarCounter is fast, but the claim that NoSql needs to drop consistency to keep up seems a little bit strange to me. Can anyone please explain?
Thanks in advance
Roland
The short answer
The CAP theorem (aka Brewers theorem) cannot be beaten for a single piece of information (like a consistent database). If you have a horizontally scaled database, you won't get consistency and performance. This conclusion comes from the laws of physics and can be deducted from Brewers theorem and Einsteins theories of relativity. You need to scale-in/up, not out. Not very "cloudy", but as the enemies of Galileo would probably confess if they were alive today, nature does a poor job at honouring human fashion.
Scaling consistent data
I'm sure there are other approaches, but Starcounter works by hosting the database image in RAM. Instead of moving database data to the application code, parts of the application code is moved to the database. Only data in the final response gets moved from the original place in RAM memory (where the data was in the first place). This makes most of the data stay put even if there are millions of requests processed every second. The downside is that the database needs to know the programming language of your application logic. The upside, however, is obvious if you have ever tried to serve millions of HTTP requests/sec, each requiring extensive database access.
A more thourough answer
The question is a good one. It is no wonder you find it strange as it was only a few years back that CAP was proven (turned into a theorem). Many developers are as disappointed as a kid would be when theoretical physicist tells him to stop looking for the perpetual motion machine because it cannot work. We still want the scale-out consistent database, don't we?
The CAP theorem
The CAP theorem gives that any piece of information cannot have consistency (C), availability (A) and partition tolerance (P). It applies to a unit of information (such as a database). You can of course have independent pieces of information that operates differently. One piece could be AP, another could be CA and a third could be CP. You just cant have the same information being CAP.
The problem with the impossibility of the 'P' in a consistent and available database results in how a scaled-out database MUST do signalling between the nodes. The conclusion must be, that even in a hundred years from now, CAP gives that a single piece of consistent data will have to live on hardware interconnected using hard wires or light beams.
The problem with the P in CAP
The problem lies in performance if you apply horizontal scaling to an available consistent database. A good performance was the very reason to do horizontally scaling in the first place, this is a very bad thing. As every node needs communicate with the other nodes whenever there is database access to achieve consistency, and given the fact that signalling is ultimately limited by the speed of light, you are left with sad but true fact that database scientist (as well as CPU scientists) are not just being stubborn for failing to see scale-out as a a magical silver bullet. It will not happen because it cannot happen (however, parts of your database could be placed in a AP set, so remember, we are talking about consistent data here). Adding the theories of Einstein to the CAP theorem and the small box wins of the cloudy data-center for consistent data.
Perpetual machines and CAP
The state of things in the database community is a little bit like the state of perpetual motion machines when horse and carriage was the way to get to work. Without any theoretical evidence against it, the patent offices granted hundreds of patents for impossible perpetual machines. Today, we may laugh at this, but we have a similar situations in the database industry with consistent scale-out databases. When you hear somebody claim that they have a scale-out ACID database, be cautious. It was only after the dot com crash mathematicians at MIT proved Brewer right at the CAP theorem was officially born, so the hunt for the impossible has unfortunately not died off just yet. You can compare this, if you want, to the way laggards kept trying to invent the perpetual machine for years after modern theoretical physics should reasonably have put a stop to it. Old habits die hard (my apologies to anyone on Stack overflow still making drawings of bearings and arms moving ad finitum on their own accord - I don't mean to be offensive).
CAP and performance
All is not lost however. Not all pieces of information needs to be consistent. Not all pieces needs to scale-out. You just have the accept Brewers theorem and make the best out of it.
For applications such as Facebook, consistency is dropped. This is okay as data is entered once and then is manipulated by a single users. Still we can experience the side effects in everyday Facebook usage such as things popping in and out of existence for a while.
However, in most business applications, data needs to be correct. The sum of all accounts in your bookkeeping needs to amount to zero. Your stock inventory must equal to 8 if you sold 2 out of 10 items even if there are multiple users buying from the same stock.
The problem with scaling out available data is that you have to make do without partition tolerance. This fancy word simply means that you have to signal between the nodes in your cloud at all times. And as it takes light a few nanoseconds to travel a single meter, this becomes impossible without making your scale-out result in less performance rather than more performance. Of course, this is only true for consistent data. The implications of this has been known by the engineers of Intel, AMD, Oracle et. al for a long time. It is not their scientist haven't heard of scale-out. It is just that they have come to accept the world as Einstein described it.
Some comfort in the gloom
If you do the math, you find that a single PC has instructions to spare on each human being living on Earth for each second it is running (google on 'modern CPU' and 'MIPS'). If you do some more math, like taking the total turnover of Amazon.com (you can find it at wwww.nasdaq.com) divided by the price of an average book, you will find that the total number of sales transactions can fit in RAM of a single modern PC. The cool thing is that the number of items, customers, orders, products etc. occupies the same amount of space in 2012 as it did in 1950. Images, video and audio has increased in size, but numeric and textual information does not grow per item. Sure the number of transactions grows, but not in the same phase as computer power grows. So the logical solution is to scale out read-only and AP data and "scale-in/up" business data.
"Scale-in" instead of "scale-out"
Database engines and business logic running in a VM (like the Java VM or the .NET CLR) typically use fairly effective machine code. This means that moving memory is the overshadowing bottleneck of total throughput for a consistent database. This is often referred to as the memory wall (wikipedia has some useful information).
The trick is to transfer code to the database image instead of data from the database image to the code (if using a MVC or a MVVM pattern). This means that the consuming code executes in the same address space as the database image and that data is never moved (and the disk is merely securing transactions and images). Data can stay in the original database image and does not have to be copied into the memory of the application. Instead of treating the database as a RAM database, the database is treated as primary memory. Everything stays put.
Only data that is part of the final user response is moved out of the database image. For a large scale applications with hundreds of millions of simultaneous users this typically amounts to only a few million requests per second, something that a single PC has no problem with handling given that the HTTP packaging is done on gateway servers. Fortunately, such servers scales out beautifully as they don't need to share data.
As it turns out, the disk is fast at sequential writes so a raided disk can persist terabytes or changes every minute.
Horizontal scaling in Starcounter
Normally you do not scale a Starcounter node. It scales-in rather than out. This works well for a few million simultaneous users. To go above that, you need to add more Starcounter nodes. They can be used to partition data (but then you lose consistency and Starcounter is not designed for partitioning so it is less elegant than solutions such as Volt DB). So a better alternative is to use the additional Starcounter nodes as gateway servers. These servers simple accumulates all incoming HTTP requests for a millisecond at a time. This might sound like a short amount of time, but it is enough to accumulate thousands of request if you decided you need to scale Starcounter. The batch of requests are then sent to the ZLATAN node (Zero LATency Atomicity Node) a thousand times a second. Each such batch can contain thousands of requests. In this way, a few hundred million user sessions can be served by a single ZLATAN node. Although you can have several ZLATAN nodes, there is only one active ZLATAN node at a time. This is how the CAP theorem is honored. To go above that, you need to consider the same tradeoff as Facebook and others.
Another important note is that the ZLATAN node does not serve applications with data. Instead, the applications controller code is run by the ZLATAN node. The cost of serializing/deserializing and sending data to an application is far greater than to process the controller logic cycles. I.e. the code is sent to the database instead of the other way around (a traditional approach is that the applications asks for data or sends data).
Making the "shared-everything" node faster by doing less
The use of the database as a "heap" for the programming language instead of a remote system for serialization and deserialization is a trick that Starcounter calls VMDBMS. If the database is in RAM, you should not move data from one place in RAM to another place in RAM which is the case with most RAM databases.
There is no 'trick'. Starcounter is talking about speed, while CAP/NoSQL are talking about scalability. There is a trade-off between features+scalability vs speed.
Sometimes it's OK to ignore scalability if you can prove there are bottlenecks elsewhere. For instance, a new startup shouldn't worry about their website scaling to a million users, they should worry about getting their first hundred users. (Does anyone remember how often Twitter was down in the early days?) Starcounter can be useful if their transaction rate is much greater than your web page hit rate.
On the other hand, I don't trust anyone who lumps all "NoSQL" Databases together. The various NoSQL databases are more different than alike. They have radically different architectures and properties. Some of them scale to thousands of nodes, some of them don't scale beyond one node. Sometimes adding scalability slows you down. Sometimes removing features speeds you up.
http://strata.oreilly.com/2010/12/strata-gems-mysql-handlersocket.html

Distributing work to multiple cores: Hadoop or Scala's parallel collections?

What is the better way of making full use of multiple cores for parallel processing in a Scala/Hadoop system?
Let's say I need to process 100 million documents. Documents are not very large, but processing them is computationally intensive. If I have a Hadoop cluster with 100 machines with 10 cores each, I could either:
A) send 1000 documents to each machine and let Hadoop start a map on each of the 10 cores (or as many as are available)
or
B) send 1000 documents to each machine (still using Hadoop) and use Scala's parallel collections to make full use of the multiple cores. (I would put all documents in a parallel collection, and then call map on the collection). In other words, use Hadoop for distribution at cluster level, and use parallel collections to manage the distribution to cores within each machine.
Hadoop is going to offer a lot more than just parallelization. It offers a platform to distribute work, a scheduler for handling concurrent jobs, a distributed filesystem, the ability to perform a distributed reduce, and fault tolerance. That said, it is a complicated system and can sometimes be difficult to work with.
If you plan to have multiple users submitting many different jobs, Hadoop is the way to go (out of the two options). However, if you are devoting a cluster to be always be processing documents through the same function, you could, without too much trouble, develop a system with Scala parallel collections and actors for inter-machine communication. The Scala solution would give you more control, the system could respond in real time, and you wouldn't have to deal with a lot of Hadoop configuration that doesn't pertain to your task.
If you need to run varied jobs over large amounts of data (larger than would fit on a single node), then use Hadoop. I can give you more information if you describe your requirements in more detail.
Update: one million is a fairly small number. You might want to do some calculations and see how long it would take on a single machine with parallel collections. The advantage here is that the development time is minimal!
The answer depends on the following question - does your Scala code capable to fully utilize all cores available. Probabbly if you have good intrinsic synchronization between parts of the document to be processed or some other way to parralelyze algorithm without lock contention - then the "B"" is the way. If so - configure one mapper per node and let your mapper to utilize cores in a best way.
If your gain from the parralelization is not that good, and adding more threads (cores) to the processing does not improve performance in a linear way - then the "A" can be better way. Efficiency of "A" also depends on the size of your RAM - you will need enough ram for 10 mappers per node.
I can suspect that ideal solution can be somewhere in between. So my suggestion is to develop mapper which takes number of threads used as a parameter and then do a few tests increasing number of threads per mapper and decreasing number of mappers per node.
Hadoop is not very good for processing a lot of small files, but for processing a small amount of very large files. Is there any way you can merge the files before processing them, or are they all totally different? Hadoop takes care of distribution and parallelism itself, so there is no need to explicitly send X docs to Y machines. And also i don't think you should use hadoop only as a distribution mechanism, that is not what it's made for. You should either use a real map/reduce, or build your own system for whatever you are trying to do, but not try to bend hadoop to your will.

What is the fastest way to read 10 GB file from the disk?

We need to read and count different types of messages/run
some statistics on a 10 GB text file, e.g a FIX engine
log. We use Linux, 32-bit, 4 CPUs, Intel, coding in Perl but
the language doesn't really matter.
I have found some interesting tips in Tim Bray's
WideFinder project. However, we've found that using memory mapping
is inherently limited by the 32 bit architecture.
We tried using multiple processes, which seems to work
faster if we process the file in parallel using 4 processes
on 4 CPUs. Adding multi-threading slows it down, maybe
because of the cost of context switching. We tried changing
the size of thread pool, but that is still slower than
simple multi-process version.
The memory mapping part is not very stable, sometimes it
takes 80 sec and sometimes 7 sec on a 2 GB file, maybe from
page faults or something related to virtual memory usage.
Anyway, Mmap cannot scale beyond 4 GB on a 32 bit
architecture.
We tried Perl's IPC::Mmap and Sys::Mmap. Looked
into Map-Reduce as well, but the problem is really I/O
bound, the processing itself is sufficiently fast.
So we decided to try optimize the basic I/O by tuning
buffering size, type, etc.
Can anyone who is aware of an existing project where this
problem was efficiently solved in any language/platform
point to a useful link or suggest a direction?
Most of the time you will be I/O bound not CPU bound, thus just read this file through normal Perl I/O and process it in single thread. Unless you prove that you can do more I/O than your single CPU work, don't waste your time with anything more. Anyway, you should ask: Why on Earth is this in one huge file? Why on Earth don't they split it in a reasonable way when they generate it? It would be magnitude more worth work. Then you can put it in separate I/O channels and use more CPU's (if you don't use some sort of RAID 0 or NAS or ...).
Measure, don't assume. Don't forget to flush caches before each test. Remember that serialized I/O is a magnitude faster than random.
This all depends on what kind of preprocessing you can do and and when.
On some of systems we have, we gzip such large text files, reducing them to 1/5 to 1/7 of their original size. Part of what makes this possible is we don't need to process these files
until hours after they're created, and at creation time we don't really have any other load on the machines.
Processing them is done more or less in the fashion of zcat thosefiles | ourprocessing.(well it's done over unix sockets though with a custom made zcat). It trades cpu time for disk i/o time, and for our system that has been well worth it. There's ofcourse a lot of variables that can make this a very poor design for a particular system.
Perhaps you've already read this forum thread, but if not:
http://www.perlmonks.org/?node_id=512221
It describes using Perl to do it line-by-line, and the users seem to think Perl is quite capable of it.
Oh, is it possible to process the file from a RAID array? If you have several mirrored disks, then the read speed can be improved. Competition for disk resources may be what makes your multiple-threads attempt not work.
Best of luck.
I wish I knew more about the content of your file, but not knowing other than that it is text, this sounds like an excellent MapReduce kind of problem.
PS, the fastest read of any file is a linear read. cat file > /dev/null should be the speed that the file can be read.
Have you thought of streaming the file and filtering out to a secondary file any interesting results? (Repeat until you have a manageble size file).
Basically need to "Divide and conquer", if you have a network of computers, then copy the 10G file to as many client PCs as possible, get each client PC to read an offset of the file. For added bonus, get EACH pc to implement multi threading in addition to distributed reading.
Parse the file once, reading line by line. Put the results in a table in a decent database. Run as many queries as you wish. Feed the beast regularly with new incoming data.
Realize that manipulating a 10 Gb file, transferring it across the (even if local) network, exploring complicated solutions etc all take time.
I have a co-worker who sped up his FIX reading by going to 64-bit Linux. If it's something worthwhile, drop a little cash to get some fancier hardware.
hmmm, but what's wrong with the read() command in C? Usually has a 2GB limit,
so just call it 5 times in sequence. That should be fairly fast.
If you are I/O bound and your file is on a single disk, then there isn't much to do. A straightforward single-threaded linear scan across the whole file is the fastest way to get the data off of the disk. Using large buffer sizes might help a bit.
If you can convince the writer of the file to stripe it across multiple disks / machines, then you could think about multithreading the reader (one thread per read head, each thread reading the data from a single stripe).
Since you said platform and language doesn't matter...
If you want a stable performance that is as fast as the source medium allows for, the only way I am aware that this can be done on Windows is by overlapped non-OS-buffered aligned sequential reads. You can probably get to some GB/s with two or three buffers, beyond that, at some point you need a ring buffer (one writer, 1+ readers) to avoid any copying. The exact implementation depends on the driver/APIs. If there's any memory copying going on the thread (both in kernel and usermode) dealing with the IO, obviously the larger buffer is to copy, the more time is wasted on that rather than doing the IO. So the optimal buffer size depends on the firmware and driver. On windows good values to try are multiples of 32 KB for disk IO. Windows file buffering, memory mapping and all that stuff adds overhead. Only good if doing either (or both) multiple reads of same data in random access manner. So for reading a large file sequentially a single time, you don't want the OS to buffer anything or do any memcpy's. If using C# there's also penalties for calling into the OS due to marshaling, so the interop code may need bit of optimization unless you use C++/CLI.
Some people prefer throwing hardware at problems but if you have more time than money, in some scenarios it's possible to optimize things to perform 100-1000x better on a single consumer level computer than a 1000 enterprise priced computers. The reason is that if the processing is also latency sensitive, going beyond using two cores is probably adding latency. This is why drivers can push gigabytes/s while enterprise software is ends stuck at megabytes/s by the time it's all done. Whatever reporting, business logic and such the enterprise software do can probably also be done at gigabytes/s on two core consumer CPU, if written like you were back in the 80's writing a game. The most famous example I've heard of approaching their entire business logic in this manner is the LMAX forex exchange, which published some of their ring buffer based code, which was said to be inspired by network card drivers.
Forgetting all the theory, if you are happy with < 1 GB/s, one possible starting point on Windows I've found is looking at readfile source from winimage, unless you want to dig into sdk/driver samples. It may need some source code fixes to calculate perf correctly at SSD speeds. Experiment with buffer sizes also.
The switches /h multi-threaded and /o overlapped (completion port) IO with optimal buffer size (try 32,64,128 KB etc) using no windows file buffering in my experience give best perf when reading from SSD (cold data) while simultaneously processing (use the /a for Adler processing as otherwise it's too CPU-bound).
I seem to recall a project in which we were reading big files, Our implementation used multithreading - basically n * worker_threads were starting at incrementing offsets of the file (0, chunk_size, 2xchunk_size, 3x chunk_size ... n-1x chunk_size) and was reading smaller chunks of information. I can't exactly recall our reasoning for this as someone else was desining the whole thing - the workers weren't the only thing to it, but that's roughly how we did it.
Hope it helps
Its not stated in the problem that sequence matters really or not. So,
divide the file into equal parts say 1GB each, and since you are using multiple CPUs, then multiple threads wont be a problem, so read each file using separate thread, and use RAM of capacity > 10 GB, then all your contents would be stored in RAM read by multiple threads.

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.