How to handle file paths in distributed environment - celery

I'm working on setting up a distributed celery environment to do OCR on PDF files. I have about 3M PDFs and OCR is CPU-bound so the idea is to create a cluster of servers to process the OCR.
As I'm writing my task, I've got something like this:
#app.task
def do_ocr(pk, file_path):
content = run_tesseract_command(file_path)
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
The question I have what the best way is to make the file_path work in a distributed environment. How do people usually handle this? Right now all my files simply live in a simple directory on one of our servers.

If your are in linux environment the easiest way is mount a remote filesystem, using sshfs, in the /mnt folder foreach node in cluster. Then you can pass the node name to do_ocr function and work as all data is local to current node
For example, your cluster has N nodes named: node1, ... ,nodeN
Let's configure node1, foreach node mount remote filesystem. Here's a sample node1's /etc/fstab file
sshfs#user#node2:/var/your/app/pdfs /mnt/node2 fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
....
sshfs#user#nodeN:/var/your/app/pdfs /mnt/nodeN fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
In current node (node1) create a symlink named as current server pointing to pdf's path
ln -s /var/your/app/pdfs node1
Your mnt folder should contain remote's filesystem and a symlink
user#node1:/mnt$ ls -lsa
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node1 -> /var/your/app/pdfs
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node2
...
0 lrwxrwxrwx 1 user user 16 apr 12 2016 nodeN
Then your function should look like this:
import os
MOUNT_POINT = '/mtn'
#app.task
def do_ocr(pk, node_name, file_path):
content = run_tesseract_command(os.path.join(MOUNT_POINT,node_name,file_path))
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
It works like all files are in the current machine but there's remote-logic working for you transparently

Well, there are multiple ways to handle it, but let's stick to one of the simpliest one:
since you'd like to process big amount of files using multiple servers, my first suggestion would be to use the same OS in each server, so you won't have to worry about cross-platform compatibility
using the word 'cluster' indicates that all of those servers should know their mutual state - it adds complexity, try to switch to the farm of stateless workers (by 'stateless' I mean "not knowing about other's" as they should be aware of at least their own state, e.g.: IDLE, IN_PROGRESS, QUEUE_FULL or more if needed)
for the file list processing part you could use pull or push model:
push model could be easily implemented by a simple app that crawls the files and dispatches them (e.g.: over SCP, FTP, whatever) to a set of available servers; servers can monitor their local directories for changes and pick up new files to process; it's also very easy to scale - just spin up more servers and update the push client (even in runtime); the only limit is your push client's performance
pull model is a little bit more tricky, cause you have to handle more complexity; having a set of servers implicates having a proper starting index per node and offset - it will make error handling more difficult, plus, it doesn't scale easily (imagine adding twice as more servers to speedup the processing and updating indices and offsets properly on each node.. seems like an error-prone solution)
I assume that the network traffic isn't a big concern - having 3M files to process will generate it somewhere, one way or the other..
collecting/storing the results is a different ballpark, but here the list of possible solutions is limitless

Since I miss a lot of your architecture details and your application specifics, you can take this answer as a guiding answer rather than a strict one.
You can take this approach, in the following order:
1- deploy an internal file server that stores all the files in one place and serve them
Example:
http://interanal-ip-address/storage/filenameA.pdf
http://interanal-ip-address/storage/filenameB.pdf
http://interanal-ip-address/storage/filenameC.pdf
and so on ...
2- Install/Deploy Redis
3- Create an upload client/service/process that takes the files you want to upload and pass them to the above storage location (/storage/), so your files will be available once they are uploaded, at the same time push the full file path URL to a predefined Redis List/Queue (build on linked lists data structure), like this: http://internal-ip-address/storage/filenameA.pdf
You can get more details here about LPUSH and RPOP under Redis Lists here: http://redis.io/topics/data-types-intro
Examples:
A file upload form, that stores the files directly to storage area
A file upload utility/command-line/background-process, that you can create it yourself or use some existing tool to upload files to the storage location, that gets the files from specific location, be it a web address or some other server that has your files
4- Now we come to your celery workers, each one of your workers should pull (RPOP) one of the files URLs from Redis queue, download the file from your internal file server (we built in first step), and do the required processing on the way you wanted it to be.
An important thing to note from Redis documentation:
Lists have a special feature that make them suitable to implement
queues, and in general as a building block for inter process
communication systems: blocking operations.
However it is possible that sometimes the list is empty and there is
nothing to process, so RPOP just returns NULL. In this case a consumer
is forced to wait some time and retry again with RPOP. This is called
polling, and is not a good idea in this context because it has several
drawbacks
So Redis implements commands called BRPOP and BLPOP which are versions
of RPOP and LPOP able to block if the list is empty: they'll return to
the caller only when a new element is added to the list, or when a
user-specified timeout is reached.
Let me know if that answers your question.
Things to keep in mind
You can add as many workers as you want since this solution is very
scalable, and your only bottleneck is Redis server, which you can make cluster of and persist your queue in case of power outage or server crash
You can replace redis with RabbitMQ, Beanstalk, Kafka, or any other queuing/messaging system, but Redis has ben nominated in this race due to simplicity and meriad of features introduced out of the box.

Related

Proper way for pods to read input files from the same persistent volume?

I'm new to Kubernetes and plan to use Google Kubernetes Engine. Hypothetically speaking, let's say I have a K8s cluster with 2 worker nodes. Each node would have its own pod housing the same application. This application will grab a file from some persistent volume and generate an output file that will be pushed back into a different persistent volume. Both pods in my cluster would be doing this continuously until there are no input files in the persistent volume left to be processed. Do the pods inherently know NOT to grab the same file that one pod is already using? If not, how would I be able account for this? I would like to avoid 2 pods using the same input file.
Do the pods inherently know NOT to grab the same file that one pod is already using?
Pods are just processes. Two separate processes accessing files from a shared directory are going to run into conflicts unless they have some sort of coordination mechanism.
Option 1
Have one process whose job it is to enumerate the available files. Your two workers connect to this process and receive filenames via some sort of queue/message bus/etc. When they finish processing a file, they request the next one, until all files are processed. Because only a single process is enumerating the files and passing out the work, there's no option for conflict.
Option 2
In general, renaming files is an atomic operation. Each worker creates a subdirectory within your PV. To claim a file, it renames the file into the appropriate subdirectory and then processes it. Because renames are atomic, even if both workers happen to pick the same file at the same time, only one will succeed.
Option 3
If your files have some sort of systematic naming convention, you can divide the list of files between your two workers (e.g., "everything that ends in an even number is processed by worker 1, and everything that ends with an odd number is processed by worker 2").
Etc. There are many ways to coordinate this sort of activity. The wikipedia entry on Synchronization may be of interest.

how to download multiple files from one server and upload to another concurrently in microservice architecture (concurrent piping)

I'm writing a (scala/jvm) microservice that is part of a CI solution.
Its job is to download built artifacts from some external build tool on the cloud and upload them to repositories from which they can be consumed, such as a docker registry or maven style repositories like Nexus.
There are many many such files for each build and many many such builds all the times, so the problem to solve is that of scale.
My microservice is integrated with an event queue (kafka), so it's easy to asynchronously assign tasks to workers.
I'm looking for the best way to manage my resources: nodes of cluster, threads, io, memory, storage - to handle the download and upload of all files, preferably without storing the entire content of each file locally on a file or in memory, but just to pipe from the source server to the target server.
I'm not sure what's the best approach to actually write the pipe code itself or how to best use the workers.
I was thinking of dispatching an event per file-to-pipe, and in each worker to pipe that one file by performing a get operation on the input server, a post operation on the target server and creating an in memory pipe between the streams with some buffer.
In this scenario there could be different transfer speeds for the input and target servers and i'm not sure if that's a problem or not. I think this should be solved by TCP/IP at the OS level and nothing for me to handle applicatively. I think if i use different thread pools for the download client and the upload client i can expect decent usage of non-blocking io to perform the pipe.
Alternatively i can do something else entirely and do some sort of producer/consumer where some workers download files while others upload them? this means more storage and shared storage at that, and a custom configuration for this microservice, which i'm not excited about.
Any other suggestions/insights are also welcome.
The eventual solution should (hopefully) be robust, scalable and as simple as possible.
Are you positive the source cloud service is not going to offer an "export file to Nexus" solution in the near/medium future? If so maybe your solution does not have to be fully efficient.
I would look at FS2 for this job https://github.com/functional-streams-for-scala/fs2/blob/series/1.0/README.md#example

Using Zookeeper or equivalent for .NET configuration management?

I have a proprietary CMS that keeps a lot (20k lines) of configuration files on disk. I have quite a few nodes, all with the same configurations except for one or two elements that designate the node name and the IP.
Since this is proprietary I do not have a lot of leverage for going in and completely overhauling the configuration loading to look at an endpoint, though I might be able to be creative.
My questions are simple, but I do not know a better place to answer them:
Is this a use case for distributed configuration management like Zookeeper? Ideally I'd like to spin up a box and have it look for a service endpoint to load config files rather than have the config files deployed through source. This way I can update the configuration in one place, and have it replicate to all nodes without doing a full deployment.
Can Zookeeper (or equivalent) mimic a file system? Could I mount an NFS point and have it expose configuration as if they were files on the filesystem, even if these are symbolic constructs? Does this make sense?
Your configuration use case seems more like a a job for chef, puppet or similar system. They will allow you to update the configuration in one place, keep them version controlled, and distribute them properly to all target nodes.
Zookeeper makes sense when your application/service needs to dynamically get fresh configuration data during live operation, and when multiple nodes in your system need the same consistent view of that data. If you don't have this requirements, Zookeeper might be too much of an overhead for just laying down mostly static config files on disk.
As for mimicking a filesystem, there is zkfuse which you could use to mount it. But again, it doesn't look like this is what you want. Zookeeper should not be used as an actual file system replacement or file distribution system. It is best for storing small bits of metadata that needs to be consistent across your distributed system.

Copy file using zookeeper

I have a distributed application and I use zookeeper to manage configuration data in all distributed servers.My service in each server needs some dlls to run . I am trying to build a centralized system from where I can copy my dlls to all the server.
Can I achieve that using zookeeper ?
I am aware that "ZooKeeper is generally not designed for large size storage" . My dll files are of size less the 3mb.
There is a 1mb soft limit on how large node data can get. According to the docs you can increase the max data size:
jute.maxbuffer:
(Java system property: jute.maxbuffer)
This option can only be set as a Java system property. There is no zookeeper prefix on it. It specifies the maximum size of the data that can be stored in a znode. The default is 0xfffff, or just under 1M. If this option is changed, the system property must be set on all servers and clients otherwise problems will arise. This is really a sanity check. ZooKeeper is designed to store data on the order of kilobytes in size.
I would not recommend using Zookeeper for this purpose, (you could much more easily host the binaries on a web server instead,) but it does seem possible in theory.
Zookeeper is designed to transfer messages inside the cluster.
Best thing you can do is create a Znode_A that will contain Znodes,
watch znode a for changes. Each Znode in Znode_A will represent a dll and will contain a dll path. Each node on the cluster watch for Znode_A data changes, so when a new dll (znode) will be created the nodes will know to copy the dll from a main repository.
In order to transfer files you can use SCP.
As data you can pass file path of your dlls. Using SCP you can pull files from base repository.

Load distribution to instances of a perl script running on each node of a cluster

I have a perl script (call it worker) installed on each node/machine (4 total) of a cluster (each running RHEL). The script itself is configured as a RedHat Cluster service (which means the RH cluster manager would ensure that one and exactly one instance of this script is running as long as at least one node in the cluster is up).
I have X amount of work to be done every day once a day, which this script does. So far the X was small enough and only one instance of this script was enough to do it. But now the load is going to increase and along with High Availability (viz already implemented using RHCS), I also need load distribution.
Question is how do I do that?
Of course I have a way to split the work in n parts of size X/n each. Options I had in mind:
Create a new load distributor, which splits the work in jobs of X/n. AND one of the following:
Create a named pipe on the network file system (which is mounted and visible on all nodes), post all jobs to the pipe. Make each worker script on each node read (atomic) from the pipe and do the work. OR
Make each worker script on each node listen on a TCP socket and the load distributor send jobs to each this socket in a round robin (or some other algo) fashion.
Theoretical problem with #1 is that we've observed some nasty latency problems with NFS. And I'm not even sure if NFS would support IPC via named pipes across machines.
Theoretical problem with #2 is that I have to implement some monitors to ensure that each worker is running and listening, which being a noob to Perl, I'm not sure if is easy enough.
I personally prefer load distributor creating a pool and workers pulling from it, rather than load distributor tracking each worker and pushing work to each. Any other options?
I'm open to new ideas as well. :)
Thanks!
-- edit --
using Perl 5.8.8, to be precise: This is perl, v5.8.8 built for x86_64-linux-thread-multi
If you want to keep it simple use a database to store the jobs and then have each worker lock the table and get the jobs they need then unlock and let the next worker do it's thing. This isn't the most scalable solution since you'll have lock contention, but with just 4 nodes it should be fine.
But if you start going down this road it might make sense to look at a dedicated job-queue system like Gearman.