Copy file using zookeeper - distributed-computing

I have a distributed application and I use zookeeper to manage configuration data in all distributed servers.My service in each server needs some dlls to run . I am trying to build a centralized system from where I can copy my dlls to all the server.
Can I achieve that using zookeeper ?
I am aware that "ZooKeeper is generally not designed for large size storage" . My dll files are of size less the 3mb.

There is a 1mb soft limit on how large node data can get. According to the docs you can increase the max data size:
jute.maxbuffer:
(Java system property: jute.maxbuffer)
This option can only be set as a Java system property. There is no zookeeper prefix on it. It specifies the maximum size of the data that can be stored in a znode. The default is 0xfffff, or just under 1M. If this option is changed, the system property must be set on all servers and clients otherwise problems will arise. This is really a sanity check. ZooKeeper is designed to store data on the order of kilobytes in size.
I would not recommend using Zookeeper for this purpose, (you could much more easily host the binaries on a web server instead,) but it does seem possible in theory.

Zookeeper is designed to transfer messages inside the cluster.
Best thing you can do is create a Znode_A that will contain Znodes,
watch znode a for changes. Each Znode in Znode_A will represent a dll and will contain a dll path. Each node on the cluster watch for Znode_A data changes, so when a new dll (znode) will be created the nodes will know to copy the dll from a main repository.

In order to transfer files you can use SCP.
As data you can pass file path of your dlls. Using SCP you can pull files from base repository.

Related

Can i increase the disc size while kafka is running?

I've an Apache Kafka 2.7 and it is in the production environment, there are discs that we use as data storage, and they are %90 filled, how can i increase the disc capacity, they are in virtual environment would you also share the answer for if physical discs were connected?
Assuming you've configured log.dirs to use specific volume mounts, then no.
The contents of the directories for that value must be topic-partition folders, not a parent folder of all volumes where it would be more possible to dynamically add directories.
Drives may be hot swappable on the physical server (or VM), yes, but you'll still need to edit the Kafka property file to include the new mount path, and restart the broker.
Also, if you add new disks, Kafka will not rebalance existing data, meaning you'll still have discs that are 90% full. In any environment, you'd need to create/acquire a new, larger disc, shutdown the machine, replicate the disc, then attach the new one, and restart.

How set different state directory for multiple instances of the same Kafka Streams application on a single machine

From version 2.6.0, KafkaStreams with states locks the state.dir directory and as the documentation says
The state directory. Kafka Streams persists local states under the state directory. Each application has a subdirectory on its hosting machine that is located under the state directory. The name of the subdirectory is the application ID. The state stores associated with the application are created under this subdirectory. When running multiple instances of the same application on a single machine, this path must be unique for each such instance.
In the scenario of running multiple instances of the same application on a single machine,
The path cannot be a random path like /state/dir/{uuid} because this solution bypass the KAFKA-10716 issue.
My solution is to have a directory like /state/dir with ordinal subdirectories, e.g., 0,1,2... and each instance on startup checks this subdirectories from 0 and finds the first subdirectory that is not locked and use that directory for state.dir. As a result, the process id is read from metafile and the previous tasks will be assign to new process correctly.
Is this a correct solution?
What is the best practice to set a different path for each instance on a single machine?
I had the same issue and i also came with a solution that is similar to yours:
I've created a service registry. Each kafka streams instance will request an instance-id when its starting up. The service registry wil then give an integer back beginning from 0. If a second instance comes up, this will get id 1. The instance-id is used to set the group.instance.id and the state.dir configs.
To make it more reliable, each instance will periodically send a heartbeat request to the service-registry. This is needed to make an instance-id available again in case an instance goes down. It will also unregister itself in a shutdown hook to make its id available again. So if instance-0 restarts, it will then get id 0 again, because 0 is the next lowest available number.
With this solution you dont need to read directories and lock-files.
PS: why dont you just increase the num.stream.threads. As you descbribe yourself, you are running it on the same machine (scaling vertically ). With the solution i provided, you can scale horizontally and point the state.dir to the same directory.

How to config AEM to use local file system instead of mongoDB for files larger than a specific size?

Currently the project is using AEM 6.0 with mongo 2.6.10. Because of an known issue about maxPasses assertion, mongo fails to allocate the space required.
Adobe official doc mentioned that crx storage can be defined to use another file system under certain conditions. In this case, it is required to store the dam assets with size > 16M on local file storage instead of in mongoDB. See repository set up with repository.xml. However, details of how is not specified.
The question is how to config repository.xml to use local file system instead of mongoDB for files larger than a specific size?
I think you have not configured a separate DataStore. That's the reason your larger files are being persisted in your MongoDB. AEM allows you to configure NodeStore and DataStore separately.
Once you configure a separate DataStore, all your larger files(default > 16MB) will be stored separately in DataStore while all your regular nodes & properties will be stored in NodeStore.
There are multiple options to choose NodeStore & DataStores. In your case, I would suggest you to continue using your MongoDB as your NodeStore while configuring a separate FileSystem DataStore to store the binaries.
Please check the below documentation on how to configure separate DataStore :
https://docs.adobe.com/docs/en/aem/6-1/deploy/platform/data-store-config.html
For 6.0 version,
https://docs.adobe.com/docs/en/aem/6-0/deploy/upgrade/data-store-config.html

How to handle file paths in distributed environment

I'm working on setting up a distributed celery environment to do OCR on PDF files. I have about 3M PDFs and OCR is CPU-bound so the idea is to create a cluster of servers to process the OCR.
As I'm writing my task, I've got something like this:
#app.task
def do_ocr(pk, file_path):
content = run_tesseract_command(file_path)
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
The question I have what the best way is to make the file_path work in a distributed environment. How do people usually handle this? Right now all my files simply live in a simple directory on one of our servers.
If your are in linux environment the easiest way is mount a remote filesystem, using sshfs, in the /mnt folder foreach node in cluster. Then you can pass the node name to do_ocr function and work as all data is local to current node
For example, your cluster has N nodes named: node1, ... ,nodeN
Let's configure node1, foreach node mount remote filesystem. Here's a sample node1's /etc/fstab file
sshfs#user#node2:/var/your/app/pdfs /mnt/node2 fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
....
sshfs#user#nodeN:/var/your/app/pdfs /mnt/nodeN fuse port=<port>,defaults,user,noauto,uid=1000,gid=1000 0 0
In current node (node1) create a symlink named as current server pointing to pdf's path
ln -s /var/your/app/pdfs node1
Your mnt folder should contain remote's filesystem and a symlink
user#node1:/mnt$ ls -lsa
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node1 -> /var/your/app/pdfs
0 lrwxrwxrwx 1 user user 16 apr 12 2016 node2
...
0 lrwxrwxrwx 1 user user 16 apr 12 2016 nodeN
Then your function should look like this:
import os
MOUNT_POINT = '/mtn'
#app.task
def do_ocr(pk, node_name, file_path):
content = run_tesseract_command(os.path.join(MOUNT_POINT,node_name,file_path))
item = Document.objects.get(pk=pk)
item.content = ocr_content
item.save()
It works like all files are in the current machine but there's remote-logic working for you transparently
Well, there are multiple ways to handle it, but let's stick to one of the simpliest one:
since you'd like to process big amount of files using multiple servers, my first suggestion would be to use the same OS in each server, so you won't have to worry about cross-platform compatibility
using the word 'cluster' indicates that all of those servers should know their mutual state - it adds complexity, try to switch to the farm of stateless workers (by 'stateless' I mean "not knowing about other's" as they should be aware of at least their own state, e.g.: IDLE, IN_PROGRESS, QUEUE_FULL or more if needed)
for the file list processing part you could use pull or push model:
push model could be easily implemented by a simple app that crawls the files and dispatches them (e.g.: over SCP, FTP, whatever) to a set of available servers; servers can monitor their local directories for changes and pick up new files to process; it's also very easy to scale - just spin up more servers and update the push client (even in runtime); the only limit is your push client's performance
pull model is a little bit more tricky, cause you have to handle more complexity; having a set of servers implicates having a proper starting index per node and offset - it will make error handling more difficult, plus, it doesn't scale easily (imagine adding twice as more servers to speedup the processing and updating indices and offsets properly on each node.. seems like an error-prone solution)
I assume that the network traffic isn't a big concern - having 3M files to process will generate it somewhere, one way or the other..
collecting/storing the results is a different ballpark, but here the list of possible solutions is limitless
Since I miss a lot of your architecture details and your application specifics, you can take this answer as a guiding answer rather than a strict one.
You can take this approach, in the following order:
1- deploy an internal file server that stores all the files in one place and serve them
Example:
http://interanal-ip-address/storage/filenameA.pdf
http://interanal-ip-address/storage/filenameB.pdf
http://interanal-ip-address/storage/filenameC.pdf
and so on ...
2- Install/Deploy Redis
3- Create an upload client/service/process that takes the files you want to upload and pass them to the above storage location (/storage/), so your files will be available once they are uploaded, at the same time push the full file path URL to a predefined Redis List/Queue (build on linked lists data structure), like this: http://internal-ip-address/storage/filenameA.pdf
You can get more details here about LPUSH and RPOP under Redis Lists here: http://redis.io/topics/data-types-intro
Examples:
A file upload form, that stores the files directly to storage area
A file upload utility/command-line/background-process, that you can create it yourself or use some existing tool to upload files to the storage location, that gets the files from specific location, be it a web address or some other server that has your files
4- Now we come to your celery workers, each one of your workers should pull (RPOP) one of the files URLs from Redis queue, download the file from your internal file server (we built in first step), and do the required processing on the way you wanted it to be.
An important thing to note from Redis documentation:
Lists have a special feature that make them suitable to implement
queues, and in general as a building block for inter process
communication systems: blocking operations.
However it is possible that sometimes the list is empty and there is
nothing to process, so RPOP just returns NULL. In this case a consumer
is forced to wait some time and retry again with RPOP. This is called
polling, and is not a good idea in this context because it has several
drawbacks
So Redis implements commands called BRPOP and BLPOP which are versions
of RPOP and LPOP able to block if the list is empty: they'll return to
the caller only when a new element is added to the list, or when a
user-specified timeout is reached.
Let me know if that answers your question.
Things to keep in mind
You can add as many workers as you want since this solution is very
scalable, and your only bottleneck is Redis server, which you can make cluster of and persist your queue in case of power outage or server crash
You can replace redis with RabbitMQ, Beanstalk, Kafka, or any other queuing/messaging system, but Redis has ben nominated in this race due to simplicity and meriad of features introduced out of the box.

Using Zookeeper or equivalent for .NET configuration management?

I have a proprietary CMS that keeps a lot (20k lines) of configuration files on disk. I have quite a few nodes, all with the same configurations except for one or two elements that designate the node name and the IP.
Since this is proprietary I do not have a lot of leverage for going in and completely overhauling the configuration loading to look at an endpoint, though I might be able to be creative.
My questions are simple, but I do not know a better place to answer them:
Is this a use case for distributed configuration management like Zookeeper? Ideally I'd like to spin up a box and have it look for a service endpoint to load config files rather than have the config files deployed through source. This way I can update the configuration in one place, and have it replicate to all nodes without doing a full deployment.
Can Zookeeper (or equivalent) mimic a file system? Could I mount an NFS point and have it expose configuration as if they were files on the filesystem, even if these are symbolic constructs? Does this make sense?
Your configuration use case seems more like a a job for chef, puppet or similar system. They will allow you to update the configuration in one place, keep them version controlled, and distribute them properly to all target nodes.
Zookeeper makes sense when your application/service needs to dynamically get fresh configuration data during live operation, and when multiple nodes in your system need the same consistent view of that data. If you don't have this requirements, Zookeeper might be too much of an overhead for just laying down mostly static config files on disk.
As for mimicking a filesystem, there is zkfuse which you could use to mount it. But again, it doesn't look like this is what you want. Zookeeper should not be used as an actual file system replacement or file distribution system. It is best for storing small bits of metadata that needs to be consistent across your distributed system.