I have a Neo4j service, but before the deployment starts up, I need to pre-fill it with data (about 2GB of data). Currently, I wrote a Kubernetes Job to transform the data from a CSV and format it for the database using the neo4j-admin tool. It saves the formatted data to a persistent volume. After waiting for the job to complete, I mount the volume in the Neo4j container and the container is effectively read-only on this data for the rest of its life.
Is there a better way to do this more automatically?
I don't want to have to wait for the job to complete to run another command to create the Neo4j deployment. I looked into initContainers, but that isn't suitable because I don't want to redo the data filling when a pod is re-created. I just want subsequent pods to read from the same persistent volume. Is there a way to wait for the job to complete first?
As Jobs can't natively spawn new objects once finished (and if exited gracefully, using PreStop to invoke further actions won't work), you might want to monitor the API objects instead.
Programatically accessing the API to determine when the Job is finished and then, create your Deployment object might be a feasible, automated way to do it.
Doing it this way, you don't have to worry for redoing the data processing with initContainers as you can essentially call the deployment and remount the already existing volume.
Also, using the official Go library allows you to either run within the cluster, in a pod or externally.
I assume that your neo4j application data won't be updated from your neo4j deployment based on you said that the deployment loads the volume as read-only.
If that is the case why do you want kubernetes to do the data loading? Use object storage like s3 or azure data lake and ensure that there is some data workflow pipeline that will update the object storage. There are many tools that provides data pipeline features such as oozie, airflow.
In your deployment, then you can refer to the object storage via Persistent Volume Claim.
Related
This is more of an architecture question. I have a data engineering background and have been using airflow to orchestrate ETL tasks using airflow for a while. I have limited knowledge of containerization and kuberentes. I have a task to come up with a good practice framework for productionalizting our Data science models using an orchestration engine namely airflow.
Our Data science team creates many NLP models to process different text documents from various resources. Previously the model was created by an external team which requires us to create an anacoda environment install libraries on it and run the model. The running of model was very manual where a data engineer would spin us a EC2 instance, and setup the model download the files to the ec2 instance and process the files using the model and take the output for further processing.
We are trying to move away from this to an automated pipeline where we have an airflow dag that basically orchestrates this all. The point where I am struggling is the running the model part.
This is the logical step I am thinking of doing. Please let me know if you think this would be feasible. All of these will be down in airflow. Step 2,3,4 are the ones I am totally unsure how to achieve.
Download files from ftp to s3
**Dynamically spin up a kubernetes cluster and create parallel pod based on number of files to be process.
Split files between those pods so each pod can only process its subset of files
Collate output of model from each pod into s3 location**
Do post processing on them
I am unsure how I can spin up a kuberentes cluster in airflow on runtime and especially how I split files between pods so each pod only processes on its own chunk of files and pushes output to shared location.
The running of the model has two methods. Daily and Complete. Daily would be a delta of files that have been added since last run whereas complete is a historical reprocessing of the whole document catalogue that we run every 6 months. As you can imagine the back catalogue would require alot of parallel processing and pods in parallel to process the number of documents.
I know this is a very generic post but my lack of kuberentes is the issue and any help would be appreciated in pointing me in the right direction.
Normally people schedule the container or PODs as per need on top of k8s cluster, however, I am not sure how frequent you need to crate the k8s cluster.
K8s cluster setup :
You can create the K8s cluster in different ways that are more dependent on the cloud provider and options they provide like SDK, CLI, etc.
Here is one example you can use this option with airflow to create the AWS EKS clusters : https://leftasexercise.com/2019/04/01/python-up-an-eks-cluster-part-i/
Most cloud providers support the CLI option so maybe using just CLI also you can create the K8s cluster.
If you want to use GCP GKE you can also check for the operators to create cluster : https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/kubernetes_engine.html
Split files between those pods so each pod can only process its subset
of files
This is more depends on the file structure, you can mount the S3 direct to all pods, or you can keep the file into NFS and mount it to POD but in all cases you have to manage the directory structure accordingly, you can mount it to POD.
Collate output of model from each pod into s3 location**
You can use boto3 to upload files to S3, Can also mount S3 bucket direct to POD.
it's more now on your structure how big files are generated, and stored.
I want to learn what is the proper way to reach the processed files when restarting the spring batch application on Kubernetes. Especially if the target type is file, it is being deleted together with the pod after the job failed.
We are considering to use persistent volume or backing up the created file somewhere such as DB or sftp server by implementing a listener.
Is there anyone have the experience of persistent volume usage(nfs or other solutions) for file operations. We are concerned about the performance and unexpected problems. Do you have any suggestions?
Thank you.
You should not rely on the ephemeral file system of a Pod for files that should persist and survive a Job (Pod) failure.
You need to use a persistent volume for that, so that Spring Batch can find the (incomplete) output file in a restart scenario and resume writing where it left off.
If you want data persistence, you may begin by using hostPath volumes first. This will restrict which nodes your pods may be spawned on. But is the simplest and gives you the best performance.
https://kubernetes.io/docs/concepts/storage/volumes/#hostpath
If you want dynamic allocation, you will need to configure storage solutions such as GlusterFS, NFS, CEPH etc.
I was trying to find a solution how to run a job handled by 2 pods in a cluster.
The job is ran by the cronjob scheduler, to run every (say) 15 mins. This job is to fetch records from the db table and process it. There is only READ permission provided to access the table records. I am trying to see, is there any way to configure in k8s, that only one pod run the job.
This way I want to prevent the duplicate processing.
The alternate is have a temporary lock file in the persistent storage and the application in the pod puts a lock to it and releases after processing.
If there is any out of box solution available with in k8s, please let me know.
This is implemented using a traditional resource lock mechanism. A lock file is created during the process and the pods do no run if there is any lock file exists.
This way only one pods will run the job any point of time.
I have an application (JHipster Gateway, UAA, Registry, 5 microservices) and each application source builds a Docker image and pushes to GitLab registry. Currently I'm running everything on Rancher using a Docker-Compose file. My volumes for Mongo databases are currently in each container.
I need advice about volume mounts. Here are my options as I see them.
Leave data in containers and monitor and backup
Use external mounts and monitor volumes on host.
If I leave Mongo data in the containers, do I just set up to just cluster and when the internal volumes fill, the database just scales? I am looking for some explanation to help my choice with Mongo database mounts, internal or external (on host)?
Thanks in advance,
David L. Whitehurst
Never store any data you care about directly in containers. There are good arguments in favor of both named volumes (native to Docker, some support in a multi-host Swarm environment, fewer host-specific dependencies) and host bind mounts (much easier to back up and maintain, possible to examine directly if needed) but use some sort of mounted storage.
The most important note here is that it's fairly routine to delete and recreate containers. If the software you're running or its underlying library stack has a security issue, you generally need to get (or build) an updated image, delete your existing container, and rebuild it against the new image. If data is stored only inside a container, then during this very routine delete-and-recreate operation, there's significant risk of losing data.
In principle, if you're really careful, and you have a replicated data store, you can roll this over without external volumes and not lose data. It's tricky, and takes a lot of patience; you'll be forced to take down one replica, wait for its data to be rebalanced across the other replicas, start up a new replica, wait for it to accept some of the data, and so on. If you can take a point release by stopping a container, deleting it, starting a new one with the same data store, and have it come up instantly with populated data, that's much easier to manage.
(The other corollary here is that you don't "back up containers", since they don't have any data you care about. You do back up the data stored on the host or in Docker named volumes, and you can always recreate the container from its image plus the external data.)
I'm learning about Containers and Kubernetes and was evaluating if we can move our monolith, stateful appplication to kubernetes?
I was also looking at https://kubernetes.io/blog/2018/03/principles-of-container-app-design/ and "Self-Containment" looks close. We can consider using "storage".
Properties of my application:
1. Runs on a JVM
2. Does not have a database. Saves all its data/content to TAR files on the file-system
3. Should be able to backup and retain state if the container goes down.
In our current scenarios, we deploy the app to a VM and our IT teams generally take snapshots of these VM's as backups and restore them if the app fails or they have to restore to a point where the app was working good. I wanted to avoid this.
Please advice.
You call it as web application, but based on what it does it just a process which writes to file system.
If you move to k8s, write to NFS or persistent storage from pod. If you can only run one instance, then you can't use k8s horizontal scaling.