I'm currently experimenting with Dataproc and I followed the Google tutorial to spin-up a Hadoop cluster with Jupyter and Spark. Everything works smoothly. I use the following command:
gcloud dataproc clusters create test-cluster \
--project proj-name \
--bucket notebooks-storage \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh
This command spin-up a cluster with one master and two workers (VM type: n1-standad-4).
I tried adding the following flag:
--num-preemptible-workers 2
But it only adds two preemptible workers to the two previous standards VMs. I would like to be able to have all of my workers be preemtible VMs because all of my data is stored on Google Cloud Storage and I don't care about the size of the Hadoop storage.
Is it something sound to do? Is there any way of doings that?
Thanks!
In general, it is not a good idea to have cluster that is exclusively or mostly pVMs. pVMs do not carry any guarantees that they will be available at the time of cluster creation, or even still available in your cluster N hours from now. Preemption, is very bad for jobs (especially ones that run for many hours). Also, even-though your data is in GCS, any shuffle operations will result in data to be written to local disks. Think of pVMs only as supplemental compute power.
For these, and other, reasons we recommend at most 1:1 ratio.
An alternative, since you're working with a notebook, is to use a single node cluster: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/single-node-clusters
Related
This is more of an architecture question. I have a data engineering background and have been using airflow to orchestrate ETL tasks using airflow for a while. I have limited knowledge of containerization and kuberentes. I have a task to come up with a good practice framework for productionalizting our Data science models using an orchestration engine namely airflow.
Our Data science team creates many NLP models to process different text documents from various resources. Previously the model was created by an external team which requires us to create an anacoda environment install libraries on it and run the model. The running of model was very manual where a data engineer would spin us a EC2 instance, and setup the model download the files to the ec2 instance and process the files using the model and take the output for further processing.
We are trying to move away from this to an automated pipeline where we have an airflow dag that basically orchestrates this all. The point where I am struggling is the running the model part.
This is the logical step I am thinking of doing. Please let me know if you think this would be feasible. All of these will be down in airflow. Step 2,3,4 are the ones I am totally unsure how to achieve.
Download files from ftp to s3
**Dynamically spin up a kubernetes cluster and create parallel pod based on number of files to be process.
Split files between those pods so each pod can only process its subset of files
Collate output of model from each pod into s3 location**
Do post processing on them
I am unsure how I can spin up a kuberentes cluster in airflow on runtime and especially how I split files between pods so each pod only processes on its own chunk of files and pushes output to shared location.
The running of the model has two methods. Daily and Complete. Daily would be a delta of files that have been added since last run whereas complete is a historical reprocessing of the whole document catalogue that we run every 6 months. As you can imagine the back catalogue would require alot of parallel processing and pods in parallel to process the number of documents.
I know this is a very generic post but my lack of kuberentes is the issue and any help would be appreciated in pointing me in the right direction.
Normally people schedule the container or PODs as per need on top of k8s cluster, however, I am not sure how frequent you need to crate the k8s cluster.
K8s cluster setup :
You can create the K8s cluster in different ways that are more dependent on the cloud provider and options they provide like SDK, CLI, etc.
Here is one example you can use this option with airflow to create the AWS EKS clusters : https://leftasexercise.com/2019/04/01/python-up-an-eks-cluster-part-i/
Most cloud providers support the CLI option so maybe using just CLI also you can create the K8s cluster.
If you want to use GCP GKE you can also check for the operators to create cluster : https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/kubernetes_engine.html
Split files between those pods so each pod can only process its subset
of files
This is more depends on the file structure, you can mount the S3 direct to all pods, or you can keep the file into NFS and mount it to POD but in all cases you have to manage the directory structure accordingly, you can mount it to POD.
Collate output of model from each pod into s3 location**
You can use boto3 to upload files to S3, Can also mount S3 bucket direct to POD.
it's more now on your structure how big files are generated, and stored.
Our organisation has recently moved its infrastructure from aws to google cloud compute and I figured dataproc clusters are a good solution to running our existing spark jobs . But when it comes to comparing the pricing , I also realised that I can just fire up a google kubernetes engine cluster and install spark in it to run spark applications on it .
Now my question is , how do “running spark on gke “ and using dataproc compare ? Which one would be the best option in terms of autoscaling , pricing and infrastructure . I’ve read googles documentation on gke and dataproc but there isn’t enough for to be sure in terms of advantages and disadvantages of using GKE or dataproc over the other .
Any expert opinion will be extremely helpful.
Thanks in advance.
Spark on DataProc is proven and it's in use at many organizations, though its not fully managed, you can automate cluster creation and tear down, submitting jobs etc through GCP api, but still it's another stack you have to manage.
Spark on GKE is something new, Spark started adding features from 2.4 onwards to support Kubernetes, and even Google updated the Kubernetes for the preview couple of days back, Link
I would just go with DataProc if I have to run Jobs in Prod environment as we speak otherwise you could just experiment yourself with Docker and see how it fares, but I think it needs little more time to be stable, from purely cost perspective it would be cheaper with Docker as you can share resources with your other services.
Adding my two cents to above answer.
I would favor DataProc, because its managed and supports Spark out of
the box. No hazzles. More importantly, cost optimized. You may not
need clusters all the time, you can have ephemeral clusters with
dataproc.
With GKE, I need to explicitly discard the cluster and recreate when
necessary. Additional care needs to be taken care of.
I could not come across any direct service offering from GCP on data
lineage. In that case, I would probably use Apache Atlas with
Spark-Atlas-Connector on Spark installation managed by myself. In
that case, running Spark on GKE with all the control lying with
myself would make a compelling choice.
I've wrote a spark program, which needs to be executed on EMR cluster. But there are some dependent files and modules being used by python program. So is there any way around to setup dependent components on a running cluster ?
Can we mount the s3 bucket and mount that one cluster nodes, and can put all the dependent component on s3 ? Is this a good idea, and using Python how we can mount the s3 buckets on EMR ?
(During cluster creation): You can use Amazon EMR bootstrap custom actions which is capable of executing a bash script at the time of creation of the cluster. You can install all the dependent components using this script. Bootstrap action will be performed on all nodes of the cluster.
(On a running cluster): You can use Amazon EMR step option to create a s3-dist-cp command-runner step to copy files from s3.
okay i have a EMR cluster which writes to HDFS and I am able to view the directory and see the files
via
hadoop fs -ls /user/hadoop/jobs - i am not seeing /user/hive or jobs directory in hadoop, but its supposed to be there.
I need to get in to the spark shell and perform sparql, so i created identical cluster with same vpc,security groups, and subnet id.
What i am supposed to see
Why this is happending i am not sure but i think this might be it? Or any suggestions
Could this be something to with a stale rule?
I am running into the same issue as in this thread with my Scala Spark Streaming application: Why does Spark job fail with "too many open files"?
But given that I am using Azure HDInsights to deploy my YARN cluster, and I don't think I can log into that machine and update the ulimit in all machines.
Is there any other way to solve this problem? I cannot reduce the number of reducers by too much either, or my job will become much slower.
You can ssh into all nodes from the head node (ambari ui show fqdn of all nodes).
ssh sshuser#nameofthecluster.azurehdinsight.net
You can the write a custom action that alters the settings on the necessary nodes if you want to automate this action.