I am ingesting 200 plus files into bigquery on dataproc serverless. The input files are not huge at all. All of them are in few mbs. Still many jobs are failing with error "Insufficient 'DISKS_TOTAL_GB' quota"
When I checked disk, i have 150TB plus space before job started. It was dataproc job which ate all space. Is there any way we can configure persistent disk that get allocated to each dataproc job>
You can configure the disk size for Dataproc Serverless Spark workloads via spark.dataproc.driver.disk.size and spark.dataproc.executor.disk.size properties as mentioned in the Dataptoc Serverless documentation.
Related
I have spark jobs on k8s which are reading and writing parquet files from azure storage (blobs). I recently understood that there are environment limits on Azure for the number of transactions/sec and my pipelines is exceeding those limits.
This is resulting in throttling and my some tasks in my jobs are taking 8-10x the usual time (it isn't data skew). One of the recommendation was to apply an exponential backoff policy but i have not found any such setting on spark configurations.
Anyone facing similar situation or any help on this would truly be appreciated ?
I am studying for the Professional Data Engineer and I wonder what is the "Google recommended best practice" for hot data on Dataproc (given that costs are no concern)?
If cost is a concern then I found a recommendation to have all data in Cloud Storage because it is cheaper.
Can a mechanism be set up, such that all data is on Cloud Storage and recent data is cached on HDFS automatically? Something like AWS does with FSx/Lustre and S3.
What to store in HDFS and what to store in GCS is a case-dependant question. Dataproc supports running hadoop or spark jobs on GCS with GCS connector, which makes Cloud Storage HDFS compatible without performance losses.
Cloud Storage connector is installed by default on all Dataproc cluster nodes and it's available on both Spark and PySpark environments.
After researching a bit: the performance of HDFS and Cloud Storage (or any other blog store) is not completely equivalent. For instance a "mv" operation in a blob store is emulated as copy + delete.
What the ASF can do is warn that our own BlobStore filesystems (currently s3:, s3n: and swift:) are not complete replacements for hdfs:, as operations such as rename() are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on.
Source: https://cwiki.apache.org/confluence/display/HADOOP2/HCFS
I'm currently experimenting with Dataproc and I followed the Google tutorial to spin-up a Hadoop cluster with Jupyter and Spark. Everything works smoothly. I use the following command:
gcloud dataproc clusters create test-cluster \
--project proj-name \
--bucket notebooks-storage \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh
This command spin-up a cluster with one master and two workers (VM type: n1-standad-4).
I tried adding the following flag:
--num-preemptible-workers 2
But it only adds two preemptible workers to the two previous standards VMs. I would like to be able to have all of my workers be preemtible VMs because all of my data is stored on Google Cloud Storage and I don't care about the size of the Hadoop storage.
Is it something sound to do? Is there any way of doings that?
Thanks!
In general, it is not a good idea to have cluster that is exclusively or mostly pVMs. pVMs do not carry any guarantees that they will be available at the time of cluster creation, or even still available in your cluster N hours from now. Preemption, is very bad for jobs (especially ones that run for many hours). Also, even-though your data is in GCS, any shuffle operations will result in data to be written to local disks. Think of pVMs only as supplemental compute power.
For these, and other, reasons we recommend at most 1:1 ratio.
An alternative, since you're working with a notebook, is to use a single node cluster: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/single-node-clusters
We have a dataproc cluster we dynamically resize for large jobs. I submitted a cluster resize request to reduce our cluster to its original size (1m,2workers) from 10-workers, 3-preemptive workers but this still hasn't completed an hour later.
Is this normal? is there a way to re-issue the request? at the moment I get cluster update in progress style messages.
If you downscale Dataproc 1.2+ cluster using Graceful Decommissioning this is expected that it could take a long time if there are running jobs on cluster - downscale operation will wait until YARN containers will finish on decommissioned nodes.
Also, if you are intensively using HDFS, nodes decommissioning could take a long time for data to be replicated to prevent data loss.
You can not issue another resize operation until current operation is finished.
I'm running multiple dataproc clusters for various spark streaming jobs. All clusters are configured as single node.
Recently (cca 10 days ago) i started to experience job failures on all clusters. Each job is running for approx. 3 days then fail with the same message:
=========== Cloud Dataproc Agent Error ===========
com.google.cloud.hadoop.services.agent.AgentException: Node was restarted while executing a job. This could be user-initiated or caused by Compute Engine maintenance event. (TASK_FAILED)
at com.google.cloud.hadoop.services.agent.AgentException$Builder.build(AgentException.java:83)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.lambda$kill$0(AbstractJobHandler.java:211)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:211)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture.doTransform(AbstractTransformFuture.java:200)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture.run(AbstractTransformFuture.java:130)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:435)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:900)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:634)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.addListener(AbstractFuture.java:98)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.AbstractTransformFuture.create(AbstractTransformFuture.java:50)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.Futures.transformAsync(Futures.java:551)
at com.google.cloud.hadoop.services.agent.job.AbstractJobHandler.kill(AbstractJobHandler.java:202)
at com.google.cloud.hadoop.services.agent.job.JobManagerImpl.recoverAndKill(JobManagerImpl.java:145)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver$NormalWorkReceiver.receivedJob(MasterRequestReceiver.java:142)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver.pollForJobsAndTasks(MasterRequestReceiver.java:106)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver.pollForWork(MasterRequestReceiver.java:78)
at com.google.cloud.hadoop.services.agent.MasterRequestReceiver.lambda$doStart$0(MasterRequestReceiver.java:68)
at com.google.cloud.hadoop.services.repackaged.com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:623)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
======== End of Cloud Dataproc Agent Error ========
This is also the very last thing that can be seen in logs.
This started to happen without any changes in the spark code, for applications that were previously running for 50+ days without problems.
All clusters are in the europe-west1-d zone, global region.
All applications are written in scala.
Anyone experienced something similar? Any help would be welcome.
Since you're saying this is fairly persistent in the last few days, I wonder if something about your input data has changed and if you were running close to 100% utilization before the failures started.
Since Compute Engine VMs don't configure a swap partition, when you run out of RAM all daemons will crash and restart.
To check this, SSH into the VM and run:
sudo journalctl -u google-dataproc-agent
Somewhere in the output should be JVM crash header. You can also repeat this for other Hadoop daemons like hadoop-hdfs-namenode. They should crash at roughly the same time.
I recommend enabling stackdriver monitoring [1] on the cluster to get RAM usage over time. If my theory is validated, you can try switching to either a highmem variant of the machine type you're using or a custom VM [2] with same CPUs but more RAM.
Additionally, if your jobs use Spark Streaming with Checkpointing (or can be converted to it) then consider Dataproc Restartable Jobs [3]. After such a crash Dataproc will auto-restart the job for you [4].
[1] https://cloud.google.com/dataproc/docs/concepts/stackdriver-monitoring
[2] https://cloud.google.com/dataproc/docs/concepts/custom-machine-types
[3] https://cloud.google.com/dataproc/docs/concepts/restartable-jobs
[4] How to restart Spark Streaming job from checkpoint on Dataproc?
This is related to a bug in image version 1.1.34. Downgrade to image 1.1.29 and that fixes the issue.
For creating a cluster with image 1.1.29 use --image-version 1.1.29
Refer to https://groups.google.com/forum/#!topic/cloud-dataproc-discuss/H5Dy2vmGZ8I for more information: