CDF pipeline throws error in GCS bucket doesn't exist - google-cloud-data-fusion

While moving data to Sink, I'm getting this error in GCP Data fusion pipeline.
Can someone help?
GCS path cdap-job/dd5d2bba-9cce-11ed-8666-56bac137a1c0 was not cleaned up for bucket gs://df-5999901975431383890-d2icduix5qi63hnabcpaqiyllq due to The specified bucket does not exist..
I tried recreating the temp buckets as it appeared in the log but it keeps on changing.
I had deleted few temp buckets from the list and I suspect that caused this issue.

As of today, with every Data Fusion instance created there is always a bucket that is created with it that has a name similar to the one that you shared in your message, i.e - df-5999901975431383890-d2icduix5qi63hnabcpaqiyllq. This bucket is visible in GCS, on the project you created you instance on.
You are probably getting this error because you deleted this bucket from GCS. You can either recreate the bucket or your Data Fusion instance.

Related

Cloud Fusion Pipeline works well in Preview mode but throws error in Deployment mode

I have the below pipeline which ingests news data from RSS Feeds. Pipeline is contructed using HTTPPoller, XMLMultiParser Transorm, Javascript and MongoDB Sink. The pipeline works well in Preview mode but throws "bucket not found" error in Deployment mode
RSS Ingest Pipeline
Error
Cloud Data Fusion (CDF) creates a Google Cloud Storage (GCS) bucket with the name format similar to the one mentioned in the error message in your GCP project when you create a CDF instance. Judging by the error message, its possible that the GCS bucket may have been deleted. Try to deploy the same pipeline in a new CDF instance (with the bucket present this time) and it should not raise the same exception.
This bucket is used as a Hadoop Compatible File System (HCFS) which is required to run pipelines on Dataproc

Pyspark Structured Streaming error with delta file verison

I have a job that streams data from a delta table with parquet files to a an output table in json format. both tables live in an Azure Data Lake container.
I get the following error, which I can't make sense of:
java.lang.IllegalArgumentException: requirement failed: Did not get the first delta file version: 921 to compute Snapshot
What does this mean? I don't want to delete my checkpoint files or the transaction log etc.
Thanks in advance
Note: Restoring Azure Data Lake Storage Gen2 flat and hierarchical
namespaces is not supported.
For more details, refer MSFT Document “Point-in-time restore”.
Point-in-time restore allows you to recover data from actions that only affected block blobs. Any activities that acted on containers are irreversibly lost. For example, if you use the Delete Container action to delete a container from the storage account, that container cannot be restored using a point-in-time restore operation. If you wish to restore individual blobs later, remove individual blobs rather than a whole container.

HadoopDataSource: Skipping Partition {} as no new files detected # s3:

So, I have an S3 folder with several subfolders acting as partitions (based on the date of creation). I have a Glue Table for those partitions and can see the data using Athena.
Running a Glue Job and trying to access the Catalog I get the following error:
HadoopDataSource: Skipping Partition {} as no new files detected # s3:...
The line that gives me problems is the following:
glueContext.getCatalogSource(database = "DB_NAME", tableName = "TABLE_NAME", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame().toDF()
I'll want at every point to access all the data in those S3 subfolders as it is updated regularly.
I'm thinking the problem is the Glue Job Bookmark not detecting new files, but this is not running directly as part of a Job but as part of a library used by a Job.
Removing "transformationContext" or changing its value to empty hasn't worked.
So the Hadoop output you are getting is not an error but just a simple log that the partition is empty.
But the partition that is getting logged, {}, seems to be off. Can you check that?
In addition, could you run the job with bookmark disabled, to make sure that this is not the cause of the problem?
I also found this unresolved GitHub issue, maybe you can comment there too, so that the issue gets some attention.

spark No such file or directory

Here is my code:
s3.textFile("s3n://data/hadoop/data_log/req/20160618/*")
.map(doMap)
.saveAsTextFile()
spark 1.4.1, standalone cluster
Sometimes(not always, this is important) it throws this error:
[2016-09-13 03:22:51,545: ERROR/Worker-1] err: java.io.FileNotFoundException:
No such file or directory
's3n://data/hadoop/data_log/req/20160618/hadoop.req.2016061811.log.0.gz'
But when i use
aws s3 ls s3://data/hadoop/data_log/req/20160618/hadoop.req.2016061811.log.0.gz
The file exists.
How to avoid this problem?
The problem is with s3 consistency.
Even though the file is listed, it does not exist. Try aws s3 cp the file and you will get the same exception.
"Read-after-write consistency is only valid for GETS of new objects - LISTS might not contain the new objects until the change is fully propagated."
Is listing Amazon S3 objects a strong consistency operation or eventual consistency operation?

Google Cloud Storage FUSE - Using gcsfuse fills up local instance memory

I've been using gcsfuse (FUSE) for some weeks and everything was running smoothly until my instance disk(10GB) got filled up out of nowhere.
I was trying to identify the cause and erasing some temporal files and found out that unmounting the bucket fixed the issue.
It's supposed to upload to the cloud right? So why is it taking up space as if it was counted as local instance space?
Thanks for the help guys.
Here is a reason why you would see this behaviour.
Pasting from the gcsfuse doc
https://cloud.google.com/storage/docs/gcs-fuse
Local storage: Objects that are new or modified will be stored in
their entirety in a local temporary file until they are closed or
synced. When working with large files, be sure you have enough local
storage capacity for temporary copies of the files, particularly if
you are working with Google Compute Engine instances. For more
information, see the readme documentation.