Cloud Fusion Pipeline works well in Preview mode but throws error in Deployment mode - deployment

I have the below pipeline which ingests news data from RSS Feeds. Pipeline is contructed using HTTPPoller, XMLMultiParser Transorm, Javascript and MongoDB Sink. The pipeline works well in Preview mode but throws "bucket not found" error in Deployment mode
RSS Ingest Pipeline
Error

Cloud Data Fusion (CDF) creates a Google Cloud Storage (GCS) bucket with the name format similar to the one mentioned in the error message in your GCP project when you create a CDF instance. Judging by the error message, its possible that the GCS bucket may have been deleted. Try to deploy the same pipeline in a new CDF instance (with the bucket present this time) and it should not raise the same exception.
This bucket is used as a Hadoop Compatible File System (HCFS) which is required to run pipelines on Dataproc

Related

Streamsets job logs to S3

I am looking for solution where we can store all logs (Info, Debug etc) of a Streamsets pipeline (Job) to S3 buckets ?
Currently logs are only available at log console of Streamsets UI only
Looking at the code streamsets uses log4j for all its logging, so you could use something like https://github.com/parvanov/log4j-s3 and create a log write appender.
an alternative is that you could create a new pipeline to consume the logs and write to s3.
Logs are stored in $SDC_HOME/log directory on the VM running SDC. from there you can copy them to S3. Or you can configure your own path, and map it to an S3 bucket on OS level.

Custom spark log location configuration in Azure databricks

We execute Databricks notebook pipelines using Azure data factory. We have configured 'Log delivery' option to get logs on DBFS. Currently, when two pipeline runs simelteneouly we are not clearly able to segregate logs per pipeline. It is possible using spark, when the instance is readily available in the databricks, to point logs directory to be ex /var/spark/{random_id}/logs/ ?

Data Fusion Dataproc compute profle in a different account

I'm trying to execute a pipeline on a Data Proc cluster in a different project by the one Data Fusion instance is deployed but I am having some trouble. Data Proc instance seems to be created correctly but the start of the job fails. Any idea on how to solve?
Here the stack trace of the error
Thanks
This seems like the project where the Google Cloud Dataproc is doesn't have SSH port open. Can you check that your project allow port 22 connection? Cloud Data Fusion uses SSH to upload and monitor the job in the Cloud Dataproc.

How to copy yarn ssh logs automatically using scala to blob storage

We have a requirement to download the yarn ssh logs to blob storage automatically. I found that the yarn logs does get added to storage account under /app-logs/user/logs/ etc path but they are in a binary format and there is no documented way to convert these into text format. So we are trying to run the external command yarn logs -application <application_id> using scala at the end of our application run to capture the logs and save them to the blob storage but facing issues with that. Looking for a solution to get these logs automatically downloaded to storage account as part of the spark pipeline itself.
I tried redirecting the output of the yarn logs command to a temp file and then copying the file from local to blob storage. These commands work fine when I ssh into the head node of the spark cluster and run them. But they are not working when executed from jupyter notebook or scala application.
("yarn logs -applicationId application_1561088998595_xxx > /tmp/yarnlog_2.txt") !!
("hadoop dfs -fs wasbs://dev52mss#sahdimssperfdev.blob.core.windows.net -copyFromLocal /tmp/yarnlog_2.txt /tmp/") !!
When I run these commands using jupyter notebook, the first command works fine to redirect to a local file but the second one to move the file to blob fails with the following error:
warning: there was one feature warning; re-run with -feature for details
java.lang.RuntimeException: Nonzero exit value: 1
at scala.sys.package$.error(package.scala:27)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 56 elided
Initially I tried capturing the output of the command as a Dataframe and writing the dataframe to blob. It succeeded for small logs but for huge logs it failed with the error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
val yarnLog = Seq(Process("yarn logs -applicationId " + "application_1560960859861_0003").!!).toDF()
yarnLog.write.mode("overwrite").text("wasbs://container#storageAccount.blob.core.windows.net/Dev/Logs/application_1560960859861_0003.txt")
Note: You can directly access the log files using Azure Storage => Blobs => Select Container => app logs
Azure HDInsight stores its log files both in the cluster file system and in Azure storage. You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage.
Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.
For more details, refer "Manage logs for Azure HDInsight cluster".
Hope this helps.
Currently, you will need to use the 'yarn logs' command to view Yarn logs.
As regards your requirement, there are two methods to achieve this;
Method 1:
Schedule a daily copy of the app-logs folder into a desired container within the blob storage. This will do a differential copy every day at a specific time of the day. For this one, I had to use Azure Data Factory to achieve the scheduling. Quite easy and no manual copy or coding required.
However, because the yarn applications logs are stored in TFile binary format and can only be read using ‘yarn logs’ command, it means that you need to have another tool application to read the file when from the destination later on. You can use the tool here to read the files https://github.com/shanyu/hadooplogparser
Alternatively, you can have your own simple script that converts it to a readable file before the transfer. Sample script below
**
yarn logs -applicationId application_15645293xxxxx > /tmp/source/applog_back.txt
hadoop dfs -fs wasbs://hdiblob #sandboxblob.blob.core.windows.net -copyFromLocal /tmp/source/applog_back.txt /tmp/destination
**
Method 2:
This is the simplest and cheapest method. You can disable the retention period of the Yarn Application logs, this means the logs will be retained indefinitely. To do this, change the config “yarn.log-aggregation.retain-seconds” to value -1. This config can be found in yarn-site.xml.
Once this is done, you can always read your Yarn Applications logs anytime from the cluster using the Yarn UI or CLI.
Hope this helps

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions