Streamsets job logs to S3 - streamsets

I am looking for solution where we can store all logs (Info, Debug etc) of a Streamsets pipeline (Job) to S3 buckets ?
Currently logs are only available at log console of Streamsets UI only

Looking at the code streamsets uses log4j for all its logging, so you could use something like https://github.com/parvanov/log4j-s3 and create a log write appender.
an alternative is that you could create a new pipeline to consume the logs and write to s3.

Logs are stored in $SDC_HOME/log directory on the VM running SDC. from there you can copy them to S3. Or you can configure your own path, and map it to an S3 bucket on OS level.

Related

Cloud Fusion Pipeline works well in Preview mode but throws error in Deployment mode

I have the below pipeline which ingests news data from RSS Feeds. Pipeline is contructed using HTTPPoller, XMLMultiParser Transorm, Javascript and MongoDB Sink. The pipeline works well in Preview mode but throws "bucket not found" error in Deployment mode
RSS Ingest Pipeline
Error
Cloud Data Fusion (CDF) creates a Google Cloud Storage (GCS) bucket with the name format similar to the one mentioned in the error message in your GCP project when you create a CDF instance. Judging by the error message, its possible that the GCS bucket may have been deleted. Try to deploy the same pipeline in a new CDF instance (with the bucket present this time) and it should not raise the same exception.
This bucket is used as a Hadoop Compatible File System (HCFS) which is required to run pipelines on Dataproc

How to copy yarn ssh logs automatically using scala to blob storage

We have a requirement to download the yarn ssh logs to blob storage automatically. I found that the yarn logs does get added to storage account under /app-logs/user/logs/ etc path but they are in a binary format and there is no documented way to convert these into text format. So we are trying to run the external command yarn logs -application <application_id> using scala at the end of our application run to capture the logs and save them to the blob storage but facing issues with that. Looking for a solution to get these logs automatically downloaded to storage account as part of the spark pipeline itself.
I tried redirecting the output of the yarn logs command to a temp file and then copying the file from local to blob storage. These commands work fine when I ssh into the head node of the spark cluster and run them. But they are not working when executed from jupyter notebook or scala application.
("yarn logs -applicationId application_1561088998595_xxx > /tmp/yarnlog_2.txt") !!
("hadoop dfs -fs wasbs://dev52mss#sahdimssperfdev.blob.core.windows.net -copyFromLocal /tmp/yarnlog_2.txt /tmp/") !!
When I run these commands using jupyter notebook, the first command works fine to redirect to a local file but the second one to move the file to blob fails with the following error:
warning: there was one feature warning; re-run with -feature for details
java.lang.RuntimeException: Nonzero exit value: 1
at scala.sys.package$.error(package.scala:27)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 56 elided
Initially I tried capturing the output of the command as a Dataframe and writing the dataframe to blob. It succeeded for small logs but for huge logs it failed with the error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
val yarnLog = Seq(Process("yarn logs -applicationId " + "application_1560960859861_0003").!!).toDF()
yarnLog.write.mode("overwrite").text("wasbs://container#storageAccount.blob.core.windows.net/Dev/Logs/application_1560960859861_0003.txt")
Note: You can directly access the log files using Azure Storage => Blobs => Select Container => app logs
Azure HDInsight stores its log files both in the cluster file system and in Azure storage. You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage.
Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.
For more details, refer "Manage logs for Azure HDInsight cluster".
Hope this helps.
Currently, you will need to use the 'yarn logs' command to view Yarn logs.
As regards your requirement, there are two methods to achieve this;
Method 1:
Schedule a daily copy of the app-logs folder into a desired container within the blob storage. This will do a differential copy every day at a specific time of the day. For this one, I had to use Azure Data Factory to achieve the scheduling. Quite easy and no manual copy or coding required.
However, because the yarn applications logs are stored in TFile binary format and can only be read using ‘yarn logs’ command, it means that you need to have another tool application to read the file when from the destination later on. You can use the tool here to read the files https://github.com/shanyu/hadooplogparser
Alternatively, you can have your own simple script that converts it to a readable file before the transfer. Sample script below
**
yarn logs -applicationId application_15645293xxxxx > /tmp/source/applog_back.txt
hadoop dfs -fs wasbs://hdiblob #sandboxblob.blob.core.windows.net -copyFromLocal /tmp/source/applog_back.txt /tmp/destination
**
Method 2:
This is the simplest and cheapest method. You can disable the retention period of the Yarn Application logs, this means the logs will be retained indefinitely. To do this, change the config “yarn.log-aggregation.retain-seconds” to value -1. This config can be found in yarn-site.xml.
Once this is done, you can always read your Yarn Applications logs anytime from the cluster using the Yarn UI or CLI.
Hope this helps

How to redirect Apache Spark logs from the driver and the slaves to the console of the machine that launchs the Spark job using log4j?

I'm trying to build an Apache Spark application that normalizes csv files from HDFS (changes delimiter, fix broken lines). I use log4j for logging but all the logs just print in the executors so the only way i can check them is using yarn logs -applicationId command. Is there any way i can redirect all logs( from driver and from executors) to my gateway node(the one which launchs the spark job) so i can check them during execution?
You should have the executors log4j props configured to write files local to themselves. Streaming back to the driver will cause unnecessary latency in processing.
If you plan on being able to 'tail" the logs in near real-time, you would need to instrument a solution like Splunk or Elasticsearch, and use tools like Splunk Forwarders, Fluentd, or Filebeat that are agents on each box that specifically watch for all configured log paths, and push that data to a destination indexer, that'll parse and extract log field data.
Now, there are other alternatives like Streamsets or Nifi or Knime (all open source), which offer more instrumentation for collecting event processing failures, and effectively allow for "dead letter queues" to handle errors in a specific way. The part I like about those tools - no programming required.
i think it is not possible. When you execute spark in local mode you can able to see it in console. Otherwise you have to alter log4j properties for the log file path.
As per https://spark.apache.org/docs/preview/running-on-yarn.html#configuration,
YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config in yarn-site.xml file), container logs are copied to HDFS and deleted on the local machine.
You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix in yarn-site.xml).
I am not sure whether the log aggregation from worker nodes happen in real time !!
There is an indirect way to achieve. Enable the following property in yarn-site.xml.
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
This will store all your logs of the submitted applications in hdfs location. Then using the following command you can download the logs into a single aggregated file.
yarn logs -applicationId application_id_example > app_logs.txt
I came across this github repo which downloads the driver and container logs separately. Clone this repository : https://github.com/hammerlab/yarn-logs-helpers
git clone --recursive https://github.com/hammerlab/yarn-logs-helpers.git
In your .bashrc (or equivalent), source .yarn-logs-helpers.sourceme:
$ source /path/to/repo/.yarn-logs-helpers.sourceme
Then download the aggregated logs into nicely segregated driver and container logs by this command.
yarn-container-logs application_example_id

way to check spring cloud stream source and sink data content

Is there any way I can check what data is there in spring cloud dataflow stream source(say some named destination ":mySource") and sink(say "log" as sink)?
e.g. dataflow:>stream create --name demo --definition ":mySource>log"
Here what is there in mySource and log - how to check?
Is it like I have to check spring cloud dataflow log somewhere to get any clue, if it at all has logs? If so, what is the location of logs for windows environment?
If you're interested in the payload content, you can deploy the stream with the DEBUG logs for the Spring Integration package, which will print the header + payload information among many other interesting lifecycle details. The logs will be either the payload consumed or produced depending on the application-type (i.e., source, processor, or sink).
In your case, you can view the payload consumed by the log-sink via:
dataflow:>stream create --name demo --definition ":mySource > log --logging.level.org.springframework.integration=DEBUG"
We have plans to add native provenance/lineage support with the help of Zipkin and Sleuth in the future releases.

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions