How to redirect Apache Spark logs from the driver and the slaves to the console of the machine that launchs the Spark job using log4j? - scala

I'm trying to build an Apache Spark application that normalizes csv files from HDFS (changes delimiter, fix broken lines). I use log4j for logging but all the logs just print in the executors so the only way i can check them is using yarn logs -applicationId command. Is there any way i can redirect all logs( from driver and from executors) to my gateway node(the one which launchs the spark job) so i can check them during execution?

You should have the executors log4j props configured to write files local to themselves. Streaming back to the driver will cause unnecessary latency in processing.
If you plan on being able to 'tail" the logs in near real-time, you would need to instrument a solution like Splunk or Elasticsearch, and use tools like Splunk Forwarders, Fluentd, or Filebeat that are agents on each box that specifically watch for all configured log paths, and push that data to a destination indexer, that'll parse and extract log field data.
Now, there are other alternatives like Streamsets or Nifi or Knime (all open source), which offer more instrumentation for collecting event processing failures, and effectively allow for "dead letter queues" to handle errors in a specific way. The part I like about those tools - no programming required.

i think it is not possible. When you execute spark in local mode you can able to see it in console. Otherwise you have to alter log4j properties for the log file path.

As per https://spark.apache.org/docs/preview/running-on-yarn.html#configuration,
YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config in yarn-site.xml file), container logs are copied to HDFS and deleted on the local machine.
You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix in yarn-site.xml).
I am not sure whether the log aggregation from worker nodes happen in real time !!

There is an indirect way to achieve. Enable the following property in yarn-site.xml.
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
This will store all your logs of the submitted applications in hdfs location. Then using the following command you can download the logs into a single aggregated file.
yarn logs -applicationId application_id_example > app_logs.txt
I came across this github repo which downloads the driver and container logs separately. Clone this repository : https://github.com/hammerlab/yarn-logs-helpers
git clone --recursive https://github.com/hammerlab/yarn-logs-helpers.git
In your .bashrc (or equivalent), source .yarn-logs-helpers.sourceme:
$ source /path/to/repo/.yarn-logs-helpers.sourceme
Then download the aggregated logs into nicely segregated driver and container logs by this command.
yarn-container-logs application_example_id

Related

Streamsets job logs to S3

I am looking for solution where we can store all logs (Info, Debug etc) of a Streamsets pipeline (Job) to S3 buckets ?
Currently logs are only available at log console of Streamsets UI only
Looking at the code streamsets uses log4j for all its logging, so you could use something like https://github.com/parvanov/log4j-s3 and create a log write appender.
an alternative is that you could create a new pipeline to consume the logs and write to s3.
Logs are stored in $SDC_HOME/log directory on the VM running SDC. from there you can copy them to S3. Or you can configure your own path, and map it to an S3 bucket on OS level.

Is possible for a container to send kafka event when finishes?

We just migrated to a kubernetes cluster, I was wondering if it is possible to send a kafka event when a container/pod finishes automatically with the stdout as message. Right now we are using fluentd with elastic search but the output of a pod is used as input for the next one, we need to poll constantly elastic search for when the output is ready and that causes performance issues on overall execution
I'm not sure of your current setup but my first thought would jump to:
Use something such as fluentd or Logstash on it's own pod per node
Configure volume access to Kubernetes log folder /var/log/containers/*
Use the Kafka output for either fluentd or Logstash with file input (tail) on the logging folder
This approach would require the configuration above on each node however but requires minimal configuration of logging locations etc..
It's not something I've personally configured but have considered it for the future.
More info here

How to send logs from Google Stackdriver to Kafka

I see many docs and posts about how to send logs to Stackdriver but almost no information about how to do the opposite - send logs from the Stackdriver to Kafka.
In my case, our Ops want to collect the logs from our web servers using Google's stackdriver agents and pushing them to stackdriver ... However, for my stream processing needs I want to get the logs into Kafka to use it's unparalleled abilities to retain and reprocess data by any number of consumers, something that I cannot do with PubSub.
So, what are the options for doing this? I only saw a couple of possible avenues - neither sounds too good:
based on this post: (https://powerspace.tech/how-to-stream-data-from-google-pubsub-to-kafka-with-kafka-connect-dbef1c340a76) push data into PubSub first, and then read from it using either Kafka connector or write my own Kafka consumer. I hate the thought of adding yet another hop (serialize/deserialize/ack/etc.) between the source of data and Kafka ....
I noticed a brief mentioning in passing on adding a plugin to Google's version of Fluentd (which is what stackdriver log collection agent is based on) here: https://powerspace.tech/how-to-stream-data-from-google-pubsub-to-kafka-with-kafka-connect-dbef1c340a76 . Not many details - so hard to tell how involved this approach is ...
Any other options?
Thank you!
Enter in to the Kafka console and add certain elements in the console. Once you have added the elements in the Kafka console you need to check if these elements are reflected successfully in the cloud shell. For this you will run the command > $ gcloud pubsub subscriptions pull from-kafka — auto-ack — limit=10 < . Once you run this command it will take some time to sync with the Kafka console. You will get the results after running this command a couple of times.
You will run the commands in the Cloud Shell and see the output in the Kafka VM SSH.
***Image1
Now you will be verifying the exact opposite procedure where in you will be running the command in the Kafka VM and seeing the output in the Cloud Shell. It will take some time for the output to be reflected and you may have to run the command > $ gcloud pubsub subscriptions pull from-kafka — auto-ack — limit=10 < a couple of times to see the output. Your output will look like this
*** image2
The Kafka plugin is deprecated. For more information, refer to https://cloud.google.com/stackdriver/docs/deprecations
Note: This functionality is only available for agents running on Linux. It is not available on Windows.
Kafka is monitored via JMX. Monitoring supports monitoring Kafka version 0.8.2 and higher.
On your VM instance, download kafka-082.conf from the GitHub configuration repository and place it in the directory /etc/stackdriver/collectd.d/:
(cd /etc/stackdriver/collectd.d/ && sudo curl -O https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/kafka-082.conf)
The downloaded plugin configuration file assumes that your Kafka server is configured to accept JMX connections on port 9999. If you have configured Kafka with a different JMX port, as root, edit the file and follow the instructions to change the JMX port settings.
After adding the configuration file, restart the Monitoring agent by running the following command:
sudo service stackdriver-agent restart
What is monitored:
https://cloud.google.com/monitoring/api/metrics_agent#agent-kafka

How to copy yarn ssh logs automatically using scala to blob storage

We have a requirement to download the yarn ssh logs to blob storage automatically. I found that the yarn logs does get added to storage account under /app-logs/user/logs/ etc path but they are in a binary format and there is no documented way to convert these into text format. So we are trying to run the external command yarn logs -application <application_id> using scala at the end of our application run to capture the logs and save them to the blob storage but facing issues with that. Looking for a solution to get these logs automatically downloaded to storage account as part of the spark pipeline itself.
I tried redirecting the output of the yarn logs command to a temp file and then copying the file from local to blob storage. These commands work fine when I ssh into the head node of the spark cluster and run them. But they are not working when executed from jupyter notebook or scala application.
("yarn logs -applicationId application_1561088998595_xxx > /tmp/yarnlog_2.txt") !!
("hadoop dfs -fs wasbs://dev52mss#sahdimssperfdev.blob.core.windows.net -copyFromLocal /tmp/yarnlog_2.txt /tmp/") !!
When I run these commands using jupyter notebook, the first command works fine to redirect to a local file but the second one to move the file to blob fails with the following error:
warning: there was one feature warning; re-run with -feature for details
java.lang.RuntimeException: Nonzero exit value: 1
at scala.sys.package$.error(package.scala:27)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 56 elided
Initially I tried capturing the output of the command as a Dataframe and writing the dataframe to blob. It succeeded for small logs but for huge logs it failed with the error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
val yarnLog = Seq(Process("yarn logs -applicationId " + "application_1560960859861_0003").!!).toDF()
yarnLog.write.mode("overwrite").text("wasbs://container#storageAccount.blob.core.windows.net/Dev/Logs/application_1560960859861_0003.txt")
Note: You can directly access the log files using Azure Storage => Blobs => Select Container => app logs
Azure HDInsight stores its log files both in the cluster file system and in Azure storage. You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage.
Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.
For more details, refer "Manage logs for Azure HDInsight cluster".
Hope this helps.
Currently, you will need to use the 'yarn logs' command to view Yarn logs.
As regards your requirement, there are two methods to achieve this;
Method 1:
Schedule a daily copy of the app-logs folder into a desired container within the blob storage. This will do a differential copy every day at a specific time of the day. For this one, I had to use Azure Data Factory to achieve the scheduling. Quite easy and no manual copy or coding required.
However, because the yarn applications logs are stored in TFile binary format and can only be read using ‘yarn logs’ command, it means that you need to have another tool application to read the file when from the destination later on. You can use the tool here to read the files https://github.com/shanyu/hadooplogparser
Alternatively, you can have your own simple script that converts it to a readable file before the transfer. Sample script below
**
yarn logs -applicationId application_15645293xxxxx > /tmp/source/applog_back.txt
hadoop dfs -fs wasbs://hdiblob #sandboxblob.blob.core.windows.net -copyFromLocal /tmp/source/applog_back.txt /tmp/destination
**
Method 2:
This is the simplest and cheapest method. You can disable the retention period of the Yarn Application logs, this means the logs will be retained indefinitely. To do this, change the config “yarn.log-aggregation.retain-seconds” to value -1. This config can be found in yarn-site.xml.
Once this is done, you can always read your Yarn Applications logs anytime from the cluster using the Yarn UI or CLI.
Hope this helps

How to turn on console logging for TwitterSource

Based on the Flink Scala quickstart, I've created a sample job that uses org.apache.flink.streaming.connectors.twitter.TwitterSource.
I'm using a local streaming job manager /start-local-streaming.sh, and start the job with ~/flink-0.10.0/bin/flink run target/quickstart-0.1.jar
In the TwitterSource sourcecode, I notice that there are logging statements of the INFO level. How can I get those logged to console?
The Flink daemons are started in the background, that's why Flink is logging to log files in the log/ directory.
To monitor these files, use tail -f log/*. This will print the log entries on the console.
Another approach would be to change the start scripts of Flink to start it in the foreground + change the conf/log4j.properties file to use the ConsoleAppender.