How to copy yarn ssh logs automatically using scala to blob storage - scala

We have a requirement to download the yarn ssh logs to blob storage automatically. I found that the yarn logs does get added to storage account under /app-logs/user/logs/ etc path but they are in a binary format and there is no documented way to convert these into text format. So we are trying to run the external command yarn logs -application <application_id> using scala at the end of our application run to capture the logs and save them to the blob storage but facing issues with that. Looking for a solution to get these logs automatically downloaded to storage account as part of the spark pipeline itself.
I tried redirecting the output of the yarn logs command to a temp file and then copying the file from local to blob storage. These commands work fine when I ssh into the head node of the spark cluster and run them. But they are not working when executed from jupyter notebook or scala application.
("yarn logs -applicationId application_1561088998595_xxx > /tmp/yarnlog_2.txt") !!
("hadoop dfs -fs wasbs://dev52mss#sahdimssperfdev.blob.core.windows.net -copyFromLocal /tmp/yarnlog_2.txt /tmp/") !!
When I run these commands using jupyter notebook, the first command works fine to redirect to a local file but the second one to move the file to blob fails with the following error:
warning: there was one feature warning; re-run with -feature for details
java.lang.RuntimeException: Nonzero exit value: 1
at scala.sys.package$.error(package.scala:27)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 56 elided
Initially I tried capturing the output of the command as a Dataframe and writing the dataframe to blob. It succeeded for small logs but for huge logs it failed with the error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
val yarnLog = Seq(Process("yarn logs -applicationId " + "application_1560960859861_0003").!!).toDF()
yarnLog.write.mode("overwrite").text("wasbs://container#storageAccount.blob.core.windows.net/Dev/Logs/application_1560960859861_0003.txt")

Note: You can directly access the log files using Azure Storage => Blobs => Select Container => app logs
Azure HDInsight stores its log files both in the cluster file system and in Azure storage. You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage.
Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.
For more details, refer "Manage logs for Azure HDInsight cluster".
Hope this helps.

Currently, you will need to use the 'yarn logs' command to view Yarn logs.
As regards your requirement, there are two methods to achieve this;
Method 1:
Schedule a daily copy of the app-logs folder into a desired container within the blob storage. This will do a differential copy every day at a specific time of the day. For this one, I had to use Azure Data Factory to achieve the scheduling. Quite easy and no manual copy or coding required.
However, because the yarn applications logs are stored in TFile binary format and can only be read using ‘yarn logs’ command, it means that you need to have another tool application to read the file when from the destination later on. You can use the tool here to read the files https://github.com/shanyu/hadooplogparser
Alternatively, you can have your own simple script that converts it to a readable file before the transfer. Sample script below
**
yarn logs -applicationId application_15645293xxxxx > /tmp/source/applog_back.txt
hadoop dfs -fs wasbs://hdiblob #sandboxblob.blob.core.windows.net -copyFromLocal /tmp/source/applog_back.txt /tmp/destination
**
Method 2:
This is the simplest and cheapest method. You can disable the retention period of the Yarn Application logs, this means the logs will be retained indefinitely. To do this, change the config “yarn.log-aggregation.retain-seconds” to value -1. This config can be found in yarn-site.xml.
Once this is done, you can always read your Yarn Applications logs anytime from the cluster using the Yarn UI or CLI.
Hope this helps

Related

GCloud authentication race conditions

I'm trying to avoid race conditions with gcloud / gsutil authentication on the same system but different CI/CD jobs on my Gitlab-Runner on a Mac Mini.
I have tried setting the auth manually with
RUN gcloud auth activate-service-account --key-file="gitlab-runner.json"
RUN gcloud config set project $GCP_PROJECT_ID
for the Dockerfile (in which I'm performing a download operation from a Google Cloud Storage bucket).
I'm using a configuration in the bash script to run the docker command and in the same script for authenticating I'm using
gcloud config configurations activate $TARGET
Where I've previously done the above two commands to save them to the configuration.
The configurations are working fine if I start the CI/CD jobs one after the other has finished. But I want to trigger them for all clients at the same time, which causes race conditions with gcloud authentication and one of the jobs trying to download from the wrong project bucket.
How to avoid a race condition? I'm already authenticating before each gsutil command but still its causing the race condition. Do I need something like CloudBuild to separate the runtime environments?
You can use Cloud Build to get separate execution environments but this might be an overkill for your use case, as a Cloud Build worker uses an entire VM which might be just too heavy, linux containers / Docker can provide necessary isolation as well.
You should make sure that each container you run has a unique config file placed in the path expected by gcloud. The issue may come from improper volume mounting (all the containers share the same location from the host/OS), or maybe you should mount a directory containing their configuration file (unique for each bucket) on running an image, or perhaps you should run gcloud config configurations activate in a Dockerfile step (thus creating image variants for different buckets if it’s feasible).
Alternatively, and I think this solution might be easier, you can switch from Cloud SDK distribution to standalone gsutil distribution. That way you can provide a path to a boto configuration file through an environment variable.
Such variables can be specified on running a Docker image.

Mounting Azure Blob Storage to Azure Databricks without using cluster

We have a requirement that while provisioning the Databricks service thru CI/CD pipeline in Azure DevOps we should able to mount a blob storage to DBFS without connecting to a cluster. Is it possible to mount object storage to DBFS cluster by using a bash script from Azure DevOps ?
I looked thru various forums but they all mention about doing this using dbutils.fs.mount but the problem is we cannot run this command in Azure DevOps CI/CD pipeline.
Will appreciate any help on this.
Thanks
What you're asking is possible but it requires a bit of extra work. In our organisation we've tried various approaches and I've been working with Databricks for a while. The solution that works best for us is to write a bash script that makes use of the databricks-cli in your Azure Devops pipeline. The approach we have is as follows:
Retrieve a Databricks token using the token API
Configure the Databricks CLI in the CI/CD pipeline
Use Databricks CLI to upload a mount script
Create a Databricks job using the Jobs API and set the mount script as file to execute
The steps above are all contained in a bash script that is part of our Azure Devops pipeline.
Setting up the CLI
Setting up the Databricks CLI without any manual steps is now possible since you can generate a temporary access token using the Token API. We use a Service Principal for authentication.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/tokens
Create a mount script
We have a scala script that follows the mount instructions. This can be Python as well. See the following link for more information:
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html#mount-azure-data-lake-storage-gen2-filesystem.
Upload the mount script
In the Azure Devops pipeline the databricks-cli is configured by creating a temporary token using the token API. Once this step is done, we're free to use the CLI to upload our mount script to DBFS or import it as a notebook using the Workspace API.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/workspace#--import
Configure the job that actually mounts your storage
We have a JSON file that defines the job that executes the "mount storage" script. You can define a job to use the script/notebook that you've uploaded in the previous step. You can easily define a job using JSON, check out how it's done in the Jobs API documentation:
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/jobs#--
At this point, triggering the job should create a temporary cluster that mounts the storage for you. You should not need to use the web interface, or perform any manual steps.
You can apply this approach to different environments and resource groups, as do we. For this we make use of Jinja templating to fill out variables that are environment or project specific.
I hope this helps you out. Let me know if you have any questions!

CannotPullContainerError: AWS Batch Job

I am trying to run a job in AWS Batch. This is my first attempt.
I have a python script which reads files from a S3 bucket, processes them and makes tables in RDS Postgres.
I have made a docker image with my script, pandas, boto3, SQLAlchemy and pushed it to hub.docker.com
When I try to run a job in AWS Batch it get the below error -
CannotPullContainerError: Error response from daemon: pull access denied for *dockerimagename*, repository does not exist or may require 'docker login'
What is a possible solution? I am stuck with this for a long time.
I had this issue when I was only putting the image name in the Container Image field of the Job Description. So I was putting:
*dockerimagename*
when I should have been putting:
0123456789.dkr.ecr.us-east-1.amazonaws.com/*dockerimagename*
You can get the first part of that by going to your ECR > Repositories in the AWS console and copying the link from there (there's even a button to do it).

How to redirect Apache Spark logs from the driver and the slaves to the console of the machine that launchs the Spark job using log4j?

I'm trying to build an Apache Spark application that normalizes csv files from HDFS (changes delimiter, fix broken lines). I use log4j for logging but all the logs just print in the executors so the only way i can check them is using yarn logs -applicationId command. Is there any way i can redirect all logs( from driver and from executors) to my gateway node(the one which launchs the spark job) so i can check them during execution?
You should have the executors log4j props configured to write files local to themselves. Streaming back to the driver will cause unnecessary latency in processing.
If you plan on being able to 'tail" the logs in near real-time, you would need to instrument a solution like Splunk or Elasticsearch, and use tools like Splunk Forwarders, Fluentd, or Filebeat that are agents on each box that specifically watch for all configured log paths, and push that data to a destination indexer, that'll parse and extract log field data.
Now, there are other alternatives like Streamsets or Nifi or Knime (all open source), which offer more instrumentation for collecting event processing failures, and effectively allow for "dead letter queues" to handle errors in a specific way. The part I like about those tools - no programming required.
i think it is not possible. When you execute spark in local mode you can able to see it in console. Otherwise you have to alter log4j properties for the log file path.
As per https://spark.apache.org/docs/preview/running-on-yarn.html#configuration,
YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config in yarn-site.xml file), container logs are copied to HDFS and deleted on the local machine.
You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix in yarn-site.xml).
I am not sure whether the log aggregation from worker nodes happen in real time !!
There is an indirect way to achieve. Enable the following property in yarn-site.xml.
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
This will store all your logs of the submitted applications in hdfs location. Then using the following command you can download the logs into a single aggregated file.
yarn logs -applicationId application_id_example > app_logs.txt
I came across this github repo which downloads the driver and container logs separately. Clone this repository : https://github.com/hammerlab/yarn-logs-helpers
git clone --recursive https://github.com/hammerlab/yarn-logs-helpers.git
In your .bashrc (or equivalent), source .yarn-logs-helpers.sourceme:
$ source /path/to/repo/.yarn-logs-helpers.sourceme
Then download the aggregated logs into nicely segregated driver and container logs by this command.
yarn-container-logs application_example_id

Data access Spark EC2

After following instruction to install cluster via ec2 script, i'm not able to correctly launch my .jar because they don't find the data file which i put on /root/persistent-hdfs/ on the master and slave nodes.
I read on an other post that i need to prefix the file location with file:// but it doesn't change anything... I have this error :
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file://root/persistent-hdfs/data/ds_1.csv
To launch the job i used the ./bin/spark-submit on the master node, am i correct ?
Thank you in advance for your support.
There are a few things you need to do:
The default configuration uses the ephemeral hdfs so you need to turn that off $ /root/ephemeral-hdfs/bin/stop-all.sh and turn persistent on $ /root/persistent-hdfs/bin/start-all.sh.
Put your file into the persistent hdfs root directory for simplicity $ /root/persistent-hdfs/bin/hadoop fs -put /root/ds_1.csv /ds_1.csv. Now check to see it is actually there $ /root/persistent-hdfs/bin/hadoop fs -ls.
Finally, edit Spark's configuration files in /root/spark/conf/spark-defaults.conf and /root/spark/conf/spark-env.sh and change everything that says ephemeral to persistent.
Assuming you put your csv in the root directory of the persistent hdfs (as we did in step 2) you can access it in spark using val rawData = sc.textFile("/ds_1.csv").
Have fun!
Seeing the code of your job would provide more details.
So far looks like workers cannot access the file on the local file system of the driver.
You need to use hadoop fs -put or -cp command to upload your file to HDFS. So workers will be able access the file with hdfs:// uri.
Since you are running your cluster on EC2 I would suggest to put the file to s3 bucket and use s3://... file uri.