File Paths become hidden unaccessible when using Kerberos Authentication and Livy (via sparkmagic) - kerberos

I am using this quickstart guide (https://github.com/aws-quickstart/quickstart-hail) when setting up EMR with sagemaker.
Due to security requirements, I had to enable kerberos (local KDC within EMR cluster) and I referenced this guide (https://aws.amazon.com/blogs/machine-learning/securing-data-analytics-with-an-amazon-sagemaker-notebook-instance-and-kerberized-amazon-emr-cluster/) for the Kerberos set up.
Everything was working well, except that the bokeh plots cannot be saved due to access restriction. (
I tried to run ls -la / via the sagemaker notebook (via sparkmagic + livy), but the plots path /plots and /var/www/html/plots do not show and cannot be accessible.
However, when running ls -la using ssh to the master node, I am able to see these folder paths. Changing the permissions using chmod -R 777 /var/www didn't resolve this issue either.
Any idea whether there is a kerberos/livy setting that hides/protects certain file paths from kerberos authenticated users?

I found out the reason why this is happening.
When using Kerberos authentication for EMR, sparkmagic starts a spark context in the core node instead of the master node. Hence, they are 2 separate filesystems and thus I am unable to see paths created on master node but not core node

Related

kernelspec not found after setting JUPYTER_PATH

I am working in Google Vertex AI, which has a two-disk system of a boot disk and a data disk, the latter of which is mounted to /home/jupyter. I am trying to expose python venv environments with kernelspec files, and then keep those environments exposed across repeated stop-start cycles. All of the default locations for kernelspec files are on the boot disk, which is ephemeral and recreated each time the VM is started (i.e., the exposed kernels vaporize each time the VM is stopped). Conceptually, I want to use a VM start-up script to add a persistent data disk path to the JUPYTER_PATH variable, since, according to the documentation, "Jupyter uses a search path to find installable data files, such as kernelspecs and notebook extensions." During interactive testing in the Terminal, I have not found this to be true. I have also tried setting the data directory variable, but it does not help.
export JUPYTER_PATH=/home/jupyter/envs
export JUPYTER_DATA_DIR=/home/jupyter/envs
I have a beginner's understanding of jupyter and of the important ramifications of using two-disk systems. Could someone please help me understand:
(1) Why is Jupyter failing to search for kernelspec files on the JUPYTER_PATH or in the JUPYTER_DATA_DIR?
(2) If I am mistaken about how the search paths work, what is the best strategy for maintaining virtual environment exposure when Jupyter is installed on an ephemeral boot disk? (Note, I am aware of nb_conda_kernels, which I am specifically avoiding)
A related post focused on the start-up script can be found at this url. Here I am more interested in the general Jupyter + two-disk use case.

GCloud authentication race conditions

I'm trying to avoid race conditions with gcloud / gsutil authentication on the same system but different CI/CD jobs on my Gitlab-Runner on a Mac Mini.
I have tried setting the auth manually with
RUN gcloud auth activate-service-account --key-file="gitlab-runner.json"
RUN gcloud config set project $GCP_PROJECT_ID
for the Dockerfile (in which I'm performing a download operation from a Google Cloud Storage bucket).
I'm using a configuration in the bash script to run the docker command and in the same script for authenticating I'm using
gcloud config configurations activate $TARGET
Where I've previously done the above two commands to save them to the configuration.
The configurations are working fine if I start the CI/CD jobs one after the other has finished. But I want to trigger them for all clients at the same time, which causes race conditions with gcloud authentication and one of the jobs trying to download from the wrong project bucket.
How to avoid a race condition? I'm already authenticating before each gsutil command but still its causing the race condition. Do I need something like CloudBuild to separate the runtime environments?
You can use Cloud Build to get separate execution environments but this might be an overkill for your use case, as a Cloud Build worker uses an entire VM which might be just too heavy, linux containers / Docker can provide necessary isolation as well.
You should make sure that each container you run has a unique config file placed in the path expected by gcloud. The issue may come from improper volume mounting (all the containers share the same location from the host/OS), or maybe you should mount a directory containing their configuration file (unique for each bucket) on running an image, or perhaps you should run gcloud config configurations activate in a Dockerfile step (thus creating image variants for different buckets if it’s feasible).
Alternatively, and I think this solution might be easier, you can switch from Cloud SDK distribution to standalone gsutil distribution. That way you can provide a path to a boto configuration file through an environment variable.
Such variables can be specified on running a Docker image.

How to copy yarn ssh logs automatically using scala to blob storage

We have a requirement to download the yarn ssh logs to blob storage automatically. I found that the yarn logs does get added to storage account under /app-logs/user/logs/ etc path but they are in a binary format and there is no documented way to convert these into text format. So we are trying to run the external command yarn logs -application <application_id> using scala at the end of our application run to capture the logs and save them to the blob storage but facing issues with that. Looking for a solution to get these logs automatically downloaded to storage account as part of the spark pipeline itself.
I tried redirecting the output of the yarn logs command to a temp file and then copying the file from local to blob storage. These commands work fine when I ssh into the head node of the spark cluster and run them. But they are not working when executed from jupyter notebook or scala application.
("yarn logs -applicationId application_1561088998595_xxx > /tmp/yarnlog_2.txt") !!
("hadoop dfs -fs wasbs://dev52mss#sahdimssperfdev.blob.core.windows.net -copyFromLocal /tmp/yarnlog_2.txt /tmp/") !!
When I run these commands using jupyter notebook, the first command works fine to redirect to a local file but the second one to move the file to blob fails with the following error:
warning: there was one feature warning; re-run with -feature for details
java.lang.RuntimeException: Nonzero exit value: 1
at scala.sys.package$.error(package.scala:27)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
... 56 elided
Initially I tried capturing the output of the command as a Dataframe and writing the dataframe to blob. It succeeded for small logs but for huge logs it failed with the error:
Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values
val yarnLog = Seq(Process("yarn logs -applicationId " + "application_1560960859861_0003").!!).toDF()
yarnLog.write.mode("overwrite").text("wasbs://container#storageAccount.blob.core.windows.net/Dev/Logs/application_1560960859861_0003.txt")
Note: You can directly access the log files using Azure Storage => Blobs => Select Container => app logs
Azure HDInsight stores its log files both in the cluster file system and in Azure storage. You can examine log files in the cluster by opening an SSH connection to the cluster and browsing the file system, or by using the Hadoop YARN Status portal on the remote head node server. You can examine the log files in Azure storage using any of the tools that can access and download data from Azure storage.
Examples are AzCopy, CloudXplorer, and the Visual Studio Server Explorer. You can also use PowerShell and the Azure Storage Client libraries, or the Azure .NET SDKs, to access data in Azure blob storage.
For more details, refer "Manage logs for Azure HDInsight cluster".
Hope this helps.
Currently, you will need to use the 'yarn logs' command to view Yarn logs.
As regards your requirement, there are two methods to achieve this;
Method 1:
Schedule a daily copy of the app-logs folder into a desired container within the blob storage. This will do a differential copy every day at a specific time of the day. For this one, I had to use Azure Data Factory to achieve the scheduling. Quite easy and no manual copy or coding required.
However, because the yarn applications logs are stored in TFile binary format and can only be read using ‘yarn logs’ command, it means that you need to have another tool application to read the file when from the destination later on. You can use the tool here to read the files https://github.com/shanyu/hadooplogparser
Alternatively, you can have your own simple script that converts it to a readable file before the transfer. Sample script below
**
yarn logs -applicationId application_15645293xxxxx > /tmp/source/applog_back.txt
hadoop dfs -fs wasbs://hdiblob #sandboxblob.blob.core.windows.net -copyFromLocal /tmp/source/applog_back.txt /tmp/destination
**
Method 2:
This is the simplest and cheapest method. You can disable the retention period of the Yarn Application logs, this means the logs will be retained indefinitely. To do this, change the config “yarn.log-aggregation.retain-seconds” to value -1. This config can be found in yarn-site.xml.
Once this is done, you can always read your Yarn Applications logs anytime from the cluster using the Yarn UI or CLI.
Hope this helps

Data access Spark EC2

After following instruction to install cluster via ec2 script, i'm not able to correctly launch my .jar because they don't find the data file which i put on /root/persistent-hdfs/ on the master and slave nodes.
I read on an other post that i need to prefix the file location with file:// but it doesn't change anything... I have this error :
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file://root/persistent-hdfs/data/ds_1.csv
To launch the job i used the ./bin/spark-submit on the master node, am i correct ?
Thank you in advance for your support.
There are a few things you need to do:
The default configuration uses the ephemeral hdfs so you need to turn that off $ /root/ephemeral-hdfs/bin/stop-all.sh and turn persistent on $ /root/persistent-hdfs/bin/start-all.sh.
Put your file into the persistent hdfs root directory for simplicity $ /root/persistent-hdfs/bin/hadoop fs -put /root/ds_1.csv /ds_1.csv. Now check to see it is actually there $ /root/persistent-hdfs/bin/hadoop fs -ls.
Finally, edit Spark's configuration files in /root/spark/conf/spark-defaults.conf and /root/spark/conf/spark-env.sh and change everything that says ephemeral to persistent.
Assuming you put your csv in the root directory of the persistent hdfs (as we did in step 2) you can access it in spark using val rawData = sc.textFile("/ds_1.csv").
Have fun!
Seeing the code of your job would provide more details.
So far looks like workers cannot access the file on the local file system of the driver.
You need to use hadoop fs -put or -cp command to upload your file to HDFS. So workers will be able access the file with hdfs:// uri.
Since you are running your cluster on EC2 I would suggest to put the file to s3 bucket and use s3://... file uri.

How can I deploy Puppet Master configuration files from my build server?

I have a (RedHat) Puppet Master server, with Puppet Master's configuration files in /etc/puppet.
I've placed the entire contents of /etc/puppet into source control and would like my CI server (TeamCity on Windows) to be able to deploy changes to the Puppet Master server.
How can I accomplish this?
I have an idea that I can use scp, but copying to /etc/puppet would require sudo privileges. At the same time I would like a simple setup.
If there are any alternative or better ways of deploying puppet master configuration files, those answers would also be helpful.
It's unlikely that the whole /etc/puppet should be subjected to CI.
It might be more appropriate to move your $manifestdir and $modulepath instances outside that tree and make some CI client their owner. Just be careful to ensure read privileges to the puppet user.
This way, you could rely on SSH without too much of a security hole (but then, opening up your manifests for writing is always risky), and avoid the need to make the master configuration writeable to a non-root user.