Can't access non-public directories on local FS in streamsets pipeline creator - streamsets

New to streamsets. Following the documentation tutorial, was getting
FileNotFound: ... HADOOPFS_14 ... (permission denied)
error when trying to set the destination location as a local FS directory and preview the pipeline (basically saying either the file can't be accessed or does not exist), yet the permissions for the directory in question are drwxrwxr-x. 2 mapr mapr. Eventually found workaround by setting the destination folder permissions to be publicly writable ($chmod o+w /path/to/dir). Yet, the user that started the sdc service (while I was following the installation instructions) should have had write permissions on that directory (was root).
I set the sdc user env. vars. to use the name "mapr" (the owner of the directories I'm trying to access), so why did I get rejected? What is happening here when I set the env. vars. for sdc (because it does not seem to be doing anything)?
This is a snippet of what my /opt/streamsets-datacollector/libexec/sdcd-env.sh file looks like:
# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr
# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr
So my question is, what determines the permissions for the sdc service (which I assume is what is being used to access FS locations by the streamsets web UI)? Any explaination or links to specific documentation would be appreciated. Thanks.

Looking at the command ps -ef | grep sdc to examine who the system thinks the owner of the sdc process really is, found that it was listed as:
sdc 36438 36216 2 09:04 ? 00:01:28 /usr/bin/java -classpath /opt/streamsets-datacollector
So it seems that editing sdcd-env.sh did not have any effect. What did work was editing the /usr/lib/systemd/system/sdc.service file to look like (notice that have set user and group to be the user that owns the directories to be used in the streamsets pipeline):
[Unit]
Description=StreamSets Data Collector (SDC)
[Service]
User=mapr
Group=mapr
LimitNOFILE=32768
Environment=SDC_CONF=/etc/sdc
Environment=SDC_HOME=/opt/streamsets-datacollector
Environment=SDC_LOG=/var/log/sdc
Environment=SDC_DATA=/var/lib/sdc
ExecStart=/opt/streamsets-datacollector/bin/streamsets dc -verbose
TimeoutSec=60
Then restarting the sdc service (with systemctl start sdc, on centos 7) showed:
mapr 157013 156955 83 10:38 ? 00:01:08 /usr/bin/java -classpath /opt/streamsets-datacollector...
and was able to validate and run pipelines with origins and destinations on local FS that are owned by the user and group set in the sdc.service file.
* NOTE: the specific directories used in the initial post are hadoop-mapr directories mounted via NFS (mapr 6.0) (though the fact that they are NFS should mean that this solution should apply generally) hosted on nodes running centos 7.

Related

databricks | enable to ship file system logs to DBFS or storage. ???`

I have created an init script that helps me in getting custom logs in databricks , By default log get created at local (Driver/ worker machine ) path log/log4j-active.log but how can I enable to ship it to DBFS or storage. ???`
%sh
ls logs
getting below output
lineage.json
log4j-active.log
log4j-mylog4j-active.log
metrics.json
product.json
stderr
stdout
ttyd_logs
usage.json
i want to copy my log file log4j-mylog4j-active.log to dbfs or blob storage anything would work ..
dbutils.fs.cp("logs/log4j-mylog4j-active.log", "dbfs:/cluster-logs/")
I am also trying filesystem copy but can't do
FileNotFoundException: /logs/log4j-active.log
I have also tried to create a folder and specify the path in the logging ( in cluster advance option)
but that also didn't work , i don't know why my fs logs are not getting ship to that location of dbfs.
can i get help that how can I transfer my fs log to dbfs or storage ??
thanks in advance !!
You just need to enable logging in your cluster configuration (unfold "Advanced options") & specify where logs should go - by default it's a dbfs:/cluster-logs/ (and cluster ID will be appended to it), but you can specify another path.

td-agent does not validate google cloud service account credentials

Trying to configure fluentd output with td-agent and the fluent-google-cloud plugin. The plugin and all dependencies are loaded but fluentd is not outputting to google cloud logging and the td-agent log states error="Unable to read the credential file specified by GOOGLE_APPLICATION_CREDENTIALS: file /home/$(whoami)/.config/gcloud/service_account_credentials.json does not exist".
However when I go to the file path, the file does exist and the $GOOGLE_APPLICATION_CREDENTIALS variable is set to the file path as well.What should I do to fix this?
On the assumption that the error and you are both correct, I suspect (!) that you're using your user account ( == whoami) and finding /home/$(whoami)/.config/gcloud while the agent is running (under systemctl?) as root and not finding the credentials file there (perhaps /root/.config/gcloud.
It would be helpful if you included more details as to what you've done in order that we can better understand the issue.

Azure batch Application package not getting copied to Working Directory of Task

I have created Azure Batch pool with Linux Machine and specified Application Package for the Pool.
My command line is
command='python $AZ_BATCH_APP_PACKAGE_scriptv1_1/tasks/XXX/get_XXXXX_data.py',
python3: can't open file '$AZ_BATCH_APP_PACKAGE_scriptv1_1/tasks/XXX/get_XXXXX_data.py':
[Errno 2] No such file or directory
when i connect to node and look at working directory non of the Application Package files are present there.
How do i make sure that files from Application Package are available in working directory or I can invoke/execute files under Application Package from command line ?
Make sure that your async operation have proper await in place before you start using the package in your code.
Also please share your design \ pseudo-code scenario and how you are approaching it as a design?
Further to add:
Seems like this one is pool level package.
The error seems like that the application env variable is either incorrectly used or there is some other user level issue. Please checkout linmk below and specially the section where use of env variable is mentioned.
This seems like user level issue because In case of downloading the package resource, if there will be an error it will be visible to you via exception handler or at the tool level is you are using batch explorer \ Batch-labs or code level exception handling.
https://learn.microsoft.com/en-us/azure/batch/batch-application-packages
Reason \ Rationale:
If the pool level or the task application has error, an error-list will come back if there was an error in the application package then it will be returned as the UserError or and AppPackageError which will be visible in the exception handle of the code.
Key you can always RDP into your node and checkout the package availability: information here: https://learn.microsoft.com/en-us/azure/batch/batch-api-basics#connecting-to-compute-nodes
I once created a small sample to help peeps around so this resource might help you to checkeout the use here.
Hope rest helps.
On Linux, the application package with version string is formatted as:
AZ_BATCH_APP_PACKAGE_{0}_{1}
On Windows it is formatted as:
AZ_BATCH_APP_PACKAGE_APPLICATIONID#version
Where 0 is the application name and 1 is the version.
$AZ_BATCH_APP_PACKAGE_scriptv1_1 will take you to the root folder where the application was unzipped.
Does this "exact" path exist in that location?
tasks/XXX/get_XXXXX_data.py
You can see more information here:
https://learn.microsoft.com/en-us/azure/batch/batch-application-packages
Edit: Just saw this question: "or can I invoke/execute files under Application Package from command line"
Yes you can invoke and execute files from the application package directory with the environment variable above.
If you type env on the node you will see the environment variables that have been set.

Capistrano 3 move log directory

how can I change the directory where capistrano puts its log files? I could not find in the docs.
Currently the logs appear in myapp/log/... on my dev machine. However, since I am using laravel, and there is a log directory myapp/storage/logs I would like capistranos logs to appear here as well.
Do you mean the capistrano.log file that is created and appended to whenever you deploy?
You can specify the location by adding the following to deploy.rb:
set :format_options, log_file: "storage/logs/capistrano.log"
This tells Airbrussh (the default logging implementation in Capistrano 3.5.0+) where to place the log file. More information here: https://github.com/mattbrictson/airbrussh#configuration

gsutil make bucket command [gsutil mb] is not working

I am trying to create a bucket using gsutil mb command:
gsutil mb -c DRA -l US-CENTRAL1 gs://some-bucket-to-my-gs
But I am getting this error message:
Creating gs://some-bucket-to-my-gs/...
BadRequestException: 400 Invalid argument.
I am following the documentation from here
What is the reason for this type of error?
I got the same error. I was because I used the wrong location.
The location parameter expects a region without specifying witch zone.
Eg.
sutil mb -p ${TF_ADMIN} -l europe-west1-b gs://${TF_ADMIN}
Should have been
sutil mb -p ${TF_ADMIN} -l europe-west1 gs://${TF_ADMIN}
One reason this error can occur (confirmed in chat with the question author) is that you have an invalid default_project_id configured in your .boto file. Ensure that ID matches your project ID in the Google Developers Console
If you can make a bucket successfully using the Google Developers Console, but not using "gsutil mb", this is a good thing to check.
I was receiving the same error for the same command while using gsutil as well as the web console. Interestingly enough, changing my bucket name from "google-gatk-test" to "gatk" allowed the request to go through. The original name does not appear to violate bucket naming conventions.
Playing with the bucket name is worth trying if anyone else is running into this issue.
Got this error and adding the default_project_id to the .boto file didn't work.
Took me some time but at the end i deleted the credentials file from the "Global Config" directory and recreated the account.
Using it on windows btw...
This can happen if you are logged into the management console (storage browser), possibly a locking/contention issue.
May be an issue if you add and remove buckets in batch scripts.
In particular this was happening to me when creating regionally diverse (non DRA) buckets :
gsutil mb -l EU gs://somebucket
Also watch underscores, the abstraction scheme seems to use them to map folders. All objects in the same project are stored at the same level (possibly as blobs in an abstracted database structure).
You can see this when downloading from the browser interface (at the moment anyway).
An object copied to gs://somebucket/home/crap.txt might be downloaded via a browser (or curl) as home_crap.txt. As a an aside (red herring) somefile.tar.gz can come down as somefile.tar.gz.tar so a little bit of renaming may be required due to the vagaries of the headers returned from the browser interface anyway. Min real support level is still $150/mth.
I had this same issue when I created my bucket using the following commands
MY_BUCKET_NAME_1=quiceicklabs928322j22df
MY_BUCKET_NAME_2=MY_BUCKET_NAME_1
MY_REGION=us-central1
But when I decided to add dollar sign $ to the variable MY_BUCKET_NAME_1 as MY_BUCKET_NAME_2=$MY_BUCKET_NAME_1 the error was cleared and I was able to create the bucket
I got this error when I had capital letter in the bucket name
$gsutil mb gs://CLIbucket-anu-100000
Creating gs://CLIbucket-anu-100000/...
BadRequestException: 400 Invalid bucket name: 'CLIbucket-anu-100000'
$gsutil mb -l ASIA-SOUTH1 -p single-archive-352211 gs://clibucket-anu-100
Creating gs://clibucket-anu-100/..
$