databricks | enable to ship file system logs to DBFS or storage. ???` - scala

I have created an init script that helps me in getting custom logs in databricks , By default log get created at local (Driver/ worker machine ) path log/log4j-active.log but how can I enable to ship it to DBFS or storage. ???`
%sh
ls logs
getting below output
lineage.json
log4j-active.log
log4j-mylog4j-active.log
metrics.json
product.json
stderr
stdout
ttyd_logs
usage.json
i want to copy my log file log4j-mylog4j-active.log to dbfs or blob storage anything would work ..
dbutils.fs.cp("logs/log4j-mylog4j-active.log", "dbfs:/cluster-logs/")
I am also trying filesystem copy but can't do
FileNotFoundException: /logs/log4j-active.log
I have also tried to create a folder and specify the path in the logging ( in cluster advance option)
but that also didn't work , i don't know why my fs logs are not getting ship to that location of dbfs.
can i get help that how can I transfer my fs log to dbfs or storage ??
thanks in advance !!

You just need to enable logging in your cluster configuration (unfold "Advanced options") & specify where logs should go - by default it's a dbfs:/cluster-logs/ (and cluster ID will be appended to it), but you can specify another path.

Related

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

td-agent does not validate google cloud service account credentials

Trying to configure fluentd output with td-agent and the fluent-google-cloud plugin. The plugin and all dependencies are loaded but fluentd is not outputting to google cloud logging and the td-agent log states error="Unable to read the credential file specified by GOOGLE_APPLICATION_CREDENTIALS: file /home/$(whoami)/.config/gcloud/service_account_credentials.json does not exist".
However when I go to the file path, the file does exist and the $GOOGLE_APPLICATION_CREDENTIALS variable is set to the file path as well.What should I do to fix this?
On the assumption that the error and you are both correct, I suspect (!) that you're using your user account ( == whoami) and finding /home/$(whoami)/.config/gcloud while the agent is running (under systemctl?) as root and not finding the credentials file there (perhaps /root/.config/gcloud.
It would be helpful if you included more details as to what you've done in order that we can better understand the issue.

Mount Bucket on Google Storage

I want to mount a Google bucket to a local server. However, when I run the line, the directory I point it to is empty. Any ideas?
gcsfuse mssng_vcf_files ./mountbucket/
It reports:
File system has been successfully mounted.
but the directory mountbucket/ is empty.
gcsfuse will not show any directory defined by a file with a slash in its name. So if your bucket contains /files/index.txt it will not show until you create a object named files. I am assuming here your bucket contains directories then files, and if that is the case this may be your problem.
gcsfuse supports a flag called --implicit-dirs that changes the behaviour. When this flag is enabled, name lookup requests from the kernel use the GCS API's Objects.list operation to search for objects that would implicitly define the existence of a directory with the name in question. So, in the example above, there would appear to be a directory named "files".
There are some drawbacks which are defined here -
https://github.com/GoogleCloudPlatform/gcsfuse/blob/master/docs/semantics.md#implicit-directories
So you have 2 options
Create the directories in your bucket which will make your files appear
Look at --implicit-dirs flag to get them to always appear.
Hope this helps.

Can't access non-public directories on local FS in streamsets pipeline creator

New to streamsets. Following the documentation tutorial, was getting
FileNotFound: ... HADOOPFS_14 ... (permission denied)
error when trying to set the destination location as a local FS directory and preview the pipeline (basically saying either the file can't be accessed or does not exist), yet the permissions for the directory in question are drwxrwxr-x. 2 mapr mapr. Eventually found workaround by setting the destination folder permissions to be publicly writable ($chmod o+w /path/to/dir). Yet, the user that started the sdc service (while I was following the installation instructions) should have had write permissions on that directory (was root).
I set the sdc user env. vars. to use the name "mapr" (the owner of the directories I'm trying to access), so why did I get rejected? What is happening here when I set the env. vars. for sdc (because it does not seem to be doing anything)?
This is a snippet of what my /opt/streamsets-datacollector/libexec/sdcd-env.sh file looks like:
# user that will run the data collector, it must exist in the system
#
export SDC_USER=mapr
# group of the user that will run the data collector, it must exist in the system
#
export SDC_GROUP=mapr
So my question is, what determines the permissions for the sdc service (which I assume is what is being used to access FS locations by the streamsets web UI)? Any explaination or links to specific documentation would be appreciated. Thanks.
Looking at the command ps -ef | grep sdc to examine who the system thinks the owner of the sdc process really is, found that it was listed as:
sdc 36438 36216 2 09:04 ? 00:01:28 /usr/bin/java -classpath /opt/streamsets-datacollector
So it seems that editing sdcd-env.sh did not have any effect. What did work was editing the /usr/lib/systemd/system/sdc.service file to look like (notice that have set user and group to be the user that owns the directories to be used in the streamsets pipeline):
[Unit]
Description=StreamSets Data Collector (SDC)
[Service]
User=mapr
Group=mapr
LimitNOFILE=32768
Environment=SDC_CONF=/etc/sdc
Environment=SDC_HOME=/opt/streamsets-datacollector
Environment=SDC_LOG=/var/log/sdc
Environment=SDC_DATA=/var/lib/sdc
ExecStart=/opt/streamsets-datacollector/bin/streamsets dc -verbose
TimeoutSec=60
Then restarting the sdc service (with systemctl start sdc, on centos 7) showed:
mapr 157013 156955 83 10:38 ? 00:01:08 /usr/bin/java -classpath /opt/streamsets-datacollector...
and was able to validate and run pipelines with origins and destinations on local FS that are owned by the user and group set in the sdc.service file.
* NOTE: the specific directories used in the initial post are hadoop-mapr directories mounted via NFS (mapr 6.0) (though the fact that they are NFS should mean that this solution should apply generally) hosted on nodes running centos 7.

Failure: GCE credentials requested outside a GCE instance

When I try to copy my files to Google Cloud Storage using
gsutil cp file.gz gs://somebackup
Get this error:
Your "GCE" credentials are invalid. For more help, see "gsutil help creds", or re-run the gsutil config command (see "gsutil help config").
Failure: GCE credentials requested outside a GCE instance.
BTW, this was working until last yesterday.
Just ran into this as well and contacted Google support. It's occurring because the instance was created with Storage permissions as Read Only, visible on the instance details page:
Apparently this can't be changed after the instance is created (!). Our solution was to mount a temp disk, copy the file there, unmount it and then remount it on a second instance (with proper Storage permissions) and do the gsutil copy from there.
Try to do a "gcloud auth login" from the command line
I've had this before and my problem was that I've set the wrong project.
Make sure you set the project ID and not project name when you run
gcloud config set project <projectID>