Synapse Spark exception handling - Can't write to log file - pyspark

I have written PySpark code to hit a REST API and extract the contents in an XML format and later wrote to Parquet in a data lake container.
I am trying to add logging functionality where I not only write out errors but updates of actions/process we execute.
I am comparatively new to Spark I have been relying on online articles and samples. All explain the error handling and logging through "1/0" examples and saving logs in the default folder structure (not in ADLS account/container/folder) which do not help at all. Most of the code written in Pure Python doesn't run as-is.
Could I get some assistance with setting up the following:
Push errors to a log file under a designated folder sitting under a data lake storage account/container/folder hierarchy".
Catching REST specific exceptions.
This is a sample of what I have written:
''''
LogFilepath = "abfss://raw#.dfs.core.windows.net/Data/logging/data.log"
#LogFilepath2 = "adl://.azuredatalakestore.net/raw/Data/logging/data.log"
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("An error occured: {}\n".format(e))
''''
I have tried it both ABFSS and ADL file paths with no luck. The log file is already available in the storage account/container/folder.

I have reproduced the above using abfss path in with open() function but it gave me the below error.
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://synapsedata#rakeshgen2.dfs.core.windows.net/datalogs.logs'
As per this Documentation
we can use open() on ADLS file with a path like /synfs/{jobId}/mountpoint/{filename}.
For that, first we need to mount the ADLS.
Here I have mounted it using ADLS linked service. you can mount either by Storage account access key or SAS as per your requirement.
mssparkutils.fs.mount(
"abfss://<container_name>#<storage_account_name>.dfs.core.windows.net",
"/mountpoint",
{"linkedService":"<ADLS linked service name>"}
)
Now use the below code to achieve your requirement.
from datetime import datetime
currentDateAndTime = datetime.now()
jobid=mssparkutils.env.getJobId()
LogFilepath='/synfs/'+jobid+'/synapsedata/datalogs.log'
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("Time : {}- Error : {}\n".format(currentDateAndTime,e))
Here I am writing date time along with the error and there is no need to create the log file first. The above code will create and append the error.
If you want to generate the logs daily, you can generate date file names log files as per your requirement.
My Execution:
Here I have executed 2 times.

Related

Talend - how to configure tFileInputDelimited do not throw error when file not found

Good day,
I am using tFileInputDelimited in Talend Data Studio to read a txt file and get some value inside.
The input file name is something like follow, it contain day in the file name:
checksum_150123.txt
This file will create in last few steps before the job end and the file not found.
Thus, every day the job first run, there is no file exist, and then tFileInputDelimited will throw error on file not found.
C:\LandingZone\jx\checksum_180123.txt (The system cannot find the file specified)
[ERROR] 14:13:35 my_track.my_precheck_registration_0_1.DL_PRECHECK_REGISTRATION- CollectCheckSum_1_tFileInputDelimited_1 - C:\LandingZone\jx\checksum_180123.txt (The system cannot find the file specified)
I have a requirement to not showing this, may I know how can I configure this?
for that I recommend you to use the tFileExist component and then use the tFileExist variable Exist (((Boolean)globalMap.get("tFileExist_1_EXISTS")) for example) in a run if trigger
Hope this answers your question

Error while loading parquet format file into Amazon Redshift using copy command and manifest file

I'm trying to load parquet file using manifest file and getting below error.
query: 124138ailed due to an internal error. File 'https://s3.amazonaws.com/sbredshift-east/data/000002_0 has an invalid version number: )
Here is my copy command
copy testtable from 's3://sbredshift-east/manifest/supplier.manifest'
IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123'
FORMAT AS PARQUET
manifest;
here is my manifest file
**{
"entries":[
{
"url":"s3://sbredshift-east/data/000002_0",
"mandatory":true,
"meta":{
"content_length":1000
}
}
]
}**
I'm able to load the same file using copy command by specifying the file name.
copy testtable from 's3://sbredshift-east/data/000002_0' IAM_ROLE 'arn:aws:iam::123456789:role/MyRedshiftRole123' FORMAT AS PARQUET;
INFO: Load into table 'supplier' completed, 800000 record(s) loaded successfully.
COPY
What could be wrong in my copy statement?
This error happens when the content_length value is wrong. You have to specify the correct content_length. You could check it executing an s3 ls command.
aws s3 ls s3://sbredshift-east/data/
2019-12-27 11:15:19 539 sbredshift-east/data/000002_0
The 539 (file size) should be the same than the content_lenght value in your manifest file.
I don't know why they are using this meta value when you don't need it in the direct copy command.
¯\_(ツ)_/¯
The only way I've gotten parquet copy to work with manifest file is to add the meta key with the content_length.
From what I can gather in my error logs, the COPY command for parquet (w/ manifest) might first be reading the files using Redshift Spectrum as an external table. If that's the case, this hidden step does require the content_step which contradicts their initial statement about COPY commands.
https://docs.amazonaws.cn/en_us/redshift/latest/dg/loading-data-files-using-manifest.html

For some reason, a warning is issued when calling the procedure SYSPROC.ADMIN_CMD ('EXPORT to ...')

I have the following problem:
I am using the following command:
EXPORT TO "D:\ExportFiles\ACTIVATE_DICT.csv" OF DEL MODIFIED BY TIMESTAMPFORMAT="YYYY/MM/DD HH:MM:SS" STRIPLZEROS MESSAGES "D:\ExportFiles\FMessage.txt" SELECT * FROM DB2INST4.ACTIVATE_DICT;
In the Command Editor of the program, the Control Center successfully exported data from the ACTIVATE_DICT table to a CSV file ACTIVATE_DICT.csv.
But for a number of reasons, I need you to execute this command in the IBM Data Studio or DataGrip program, and there it cannot be executed in this form.
Therefore, I read the following manual enter link description here
and based on it wrote the following command:
CALL SYSPROC.ADMIN_CMD('EXPORT to /lotus/ExportFiles/ACTIVATE_DICT.csv OF DEL MODIFIED BY TIMESTAMPFORMAT="YYYY/MM/DD HH:MM:SS" STRIPLZEROS MESSAGES /lotus/ExportFiles/FMessage.txt SELECT * FROM DB2INST4.ACTIVATE_DICT');
Here is the message on the result of the command:
[2018-10-11 15:15:23] [ ][3107] There is at least one warning
message in the message file.. SQLCODE=3107, SQLSTATE= ,
DRIVER=4.23.42 [2018-10-11 15:15:23] 1 row retrieved starting from 1
in 75 ms (execution: 29 ms, fetching: 46 ms)
And in the / lotus / ExportFiles / directory there is no ACTIVATE_DICT.csv file and there is no FMessage.txt file in the / lotus / ExportFiles / directory.
Question: How then to correctly execute this command ??? Maybe I'm doing something wrong?
sqlcode 3107 is a warning message:
SQL3107W At least one warning message was encountered during LOAD processing.
Explanation
You can load data into a database from a file, tape, or named pipe using the LOAD command. You can specify that any warnings or errors from the LOAD processing be printed to a message file. If no message file is specified, the warnings or errors are printed to standard out (unless the database manager instance is configured as a partitioned-database environment.)
It is to tell you to read message log in the message file you specified. In your case: /lotus/ExportFiles/FMessage.txt
Please read into the file to see what error is logged and if you need help understand what is logged, please post the content of the file.
This message is returned when at least one warning was received during processing. If a message file is being used, the warnings and errors will be printed there.
This warning does not affect processing.
User response
Review the message file warning.
EXPORT command using the ADMIN_CMD procedure
See use of the 'MESSAGES ON SERVER' clause, and how to get these messages using the result set returned by this routine in this case.

Databricks dbutils.fs.ls shows files. However, reading them throws an IO error

I am running a Spark Cluster and when I'm executing the below command on Databricks Notebook, it gives me the output:
dbutils.fs.ls("/mnt/test_file.json")
[FileInfo(path=u'dbfs:/mnt/test_file.json', name=u'test_file.json', size=1083L)]
However, when I'm trying to read that file, I'm getting the below mentioned error:
with open("mnt/test_file.json", 'r') as f:
for line in f:
print line
IOError: [Errno 2] No such file or directory: 'mnt/test_file.json'
What might be the issue here? Any help/support is greatly appreciated.
In order to access files on a DBFS mount using local file APIs you need to prepend /dbfs to the path, so in your case it should be
with open('/dbfs/mnt/test_file.json', 'r') as f:
for line in f:
print(line)
See more details in the docs at https://docs.databricks.com/data/databricks-file-system.html#local-file-apis especially regarding limitations. With Databricks Runtime 5.5 and below there's a 2GB file limit. With 6.0+ there's no longer such a limit as the FUSE mount has been optimized to deal with larger file sizes.

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)