pyspark in AWS Glue skip bad file - pyspark

I am using pyspark in AWS Glue to read ETL 100K S3 files, however, I don't have permissions to read tens of files.
I used following code:
datasource0 = glueContext.create_dynamic_frame_from_options("s3",
{'paths': ["s3://mykkkk-test"],
'recurse':True,
'groupFiles': 'inPartition',
'groupSize': '10485760'},
format="json",
transformation_ctx = "datasource0")
## #type: toDF
## #args: []
## #return: df
## #inputs: [frame = datasource0]
df = datasource0.toDF()
It says
An error occurred while calling o70.toDF. java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
...
Caused by: java.io.FileNotFoundException: No such file or directory
s3://mykkkk-test/1111/2222/3333.json
I don't have permission to read 3333.json then the entire job stopped.
Is there a way to catch the exception and skip files, and let the script continue handle other files?

No you cannot... This is becuase spark assumes that i can access all the data files in the folder you have mentioned as the source. Your best option would be to identify before hand the list of files you have access to, move them to a different folder and then read the data from there.
Or try and get the list of files you have access to and then ready each file individually in a loop

Related

Synapse Spark exception handling - Can't write to log file

I have written PySpark code to hit a REST API and extract the contents in an XML format and later wrote to Parquet in a data lake container.
I am trying to add logging functionality where I not only write out errors but updates of actions/process we execute.
I am comparatively new to Spark I have been relying on online articles and samples. All explain the error handling and logging through "1/0" examples and saving logs in the default folder structure (not in ADLS account/container/folder) which do not help at all. Most of the code written in Pure Python doesn't run as-is.
Could I get some assistance with setting up the following:
Push errors to a log file under a designated folder sitting under a data lake storage account/container/folder hierarchy".
Catching REST specific exceptions.
This is a sample of what I have written:
''''
LogFilepath = "abfss://raw#.dfs.core.windows.net/Data/logging/data.log"
#LogFilepath2 = "adl://.azuredatalakestore.net/raw/Data/logging/data.log"
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("An error occured: {}\n".format(e))
''''
I have tried it both ABFSS and ADL file paths with no luck. The log file is already available in the storage account/container/folder.
I have reproduced the above using abfss path in with open() function but it gave me the below error.
FileNotFoundError: [Errno 2] No such file or directory: 'abfss://synapsedata#rakeshgen2.dfs.core.windows.net/datalogs.logs'
As per this Documentation
we can use open() on ADLS file with a path like /synfs/{jobId}/mountpoint/{filename}.
For that, first we need to mount the ADLS.
Here I have mounted it using ADLS linked service. you can mount either by Storage account access key or SAS as per your requirement.
mssparkutils.fs.mount(
"abfss://<container_name>#<storage_account_name>.dfs.core.windows.net",
"/mountpoint",
{"linkedService":"<ADLS linked service name>"}
)
Now use the below code to achieve your requirement.
from datetime import datetime
currentDateAndTime = datetime.now()
jobid=mssparkutils.env.getJobId()
LogFilepath='/synfs/'+jobid+'/synapsedata/datalogs.log'
print(LogFilepath)
try:
1/0
except Exception as e:
print('My Error...' + str(e))
with open(LogFilepath, "a") as f:
f.write("Time : {}- Error : {}\n".format(currentDateAndTime,e))
Here I am writing date time along with the error and there is no need to create the log file first. The above code will create and append the error.
If you want to generate the logs daily, you can generate date file names log files as per your requirement.
My Execution:
Here I have executed 2 times.

ERROR AzureNativeFileSystemStore: DirectoryIsNotEmpty

I am trying to execute this code in Azure HdInsigth. I have a cluster Spark that is connected with Data Lake Storage.
spark.conf.set(
"fs.azure.sas.data.spmdevsharedstorage.blob.core.windows.net",
"xxxxxxxxxxx key xxxxxxxxxxx"
)
val shared_data = "wasbs://data#spmdevsharedstorage.blob.core.windows.net/"
//Read Csv
val dfCsv = spark.read.option("inferSchema", "true").option("header", true).csv(shared_data + "/test/4G-pixel.csv")
val dfCsv_final_withcolumn = dfCsv.select($"latitude",$"longitude")
val dfCsv_final = dfCsv_final_withcolumn.withColumn("new_latitude",col("latitude")*100)
//write
dfCsv_final.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").mode("overwrite").save(shared_data + "/test/4G-pixel_edit.csv")
The code reads the csv file well. So, when write the new file csv I see the following error:
20/04/03 14:58:12 ERROR AzureNativeFileSystemStore: Encountered Storage Exception for delete on Blob: https://spmdevsharedstorage.blob.core.windows.net/data/test/4G-pixel_edit.csv/_temporary/0, Exception Details: This operation is not permitted on a non-empty directory. Error Code: DirectoryIsNotEmpty
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2627)
at org.apache.hadoop.fs.azure.AzureNativeFileSystemStore.delete(AzureNativeFileSystemStore.java:2637)
The new file csv is written to the Data Lake but the code stops. I need you to not see this error.
How can I fix it?
I faced a similar issue.
I resolved it by using the below configuration.. set this to true.
--conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped=true
or
spark.conf.set("spark.hadoop.mapreduce.fileoutputcommitter.cleanup.skipped","true")

Scala Spark - Overwrite parquet File on HDFS

I was trying to append the data frame to existing parquet file found option to have the saveMode to append. But when I was trying to append it throws the error it was not the directory.
data.coalesce(1).write.mode(SaveMode.Append).parquet("/user/root/AppendTest");
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): Permission denied: user=root, access=EXECUTE, inode="/user/root/AppendTest":root:root:-rw-r--r-- (Ancestor /user/root/AppendTest is not a directory).
P.S: While was creating the new file it was generated to the folder and then I have renamed to the desired file.
I have checked How to overwrite the output directory in spark but that doesn't solve my problem here. I have tried the ways mentioned in the above questions(issue mentioned is also different).

Spark-SQL: access file in current worker node directory

I need to read a file using spark-sql, and the file is in the current directory.
I use this command to decompress a list of files I have stored on HDFS.
val decompressCommand = Seq(laszippath, "-i", inputFileName , "-o", "out.las").!!
The file is outputted in the current worker node directory, and I know this because executing "ls -a"!! through scala I can see that the file is there. I then try to access it with the following command:
val dataFrame = sqlContext.read.las("out.las")
I assumed that the sql context would try to find the file in the current directory, but it doesn't. Also, it doesn't throw an error but a warning stating that the file could not be found (so spark continues to run).
I attempted to add the file using: sparkContext.addFile("out.las") and then access the location using: val location = SparkFiles.get("out.las") but this didn't work either.
I even ran the command val locationPt = "pwd"!! and then did val fullLocation = locationPt + "/out.las" and attempted to use that value but it didn't work either.
The actual exception that gets thrown is the following:
User class threw exception: org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: [];
org.apache.spark.sql.AnalysisException: cannot resolve 'x' given input columns: []
And this happens when I try to access column "x" from a dataframe. I know that column 'X' exists because I've downloaded some of the files from HDFS, decompressed them locally and ran some tests.
I need to decompress files one by one because I have 1.6TB of data and so I cannot decompress it at one go and access them later.
Can anyone tell me what I can do to access files which are being outputted to the worker node directory? Or maybe should I be doing it some other way?
So I managed to do it now. What I'm doing is I'm saving the file to HDFS, and then retrieving the file using the sql context through hdfs. I overwrite "out.las" each time in HDFS so that I don't have take too much space.
I have used the hadoop API before to get to files, I dunno if it will help you here.
val filePath = "/user/me/dataForHDFS/"
val fs:FileSystem = FileSystem.get(new java.net.URI(filePath + "out.las"), sc.hadoopConfiguration)
And I've not tested the below, but I'm pretty sure I'm passing the java array to scala illegally. But just giving an idea of what to do afterward.
var readIn: Array[Byte] = Array.empty[Byte]
val fileIn: FSDataInputStream = fs.open(file)
val fileIn.readFully(0, readIn)

CSV file to redshift using talend

Windows 8.1
talend version:5.6
JOb design:
tFileinputDelimited >> tredshiftoutput
I am loading 1 millon data from csv file to redshift. After loading of 5 lakshs data I am getting these errors:
Exception in component tRedshiftOutput_1
org.postgresql.util.PSQLException: ERROR: /rds/bin/padb.1.0.867/data/exec/58/0: failed to map segment from shared object: Cannot allocate memory
Detail:
error: /rds/bin/padb.1.0.867/data/exec/58/0: failed to map segment from shared object: Cannot allocate memory
code: 1015
context: dlopen(/rds/bin/padb.1.0.867/data/exec/58/0,RTLD_LAZY)
query: 4234372
location: exec_plan.cpp:2213
process: padbmaster [pid=15630]
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2096)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1829)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:257)
at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:510)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:386)
at org.postgresql.jdbc2.AbstractJdbc2Statement.executeUpdate(AbstractJdbc2Statement.java:332)
at project_1.red_mysqltest_0_1.red_mysqltest.tFileInputDelimited_1Process(red_mysqltest.java:1056)
at project_1.red_mysqltest_0_1.red_mysqltest.runJobInTOS(red_mysqltest.java:1802)
at project_1.red_mysqltest_0_1.red_mysqltest.main(red_mysqltest.java:1646)
[statistics] disconnected
How to resolve these errors??
To use COPY command you need to copy all the csv file in S3.
To copy files in S3 you can use tSystem component with below command -
"aws s3 cp /home/filename.csv s3://data/"
then you can use tRedshiftRow to use COPY command which copy the data from S3 to table. If you are not using S3 then just direct pass path of the file.
"COPY tablename
FROM 's3://location/filename.csv'
credentials
'aws_access_key_id= enter_access_key;
aws_secret_access_key=enter_aws_secret_access_ket'
CSV DELIMITER ';' IGNOREHEADER 1 BLANKSASNULL EMPTYASNULL MAXERROR 10;"
For more detail see this COPY