AWS Glue job throwing Null pointer exception when writing df

AWS Glue job throwing Null pointer exception when writing df - pyspark

I am trying to write a job to read data from S3 and write to BQ db (using connector), running the same script for other tables and it is working correctly, but for one of the tables the write is not working.
It is working on the first run, but after first load the incremental runs throws this null pointer exception error. I have bookmarks enabled to fetch new data added in S3 and write to BQ database.
I am already handling the new data check, if there are files to process then proceed else abort job.
In the job logs df is printing and count is printing too, everything seems to be working but as it runs the write df command the job fails.
I am not sure what is the cause. Had tried to make the nullability of source and target to be same too, by setting the nullable property of source to True same as target, but it still fails.
Unable to understand the null pointer exception thrown.
Error: Caused by: java.lang.NullPointerException at com.google.cloud.bigquery.connector.common.BigQueryClient.loadDataIntoTable(BigQueryClient.java:532) at com.google.cloud.spark.bigquery.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.scala:87) at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:66) ... 42 more

The BQ connector by AWS had a bug. This was resolved when I contacted the AWS team and they suggested to use previous version of the connector.
So, using previous version of connector helped me resolve the issue.

Related

Azure Data Factory CICD error: The document creation or update failed because of invalid reference

All, when running a build pipeline using Azure Devops with ARM template, the process is consistently failing when trying to deploy a dataset or a reference to a dataset with this error:
ARM Template deployment: Resource Group scope (AzureResourceManagerTemplateDeployment)
BadRequest: The document creation or update failed because of invalid reference 'dataset_1'.
I've tried renaming the dataset and also recreating it to see if that would help.
I then deleted the dataset_1.json file from the repo and still get the same message so it's some reference to this dataset and not the dataset itself I think. I've looked through all the other files for references to this but they all look fine.
Any ideas on how to troubleshoot this?
thanks

try this
Looks like you have created 'myTestLinkedService' linked service, tested connection but haven't published it yet and trying to reference that linked service in the new dataset that you are trying to create using Powershell.
In order to reference any data factory entity from Powershell, please make sure those entities are published first. Please try publishing the linked service first from the portal and then try to run your Powershell script to create the new dataset/actvitiy.

I think I found the issue. When I went into the detailed logs I found that in addition to this error there was an error message about an invalid SQL connection string, so I though it may be related since the dataset in question uses Azure SQL database linked service.
I adjusted the connection string and this seems to have solved the issue.

PostgreSQL "forgets" default schema when closing data source connection

I am running into a very strange issue with Spring Boot and Spring Data: after I manually close a connection, the formerly working application seems to "forget" which schema it's using and complains about missing relations.
Here's the code snippet in question:
try (Connection connection = this.dataSource.getConnection()) {
ScriptUtils.executeSqlScript(connection, new ClassPathResource("/script.sql"));
}
This code works fine, but after it executes, the application immediately starts throwing errors like the following:
org.postgresql.util.PSQLException: ERROR: relation "some_table" does not exist
Prior to executing the code above, the application works fine (including referencing the table it later complains about). If I remove the try-resource block, and do not close the Connection, everything also works fine, except that I've now created a resource leak. I have also tried explicitly setting the default schema (public) in the following ways:
In the JDBC URL with the currentSchema parameter
With the the spring.datasource.hikari.schema parameter
With the spring.datasource.jpa.properties.hibernate.default_schema property
The last does alleviate the issue with respect to Hibernate managed classes, but the issue persists with native queries. I could, of course, make the schema explicit in those queries, but that doesn't seem to address the root issue. Why would closing a connection trigger this behavior?
My environment:
Spring Boot 2.5.1
PostgreSQL 12.7

Thanks to several users above who immediately saw what I did not. The script, adapted from an older pg_dump run, was indeed mucking with the search_path:
SELECT pg_catalog.set_config('search_path', '', false);
Removing that line, and some other unnecessary ones, resolved the problem. Big duh on my part.

External Table on DELTA format files in ADLS Gen 1

We have number of databricks DELTA tables created on ADLS Gen1. and also, there are external tables built on top each of those tables in one of the databricks workspace.
similarly, I am trying to create same sort of external tables on the same DELTA format files,but in different workspace.
I do have read only access via Service principle on ADLS Gen1. So I can read DELTA files through spark data-frames, as in given below:
read_data_df = spark.read.format("delta").load('dbfs:/mnt/data/<foldername>')
I can even able to create hive external tables, but I do see following warning while reading data from the same table:
Error in SQL statement: AnalysisException: Incompatible format detected.
A transaction log for Databricks Delta was found at `dbfs:/mnt/data/<foldername>/_delta_log`,
but you are trying to read from `dbfs:/mnt/data/<foldername>` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://learn.microsoft.com/azure/databricks/delta/index
;
If I create external table 'using DELTA', then I see a different access error as in:
Caused by: org.apache.hadoop.security.AccessControlException:
OPEN failed with error 0x83090aa2 (Forbidden. ACL verification failed.
Either the resource does not exist or the user is not authorized to perform the requested operation.).
failed with error 0x83090aa2 (Forbidden. ACL verification failed.
Either the resource does not exist or the user is not authorized to perform the requested operation.).
Does it mean that I would need full access, rather just READ ONLY?, on those underneath file system?
Thanks

Resolved after upgrading to Databricks Runtime environment to runtime version DBR-7.3.

EMR Spark Fails to Save Dataframe to S3

I am using the RunJobFlow command to spin up a Spark EMR cluster. This command sets the JobFlowRole to an IAM Role which has the policies AmazonElasticMapReduceforEC2Role and AmazonRedshiftReadOnlyAccess. The first policy contains an action to allow all s3 permissions.
When the EC2 instances spin up, they assume this IAM role, and generate temporary credentials via STS.
The first thing which I do is read a table from my Redshift cluster into a Spark Dataframe using the com.databricks.spark.redshift format and using the same IAM Role to unload the data from redshift as I did for the EMR JobFlowRole.
So far as I understand, this runs an UNLOAD command on Redshift to dump into the S3 bucket I specify. Spark then loads the newly unloaded data into a Dataframe. I use the recommended s3n:// protocol for the tempdir option.
This command works great, and it always successfully loads the data into the Dataframe.
I then run some transformations and attempt to save the dataframe in the csv format to the same S3 bucket Redshift Unloaded into.
However, when I try to do this, it throws the following error
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)
Okay. So I don't know why this happens, but I tried to hack around it by setting the recommended hadoop configuration parameters. I then used DefaultAWSCredentialsProviderChain to load the AWSAccessKeyID and AWSSecretKey and set via
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <CREDENTIALS_ACCESS_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <CREDENTIALS_SECRET_ACCESS_KEY>)
When I run it again it throws the following error:
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
Okay. So that didn't work. I then removed setting the hadoop configurations and hardcoded an IAM user's credentials in the s3 url via s3n://ACCESS_KEY:SECRET_KEY#BUCKET/KEY
When I ran this it spit out the following error:
java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long
So it tried to create a bucket.. which is definitely not what we want it to do.
I am really stuck on this one and would really appreciate any help here! It works fine when I run it locally, but completely fails on EMR.

The problem was the following:
EC2 Instance Generated Temporary Credentials on EMR Bootstrap Phase
When I queried Redshift, I passed the aws_iam_role to theDatabricks driver. The driver then re-generated temporary credentials for that same IAM role. This invalidated the credentials the EC2 instance generated.
I then tried to upload to S3 using the old credentials (and the credentials which were stored in the instance's metadata)
It failed because it was trying to use out-of-date credentials.
The solution was to remove redshift authorization via aws_iam_role and replace it with the following:
val credentials = EC2MetadataUtils.getIAMSecurityCredentials
...
.option("temporary_aws_access_key_id", credentials.get(IAM_ROLE).accessKeyId)
.option("temporary_aws_secret_access_key", credentials.get(IAM_ROLE).secretAccessKey)
.option("temporary_aws_session_token", credentials.get(IAM_ROLE).token)

On amazon EMR, try usong the prefix s3:// to refer to an object in S3.
It's a long story.

Spring GS - Creating a Batch Service missing output from db query

I have run the complete source for Getting Started - Creating a Batch Service
Knowing that the sample uses the memory-based database provided by the #EnableBatchProcessing, is the db query result expected or it will only be available if data will be persisted permanently?
After adding some debug lines, it seems that the DB query is executed first before the job gets executed. Was this the expected behavior?
Is there anything I'm missing here.
Thanks
Alex

You aren't missing anything. This is related to issue number 8 for that guide (https://github.com/spring-guides/gs-batch-processing/issues/8). I just created a pull request to address this issue. You can view the PR here (https://github.com/spring-guides/gs-batch-processing/pull/9) until it's merged.
UPDATE
The PR has been merged and the guid has been updated. The new version should no longer have this issue.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse