Pyspark and BigQuery using two different project-ids in Google Dataproc

Pyspark and BigQuery using two different project-ids in Google Dataproc - pyspark

I want to run some pyspark jobs using Google Dataproc with different project Ids without success so far. I'm newbie with pyspark and Google Cloud but I've followed this example and runs well (if the BigQuery dataset is either public or belongs to my GCP project, which is ProjectA). Input parameters look like this:
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
projectA = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory ='gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
# Input Parameters
'mapred.bq.project.id': projectA,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'projectA',
'mapred.bq.input.dataset.id': 'my_dataset',
'mapred.bq.input.table.id': 'my_table',
}
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
But what I need is to run a job from a BQ dataset of a ProjectB (I have credentials to query it), so when setting the input parameters, which look like this:
conf = {
# Input Parameters
'mapred.bq.project.id': projectA,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'projectB',
'mapred.bq.input.dataset.id': 'the_datasetB',
'mapred.bq.input.table.id': 'the_tableB',
}
and try to load data in from BQ, my script keeps running infinitely. How should I set it up properly?
FYI, after running the example I mentioned before, I can see that 2 carpets (shard-0 and shard-1) are created in Google Storage and contain the corresponding BQ data, but with my job only shard-0 is created and it's empty.

I talked to my co-worker Dennis and here is his suggestion:
"Hmm, not sure, it should work. They might want to test with "bq" CLI inside the master node to manually try some "bq extract" job of the projectB table into their GCS bucket since that's all the connector is doing under the hood.
If I had to guess I'd suspect they only meant their personal username has the credentials to query projectB, but the default service account of projectA might not have the query permissions. Everything inside Dataproc VMs act on behalf of the compute service account assigned to the VMs, not the end-user.
They can
gcloud compute instances describe -m
and somewhere in there it lists the service-account email address."

Related

Where is a file created via Terraform code stored in Terraform Cloud?

I've been using Terraform for some time but I'm new to Terraform Cloud. I have a piece of code that if you run it locally it will create a .tf file under a folder that I tell him but if I run it with Terraform CLI on Terraform cloud this won't happen. I'll show it to you so it will be more clear for everyone.
resource "genesyscloud_tf_export" "export" {
directory = "../Folder/"
resource_types = []
include_state_file = false
export_as_hcl = true
log_permission_errors = true
}
So basically when I launch this code with terraform apply in local, it creates a .tf file with everything I need. Where? It goes up one folder and under the folder "Folder" it will store this file.
But when I execute the same code on Terraform Cloud obviously this won't happen. Does any of you have any workaround with this kind of troubles? How can I manage to store this file for example in a github repo when executing github actions? Thanks beforehand

The Terraform Cloud remote execution environment has an ephemeral filesystem that is discarded after a run is complete. Any files you instruct Terraform to create there during the run will therefore be lost after the run is complete.
If you want to make use of this information after the run is complete then you will need to arrange to either store it somewhere else (using additional resources that will write the data to somewhere like Amazon S3) or export the relevant information as root module output values so you can access it via Terraform Cloud's API or UI.
I'm not familiar with genesyscloud_tf_export, but from its documentation it sounds like it will create either one or two files in the given directory:
genesyscloud.tf or genesyscloud.tf.json, depending on whether you set export_as_hcl. (You did, so I assume it'll generate genesyscloud.tf.
terraform.tfstate if you set include_state_file. (You didn't, so I assume that file isn't important in your case.
Based on that, I think you could use the hashicorp/local provider's local_file data source to read the generated file into memory once the MyPureCloud/genesyscloud provider has created it, like this:
resource "genesyscloud_tf_export" "export" {
directory = "../Folder"
resource_types = []
include_state_file = false
export_as_hcl = true
log_permission_errors = true
}
data "local_file" "export_config" {
filename = "${genesyscloud_tf_export.export.directory}/genesyscloud.tf"
}
You can then refer to data.local_file.export_config.content to obtain the content of the file elsewhere in your module and declare that it should be written into some other location that will persist after your run is complete.
This genesyscloud_tf_export resource type seems unusual in that it modifies data on local disk and so its result presumably can't survive from one run to the next in Terraform Cloud. There might therefore be some problems on the next run if Terraform thinks that genesyscloud_tf_export.export.directory still exists but the files on disk don't, but hopefully the developers of this provider have accounted for that somehow in the provider logic.

How to set Hadoop fs.s3a.acl.default on AWS EMR?

I have a map-reduce application running on AWS EMR that writes some output to a different (aws account) s3 bucket. I have the permission setup and the job can write to the external bucket, but the owner is still the root from the account where the Hadoop job is running. I would like to change this to the external account that owns the bucket.
I found I can set fs.s3a.acl.default to bucket-owner-full-control, however that doesn't seem like working. This is what I am doing:
conf.set("fs.s3a.acl.default", "bucket-owner-full-control");
FileSystem fileSystem = FileSystem.get(URI.create(s3Path), conf);
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(filePath));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write(contentAsString);
writer.close();
fsDataOutputStream.close();
Any help is appreciated.

conf.set("fs.s3a.acl.default", "bucket-owner-full-control");
is the right property you are setting.
As this the property in core-site.xml to give full control to bucket owner.
<property>
<name>fs.s3a.acl.default</name>
<description>Set a canned ACL for newly created and copied objects. Value may be private,
public-read, public-read-write, authenticated-read, log-delivery-write,
bucket-owner-read, or bucket-owner-full-control.</description>
</property>
BucketOwnerFullControl
Specifies that the owner of the bucket is granted Permission.FullControl. The owner of the bucket is not necessarily the same as the owner of the object.
I recommend to set fs.s3.canned.acl also to value BucketOwnerFullControl
For debugging you can use the below snippet to understand what parameters are actually passing..
for (Entry<String, String> entry: conf) {
System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
}
For testing purpose do this command with command line
aws s3 cp s3://bucket/source/dummyfile.txt s3://bucket/target/dummyfile.txt --sse --acl bucket-owner-full-control
If this works then through api also it will.
Bonus point with Spark , useful for spark scala users:
For Spark to access the s3 file system and set the proper configurations like the below example...
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3a.fast.upload","true")
hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version","2")
hadoopConf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
hadoopConf.set("fs.s3a.canned.acl","BucketOwnerFullControl")
hadoopConf.set("fs.s3a.acl.default","BucketOwnerFullControl")

If you are using EMR then you have to use the AWS team's S3 connector, with "s3://" URLs and use their documented configuration options. They don't support the apache one, so any option with "fs.s3a" at the beginning isn't going to have any effect whatsoever.

As mentioned in answer by Stevel, For EMR with pyspark use this
sc=spark.sparkContext
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.canned.acl","BucketOwnerFullControl")
Canned ACL Description
BucketOwnerFullControl Specifies that the owner of the bucket is granted
Permission.FullControl. The owner of the bucket is not necessarily
the same as the owner of the object.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-s3-acls.html

Producing a CSV of Cloud Bucket files

What's the best way to create a CSV file listing images in a Google Cloud bucket to be imported into AutoML Vision?

If you want to listen the files that are saved on a bucket you can use a Google cloud function to listen the new files and create the csv file in another bucket
For example you can use this python code as starting point, this code log the details of a new uploaded file
def hello_gcs_generic(data, context):
"""Background Cloud Function to be triggered by Cloud Storage.
This generic function logs relevant data when a file is changed.
Args:
data (dict): The Cloud Functions event payload.
context (google.cloud.functions.Context): Metadata of triggering event.
Returns:
None; the output is written to Stackdriver Logging
"""
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(data['bucket']))
print('File: {}'.format(data['name']))
print('Metageneration: {}'.format(data['metageneration']))
print('Created: {}'.format(data['timeCreated']))
print('Updated: {}'.format(data['updated']))
Basically the function is listening the storage events "google.storage.object.finalize
" (this happen when a file is uploaded)
To deploy this function on the cloud you can use this command
gcloud functions deploy hello_gcs_generic --runtime python37 --trigger-resource [your bucket name] --trigger-event google.storage.object.finalize
or you can use the GCP console (Web UI) to deploy this function.
selecting "cloud storage" on the trigger field
select "Finalize/create" on the event type
specifiying your bucket
Even you can directly process the files using Auto ML within a cloud function as is mentioned in this example.

running a bigquery command across multiple projetcs

Is it possible to run a bq command which query a dataset on project X and stores the result on another dataset on project Y:
bq query --destination_table=project_Y.dataset_1.table_1 "SELECT * FROM project_X.dataset2.table_2"
What about the credentials now that I have two projects involved?
I only have set a service account credential for project_X using gcloud.

Yes. Queries across multiple projects are supported.
The user (or service account) issuing the query will need to have the appropriate permissions on each project (and/or dataset).
You can read more about BigQuery permissions here:
https://cloud.google.com/bigquery/docs/access-control

EMR Spark Fails to Save Dataframe to S3

I am using the RunJobFlow command to spin up a Spark EMR cluster. This command sets the JobFlowRole to an IAM Role which has the policies AmazonElasticMapReduceforEC2Role and AmazonRedshiftReadOnlyAccess. The first policy contains an action to allow all s3 permissions.
When the EC2 instances spin up, they assume this IAM role, and generate temporary credentials via STS.
The first thing which I do is read a table from my Redshift cluster into a Spark Dataframe using the com.databricks.spark.redshift format and using the same IAM Role to unload the data from redshift as I did for the EMR JobFlowRole.
So far as I understand, this runs an UNLOAD command on Redshift to dump into the S3 bucket I specify. Spark then loads the newly unloaded data into a Dataframe. I use the recommended s3n:// protocol for the tempdir option.
This command works great, and it always successfully loads the data into the Dataframe.
I then run some transformations and attempt to save the dataframe in the csv format to the same S3 bucket Redshift Unloaded into.
However, when I try to do this, it throws the following error
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively)
Okay. So I don't know why this happens, but I tried to hack around it by setting the recommended hadoop configuration parameters. I then used DefaultAWSCredentialsProviderChain to load the AWSAccessKeyID and AWSSecretKey and set via
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <CREDENTIALS_ACCESS_KEY>)
spark.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <CREDENTIALS_SECRET_ACCESS_KEY>)
When I run it again it throws the following error:
java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId;
Okay. So that didn't work. I then removed setting the hadoop configurations and hardcoded an IAM user's credentials in the s3 url via s3n://ACCESS_KEY:SECRET_KEY#BUCKET/KEY
When I ran this it spit out the following error:
java.lang.IllegalArgumentException: Bucket name should be between 3 and 63 characters long
So it tried to create a bucket.. which is definitely not what we want it to do.
I am really stuck on this one and would really appreciate any help here! It works fine when I run it locally, but completely fails on EMR.

The problem was the following:
EC2 Instance Generated Temporary Credentials on EMR Bootstrap Phase
When I queried Redshift, I passed the aws_iam_role to theDatabricks driver. The driver then re-generated temporary credentials for that same IAM role. This invalidated the credentials the EC2 instance generated.
I then tried to upload to S3 using the old credentials (and the credentials which were stored in the instance's metadata)
It failed because it was trying to use out-of-date credentials.
The solution was to remove redshift authorization via aws_iam_role and replace it with the following:
val credentials = EC2MetadataUtils.getIAMSecurityCredentials
...
.option("temporary_aws_access_key_id", credentials.get(IAM_ROLE).accessKeyId)
.option("temporary_aws_secret_access_key", credentials.get(IAM_ROLE).secretAccessKey)
.option("temporary_aws_session_token", credentials.get(IAM_ROLE).token)

On amazon EMR, try usong the prefix s3:// to refer to an object in S3.
It's a long story.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse