Airflow Dataproc serverless job creator doesnt take python parameters - pyspark

I'm trying to setup a Dataproc Serverless Batch Job from google cloud composer using the DataprocCreateBatchOperator operator that takes some arguments that would impact the underlying python code. However I'm running into the following error:
error: unrecognized arguments: --run_timestamp "2022-06-17T13:22:51.800834+00:00" --temp_bucket "gs://pipeline/spark_temp_bucket/hourly/" --bucket "pipeline" --pipeline "hourly"
This is how my operator is setup:
create_batch = DataprocCreateBatchOperator(
task_id="hourly_pipeline",
project_id="dev",
region="us-west1",
batch_id="".join(random.choice(string.ascii_lowercase + string.digits + "-") for i in range(40)),
batch={
"environment_config": {
"execution_config": {
"service_account": "<service_account>",
"subnetwork_uri": "<uri>
}
},
"pyspark_batch": {
"main_python_file_uri": "gs://pipeline/code/pipeline_feat_creation.py",
"args": [
'--run_timestamp "{{ ts }}"',
'--temp_bucket "gs://pipeline/spark_temp_bucket/hourly/"',
'--bucket "pipeline"',
'--pipeline "hourly"'
],
"jar_file_uris": [
"gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.25.0.jar"
],
}
}
)
Regarding the args array: I tried setting the parameters with and without encapsulating them with "". I've also already did a gcloud submit that worked like so:
gcloud dataproc batches submit pyspark "gs://pipeline/code/pipeline_feat_creation.py" \
--batch=jskdnkajsnd-test-10 --region=us-west1 --subnet="<uri>" \
-- --run_timestamp "2020-01-01" --temp_bucket gs://pipeline/spark_temp_bucket/hourly/ --bucket pipeline --pipeline hourly

The error I was running into was that I wasn't adding a = after each parameter; I've also eliminated the " encapsulation around each parameter. This is how the args are now setup:
"args": [
'--run_timestamp={{ ts }}',
'--temp_bucket=gs://pipeline/spark_temp_bucket/hourly/',
'--bucket=pipeline',
'--pipeline=hourly'
]

Related

How to print arguments that i sent to a Airflow-EMR cluster?

I am executing EMR (spark-submit) through airflow 2.0 and I am submitting steps as follows:
My s3://dbook/ buckets all files needed for spark-submit, first I am copying all files to EMr(Copy S3 to EMR) and then executing the spark-submit command, but I am getting an error called
"no module named config". I need to know what args is being sent to EMR clsuter. How to achieve this?
SPARK_STEPS = [
{
'Name': 'Copy S3 to EMR',
"ActionOnFailure": "CANCEL_AND_WAIT",
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": ['aws' ,'s3', 'cp' ,'s3://dbook/', '.', '--recursive'],
},
},
{
'Name': 'Spark-Submit Command',
"ActionOnFailure": "CANCEL_AND_WAIT",
'HadoopJarStep': {
"Jar": "command-runner.jar",
"Args": [
'spark-submit',
'--py-files',
'config.zip,jobs.zip',
'main.py'],
},
}
]
Thanks,
Xi

AWS EMR, Submit python pyspark script as step using terraform

I have successfully created an EMR cluster using terraform, as per terraform documentation, it's specified on how to submit a step to EMR as a jar
https://www.terraform.io/docs/providers/aws/r/emr_cluster.html#step-1
step {
action_on_failure = "TERMINATE_CLUSTER"
name = "Setup Hadoop Debugging"
hadoop_jar_step {
jar = "command-runner.jar"
args = ["state-pusher-script"]
}
}
where as documentation for adding a pyspark script as a step is missing.
Does anyone has experience adding pyspark script as EMR step using terraform ?
A common way to do this is to copy a script from S3 and use command-runner.jar to execute the script. (I don't know that it's ideal...)
step = [
{
name = "Copy script"
action_on_failure = "CONTINUE"
hadoop_jar_step {
jar = "command-runner.jar"
args = ["aws", "s3", "cp", "s3://path/to/script.py", "/home/hadoop/"]
}
},
{
name = "Run script"
action_on_failure = "CONTINUE"
hadoop_jar_step {
jar = "command-runner.jar"
args = ["bash", "/home/hadoop/script.py"]
}
},
]
Here is a working example of using hadoop_jar_step.
step = [
{
action_on_failure = "TERMINATE_CLUSTER"
name = "Setup Hadoop Debugging"
hadoop_jar_step = [
{
jar = "command-runner.jar"
args = [
"state-pusher-script"
]
main_class = ""
properties = {}
}
]
}
]
Contains workaround for https://github.com/hashicorp/terraform-provider-aws/issues/20911
Change args to your spark-submit command.

resource type error while trying to use cloudformation

I tried to use the exact same example provided in the user guide mentioned below. It works from console but fails to create stack using client.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-athena-namedquery.html
I got an error while trying to execute the following:
{
"Resources": {
"AthenaNamedQuery": {
"Type": "AWS::Athena::NamedQuery",
"Properties": {
"Database": "swfnetadata",
"Description": "A query that selects all aggregated data",
"Name": "MostExpensiveWorkflow",
"QueryString": "SELECT workflowname, AVG(activitytaskstarted) AS AverageWorkflow FROM swfmetadata WHERE year='17' AND GROUP BY workflowname ORDER BY AverageWorkflow DESC LIMIT 10"
}
}
}
}
Is the "create-stack" parameter of cloudformation correct?
aws cloudformation create-stack --stack-name dnd --template-body file://final.json
Why am I getting a resource type error like this?
An error occurred (ValidationError) when calling the CreateStack operation: Template format error: Unrecognized resource types: [AWS::Athena::NamedQuery]
It worked when I updated my CLI version as suggested in the comment. This issue is now closed.

MongoDB to BigQuery

What is the best way to export data from MongoDB hosted in mlab to google bigquery?
Initially, I am trying to do one time load from MongoDB to BigQuery and later on I am thinking of using Pub/Sub for real time data flow to bigquery.
I need help with first one time load from mongodb to bigquery.
In my opinion, the best practice is building your own extractor. That can be done with the language of your choice and you can extract to CSV or JSON.
But if you looking to a fast way and if your data is not huge and can fit within one server, then I recommend using mongoexport. Let's assume you have a simple document structure such as below:
{
"_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ",
"statuses" : [
{
"status" : "dc9e5511-466c-4146-888a-574918cc2534",
"score" : 53.24388894
}
],
"stored_at" : ISODate("2017-04-12T07:04:23.545Z")
}
Then you need to define your BigQuery Schema (mongodb_schema.json) such as:
$ cat > mongodb_schema.json <<EOF
[
{ "name":"_id", "type": "STRING" },
{ "name":"stored_at", "type": "record", "fields": [
{ "name":"date", "type": "STRING" }
]},
{ "name":"statuses", "type": "record", "mode": "repeated", "fields": [
{ "name":"status", "type": "STRING" },
{ "name":"score", "type": "FLOAT" }
]}
]
EOF
Now, the fun part starts :-) Extracting data as JSON from your MongoDB. Let's assume you have a cluster with replica set name statuses, your db is sample, and your collection is status.
mongoexport \
--host statuses/db-01:27017,db-02:27017,db-03:27017 \
-vv \
--db "sample" \
--collection "status" \
--type "json" \
--limit 100000 \
--out ~/sample.json
As you can see above, I limit the output to 100k records because I recommend you run sample and load to BigQuery before doing it for all your data. After running above command you should have your sample data in sample.json BUT there is a field $date which will cause you an error loading to BigQuery. To fix that we can use sed to replace them to simple field name:
# Fix Date field to make it compatible with BQ
sed -i 's/"\$date"/"date"/g' sample.json
Now you can compress, upload to Google Cloud Storage (GCS) and then load to BigQuery using following commands:
# Compress for faster load
gzip sample.json
# Move to GCloud
gsutil mv ./sample.json.gz gs://your-bucket/sample/sample.json.gz
# Load to BQ
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/*.json.gz" \
"mongodb_schema.json"
If everything was okay, then go back and remove --limit 100000 from mongoexport command and re-run above commands again to load everything instead of 100k sample.
ALTERNATIVE SOLUTION:
If you want more flexibility and performance is not your concern, then you can use mongo CLI tool as well. This way you can write your extract logic in a JavaScript and execute it against your data and then send output to BigQuery. Here is what I did for the same process but used JavaScript to output in CSV so I can load it much easier to BigQuery:
# Export Logic in JavaScript
cat > export-csv.js <<EOF
var size = 100000;
var maxCount = 1;
for (x = 0; x < maxCount; x = x + 1) {
var recToSkip = x * size;
db.entities.find().skip(recToSkip).limit(size).forEach(function(record) {
var row = record._id + "," + record.stored_at.toISOString();;
record.statuses.forEach(function (l) {
print(row + "," + l.status + "," + l.score)
});
});
}
EOF
# Execute on Mongo CLI
_MONGO_HOSTS="db-01:27017,db-02:27017,db-03:27017/sample?replicaSet=statuses"
mongo --quiet \
"${_MONGO_HOSTS}" \
export-csv.js \
| split -l 500000 --filter='gzip > $FILE.csv.gz' - sample_
# Load all Splitted Files to Google Cloud Storage
gsutil -m mv ./sample_* gs://your-bucket/sample/
# Load files to BigQuery
bq load \
--source_format=CSV \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/sample_*.csv.gz" \
"ID,StoredDate:DATETIME,Status,Score:FLOAT"
TIP: In above script I did small trick by piping output to able to split the output in multiple files with sample_ prefix. Also during split it will GZip the output so you can load it easier to GCS.
From a basic reading of MongoDB's documentation, it sounds like you can use mongoexport to dump your database as JSON. Once you've done that, refer to the BigQuery loading data topic for a description of how to create a table from JSON files after copying them to GCS.
You can read data from MongoDB and stream it to BigQuery. You can find an example in NodeJS here.
This is an extension of the linked example that prevents duplicated records (as long as they are still streaming buffer):
const { BigQuery } = require('#google-cloud/bigquery');
const bigqueryClient = new BigQuery();
...
const jsonData = // Array of documents from MongoDB
const inputRows = jsonData.map(row => ({
insertId: row._id,
json: row
}));
const insertOptions = {
raw: true
};
await bigqueryClient
.dataset(datasetId)
.table(tableId)
.insert(inputRows, insertOptions);

Can i use replica set name to connect via mongo-connector

I would like to know, is there a way we can replicate from one mongo replica set to another via mongo-connector? As per mongo documentation we can connect two mongo instances via mongo-connector by using a command as in the example below, but I would like to pass replica set name or use a configuration file instead of passing server:port name in command line.
Mongo Connector can replicate from one MongoDB replica set or sharded cluster to another using the Mongo DocManager. The most basic usage is like the following:
mongo-connector -m localhost:27017 -t localhost:37017 -d mongo_doc_manager
I also tried config.json option by creating below config.json file but it has failed.
{
"__comment__": "Configuration options starting with '__' are disabled",
"__comment__": "To enable them, remove the preceding '__'",
"mainAddress": "localhost:27017",
"oplogFile": "C:\Dev\mongodb\mongo-connector\oplog.timestamp",
"verbosity": 2,
"continueOnError": false,
"logging": {
"type": "file",
"filename": "C:\Dev\mongodb\mongo-connector\mongo-connector.log",
"__rotationWhen": "D",
"__rotationInterval": 1,
"__rotationBackups": 10,
"__type": "syslog"
},
"docManagers": [
{
"docManager": "mongo_doc_manager",
"targetURL": "localhost:37010",
"__autoCommitInterval": null
}
]
}
yes its possible to connect to a replica set or a shard server using mongo connector.
{
mongo-connector -m <mongodb server hostname>:<replica set port> \
-t <replication endpoint URL, e.g. http://localhost:8983/solr> \
-d <name of doc manager, e.g., solr_doc_manager>
}
you can also also pass a connection string to the mongo-connector such as
{
mongo connector -m mongodb://db1.example.net,db2.example.net:2500/?replicaSet=test&connectTimeoutMS=300000
}
to specify specifc config files you can use
{ mongo-connector -c config.json }
where config.json is your config file.
I'm able to resolve my issue by entering backslash '\' for my windows directory path.Here is my updated config file for reference. Thanks to ShaneHarveyNot able to use Configuration file for connecting to mongo-connector
{
"__comment__": "Configuration options starting with '__' are disabled",
"__comment__": "To enable them, remove the preceding '__'",
"mainAddress": "localhost:27017",
"oplogFile": "C:\\Dev\\mongodb\\mongo-connector\\oplog.timestamp",
"noDump": false,
"batchSize": -1,
"verbosity": 2,
"continueOnError": false,
"logging": {
"type": "file",
"filename": "C:\\Dev\\mongodb\\mongo-connector\\mongo-connector.log",
"__format": "%(asctime)s [%(levelname)s] %(name)s:%(lineno)d - %(message)s",
"__rotationWhen": "D",
"__rotationInterval": 1,
"__rotationBackups": 10,
"__type": "syslog",
"__host": "localhost:27017"
},
"docManagers": [
{
"docManager": "mongo_doc_manager",
"targetURL": "localhost:37017",
"__autoCommitInterval": null
}
]
}