How do I pass a "blob" to AWS sns publish on the command line? - command-line

I am trying to publish a message to an SNS topic from the command line, which includes binary data.
How do I pass this binary data into the message-attributes field, as indicated in https://awscli.amazonaws.com/v2/documentation/api/latest/reference/sns/publish.html for --message-attributes?
This is my command:
awslocal sns publish \
--topic-arn ${AWS_TOPIC_ARN} \
--subject="Test Subject" \
--message "Data for today..." \
--message-attributes=file://sns_input.json
and my input:
{
"ProtoData": {
"DataType": "Binary",
"BinaryValue": **BLOB**
}
}
My "BLOB" data is in file://sample_proto_data.db

Related

How we can Filter the Insight data for multiple specific campaigns in Facebook marketing?

I'm trying to get Insight of an Ad-Account by filtering using multiple specific campaigns, I was able to filter with single specific campaigns
Here is the code which I tried for single specific campaigns
https://graph.facebook.com/v12.0/act_YOUR_ACCOUNT_ID/insights?fields=actions,reach,impressions,clicks,cpc,spend&filtering=[{field: "campaign.id",operator:"CONTAIN", value: '123456789'}}]
You have 2 options:
You can build a list of campaigns by your specific condition and then you use IN operator instead of CONTAIN operator like this:
https://graph.facebook.com/v12.0/act_YOUR_ACCOUNT_ID/insights?fields=actions,reach,impressions,clicks,cpc,spend&filtering=[{field: "campaign.id",operator:"IN", value: ['id1', 'id2', 'id3']}]
You can try to use batch request like this (documentation here and here):
curl \
-F 'access_token=<ACCESS_TOKEN>' \
-F 'batch=[ \
{ \
"method": "GET", \
"relative_url": "v12.0/act_YOUR_ACCOUNT_ID/insights?fields=impressions,spend,ad_id,adset_id&filtering=[{field: "campaign.id",operator:"CONTAIN", value: '123456789'}]" \
}, \
{ \
"method": "GET", \
"relative_url": "v12.0/act_YOUR_ACCOUNT_ID/insights?fields=impressions,spend,ad_id,adset_id&filtering=[{field: "campaign.id",operator:"CONTAIN", value: '222222222'}]" \
}, \
]' \
'https://graph.facebook.com'

Get run id after triggering a github workflow dispatch event

I am triggering a workflow run via github's rest api. But github doesn't send any data in the response body (204).
How do i get the run id of the trigger request made?
I know about the getRunsList api, which would return runs for a workflow id, then i can get the latest run, but this can cause issues when two requests are submitted at almost the same time.
This is not currently possible to get the run_id associated to the dispatch API call in the dispatch response itself, but there is a way to find this out if you can edit your worflow file a little.
You need to dispatch the workflow with an input like this:
curl "https://api.github.com/repos/$OWNER/$REPO/actions/workflows/$WORKFLOW/dispatches" -s \
-H "Authorization: Token $TOKEN" \
-d '{
"ref":"master",
"inputs":{
"id":"12345678"
}
}'
Also edit your workflow yaml file with an optionnal input (named id here). Also, place it as the first job, a job which has a single step with the same name as the input id value (this is how we will get the id back using the API!):
name: ID Example
on:
workflow_dispatch:
inputs:
id:
description: 'run identifier'
required: false
jobs:
id:
name: Workflow ID Provider
runs-on: ubuntu-latest
steps:
- name: ${{github.event.inputs.id}}
run: echo run identifier ${{ inputs.id }}
The trick here is to use name: ${{github.event.inputs.id}}
https://docs.github.com/en/actions/creating-actions/metadata-syntax-for-github-actions#inputs
Then the flow is the following:
run the dispatch API call along with the input named id in this case with a random value
POST https://api.github.com/repos/$OWNER/$REPO/actions/workflows/$WORKFLOW/dispatches
in a loop get the runs that have been created since now minus 5 minutes (the delta is to avoid any issue with timings):
GET https://api.github.com/repos/$OWNER/$REPO/actions/runs?created=>$run_date_filter
example
in the run API response, you will get a jobs_url that you will call:
GET https://api.github.com/repos/$OWNER/$REPO/actions/runs/[RUN_ID]/jobs
the job API call above returns the list of jobs, as you have declared the id jobs as 1st job it will be in first position. It also gives you the steps with the name of the steps. Something like this:
{
"id": 3840520726,
"run_id": 1321007088,
"run_url": "https://api.github.com/repos/$OWNER/$REPO/actions/runs/1321007088",
"run_attempt": 1,
"node_id": "CR_kwDOEi1ZxM7k6bIW",
"head_sha": "4687a9bb5090b0aadddb69cc335b7d9e80a1601d",
"url": "https://api.github.com/repos/$OWNER/$REPO/actions/jobs/3840520726",
"html_url": "https://github.com/$OWNER/$REPO/runs/3840520726",
"status": "completed",
"conclusion": "success",
"started_at": "2021-10-08T15:54:40Z",
"completed_at": "2021-10-08T15:54:43Z",
"name": "Hello world",
"steps": [
{
"name": "Set up job",
"status": "completed",
"conclusion": "success",
"number": 1,
"started_at": "2021-10-08T17:54:40.000+02:00",
"completed_at": "2021-10-08T17:54:42.000+02:00"
},
{
"name": "12345678", <=============== HERE
"status": "completed",
"conclusion": "success",
"number": 2,
"started_at": "2021-10-08T17:54:42.000+02:00",
"completed_at": "2021-10-08T17:54:43.000+02:00"
},
{
"name": "Complete job",
"status": "completed",
"conclusion": "success",
"number": 3,
"started_at": "2021-10-08T17:54:43.000+02:00",
"completed_at": "2021-10-08T17:54:43.000+02:00"
}
],
"check_run_url": "https://api.github.com/repos/$OWNER/$REPO/check-runs/3840520726",
"labels": [
"ubuntu-latest"
],
"runner_id": 1,
"runner_name": "Hosted Agent",
"runner_group_id": 2,
"runner_group_name": "GitHub Actions"
}
The name of the id step is returning your input value, so you can safely confirm that it is this run that was triggered by your dispatch call
Here is an implementation of this flow in python, it will return the workflow run id:
import random
import string
import datetime
import requests
import time
# edit the following variables
owner = "YOUR_ORG"
repo = "YOUR_REPO"
workflow = "dispatch.yaml"
token = "YOUR_TOKEN"
authHeader = { "Authorization": f"Token {token}" }
# generate a random id
run_identifier = ''.join(random.choices(string.ascii_uppercase + string.digits, k=15))
# filter runs that were created after this date minus 5 minutes
delta_time = datetime.timedelta(minutes=5)
run_date_filter = (datetime.datetime.utcnow()-delta_time).strftime("%Y-%m-%dT%H:%M")
r = requests.post(f"https://api.github.com/repos/{owner}/{repo}/actions/workflows/{workflow}/dispatches",
headers= authHeader,
json= {
"ref":"master",
"inputs":{
"id": run_identifier
}
})
print(f"dispatch workflow status: {r.status_code} | workflow identifier: {run_identifier}")
workflow_id = ""
while workflow_id == "":
r = requests.get(f"https://api.github.com/repos/{owner}/{repo}/actions/runs?created=%3E{run_date_filter}",
headers = authHeader)
runs = r.json()["workflow_runs"]
if len(runs) > 0:
for workflow in runs:
jobs_url = workflow["jobs_url"]
print(f"get jobs_url {jobs_url}")
r = requests.get(jobs_url, headers= authHeader)
jobs = r.json()["jobs"]
if len(jobs) > 0:
# we only take the first job, edit this if you need multiple jobs
job = jobs[0]
steps = job["steps"]
if len(steps) >= 2:
second_step = steps[1] # if you have position the run_identifier step at 1st position
if second_step["name"] == run_identifier:
workflow_id = job["run_id"]
else:
print("waiting for steps to be executed...")
time.sleep(3)
else:
print("waiting for jobs to popup...")
time.sleep(3)
else:
print("waiting for workflows to popup...")
time.sleep(3)
print(f"workflow_id: {workflow_id}")
gist link
Sample output
$ python3 github_action_dispatch_runid.py
dispatch workflow status: 204 | workflow identifier: Z7YPF6DD1YP2PTM
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321463229/jobs
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321463229/jobs
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321463229/jobs
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321475221/jobs
waiting for steps to be executed...
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321463229/jobs
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321475221/jobs
waiting for steps to be executed...
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321463229/jobs
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321475221/jobs
waiting for steps to be executed...
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321463229/jobs
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321475221/jobs
get jobs_url https://api.github.com/repos/OWNER/REPO/actions/runs/1321463229/jobs
workflow_id: 1321475221
It would have been easier if there was a way to retrieve the workflow inputs via API but there is no way to do this at this moment
Note that in the worflow file, I use ${{github.event.inputs.id}} because ${{inputs.id}} doesn't work. It seems inputs is not being evaluated when we use it as the step name
Get WORKFLOWID
gh workflow list --repo <repo-name>
Trigger workflow of type workflow_dispatch
gh workflow run $WORKFLOWID --repo <repo-name>
It doesnot return the run-id which is required get the status of execution
Get latest run-id WORKFLOW_RUNID
gh run list -w $WORKFLOWID --repo <repo> -L 1 --json databaseId | jq '.[]| .databaseId'
Get workflow run details
gh run view --repo <repo> $WORKFLOW_RUNID
This is workaround that we do. It is not perfect, but should work.
inspired by the comment above, made a /bin/bash script which gets your $run_id
name: ID Example
on:
workflow_dispatch:
inputs:
id:
description: 'run identifier'
required: false
jobs:
id:
name: Workflow ID Provider
runs-on: ubuntu-latest
steps:
- name: ${{github.event.inputs.id}}
run: echo run identifier ${{ inputs.id }}
workflow_id= generates a random 8 digit number
now, later, date_filter= use for time filter, now - 5 minutes \
generates a random ID
POST job and trigger workflow
GET action/runs descending and gets first .workflow_run[].id
keeps looping until script matches random ID from step 1
echo run_id
TOKEN="" \
GH_USER="" \
REPO="" \
REF=""
WORKFLOW_ID=$(tr -dc '0-9' </dev/urandom | head -c 8) \
NOW=$(date +"%Y-%m-%dT%H:%M") \
LATER=$(date -d "-5 minutes" +"%Y-%m-%dT%H:%M") \
DATE_FILTER=$(echo "$NOW-$LATER") \
JSON=$(cat <<-EOF
{"ref":"$REF","inputs":{"id":"$WORKFLOW_ID"}}
EOF
) && \
curl -s \
-X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer $TOKEN" \
"https://api.github.com/repos/$GH_USER/$REPO/actions/workflows/main.yml/dispatches" \
-d $JSON && \
INFO="null" \
COUNT=1 \
ATTEMPTS=10 && \
until [[ $CHECK -eq $WORKFLOW_ID ]] || [[ $COUNT -eq $ATTEMPTS ]];do
echo -e "$(( COUNT++ ))..."
INFO=$(curl -s \
-X GET \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer $TOKEN" \
"https://api.github.com/repos/$GH_USER/$REPO/actions/runs?created:<$DATE_FILTER" | jq -r '.workflow_runs[].id' | grep -m1 "")
CHECK=$(curl -s \
-X GET \
-H "Accept: application/vnd.github+json" \
-H "Authorization: Bearer $TOKEN" \
"https://api.github.com/repos/$GH_USER/$REPO/actions/runs/$INFO/jobs" | jq -r '.jobs[].steps[].name' | grep -o '[[:digit:]]*')
sleep 5s
done
echo "Your run_id is $CHECK"
output:
1...
2...
3...
Your run_id is 67530050
I recommend using the convictional/trigger-workflow-and-wait action:
- name: Example
uses: convictional/trigger-workflow-and-wait#v1.6.5
with:
owner: my-org
repo: other-repo
workflow_file_name: other-workflow.yaml
github_token: ${{ secrets.GH_TOKEN }}
client_payload: '{"key1": "value1", "key2": "value2"}'
This takes care of waiting for the other job and returning a success or failure according to whether the other workflow succeeded or failed. It does so in a robust way that handles multiple runs being triggered at almost the same time.
the whole idea is to know which run was dispatched, when id was suggested to use on dispatch, this id is expected to be found in the response of the GET call to this url "actions/runs" so now user is able to identify the proper run to monitor. The injected id is not part of the response, so extracting another url to find your id is not helpful since this is the point where id needed to identify the run for monitoring

How can I print an Ansible vaulted variable that includes a Kubernetes secret from the CLI?

I have a Ansible group_vars directory with the following file within it:
$ cat inventory/group_vars/env1
...
...
ldap_config: !vault |
$ANSIBLE_VAULT;1.1;AES256
31636161623166323039356163363432336566356165633232643932623133643764343134613064
6563346430393264643432636434356334313065653537300a353431376264333463333238383833
31633664303532356635303336383361386165613431346565373239643431303235323132633331
3561343765383538340a373436653232326632316133623935333739323165303532353830386532
39616232633436333238396139323631633966333635393431373565643339313031393031313836
61306163333539616264353163353535366537356662333833653634393963663838303230386362
31396431636630393439306663313762313531633130326633383164393938363165333866626438
...
...
This Ansible encrypted string has a Kubernetes secret encapsulated within it. A base64 blob that looks something like this:
IyMKIyBIb3N0IERhdGFiYXNlCiMKIyBsb2NhbGhvc3QgaXMgdXNlZCB0byBjb25maWd1cmUgdGhlIGxvb3BiYWNrIGludGVyZmFjZQojIHdoZW4gdGhlIHN5c3RlbSBpcyBib290aW5nLiAgRG8gbm90IGNoYW5nZSB0aGlzIGVudHJ5LgojIwoxMjcuMC4wLjEJbG9jYWxob3N0CjI1NS4yNTUuMjU1LjI1NQlicm9hZGNhc3Rob3N0Cjo6MSAgICAgICAgICAgICBsb2NhbGhvc3QKIyBBZGRlZCBieSBEb2NrZXIgRGVza3RvcAojIFRvIGFsbG93IHRoZSBzYW1lIGt1YmUgY29udGV4dCB0byB3b3JrIG9uIHRoZSBob3N0IGFuZCB0aGUgY29udGFpbmVyOgoxMjcuMC4wLjEga3ViZXJuZXRlcy5kb2NrZXIuaW50ZXJuYWwKIyBFbmQgb2Ygc2VjdGlvbgo=
How can I decrypt this in a single CLI?
We can use an Ansible adhoc command to retrieve the variable of interest, ldap_config. To start we're going to use this adhoc to retrieve the Ansible encrypted vault string:
$ ansible -i "localhost," all \
-m debug \
-a 'msg="{{ ldap_config }}"' \
--vault-password-file=~/.vault_pass.txt \
-e#inventory/group_vars/env1
localhost | SUCCESS => {
"msg": "ABCD......."
Make note that we're:
using the debug module and having it print the variable, msg={{ ldap_config }}
giving ansible the path to the secret to decrypt encrypted strings
using the notation -e#< ...path to file...> to pass the file with the encrypted vault variables
Now we can use Jinja2 filters to do the rest of the parsing:
$ ansible -i "localhost," all \
-m debug \
-a 'msg="{{ ldap_config | b64decode | from_yaml }}"' \
--vault-password-file=~/.vault_pass.txt \
-e#inventory/group_vars/env1
localhost | SUCCESS => {
"msg": {
"apiVersion": "v1",
"bindDN": "uid=readonly,cn=users,cn=accounts,dc=mydom,dc=com",
"bindPassword": "my secret password to ldap",
"ca": "",
"insecure": true,
"kind": "LDAPSyncConfig",
"rfc2307": {
"groupMembershipAttributes": [
"member"
],
"groupNameAttributes": [
"cn"
],
"groupUIDAttribute": "dn",
"groupsQuery": {
"baseDN": "cn=groups,cn=accounts,dc=mydom,dc=com",
"derefAliases": "never",
"filter": "(objectclass=groupOfNames)",
"scope": "sub"
},
"tolerateMemberNotFoundErrors": false,
"tolerateMemberOutOfScopeErrors": false,
"userNameAttributes": [
"uid"
],
"userUIDAttribute": "dn",
"usersQuery": {
"baseDN": "cn=users,cn=accounts,dc=mydom,dc=com",
"derefAliases": "never",
"scope": "sub"
}
},
"url": "ldap://192.168.1.10:389"
}
}
NOTE: The above section -a 'msg="{{ ldap_config | b64decode | from_yaml }}" is what's doing the heavy lifting in terms of converting from Base64 to YAML.
References
How to run Ansible without hosts file
https://docs.ansible.com/ansible/latest/user_guide/playbooks_filters.html#filters-for-formatting-data
Base64 Decode String in jinja
How to decrypt string with ansible-vault 2.3.0
If you need a one liner that works with any yaml file (not only in inventory) containing inlined vault vars, and if you are ready to install a pip package for that, there is a solution using yq, a yaml processor built on top of jq
prerequesite: Install yq
pip install yq
Usage
You can get your result with the following command:
yq -r .ldapconfig inventory/group_vars/env1 | ansible_vault decrypt
If you need to type your vault pass interactively, don't forget to add the relevant option
yq -r .ldapconfig inventory/group_vars/env1 | ansible_vault --ask-vault-pass decrypt
Note: the -r option to yq is mandatory to get a raw result without the quotation marks around the value.

What is the format of the JSON for a Jenkins REST buildWithParameters to override the default parameters values

I am able to build a Jenkins job with its parameters' default values by sending a POST call to
http://jenkins:8080/view/Orion_phase_2/job/test_remote_api_triggerring/buildWithParameters
and I can override the default parameters "product", "suites" and "markers by sending to this URL:
http://jenkins:8080/view/Orion_phase_2/job/test_remote_api_triggerring/buildWithParameters?product=ALL&suites=ALL&markers=ALL
But I saw examples were the parameters can be override by sending a JSON body with new values. I am trying to do that by sending the following json bodies. Neither of them works for me.
{
'product': 'ALL',
'suites': 'ALL',
'markers': 'ALL'
}
and
{
"parameter": [
{
"name": "product",
"value": "ALL"
},
{
"name": "suites",
"value": "ALL"
},
{
"name": "markers",
"value": "ALL"
}
]
}
What JSON to send if I want to override the values of parameters "product", "suites" & "markers"?
I'll leave the original question as is and elaborate here on the various API calls to trigger parameterized builds. These are the calls options that I used.
Additional documentation: https://wiki.jenkins.io/display/JENKINS/Remote+access+API
The job contains 3 parameters named: product, suites, markers
Send the parameters as URL query parameters to /buildWithParameters:
http://jenkins:8080/view/Orion_phase_2/job/test_remote_api_triggerring/buildWithParameters?product=ALL&suites=ALL&markers=ALL
Send the parameters as JSON data\payload to /build:
http://jenkins:8080/view/Orion_phase_2/job/test_remote_api_triggerring/build
The JSON data\payload is not sent as the call's json_body (which is what confused me), but rater in the data payload as:
json:'{
"parameter": [
{"name":"product", "value":"123"},
{"name":"suites", "value":"high"},
{"name":"markers", "value":"Hello"}
]
}'
And here are the CURL commands for each of the above calls:
curl -X POST -H "Jenkins-Crumb:2e11fc9...0ed4883a14a" http://jenkins:8080/view/Orion_phase_2/job/test_remote_api_triggerring/build --user "raameeil:228366f31...f655eb82058ad12d" --form json='{"parameter": [{"name":"product", "value":"123"}, {"name":"suites", "value":"high"}, {"name":"markers", "value":"Hello"}]}'
curl -X POST \
'http://jenkins:8080/view/Orion_phase_2/job/test_remote_api_triggerring/buildWithParameters?product=234&suites=333&markers=555' \
-H 'authorization: Basic c2hsb21pb...ODRlNjU1ZWI4MjAyOGFkMTJk' \
-H 'cache-control: no-cache' \
-H 'jenkins-crumb: 0bed4c7...9031c735a' \
-H 'postman-token: 0fb2ef51-...-...-...-6430e9263c3b'
What to send to Python's requests
In order to send the above calls in Python you will need to pass:
headers = jenkins-crumb
auth = tuple of your (user_name, user_auth_token)
data = dictionary type { 'json' : json string of {"parameter":[....]} }
curl -v POST http://user:token#host:port/job/my-job/build --data-urlencode json='{"parameter": [{"name":"xx", "value":"xxx"}]}
or use Python request:
import requests
import json
url = " http://user:token#host:port/job/my-job/build "
pyload = {"parameter": [
{"name":"xx", "value":"xxx"},
]}
data = {'json': json.dumps(pyload)}
rep = requests.post(url, data)

MongoDB to BigQuery

What is the best way to export data from MongoDB hosted in mlab to google bigquery?
Initially, I am trying to do one time load from MongoDB to BigQuery and later on I am thinking of using Pub/Sub for real time data flow to bigquery.
I need help with first one time load from mongodb to bigquery.
In my opinion, the best practice is building your own extractor. That can be done with the language of your choice and you can extract to CSV or JSON.
But if you looking to a fast way and if your data is not huge and can fit within one server, then I recommend using mongoexport. Let's assume you have a simple document structure such as below:
{
"_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ",
"statuses" : [
{
"status" : "dc9e5511-466c-4146-888a-574918cc2534",
"score" : 53.24388894
}
],
"stored_at" : ISODate("2017-04-12T07:04:23.545Z")
}
Then you need to define your BigQuery Schema (mongodb_schema.json) such as:
$ cat > mongodb_schema.json <<EOF
[
{ "name":"_id", "type": "STRING" },
{ "name":"stored_at", "type": "record", "fields": [
{ "name":"date", "type": "STRING" }
]},
{ "name":"statuses", "type": "record", "mode": "repeated", "fields": [
{ "name":"status", "type": "STRING" },
{ "name":"score", "type": "FLOAT" }
]}
]
EOF
Now, the fun part starts :-) Extracting data as JSON from your MongoDB. Let's assume you have a cluster with replica set name statuses, your db is sample, and your collection is status.
mongoexport \
--host statuses/db-01:27017,db-02:27017,db-03:27017 \
-vv \
--db "sample" \
--collection "status" \
--type "json" \
--limit 100000 \
--out ~/sample.json
As you can see above, I limit the output to 100k records because I recommend you run sample and load to BigQuery before doing it for all your data. After running above command you should have your sample data in sample.json BUT there is a field $date which will cause you an error loading to BigQuery. To fix that we can use sed to replace them to simple field name:
# Fix Date field to make it compatible with BQ
sed -i 's/"\$date"/"date"/g' sample.json
Now you can compress, upload to Google Cloud Storage (GCS) and then load to BigQuery using following commands:
# Compress for faster load
gzip sample.json
# Move to GCloud
gsutil mv ./sample.json.gz gs://your-bucket/sample/sample.json.gz
# Load to BQ
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/*.json.gz" \
"mongodb_schema.json"
If everything was okay, then go back and remove --limit 100000 from mongoexport command and re-run above commands again to load everything instead of 100k sample.
ALTERNATIVE SOLUTION:
If you want more flexibility and performance is not your concern, then you can use mongo CLI tool as well. This way you can write your extract logic in a JavaScript and execute it against your data and then send output to BigQuery. Here is what I did for the same process but used JavaScript to output in CSV so I can load it much easier to BigQuery:
# Export Logic in JavaScript
cat > export-csv.js <<EOF
var size = 100000;
var maxCount = 1;
for (x = 0; x < maxCount; x = x + 1) {
var recToSkip = x * size;
db.entities.find().skip(recToSkip).limit(size).forEach(function(record) {
var row = record._id + "," + record.stored_at.toISOString();;
record.statuses.forEach(function (l) {
print(row + "," + l.status + "," + l.score)
});
});
}
EOF
# Execute on Mongo CLI
_MONGO_HOSTS="db-01:27017,db-02:27017,db-03:27017/sample?replicaSet=statuses"
mongo --quiet \
"${_MONGO_HOSTS}" \
export-csv.js \
| split -l 500000 --filter='gzip > $FILE.csv.gz' - sample_
# Load all Splitted Files to Google Cloud Storage
gsutil -m mv ./sample_* gs://your-bucket/sample/
# Load files to BigQuery
bq load \
--source_format=CSV \
--max_bad_records=999999 \
--ignore_unknown_values=true \
--encoding=UTF-8 \
--replace \
"YOUR_DATASET.mongodb_sample" \
"gs://your-bucket/sample/sample_*.csv.gz" \
"ID,StoredDate:DATETIME,Status,Score:FLOAT"
TIP: In above script I did small trick by piping output to able to split the output in multiple files with sample_ prefix. Also during split it will GZip the output so you can load it easier to GCS.
From a basic reading of MongoDB's documentation, it sounds like you can use mongoexport to dump your database as JSON. Once you've done that, refer to the BigQuery loading data topic for a description of how to create a table from JSON files after copying them to GCS.
You can read data from MongoDB and stream it to BigQuery. You can find an example in NodeJS here.
This is an extension of the linked example that prevents duplicated records (as long as they are still streaming buffer):
const { BigQuery } = require('#google-cloud/bigquery');
const bigqueryClient = new BigQuery();
...
const jsonData = // Array of documents from MongoDB
const inputRows = jsonData.map(row => ({
insertId: row._id,
json: row
}));
const insertOptions = {
raw: true
};
await bigqueryClient
.dataset(datasetId)
.table(tableId)
.insert(inputRows, insertOptions);