mongoimport JSON from Google Cloud Storage in an Airflow task - google-cloud-storage

It seems that moving data from GCS to MongoDB is not common, since there is not very much documentation on this. We have the following task that we we pass as the python_callable to a Python operator - this task moves data from BigQuery into GCS as JSON:
def transfer_gcs_to_mongodb(table_name):
# connect
client = bigquery.Client()
bucket_name = "our-gcs-bucket"
project_id = "ourproject"
dataset_id = "ourdataset"
destination_uri = f'gs://{bucket_name}/{table_name}.json'
dataset_ref = bigquery.DatasetReference(project_id, dataset_id)
table_ref = dataset_ref.table(table_name)
configuration = bigquery.job.ExtractJobConfig()
configuration.destination_format = 'NEWLINE_DELIMITED_JSON'
extract_job = client.extract_table(
table_ref,
destination_uri,
job_config=configuration,
location="US",
) # API request
extract_job.result() # Waits for job to complete.
print("Exported {}:{}.{} to {}".format(project_id, dataset_id, table_name, destination_uri))
This task is successfully getting data into GCS. However, we are stuck now when it comes to how to run mongoimport correctly, to get this data into MongoDB. In particular, it seems like mongoimport cannot point to the file in GCS, but rather it has to be downloaded locally first, and then imported into MongoDB.
How should this be done in Airflow? Should we write a shell script that downloads the JSON from GCS, and then runs mongoimport with the correct uri and all the correct flags? Or is there another way to run mongoimport in Airflow that we are missing?

You don't need to write shell script to download from GCS. You can simply use the GCSToLocalFilesystemOperator then you can open the file and write it to mongo using the insert_many function of the MongoHook.
I didn't test it but it should be something like:
mongo = MongoHook(conn_id=mongo_conn_id)
with open('file.json') as f:
file_data = json.load(f)
mongo.insert_many(file_data)
This is for a pipe of: BigQuery -> GCS -> Local File System -> MongoDB.
You can also do it in memory as: BigQuery -> GCS -> MongoDB if you prefer.

Related

Best practice for importing bulk data to AWS RDS PostgreSQL database

I have a big AWS RDS database that needs to be updated with data on a periodic basis. The data is in JSON files stored in S3 buckets.
This is my current flow:
Download all the JSON files locally
Run a ruby script to parse the JSON files to generate a CSV file matching the table in the database
Connect to RDS using psql
Use \copy command to append the data to the table
I would like switch this to an automated approach (maybe using an AWS Lambda). What would be the best practices?
Approach 1:
Run a script (Ruby / JS) that parses all folders in the past period (e.g., week) and within the parsing of each file, connect to the RDS db and execute an INSERT command. I feel this is a very slow process with constant writes to the database and wouldn't be optimal.
Approach 2:
I already have a Ruby script that parses local files to generate a single CSV. I can modify it to parse the S3 folders directly and create a temporary CSV file in S3. The question is - how do I then use this temporary file to do a bulk import?
Are there any other approaches that I have missed and might be better suited for my requirement?
Thanks.

Cannot read data from Azure Mongo after copying from Storage Gen2 using Azure Data factory

Source: Azure Storage Gen 2 (file with 10 json lines)
Sink: Azure Cosmos with Mongo API
I used Azure Data factory pipeline (Copy activity) to move the file data to Mongo collection. Copy is successful but when I run find({}) on my collection, it returns 0 records. When I run stats(), it shows the count as 10 which is expected. I cannot figure out what is the issue when reading these records from Robo3T to query Mongo DB.
I created second pipeline to read data from Mongo and write to Azure Storage to test if the data really is present in Mongo. I was able to write all 10 records to storage. It proves the data is present in Mongo, but I cannot read/access it.
You wont be able to directly read data collection stored in the data or any databases. Must you use Mongo Shell via Azure Portal. Where you have to go to your Azure Cosmos DB resource -> Data Explorer -> Mongo Shell. If there any specific errors here is the troubleshooting document.

AWS mirgate data from MongoDB to DynamoDB/S3/Redshift

The issue is that mirgating data from MongoDB to DynamoDB/S3/Redshift currently, as I unterstand for us is not available via AWS DMS Service, as it does not support all data types. Or maybe I'm wrong.
The probelm is that our Mongo object contain not scalar fields(arrays, maps).
So when I make a mirgation task via AWS DMS with table mode, it pull data badly.Buy some reason only selection works. Transformation rules are ignored by DMS(tried renaming and removing).
In the doc mode is all ok, but how can I run migration with some custom script for transformation? As storing data this way still need transformation.
We need some modifications like: rename, remove fields and flatting some fields(for example we ahve a map object and it should be flatten into several scalar fields).
Migration should be done into one of the sources: S3, Dyanamo, Redshift
Will be thankfull for any help and suggestions.
use the following below script to take a backup of the MongoDB DB
mongodump -h localhost:27017 -d my_db_name -o $DEST
use the below command to sync your backup to S3 bucket
aws s3 sync ~/db_backups s3://my-bucket-name
Once your data in S3, you can load very easily to Redshift using copy command

Delete redshift table from within databricks using pyspark

I tried to connect to a redshift system table called stv_sessions and I can read the data into a dataframe.
This stv_sessions table is a redshift system table which has the process id's of all the queries that are currently running.
To delete a query from running we can do this.
select pg_terminate_backend(pid)
While this works for me if I directly connect to redshift (using aginity), it gives me insuffecient previlege issues when trying to run from databricks.
Simply put I dont know how to run the query from databricks notebook.
I have tried this so far,
kill_query = "select pg_terminate_backend('12345')"
some_random_df_i_created.write.format("com.databricks.spark.redshift").option("url",redshift_url).option("dbtable","stv_sessions").option("tempdir", temp_dir_loc).option("forward_spark_s3_credentials", True).options("preactions", kill_query).mode("append").save()
Please let me know if the methodology i follow is correct.
Thank you
Databricks purposely does not preinclude this driver. You need to Download and install the offical Redshift JDBC driver for databricks. : download the official Amazon Redshift JDBC driver, upload it to Databricks, and attach the library to your cluster.(recommend using v1.2.12 or lower with Databricks clusters). Then, use JDBC URLs of the form
val jdbcUsername = "REPLACE_WITH_YOUR_USER"
val jdbcPassword = "REPLACE_WITH_YOUR_PASSWORD"
val jdbcHostname = "REPLACE_WITH_YOUR_REDSHIFT_HOST"
val jdbcPort = 5439
val jdbcDatabase = "REPLACE_WITH_DATABASE"
val jdbcUrl = s"jdbc:redshift://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}"
jdbcUsername: String = REPLACE_WITH_YOUR_USER
jdbcPassword: String = REPLACE_WITH_YOUR_PASSWORD
jdbcHostname: String = REPLACE_WITH_YOUR_REDSHIFT_HOST
jdbcPort: Int = 5439
jdbcDatabase: String = REPLACE_WITH_DATABASE
jdbcUrl: String = jdbc:redshift://REPLACE_WITH_YOUR_REDSHIFT_HOST:5439/REPLACE_WITH_DATABASE?user=REPLACE_WITH_YOUR_USER&password=REPLACE_WITH_YOUR_PASSWORD
Then try putting jdbcUrl in place of your redshift_url.
That may be the only reason you are getting privilege issues.
Link1:https://docs.databricks.com/_static/notebooks/redshift.html
Link2:https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html#installation
Another reason could be the redshift-databricks connector only uses SSL(encryption in flight) and it is possible that IAM roles may have been set on your redshift cluster to only allow some users to delete tables.
Apologies if none of this helps your case.

HBase Export/Import: Unable to find output directory

I am using HBase for my application and I am trying to export the data using org.apache.hadoop.hbase.mapreduce.Export as it was directed here. The issue I am facing with the command is that once the command is executed, there are no errors while creating the export. But the specified output directoy does not appear at its place.The command I used was
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name db_dump/
I got the solution hence I am replying my own answer
You must have following two lines in hadoop-env.sh in conf directory of hadoop
export HBASE_HOME=/home/sitepulsedev/hbase/hbase-0.90.4
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.90.4.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.90.4-test.jar:$HBASE_HOME/lib/zookeeper-3.3.2.jar:$HBASE_HOME
save it and restart mapred by ./stop-mapred.sh and ./start-mapred.sh
now run in bin directory of hadoop
./hadoop jar ~/hbase/hbase-0.90.4/hbase-0.90.4.jar export your_table /export/your_table
Now you can verify the dump by hitting
./hadoop fs -ls /export
finally you need to copy the whole thing into your local file system for which run
./hadoop fs -copyToLocal /export/your_table ~/local_dump/your_table
here are the References that helped me out in export/import and in hadoop shell commands
Hope this one helps you out!!
As you noticed the HBase export tool will create the backup in the HDFS, if you instead want the output to be written on your local FS you can use the file URI. In your example it would be something similar to:
bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name file:///tmp/db_dump/
Related to your own answer, this would also avoid going through the HDFS. Just be very careful if your are running this is a cluster of servers, because each server will write the result files in their own local file systems.
This is true for HBase 0.94.6 at least.
Hope this helps
I think the previous answer needs some modification:
Platform: AWS EC2,
OS: Amazon Linux
Hbase Version: 0.96.1.1
Hadoop Distribution: Cloudera CDH5.0.1
MR engine: MRv1
To export data from Hbase Table to local filesystem:
sudo -u hdfs /usr/bin/hbase org.apache.hadoop.hbase.mapreduce.Export -Dmapred.job.tracker=local "table_name" "file:///backups/"
This command will dump data in HFile format with number of files equaling the number of regions of that table in Hbase.