Migrating a huge Bigtable database in GCP from one account to another using DataFlow - gcloud

I have a huge database stored in Bigtable in GCP. I am migrating the bigtable data from one account to another GCP Account using DataFlow.
but, when I created a job to create a sequence file from the bigtable it has created 3000 sequence files on the destination bucket.
so, it is not possible to create a single dataflow for each 3000 sequence file
so, Is there any way to reduce the sequence files or a way to provide the whole 3000 sequence files at once in a Data Flow Job template in GCP
We have two sequence file wanted to upload data sequentially one after another(10 rows and one column), but actually getting result uploaded(5 rows and 2 columns)

The sequence files should have some sort of pattern to their naming e.g. gs://mybucket/somefolder/output-1, gs://mybucket/somefolder/output-2, gs://mybucket/somefolder/output-3 etc.
When running the Cloud Storage SequenceFile to Bigtable Dataflow template set the sourcePattern parameter to the prefix of that pattern like gs://mybucket/somefolder/output-* or gs://mybucket/somefolder/*

Related

Partition data by multiple partition keys - Azure ADF

I have some data on on-prem SQL table. The data is huge ~100GB. The data many columns but two important ones are d_type and d_date.
d_type unique elements are 1,10,100 and d_date ranges from 2022-01-01 - 2022-03-30
I want to load this data into Azure using copy activity or dataflow but in a partitioned fashion, like the following format:
someDir/d_type=1/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=10/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=100/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
I have tried with copy activity:
Copy activity can only use one partition key
If I partition by d_type, it creates parquet file with random bins i.e 1-20 (which contains only data for d_type=1), other file could have bins be 20-30 (which has no data)
Dataflow allows multiple partition keys, but I cannot use that sinceill have to copy the entrire data first from onprem sql to azure then process it. (As dataflow can only work with source link service which are linke via AzureIR and not SHIR).
Anyone got tips on how to solve this?
We ended up using custom python scripts because CopyActivity doesn't support partitions with multiple keys and we couldn't use the dataflow due to some business reasons as explained in the question.

ETL with Dataprep - Union Dataset

I'm a newcomer to GCP, and I'm learning every day and I'm loving this platform.
I'm using GCP's dataprep to join several csv files (with the same column structure), treat some data and write to a BigQuery.
I created a storage (butcket) to put all 60 csv files inside. In dataprep can I define a data set to be the union of all these files? Or do you have to create a dataset for each file?
Thank you very much for your time and attention.
If you have all your files inside a directory in GCS you can import that directory as a single dataset. The process is the same as importing single files. You have to make sure though, that the column structure is exactly the same for all the files inside the directory.
If you create a separate dataset for each file you are more flexible on the structure they have when you use the UNION page to concatenate them.
However, if your use case is just to load all the files (~60) to a single table in Bigquery without any transformation, I would suggest to just use a BigQuery load job. You can use a wildcard in the Cloud Storage URI to specify the files you want. Currently, BigQuery load jobs are free of charge, so it would be a very cost-effective solution compared to the use of Dataprep.

Cloud SQL: export data to CSV periodically avoiding duplicates

I want to export the data from Cloud SQL (postgres) to a CSV file periodically (once a day for example) and each time the DB rows are exported it must not be exported in the next export task.
I'm currently using a POST request to perform the export task using cloud scheduler. The problem here (or at least until I know) is that it won't be able to export and delete (or update the rows to mark them as exported) in a single http export request.
Is there any possibility to delete (or update) the rows which have been exported automatically with any Cloud SQL parameter in the http export request?
If not, I assume it should be done it a cloud function triggered by a pub/sub (using scheduler to send data once a day to pub/sub) but, is there any optimal way to take all the ID of the rows retrieved from the select statment (which will be use in the export) to delete (or update) them later?
You can export and delete (or update) at the same time using RETURNING.
\copy (DELETE FROM pgbench_accounts WHERE aid<1000 RETURNING *) to foo.txt
The problem would be in the face of crashes. How can you know that foo.txt has been writing and flushed to disk, before the DELETE is allowed to commit? Or the reverse, foo.txt is partially (or fully) written, but a crash prevents DELETE from committing.
Can't you make the system idempotent, so that exporting the same row more than once doesn't create problems?
You could use a set up to achieve what you are looking for: 
1.Create a Cloud Function to extract the information from the database that subscribes to a Pub/Sub topic.
2.Create a Pub/Sub topic to trigger that function.
3.Create a Cloud Scheduler job that invokes the Pub/Sub trigger.
4.Run the Cloud Scheduler job.
5.Then create a trigger which activate another Cloud Function to delete all the data require from the database once the csv has been created.
Here I leave you some documents which could help you if you decide to follow this path.
Using Pub/Sub to trigger a Cloud Function:https://cloud.google.com/scheduler/docs/tut-pub-sub
Connecting to Cloud SQL from Cloud Functions:https://cloud.google.com/sql/docs/mysql/connect-functionsCloud
Storage Tutorial:https://cloud.google.com/functions/docs/tutorials/storage
Another method aside from #jjanes would be to partition your database by date. This would allow you to create an index on the date, making exporting or deleting a days entries very easy. With this implementation, you could also create a Cron Job that deletes all tables older then X days ago.
The documentation provided will walk you through setting up a Ranged partition
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.
Thank you for all your answers. There are multiples ways of doing this, so I'm goint to explain how I did it.
In the database I have included a column which contains the date when the data was inserted.
I used a cloud scheduler with the following body:
{"exportContext":{"fileType": "CSV", "csvExportOptions" :{"selectQuery" : "select \"column1\", \"column2\",... , \"column n\" from public.\"tablename\" where \"Insertion_Date\" = CURRENT_DATE - 1" },"uri": "gs://bucket/filename.csv","databases": ["postgres"]}}
This scheduler will be triggered once a day and it will export only the data of the previous day
Also, I have to noticed that in the query I used in cloud scheduler you can choose which columns you want to export, doing this you can avoid to export the column which include the Insertion_Date and use this column only an auxiliary.
Finally, the cloud scheduler will create automatically the csv file in a bucket

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile

MongoDB into AWS Redshift

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.
I've looked around for options to make the data available and easier to consume, and have settled on two promising options:
AWS Redshift
Hadoop + Hive
We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).
As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.
I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.
Has anyone done something like this?
I ended up coding up our own migrator using NodeJS.
I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.
Timestamped data
Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.
Plugins returning cursors
We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.
How the cursors are used
The migrator engine then uses this cursor, and loops through each record.
It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.
Delimited exports from S3 into a table on redshift
The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.
So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.
Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.
Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.
So in summary, each plugin/migration provides the following to the engine:
A cursor, which optionally uses the last migrated date passed to it
from the engine, in order to ensure that only deltas are moved
across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift
Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places.
You can move all Mongo DB data to Redshift as a one time activity.
Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB.
Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.
Mongo DB being a document storage engine, Apache Solr, Elastic Search can be considered as possible replacements. But they do not support SQL type querying capabilities.They basically use a different filtering mechanism. For eg, for Solr, you might need to use the Dismax Filter.
On Cloud, Amazon's Cloud Search/Azure Search would be compelling options to try as well.
You can use AWS DMS to migrate data to redshift now easily , you can also realtime ongoing changes with it.