Kinesis Firehose after COPY query - amazon-redshift

I have Kinesis Firehose feeding data into Redshift. My data has an IP Address string field, and I use a second integer ip address column for geolocation purposes that is not in my source data. I'd like to run an update query every time the COPY finishes to update the integer ip columns from the string ip column where the integer one is null.
Is there a way to run a query after a load? Some other way to schedule a recurring query without introducing an ec2 instance for that?

Related

How can I get a stable result set for my REST API without an order by clause or a cursor?

I am creating a REST API that connects to an AWS RDS instance (Aurora Serverless/Postgres) using the RDS Data API. The request flow:
Client -> API Gateway -> Lambda -> RDS Data API -> RDS
The Lambda maps the request to a sql statement which is sent to RDS via the Data API. The results (rows) of the query are sent back to the client in json format.
All my tables have an indexed primary key column. However the value of this primary key does not always increment by 1, there are gaps. The tables can contain up to a couple billion rows.
The solution is simple for smaller tables, but for larger tables some sort of pagination has to be implemented due to Data API limits (1 MB response size) and Lambda runtime limits (max 15 minutes). I have read about different methods to implement pagination. The only suitable method seems to be cursor-based pagination, as it's the only method that can guarantee a stable result set without using an order by clause.
However I can't use (server side) cursors because when the Lambda times out, the database connection gets closed, as well as the cursor.
I could use url parameters containing a start and an end value to filter the primary key column of the table, however this column does not always increment by 1. So doing something like https://website.com/table/users?start_idx=0&end_idx=100 would not necessarily return 100 records, which is undesirable.
How can I best solve this in an idiomatic manner?

Cloud SQL: export data to CSV periodically avoiding duplicates

I want to export the data from Cloud SQL (postgres) to a CSV file periodically (once a day for example) and each time the DB rows are exported it must not be exported in the next export task.
I'm currently using a POST request to perform the export task using cloud scheduler. The problem here (or at least until I know) is that it won't be able to export and delete (or update the rows to mark them as exported) in a single http export request.
Is there any possibility to delete (or update) the rows which have been exported automatically with any Cloud SQL parameter in the http export request?
If not, I assume it should be done it a cloud function triggered by a pub/sub (using scheduler to send data once a day to pub/sub) but, is there any optimal way to take all the ID of the rows retrieved from the select statment (which will be use in the export) to delete (or update) them later?
You can export and delete (or update) at the same time using RETURNING.
\copy (DELETE FROM pgbench_accounts WHERE aid<1000 RETURNING *) to foo.txt
The problem would be in the face of crashes. How can you know that foo.txt has been writing and flushed to disk, before the DELETE is allowed to commit? Or the reverse, foo.txt is partially (or fully) written, but a crash prevents DELETE from committing.
Can't you make the system idempotent, so that exporting the same row more than once doesn't create problems?
You could use a set up to achieve what you are looking for: 
1.Create a Cloud Function to extract the information from the database that subscribes to a Pub/Sub topic.
2.Create a Pub/Sub topic to trigger that function.
3.Create a Cloud Scheduler job that invokes the Pub/Sub trigger.
4.Run the Cloud Scheduler job.
5.Then create a trigger which activate another Cloud Function to delete all the data require from the database once the csv has been created.
Here I leave you some documents which could help you if you decide to follow this path.
Using Pub/Sub to trigger a Cloud Function:https://cloud.google.com/scheduler/docs/tut-pub-sub
Connecting to Cloud SQL from Cloud Functions:https://cloud.google.com/sql/docs/mysql/connect-functionsCloud
Storage Tutorial:https://cloud.google.com/functions/docs/tutorials/storage
Another method aside from #jjanes would be to partition your database by date. This would allow you to create an index on the date, making exporting or deleting a days entries very easy. With this implementation, you could also create a Cron Job that deletes all tables older then X days ago.
The documentation provided will walk you through setting up a Ranged partition
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.
Thank you for all your answers. There are multiples ways of doing this, so I'm goint to explain how I did it.
In the database I have included a column which contains the date when the data was inserted.
I used a cloud scheduler with the following body:
{"exportContext":{"fileType": "CSV", "csvExportOptions" :{"selectQuery" : "select \"column1\", \"column2\",... , \"column n\" from public.\"tablename\" where \"Insertion_Date\" = CURRENT_DATE - 1" },"uri": "gs://bucket/filename.csv","databases": ["postgres"]}}
This scheduler will be triggered once a day and it will export only the data of the previous day
Also, I have to noticed that in the query I used in cloud scheduler you can choose which columns you want to export, doing this you can avoid to export the column which include the Insertion_Date and use this column only an auxiliary.
Finally, the cloud scheduler will create automatically the csv file in a bucket

How to find out the IP address who performed DML operations on a certain table in Postgres?

Is there a way to find out the IP address who performed DML operations on a certain table in Postgres?
One way is you can modify your code to store the IP address of the user machine into one of the columns.(Basically in one of the common table)
Ahead of time, you can set up an audit trigger which records the inet_client_addr(). For example.
Or you can use the server log file, by setting log_statement='mod' and making sure that log_line_prefix records the remote host.
If you are trying to do this after the fact without having first set up something like one of the above, then no.

Daily data archival from Postgres to Hive/HDFS

I am working on IOT data pipeline and I am getting messages every second from multiple devices into postgres database. Postgres will be having data for only two days and after two days data will be flushed so that every time there is data for last two days. Now I needed to do data archival from postgres to HDFS daily. The parameters I have are :
deviceid, timestamp, year, month, day, temperature, humidity
I want to archive it daily into HDFS and query that data using hive query. For that I need to create external partitioned table in Hive using deviceid, year and month as partitions. I have tried following options but its not working:
I have tried using sqoop for data copying but it can't create dynamic folders based on different deviceid,year and month so that the external hive table can pick partitions
Used sqoop import using --hive-import attribute so that data can be copied directly into hive table but in this case it overwrites the existing table and I am also not sure whether this works for partitioned table or not
Please suggest some solutions for the archival.
Note: I am using azure services so option for Azure Data Factory is open.

Get incremental changes for RDS Postgres DB

I have a process that extract data from origin DB, make changes and insert it to target DB.
Now I'm using an AWS Lambda that runs every few minutes and added a timestamp column that indicates when the data was last changed to filter base on it.
This process is inefficient since I need to "remember" to manually add a timestamp column to each new table. Is there a better way? can I use a query (Postgres) or an API call (boto3) to get only the data that was changed?