So I'm trying to load data into my Redshift database from an S3 bucket. I have a table 'Example' which has a field 'timestamp' in the format 'YY-MM-DD HH:MM:SS'.
Using the copy query to load the data, so I'm able to load for a specific pattern/prefix, but I want to load data after a certain timestamp, say, greater than '2014-07-09 10:00:00'. How do I approach this?
You have two options:
either process the file before you load it to S3 (and upload only the data with timestamp greater than $SOME_TIMESTAMP)
use the COPY command to load the file into intermediate table (can be even temp table - as long as you stay within the same session) and then run:
insert into YOUR_ORIGINAL_TABLE (select * from YOUR_TEMP_TABLE where timestamp > WHATEVER_YOU_NEED)
Related
We daily receive 7 millions of records , we are going to append to the existing target table.The target table is partitioned based on date
We are using DB2 Load command to load data from one DB2 table (stage) to another DB2 table target
call SYSPROC.ADMIN_CMD('LOAD FROM (SELECT * FROM stage_table )
OF CURSOR INSERT INTO target_table NONRECOVERABLE INDEXING MODE INCREMENTAL ALLOW READ ACCESS')
As per IBM documentation , ALLOW READ ACCESS is going to be deprecated suggested to use INGEST method instead of that
https://www.ibm.com/docs/en/db2/10.1.0?topic=functionality-fp1-allow-read-access-parameter-load-command
Question:
How to use INGEST method to load data from DB2 to DB2 tables ?
what would be other alternatives to load millions of records with improved performance.
I need to load data with a default value column into Redshift, as outlined in the AWS docs.
Unfortunately the COPY command doesn't allow loading data with default values from a parquet file, so I need to find a different way to do that.
My table requires a column with the getdate function from Redshift:
LOAD_DT TIMESTAMP DEFAULT GETDATE()
If I use the COPY command and add the column names as arguments I get the error:
Column mapping option argument is not supported for PARQUET based COPY
What is a workaround for this?
Can you post a reference for Redshift not supporting default values for a Parquet COPY? I haven't heard of this restriction.
As to work-arounds I can think of two.
Copy the file to a temp table and then insert from this temp table into your table with the default value.
Define an external table that uses the parquet file as source and insert from this table onto the table with the default value.
I have a requirement to move data from S3 to Redshift. Currently I am using Glue for the work.
Current Requirement:
Compare the primary key of record in redshift table with the incoming file, if a match is found close the old record's end date (update it from high date to current date) and insert the new one.
If primary key match is not found then insert the new record.
Implementation:
I have implemented it in Glue using pyspark with the following steps:
Created dataframes which will cover three scenarios:
If a match is found update the existing record's end date to current date.
Insert the new record to Redshift table where PPK match is found
Insert the new record to Redshift table where PPK match is not found
Finally, Union all these three data frames into one and write this to redshift table.
With this approach, both old record ( which has high date value) and the new record ( which was updated with current date value) will be present.
Is there a way to delete the old record with high date value using pyspark? Please advise.
We have successfully implemented the desired functionality where in we were using AWS RDS [PostGreSql] as database service and GLUE as a ETL service . My Suggestion would be instead of computing the delta in sparkdataframes it would be far more easier and elegant solution if you create stored procedures and call them in pyspark Glue Shell .
[for example : S3 bucket - > Staging table -> Target Table]
In addition if your execution logic is getting executed in less than 10 mins I will suggest you to use python shell and use external libraries such as psycopyg2 / sqlalchemy for DB operations .
I am trying to load data from S3 bucket to redshift table,there is one column as source id in the table and i want to store the folder name where the source file is available,in to that column.
Actually i have multiple folders in S3 bucket and in each folder i have one file and i port all the files in same table with copy command in redshift, so to identify from which folder the data is, so i need to store the folder name along with data into the Redshift table, i have seperate column in table as Source id.
can any body help me.
If you are using the Redshift copy command, then you have no choice other than a process to import each folder (e.g. as a temp table) and then set your value manually the the value of the folder that you restored. repeat for each folder.
Another option is to use redshift spectrum and create an external table that maps to your folder as partitions.
first you create your base table like this
create external table spectrum.sales_part(
salesid integer,
listid integer,
sellerid integer,
buyerid integer,
eventid integer,
dateid smallint,
qtysold smallint,
pricepaid decimal(8,2),
commission decimal(8,2),
saletime timestamp)
partitioned by (saledate date)
row format delimited
fields terminated by '|'
stored as textfile
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/'
table properties ('numRows'='172000');
Then you add partitions to it like this
alter table spectrum.sales_part
add partition(saledate='2008-01-01')
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/';
alter table spectrum.sales_part
add partition(saledate='2008-02-01')
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/';
alter table spectrum.sales_part
add partition(saledate='2008-03-01')
location 's3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-03/';
Once you have that set up as an external table, you can use standard sql against that table, for example you could run your queries against that table or copy it to a permanent redshift table using CTAS.
Here is a link to the documentation
https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html
I don't want to use pg_dump to export data into sql script, since feeding it to the greenplum cluster is too slow when I have a large amount of data to import. So it seems using greenplum's gpfdist is prefered. Is there any way I can do this?
Or as an alternative, can I export a particular Postgres table's data into a CSV format file containing the large orbjects of that table?
pg_dump will create a file that will use "COPY" to load the data back into a database. When loading into Greenplum, it will load through the Master server and for very large loads, it will become a bottleneck. Yes, the preferred method is to use gpfdist but you can most certainly use COPY to load data into Greenplum. It won't load in the 10+ TB per hour rate that gpfdist can achieve but it still can achieve 1 to 2 TB per hour.
Another alternative is to use gpfdist to execute a program to get data. It would execute the SELECT statement against PostgreSQL to make that available to an External Table in Greenplum. I created a wrapper for this process called, "gplink". You can check it out here: http://www.pivotalguru.com/?page_id=982
Accoridng to greenplum reference:
The simplest data loading method is the SQL INSERT statement...
You can use the COPY command to load the data into a table when the data
is in external text files...
You can use a pair of Greenplum utilities, gpfdist and gpload, to load external data into tables...
Nevertheless if you want to use csv to import data, you can generate csv with large object "filename" joining you table against pg_largeobject. Eg:
b=# create table lo (n text,p oid);
CREATE TABLE
b=# insert into lo values('wheel',lo_import ('/tmp/wheel.PNG'));
INSERT 0 1
b=# copy (select lo.*, pg_largeobject.pageno, pg_largeobject.data from lo join pg_largeobject on lo.p = loid) to '/tmp/lo.csv' WITH (format csv, header);
COPY 20
Generated /tmp/lo.csv will have name, oid and data bytea in csv format.