Best way to sanitize some data for importing into postgresql? - postgresql

I have two columns with date in the YYMMDD format and a time in the HHMMSS format, they are strings like 150103 132244. These are close to a quarter of a billion records. What would be the best way to sanitize the data prior to importing to PostgreSQL? Is there a way to do this while importing, for instance?

Your data can be converted to timestamp with time zone using the function to_timestamp():
with example(d, t) as (
values ('150103', '132244')
)
select d, t, to_timestamp(concat(d, t), 'yymmddhh24miss')
from example;
d | t | to_timestamp
--------+--------+------------------------
150103 | 132244 | 2015-01-03 13:22:44+01
(1 row)
You can import a file into a table with temporary columns (d, t):
create table example(d text, t text);
copy example from ....
add a timestamp with time zone column, convert the data and drop redundant text columns:
alter table example add tstamp_column timestamptz;
update example
set tstamp_column = to_timestamp(concat(d, t), 'yymmddhh24miss');
alter table example drop d, drop t;

Related

How to join 2 tables on fields which have different formats?

I have 2 tables with the following structure:
Table A:
id - number
a_d - text
where A.a_d has the text format: "yyyy-mm-dd 00:00:00" (examples: 2001-08-22 00:00:00, or 2002-03-23 00:00:00)
Table B:
id - number
a_d - text
where B.a_d has the text format: "dd-month-yyyy" (example: 01-jul-2002 or 09-feb-2005)
I want to run join query on the text fields of those table.
select a.a_d
from A a
join B b
on a.a_d =?= b.a_d
I can't change or update the tables, just get data from them
How can I compare this 2 fields, if there have different format ?
Use TO_DATE to convert the text dates into bona fide dates before comparing:
SELECT a.a_d
FROM A a
INNER JOIN B b
ON a.a_d::date = TO_DATE(b.a_d, 'DD-mon-YYYY');
Note that the a_d field in table A happens to be a text timestamp which can already be directly cast to date, so we only need TO_DATE for the B table.
Ideally you should store your dates and timestamps in proper columns rather than text. Then, the join would be possible without costly conversions.

ERROR: invalid input syntax for type timestamp: "20-MAR-17 08.30.41.453267 AM"

I am trying to copy data spooled form oracle to PostgreSQL with csv format.
I am getting below error while doing the copy .
ERROR: invalid input syntax for type timestamp: "20-MAR-17
08.30.41.453267 AM"
I tried to set the date time in DMY on Postgres but it did not work. I can input the data if I convert it to YMD format (i.e. I have to change numerous fields and almost 50 TB of data)
can someone please help me on this.
badmin=# copy downloaded_file from '/export/home/dbadmin/postgresql/TESTPGDB/scripts/FACTSET_IDS_2_V1.DOWNLOADED_FILE.csv'
with delimiter ',';
ERROR: invalid input syntax for type timestamp:
"20-MAR-17 08.30.41.453267 AM" CONTEXT: COPY downloaded_file, line 1,
column DOWNLOAD_TIME: "20-MAR-17 08.30.41.453267 AM"
Let's assume your main table has following columns and datatypes.
\d downloaded_file
Column | Type
--------+----------------------------
id | integer
txt | text
tstamp | timestamp without time zone
Now, rather than copying directly into the table, create a temporary table with the same columns but with all text datatypes.
create temporary table downloaded_file_tmp ( id text, txt text, tstamp text);
Now, copy the contents of the file into this temp table.
The file looks like this.
$cat f.csv
1,'TEXT1','20-MAR-17 08.30.41.453267 AM'
Copying from file to temp table.
\copy downloaded_file_tmp from 'f.csv' with delimiter ',' CSV;
Copying from temp table to main table.
INSERT INTO downloaded_file
(id,
txt,
tstamp)
SELECT id :: INT,
txt,
TO_TIMESTAMP('20-MAR-17 08.30.41.453267 AM', 'dd-mon-yy hh.mi.ss.US AM')
FROM downloaded_file_tmp;
Notice the format specifier US which represents microsecond (000000-999999)
knayak=# select * from downloaded_file;
id | txt | tstamp
----+---------+----------------------------
1 | 'TEXT1' | 2017-03-20 08:30:41.453267

How to query from the result of a changed column of a table in postgresql

So I have a string time column in a table and now I want to change that time to date time type and then query data for selected dates.
Is there a direct way to do so? One way I could think of is
1) add a new column
2) insert values into it with converted date
3) Query using the new column
Here I am stuck with the 2nd step with INSERT so need help with that
ALTER TABLE "nds".”unacast_sample_august_2018"
ADD COLUMN new_date timestamp
-- Need correction in select statement that I don't understand
INSERT INTO "nds".”unacast_sample_august_2018” (new_date)
(SELECT new_date from_iso8601_date(substr(timestamp,1,10))
Could some one help me with correction and if possible a better way of doing it?
Tried other way to do in single step but gives error as Column does not exist new_date
SELECT *
FROM (SELECT from_iso8601_date(substr(timestamp,1,10)) FROM "db_name"."table_name") AS new_date
WHERE new_date > from_iso8601('2018-08-26') limit 10;
AND
SELECT new_date = (SELECT from_iso8601_date(substr(timestamp,1,10)))
FROM "db_name"."table_name"
WHERE new_date > from_iso8601('2018-08-26') limit 10;
Could someone correct these queries?
You don't need those steps, just use USING CAST clause on your ALTER TABLE:
CREATE TABLE foobar (my_timestamp) AS
VALUES ('2018-09-20 00:00:00');
ALTER TABLE foobar
ALTER COLUMN my_timestamp TYPE timestamp USING CAST(my_timestamp AS TIMESTAMP);
If your string timestamps are in a correct format this should be enough.
Solved as follows:
select *
from
(
SELECT from_iso8601_date(substr(timestamp,1,10)) as day,*
FROM "db"."table"
)
WHERE day > date_parse('2018-08-26', '%Y-%m-%d')
limit 10

Redshift COPY statement loading date format with two digit year (mm/dd/yy)

I have a data source file that I am loading in Redshift with a COPY command.
The file has a bunch of date columns with a two digit year format (I know, I am dealing with dinosaurs here).
Redshift recognizes the date format, but the problem is the file has values like:
06/01/79
which actually means:
2079-06-01
however Redshift interprets it as:
1979-06-01
Is there a way to tell Redshift what is my threshold for the two digit date formats. For example values lower than 90 should be interpreted as 20XX.
The DATEFORMAT parameter in the COPY command does not have such an option.
-- Begin transaction
BEGIN TRANS;
-- Create a temp table
CREATE TEMP TABLE my_temp (dtm_str CHAR(8));
-- Load your data into the temp table
COPY my_temp FROM s3://my_bucket … ;
-- Insert your data into the final table
INSERT INTO final_table
-- Grab the first 6 chars and concatenate to the following
SELECT CAST(LEFT(dtm_str,6)||
-- Convert the last 2 chars to and in and compare to your threshold
CASE WHEN CAST(RIGHT(dtm_str,2) AS INT) < 85
-- Add either 1900 or 2000 to the INT, convert to CHAR
THEN CAST(CAST(RIGHT(dtm_str,2) AS INT) + 2000 AS CHAR(4))
ELSE CAST(CAST(RIGHT(dtm_str,2) AS INT) + 1900 AS CHAR(4)) END
-- Convert the final CHAR to a DATE
AS DATE) new_dtm
FROM my_temp;
COMMIT;

Extract year from date within WHERE clause

I need to include EXTRACT() function within WHERE clause as follow:
SELECT * FROM my_table WHERE EXTRACT(YEAR FROM date) = '2014';
I get a message like this:
pg_catalog.date_part(unknown, text) doesn't exist**
SQL State 42883
Here is my_table content (gid INTEGER, date DATE):
gid | date
-------+-------------
1 | 2014-12-12
2 | 2014-12-08
3 | 2013-17-15
I have to do it this way because the query is sent from a form on a website that includes a 'Year' field where users enter the year on a 4-digits basis.
The problem is that your column is of data type text, while EXTRACT() only works for date / time types.
You should convert your column to the appropriate data type.
ALTER TABLE my_table ALTER COLUMN date TYPE date;
That's smaller (4 bytes instead of 11 for the text), faster and cleaner (disallows illegal dates and most typos).
If you have non-standard format add a USING clause with a conversion expression. Example:
Alter character field to date
Also, for your queries to be fast with a plain index on date you should rather use sargable predicates. Like:
SELECT * FROM my_table
WHERE date >= '2014-01-01'
AND date < '2015-01-01';
Or, to go with your 4-digit input for the year:
SELECT * FROM my_table
WHERE date >= to_date('2014', 'YYYY')
AND date < to_date('2015', 'YYYY');
You could also be more explicit:
to_date('2014' || '0101', 'YYYYMMNDD')
Both produce the same date '2014-01-01'.
Aside: date is a reserved word in standard SQL and a basic type name in Postgres. Don't use it as identifier.
This happens because the column has a text or varchar type, as opposed to date or timestamp. This is easily reproducible:
SELECT 1 WHERE extract(year from '2014-01-01'::text)='2014';
yields this error:
ERROR: function pg_catalog.date_part(unknown, text) does not exist
LINE 1: SELECT 1 WHERE extract(year from '2014-01-01'::text)='2014';
^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.
extract or is underlying function date_part does not exist for text-like datatypes, but they're not needed anyway. Extracting the year from this date format is equivalent to getting the 4 first characters, so your query would be:
SELECT * FROM my_table WHERE left(date,4)='2014';