How can I remove extra characters from a column? - postgresql

I have a table with Customer/Phone/City/State/Zip/etc..
Occasionally, I'll be importing the info from a .csv file, and sometimes the zipcode is formatted like this: xxxxx-xxxx and I only need it to be a general, 5 digit zip code.
How can I delete the last 5 characters without having to do it from Excel, cell by cell (which is what I'm doing now)?
Thanks
EDIT: This is what I used after Craig's suggestion and it worked. However, some of the zip entries are canadian zipcodes and often time they are formated x1x-x2x. Running this deletes the last character in the field.
How could I remedy this?

You'll need to do one of these 3 ideas:
use an ETL tool to filter the data during insert;
COPY into a TEMPORARY or UNLOGGED table then do an INSERT INTO real_table SELECT ... that transforms the data with a suitable substring(...) call; or
Write a simple Perl/Python/whatever script that reads the csv, transforms it as desired, and inserts the results into PostgreSQL. I'd use Python with the csv module and psycopg2's copy_from.
Such an insert into ... select might look like:
INSERT INTO real_table(col1, col2, zip)
SELECT
col1,
col2,
substring(zip from 1 for 5)
FROM temp_table;

Related

copy columns of a csv file into postgresql table

I have a CSV file with 12 - 11 - 10 or 5 columns.
After creating a PostgreSQL table with 12 columns, I want to copy this CSV into the table.
I use this request:
COPY absence(champ1, champ2, num_agent, nom_prenom_agent, code_gestion, code_service, calendrier_agent, date_absence, code_absence, heure_absence, minute_absence, periode_absence)
FROM 'C:\temp\absence\absence.csv'
DELIMITER '\'
CSV
My CSV file contains 80000 line.
Ex :
20\05\ 191\MARKEY CLAUDIE\GA0\51110\39H00\21/02/2020\1471\03\54\Matin
21\05\ 191\MARKEY CLAUDIE\GA0\51110\39H00\\8130\7H48\Formation avec repas\
30\05\ 191\MARKEY CLAUDIE\GA0\51430\39H00\\167H42\
22\9993\Temps de déplacement\98\37
when I execute the request, I get a message indicating that there is missing data for the lines with less than 12 fields.
Is there a trick?
copy is extremely fast and efficient, but less flexible because of that. Specifically it can't cope with files that have a different number of "columns" for each line.
You can either use a different import tool, or if you want to stick to built-in tools, copy the file into staging table that only has a single column, then use Postgres string functions to split the lines into the columns:
create unlogged table absence_import
(
line text
);
\COPY absence_import(line) FROM 'C:\temp\absence\absence.csv' DELIMITER E'\b' CSV
E'\b' is the "backspace" character which can't really appear in a text file, so no column splitting is taking place.
Once you have imported the file, you can split the line using string_to_array() and the insert that into the real table:
insert into absence(champ1, champ2, num_agent, nom_prenom_agent, code_gestion, code_service, calendrier_agent, date_absence, code_absence, heure_absence, minute_absence, periode_absence)
select line[1], line[2], line[3], .....
from (
select string_to_array(line, '\') as line
from absence_import
) t;
If there are non-text columns, might need to cast the values to the target data type explicitly: e.g. line[3]::int.
You can add additional expressions to deal with missing columns, e.g. something like: coalesce(line[10], 'default value')

Which delimiter to use when loading CSV data into Postgres?

I've come across a problem with loading some CSV files into my Postgres tables. I have data that looks like this:
ID,IS_ALIVE,BODY_TEXT
123,true,Hi Joe, I am looking for a new vehicle, can you help me out?
Now, the problem here is that the text in what is supposed to be the BODY_TEXT column is unstructured email data and can contain any sort of characters, and when I run the following COPY command it's failing because there are multiple , characters within the BODY_TEXT.
COPY sent from ('my_file.csv') DELIMITER ',' CSV;
How can I resolve this so that everything in the BODY_TEXT column gets loaded as-is without the load command potentially using characters within it as separators?
Additionally to the fixing the source file format you can do it by PostgreSQL itself.
Load all lines from file to temporary table:
create temporary table t (x text);
copy t from 'foo.csv';
Then you can to split each string using regexp like:
select regexp_matches(x, '^([0-9]+),(true|false),(.*)$') from t;
regexp_matches
---------------------------------------------------------------------------
{123,true,"Hi Joe, I am looking for a new vehicle, can you help me out?"}
{456,false,"Hello, honey, there is what I want to ask you."}
(2 rows)
You can use this query to load data to your destination table:
insert into sent(id, is_alive, body_text)
select x[1], x[2], x[3]
from (
select regexp_matches(x, '^([0-9]+),(true|false),(.*)$') as x
from t) t

Best way to prevent duplicate data on copy csv postgresql

This is more of a conceptual question because I'm planning how best to achieve our goals here.
I have a postgresql/postgis table with 5 columns. I'll be inserting/appending data into the database from a csv file every 10 minutes or so via the copy command. There will likely be some duplicate rows of data, so I'd like to copy the data from the csv file to the postgresql table but prevent any duplicate entries from getting into the table from the csv file. There are three columns, where if they are all equal, that will mean the entry is a duplicate. They are "latitude", "longitude" and "time". Should I make a composite key from all three columns? If I do that, will it just throw an error upon trying to copy the csv file into the database? I'm going to be copying the csv file automatically so I would want it to go ahead and copy the rest of the file that aren't duplicates and not copy the duplicates. Is there a way to do this?
Also, I of course want it to look for duplicates in the most efficient way. I don't need to look through the whole table (which will be quite large) for duplicates...just the past 20 minutes or so via the timestamp on the row. And I've indexed the db with the time column.
Thanks for any help!
Upsert
The Answer by Linoff is correct but can simplified a bit by Postgres 9.5 new ”UPSERT“ feature (a.k.a. MERGE). That new feature is implemented in Postgres as INSERT ON CONFLICT syntax.
Rather than explicitly check for violation of the unique index, we can let the ON CONFLICT clause detect the violation. Then we DO NOTHING, meaning we abandon the effort to INSERT without bothering to attempt an UPDATE. So if we cannot insert, we just move on to next row.
We get the same results as Linoff’s code but lose the WHERE clause.
INSERT INTO bigtable(col1, … )
SELECT col1, …
FROM stagingtable st
ON CONFLICT idx_bigtable_col1_col2_col
DO NOTHING
;
I think I would take the following approach.
First, create an index on the three columns that you care about:
create unique index idx_bigtable_col1_col2_col3 on bigtable(col1, col2, col3);
Then, load the data into a staging table using copy. Finally, you can do:
insert into bigtable(col1, . . . )
select col1, . . .
from stagingtable st
where (col1, col2, col3) not in (select col1, col2, col3 from bigtable);
Assuming no other data modifications are going on, this should accomplish what you want. Checking for duplicates using the index should be ok from a performance perspective.
An alternative method is to emulates MySQL's "on duplicate key update" to ignore such records. Bill Karwin suggests implementing a rule in an answer to this question. The documentation for rules is here. Something similar could also be done with triggers.
The method posted by Basil Bourque was great, but there was a slight syntax error.
Based on the documentation, I modified it to the following, which works:
INSERT INTO bigtable(col1, … )
SELECT col1, …
FROM stagingtable st
ON CONFLICT (col1)
DO NOTHING
;

Unloading from redshift to s3 with headers

I already know how to unload a file from redshift into s3 as one file. I need to know how to unload with the column headers. Can anyone please help or give me a clue?
I don't want to manually have to do it in shell or python.
As of cluster version 1.0.3945, Redshift now supports unloading data to S3 with header rows in each file i.e.
UNLOAD('select column1, column2 from mytable;')
TO 's3://bucket/prefix/'
IAM_ROLE '<role arn>'
HEADER;
Note: you can't use the HEADER option in conjunction with FIXEDWIDTH.
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
If any of your columns are non-character, then you need to explicitly cast them as char or varchar because the UNION forces a cast.
Here is an example of the full statement that will create a file in S3 with the headers in the first row.
The output file will be a single CSV file with quotes.
This example assumes numeric values in column_1. You will need to adjust the ORDER BY clause to a numeric column to ensure the header row is in row 1 of the S3 file.
******************************************
/* Redshift export to S3 CSV single file with headers - limit 6.2GB */
UNLOAD ('
SELECT \'column_1\',\'column_2\'
UNION
SELECT
CAST(column_1 AS varchar(255)) AS column_1,
CAST(column_2 AS varchar(255)) AS column_2
FROM source_table_for_export_to_s3
ORDER BY 1 DESC
;
')
TO 's3://bucket/path/file_name_for_table_export_in_s3_' credentials
'aws_access_key_id=<key_with_no_<>_brackets>;aws_secret_access_key=<secret_access_key_with_no_<>_brackets>'
PARALLEL OFF
ESCAPE
ADDQUOTES
DELIMITER ','
ALLOWOVERWRITE
GZIP
;
****************************************
There is no direct option provided by redshift unload .
But we can tweak queries to generate files with rows having headers added.
First we will try with parallel off option so that it will create only on file.
"By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. So, for example, if you unload 13.4 GB of data, UNLOAD creates the following three files."
To have headers in unload files we will do as below.
Suppose you have table as below
create table mutable
(
name varchar(64) default NULL,
address varchar(512) default NULL
)
Then try to use select command from you unload as below to add headers as well
( select 'name','address') union ( select name,address from mytable )
this will add headers name and address as first line in your output.
Just to complement the answer, to ensure the header row comes first, you don't have to order by a specific column of data. You can enclose the UNIONed selects inside another select, add a ordinal column to them and then in the outer select order by that column without including it in the list of selected columns.
UNLOAD ('
SELECT column_1, column_2 FROM (
SELECT 1 AS i,\'column_1\' AS column_, \'column_2\' AS column_2
UNION ALL
SELECT 2 AS i, column_1::varchar(255), column_2::varchar(255)
FROM source_table_for_export_to_s3
) t ORDER BY i
')
TO 's3://bucket/path/file_name_for_table_export_in_s3_'
CREDENTIALS
'aws_access_key_id=...;aws_secret_access_key=...'
DELIMITER ','
PARALLEL OFF
ESCAPE
ADDQUOTES;
Redshift now supports unload with headers. September 19–October 10, 2018 release.
The syntax for unloading with headers is -
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
authorization
HEADER
Unfortunately, the UNLOAD command doesn't natively support this feature (see other answers for how to do it with workarounds).
I've posted a feature request on the AWS forums, so hopefully it gets added someday.
Edit: The feature has now been implemented natively in Redshift! 🎉
Try like this:
Unload VENUE with a Header:
unload ('select * from venue where venueseats > 75000')
to 's3://mybucket/unload/'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
header
parallel off;
The following shows the contents of the output file with a header row:
venueid|venuename|venuecity|venuestate|venueseats
6|New York Giants Stadium|East Rutherford|NJ|80242
78|INVESCO Field|Denver|CO|76125
83|FedExField|Landover|MD|91704
79|Arrowhead Stadium|Kansas City|MO|79451
To make the process easier you can use a pre-built docker image to extract and include the header row.
https://github.com/openbridge/ob_redshift_unload
It will also do a few other things, but it seemed to make sense to package this in an easy to use format.
To unload a table as csv to s3 including the headers, you will simply have to do it this way
UNLOAD ('SELECT * FROM {schema}.{table}')
TO 's3://{s3_bucket}/{s3_key}/{table}/'
with credentials
'aws_access_key_id={access_key};aws_secret_access_key={secret_key}'
CSV HEADER ALLOWOVERWRITE PARALLEL OFF;

SQL Server 2000 query that omits commas in resulting rows?

Wondering if there is a way to query a SQL Server database and somehow format columns to omit commas in the data if there is any.
Reason for asking is I have 10000+ records and through out the data the varchar have data like 3,25% and other 1%.
I'd prefer not to alter the data in the original table thus asking if a select with other functions would do the trick.
I have thought about selecting all the data into a temp table and stripping the commas but that is a lot of work for every time I do the query.
Any info or if its is possible please reply.
Take a look at the REPLACE function:
SELECT REPLACE(YourColumn, ',', '')
FROM YourTable
Use SQL REPLACE :
REPLACE(YourField,',','')