I need to load an entire text file (contains only ASCII text) into the database (DB2 Express ed.). The table has only two columns EXAMPLE_TABLE (ID, TEXT). The ID column is PK, with auto generated data, whereas the text is VARCHAR(50).
Now I need to use the load/import utility to save each sentence in the text face into the EXAMPLE_TABLE, that is, we have a row for each sentence. The row-id should be auto generated, but that is already specified in table creation time. The import utility should consider the period '.' as delimiter (otherwise I don't know how to extract sentences).
How can this be done in DB2?
Thanks in advance!
When using delimited files, the standard DB2 import and load utilities do not have the ability to specify a row record terminator. The LF character (or CRLF on Windows) is the only record terminator you can use.
So, you would need to pre-process your file (to either replace each period (.) with a newline or insert a newline after each period) before you can use import or load, resulting in a file with each sentence on a separate line.
You can do this with tr:
cat file | tr '.' '\n' > file.load
db2 "import from file.load of del insert into example_table (text)"
Keep in mind that you will probably also need to account for spaces after the period, so you don't end up with leading spaces at the beginning of each "sentence" in your table, and you may also want to account for additional whitespace (i.e. empty lines between each paragraph).
Related
I have a CSV file with 12 - 11 - 10 or 5 columns.
After creating a PostgreSQL table with 12 columns, I want to copy this CSV into the table.
I use this request:
COPY absence(champ1, champ2, num_agent, nom_prenom_agent, code_gestion, code_service, calendrier_agent, date_absence, code_absence, heure_absence, minute_absence, periode_absence)
FROM 'C:\temp\absence\absence.csv'
DELIMITER '\'
CSV
My CSV file contains 80000 line.
Ex :
20\05\ 191\MARKEY CLAUDIE\GA0\51110\39H00\21/02/2020\1471\03\54\Matin
21\05\ 191\MARKEY CLAUDIE\GA0\51110\39H00\\8130\7H48\Formation avec repas\
30\05\ 191\MARKEY CLAUDIE\GA0\51430\39H00\\167H42\
22\9993\Temps de déplacement\98\37
when I execute the request, I get a message indicating that there is missing data for the lines with less than 12 fields.
Is there a trick?
copy is extremely fast and efficient, but less flexible because of that. Specifically it can't cope with files that have a different number of "columns" for each line.
You can either use a different import tool, or if you want to stick to built-in tools, copy the file into staging table that only has a single column, then use Postgres string functions to split the lines into the columns:
create unlogged table absence_import
(
line text
);
\COPY absence_import(line) FROM 'C:\temp\absence\absence.csv' DELIMITER E'\b' CSV
E'\b' is the "backspace" character which can't really appear in a text file, so no column splitting is taking place.
Once you have imported the file, you can split the line using string_to_array() and the insert that into the real table:
insert into absence(champ1, champ2, num_agent, nom_prenom_agent, code_gestion, code_service, calendrier_agent, date_absence, code_absence, heure_absence, minute_absence, periode_absence)
select line[1], line[2], line[3], .....
from (
select string_to_array(line, '\') as line
from absence_import
) t;
If there are non-text columns, might need to cast the values to the target data type explicitly: e.g. line[3]::int.
You can add additional expressions to deal with missing columns, e.g. something like: coalesce(line[10], 'default value')
I've come across a problem with loading some CSV files into my Postgres tables. I have data that looks like this:
ID,IS_ALIVE,BODY_TEXT
123,true,Hi Joe, I am looking for a new vehicle, can you help me out?
Now, the problem here is that the text in what is supposed to be the BODY_TEXT column is unstructured email data and can contain any sort of characters, and when I run the following COPY command it's failing because there are multiple , characters within the BODY_TEXT.
COPY sent from ('my_file.csv') DELIMITER ',' CSV;
How can I resolve this so that everything in the BODY_TEXT column gets loaded as-is without the load command potentially using characters within it as separators?
Additionally to the fixing the source file format you can do it by PostgreSQL itself.
Load all lines from file to temporary table:
create temporary table t (x text);
copy t from 'foo.csv';
Then you can to split each string using regexp like:
select regexp_matches(x, '^([0-9]+),(true|false),(.*)$') from t;
regexp_matches
---------------------------------------------------------------------------
{123,true,"Hi Joe, I am looking for a new vehicle, can you help me out?"}
{456,false,"Hello, honey, there is what I want to ask you."}
(2 rows)
You can use this query to load data to your destination table:
insert into sent(id, is_alive, body_text)
select x[1], x[2], x[3]
from (
select regexp_matches(x, '^([0-9]+),(true|false),(.*)$') as x
from t) t
I am trying to create a postgres external table from a CSV file which has one of its column as email address and this column has multiple email addresses separated by comma. Since the delimited for the file is a comma, when creating an external table, it is not able to differentiate between the "," within a column from the "," between columns. The list of emails in the email column is enclosed in double quotes as well.
Is there a way to load it without changing the delimited for the entire file?
It's standard practice with CSV to enclose fields between double quotes.
COPY... CSV will import this directly.
Example:
CREATE TABLE email(id int, emails text);
COPY email FROM STDIN CSV;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,"abc#example.com,def#example.com"
>> \.
Result:
select * from email ;
id | emails
----+---------------------------------
1 | abc#example.com,def#example.com
(1 row)
Lets say I have some customer data like the following saved in a text file:
|Mr |Peter |Bradley |72 Milton Rise |Keynes |MK41 2HQ |
|Mr |Kevin |Carney |43 Glen Way |Lincoln |LI2 7RD | 786 3454
I copied the aforementioned data into my customer table using the following command:
\copy customer(title, fname, lname, addressline, town, zipcode, phone) from 'customer.txt' delimiter '|'
However, as it turns out, there are some extra space characters before and after various parts of the data. What I'd like to do is call trim() before copying the data into the table - what is the best way to achieve this?
Is there a way to call trim() on every value of every row and avoid inserting unclean data in the first place?
Thanks,
I think the best way to go about this is to add a BEFORE INSERT trigger to the table you're inserting to. This way, you can write a stored procedure that will execute before every record is inserted and trim whitepsace (or do any other transformations you may need) on any columns that need it. When you're done, simply remove the trigger (or leave it, which will improve data integrity if you never want that whitespace int those columns). I think explaining how to create a trigger and stored procedure in PostgreSQL is probably outside the scope of this question, but I will link to the documentation for each.
I think this is the best way because it is simpler than parsing through a text file or writing shell code to do this. This kind of sanitization is the kind of thing triggers do very well and very simply.
Creating a Trigger
Creating a Trigger Function
I have somehow similar use case in one of the projects. My input files:
has number of lines in the file as a last line;
needs to have line numbers added on every line;
needs to have file_id added to every line.
I use the following piece of shell code:
FACT=$( dosql "TRUNCATE tab_raw RESTART IDENTITY;
COPY tab_raw(file_id,lnum,bnum,bname,a_day,a_month,a_year,a_time,etype,a_value)
FROM stdin WITH (DELIMITER '|', ENCODING 'latin1', NULL '');
$(sed -e '$d' -e '=' "$FILE"|sed -e 'N;s/\n/|/' -e 's/^/'$DSID'|/')
\.
VACUUM ANALYZE tab_raw;
SELECT count(*) FROM tab_raw;
" | sed -e 's/^[ ]*//' -e '/^$/d'
)
dosql is a shell function, that executes psql with proper connectivity info and executes everything, that was given as an argument.
As a result of this operation I will have $FACT variable holding a total count of inserter records (for error detection).
Later I do another dosql call:
dosql "SET work_mem TO '800MB';
SELECT tab_prepare($DSID);
VACUUM ANALYZE tab_raw;
SELECT tab_duplicates($DSID);
SELECT tab_dst($DSID);
SELECT tab_gaps($DSID);
SELECT tab($DSID);"
to get analyze and move data into the final tables from auxiliary one.
I need to export content of a db2 table to CSV file.
I read that nochardel would prevent to have the separator between each data but that is not happening.
Suppose I have a table
MY_TABLE
-----------------------
Field_A varchar(10)
Field_B varchar(10)
Field_A varchar(10)
I am using this command
export to myfile.csv of del modified by nochardel select * from MY_TABLE
I get this written into the myfile.csv
data1 ,data2 ,data3
but I would like no ',' separator like below
data1 data2 data3
Is there a way to do that?
You're asking how to eliminate the comma (,) in a comma separated values file? :-)
NOCHARDEL tells DB2 not to surround character-fields (CHAR and VARCHAR fields) with a character-field-delimiter (default is the double quote " character).
Anyway, when exporting from DB2 using the delimited format, you have to have some kind of column delimiter. There isn't a NOCOLDEL option for delimited files.
The EXPORT utility can't write fixed-length (positional) records - you would have to do this by either:
Writing a program yourself,
Using a separate utility (IBM sells the High Performance Unload utility)
Writing an SQL statement that concatenates the individual columns into a single string:
Here's an example for the last option:
export to file.del
of del
modified by nochardel
select
cast(col1 as char(20)) ||
cast(intcol as char(10)) ||
cast(deccol as char(30));
This last option can be a pain since DB2 doesn't have an sprintf() function to help format strings nicely.
Yes there is another way of doing this. I always do this:
Put select statement into a file (input.sql):
select
cast(col1 as char(20)),
cast(col2 as char(10)),
cast(col3 as char(30));
Call db2 clp like this:
db2 -x -tf input.sql -r result.txt
This will work for you, because you need to cast varchar to char. Like Ian said, casting numbers or other data types to char might bring unexpected results.
PS: I think Ian points right on the difference between CSV and fixed-length format ;-)
Use "of asc" instead of "of del". Then you can specify the fixed column locations instead of delimiting.