I would like to know how I can import my data to table. I know the COPY command and the option HEADER. But the file I have to import has the following format:
Line 1: header1, header2, header3,...
Line 2: vartype, vartype, vartype,...
Line 3: data1, data2,...
Like you can see, I need to skip the second line too. For example:
"phonenumber","countrycode","firstname","lastname"
INTEGER,INTEGER,VARCHAR(50),VARCHAR(50)
123456789,44,"James","Bond"
5551234567,1,"Angelina","Jolie"
912345678,34,"Antonio","Banderas"
The first line is the exact name of the table's columns. I have tried to use the INSERT INTO command but I have not got good result.
I am using these two strategies for this type of problem:
1) Import all
import all rows into temporary table where columns have varchar type
delete rows you do not want
insert data into final table, cast varchar to desired types
2) Pre-process
delete rows from imported file
import
For your case, you can delete 2nd line using sed for example:
sed -i '2d' importfile.txt
This will remove 2nd line from file named importfile.txt. Note that flag -i will overwrite the file immediately, so use it with care.
You can use this to delete range of lines:
sed -i '2,4d' importfile.txt
This will remove lines 2, 3, 4 from file.
If you are working in a Linux shell you could always just stream in the records you want, eg
tail -[number of lines minus header] <file> | psql <db> -c "COPY <table> FROM STDIN CSV;"
or if your header is marked by say "#"
grep -v "^#" <file> | psql <db> -c "COPY <table> FROM STDIN CSV;"
You'll have to pre-process the file I'm afraid. There are far too many strange formats (like this one) around for COPY to understand - it just concentrates on handling the basics. You can trim the second line out with a simple bit of sed or perl.
perl -ne 'print unless ($.==2)' source_file.txt
Related
There is a text field in a Postgres database containing new lines. I would like to export the content of that field to a text file, preserving those new lines. However, the COPY TO command explictly transforms those characters into the \n string. For example:
$ psql -d postgres -c "COPY (SELECT CHR(10)) TO '/tmp/out.txt';"
COPY 1
$ cat /tmp/out.txt
\n
This behaviour seems to match the short description in the documents:
Presently, COPY TO will never emit an octal or hex-digits backslash sequence, but it does use the other sequences listed above for those control characters.
Is there any workaround to get the new line in the output? E.g. that a command like:
$ psql -d postgres -c "COPY (SELECT 'A line' || CHR(10) || 'Another line') TO '/tmp/out.txt';"
Results in something like:
A line
Another line
Update: I do not wish to obtain a CSV file. The output must not have headers, column separators or column decorators such as quotes (exactly as exemplified in the output above). The answers provided in a different question with COPY AS CSV do not fulfil this requirement.
Per my comment:
psql -d postgres -U postgres -c "COPY (SELECT CHR(10)) TO '/tmp/out.txt' WITH CSV;"
Null display is "NULL".
COPY 1
cat /tmp/out.txt
"
"
psql -d postgres -U postgres -c "COPY (SELECT 'A line' || CHR(10) || 'Another line') TO '/tmp/out.txt' WITH CSV;"
Null display is "NULL".
COPY 1
cat /tmp/out.txt
"A line
Another line"
Using the CSV format will maintain the embedded line breaks in the output. This is explained here COPY under CSV Format
The values in each record are separated by the DELIMITER character. If the value contains the delimiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, then the whole value is prefixed and suffixed by the QUOTE character, and any occurrence within the value of a QUOTE character or the ESCAPE character is preceded by the escape character. You can also use FORCE_QUOTE to force quotes when outputting non-NULL values in specific columns.
...
Note
CSV format will both recognize and produce CSV files with quoted values containing embedded carriage returns and line feeds. Thus the files are not strictly one line per table row like text-format files.
UPDATE
Alternate method that does not involve quoting, using psql.
create table line_wrap(id integer, fld_1 varchar);
insert into line_wrap values (1, 'line1
line2');
insert into line_wrap values (2, 'line3
line4');
select fld_1 from line_wrap
\g (format=unaligned tuples_only=on) out.txt
cat out.txt
line1
line2
line3
line4
I need to rename following file likewise from PFSI4C.CSC.CCC.FSIContractData20211008.zip to TFSI4C.CSC.CCC.FSIContractData20211104.zip.
Every file's name should start with "T" and end up with current system date + .zip"
I am trying to loop over files and it looks like this:
for FILENAME in PFSI4C.CSC.CCC.FSIContractData20211008; do
NEW_FILENAME_HEADER=`echo $FILENAME | awk -F "." '{ print $1"."$2"."$3 }'` # which would takes PFSI4C.CSC.CCC.
NEW_FILENAME_SUFFIX=`echo $FILENAME | awk -F "[.|Data20]" '{ print "."$4 }'` # this part where I can't figure out to take only "FSIContract"
NEW_FILENAME="${NEW_FILENAME_HEADER}.""${NEW_FILENAME_SUFFIX}""Data20""${DATE}".zip" # which should make "TFSI4C.CSC.CCC.FSIContractData20211104.zip."
mv $FILENAME $NEW_FILENAME
done
FYI $DATE in our script defined like this: DATE='date +'%y%m%d' for example 211104
Thanks in advance!
With Perl's rename command you could try following code. I am using -n option here to test it in DRY RUN mode it will only print the file names from which file name(current) to which file name(required one) will be changed; remove -n in code once you are satisfied with shown output. Also DATE variable (DATE='20211104') is a shell variable which contains value of date needed to be in new file name.
rename -n 's:^.(.*)\d{8}(\.zip)$:T$1$2:; s:\.zip$:'"$DATE"'.zip:' *.zip
Output will be as follows:
rename(PFSI4C.CSC.CCC.FSIContractData20211008.zip, TFSI4C.CSC.CCC.FSIContractData20211104.zip)
Explanation of rename code:
-n: To run rename command in DRY RUN mode.
s:^.(.*)\d{8}(\.zip)$:T$1$2:;: Running 1st set of substitution in rename code. Where it creates 2 capturing group, 1st capturing group has everything from 2nd character onwards to just before 8 digits AND 2nd capturing group contains .zip at the end of filename. While substitution substituting it with T1$2 as per requirement.
s:\.zip$:'"$DATE"'.zip:: Running 2nd set of substitution in rename code. Where .zip$ with shell variable DATE along with .zip as per requirement.
First of all you should get the current date with date +%Y%m%d (4-digits year) instead of date +%y%m%d (2-digits year). The following assumes you do that. Prepend 20 to $DATE if it is not an option.
If your file names all look as the example you show bash substitutions can do it. First compute the length, extract the characters from second to last before date, prepend T, append $DATE.zip:
len="${#FILENAME}"
NEW_FILENAME="T${FILENAME:1:$((len-13))}$DATE.zip"
But you could also use sed, it offers a bit more flexibility. For instance it can deal with ending dates on a variable number of digits:
NEW_FILENAME=$(echo "$FILENAME" | sed 's/.\(.*[^0-9]\)\?[0-9]*\.zip/T\1'"$DATE"'.zip/')
Or, a bit more elegant with bash (here strings) and GNU sed or another sed that supports the -E option (for extended regular expressions):
NEW_FILENAME=$(sed -E 's/.(.*[^0-9])?[0-9]*\.zip/T\1'"$DATE"'.zip/' <<< "$FILENAME")
Assumptions:
replace first character (P in OP's sample) with T
replace last 10 characters (YYMMDD.zip) with $DATE.zip (OP has already defined $DATE)
all files contain 20YYMMDD so we don't need to worry about names containing strings like 19YYMMDD and 21YYMMDD
One idea using parameter substitutions (which also eliminates the overhead of subprocess calls to execute various echo, awk and sed commands):
DATE='211104'
FILENAME='PFSI4C.CSC.CCC.FSIContractData20211008.zip'
NEWFILENAME="T${FILENAME/?}" # prepend "T"; "/?" => remove first character
NEWFILENAME="${NEWFILENAME/??????.zip}${DATE}.zip" # remove string "??????.zip"; append "${DATE}.zip"
echo mv "${FILENAME}" "${NEWFILENAME}"
This generates:
mv PFSI4C.CSC.CCC.FSIContractData20211008.zip TFSI4C.CSC.CCC.FSIContractData20211104.zip
Once OP is satisified with the accuracy of the code the echo can be removed to enable execution of the mv command.
I have a CSV file that is causing me serious headaches going into Tableau. Some of the rows in the CSV are wrapped in a " " and some not. I would like them all to be imported without this (i.e. ignore it on rows that have it).
Some data:
"1;2;Red;3"
1;2;Green;3
1;2;Blue;3
"1;2;Hello;3"
Do you have any suggestions?
If you have a bash prompt hanging around...
You can use cat to output the file contents so you can make sure you're working with the right data:
cat filename.csv
Then, pipe it through sed so you can visually check that the quotes were delted:
cat filename.csv | sed 's/"// g'
If the output looks good, use the -i flag to edit the file in place:
sed -i 's/"// g' filename.csv
All quotes should now be missing from filename.csv
If your data has quotes in it, and you want to only strip the quotes that appear at the beginning and end of each line, you can use this instead:
sed -i 's/^"\(.*\)"$/\1/' filename.csv
It's not the most elegant way to do it in Tableau but if you cannot remove it in the source file, you could create a calculated field for the first and last column that strips the quotation marks.
right click on the field for the first column choose Create/Calculated Field
Use this formula: INT(REPLACE([FirstColumn],'"',''))
Name the column accordingly
Do the same for the last column
Assuming the data you provided fits the data you work on. The assumption is that these fields are integer field (thus the INT() usage). In case they are string fields you would want to make sure that you don't remove quotation marks that belong to the field value.
I have a large file (~4,000,000 lines) consisting of multiple blocks of data, each with an introductory ID tag, and a list of selected ID tags in a second file.
For example:
Data.txt
>ID:1000
data about this
more data
data
>ID:1001
blah blah
data
>ID:1002
foo
...
And ID_Tags.txt
>ID:1000
>ID:1002
>ID:1085
>ID:3062
...
I need a way to grab the ID tag and corresponding data from Data.txt for the data specified in ID_Tags.txt so that I wind up with a file looking like:
Select_Data.txt
>ID:1000
data about this
more data
data
>ID:1002
foo
...
I can get one block of data at a time with
sed -n '/ID:1000/,/>/p' Data.txt | head -n -1 >> Select_Data.txt
But this only does a single ID tag at a time, and I have hundreds of select ID tags. Is there a way to avoid doing this manually?
$ awk 'NR==FNR{tags[$0];next} /^>/{f=($0 in tags)} f' ID_Tags.txt Data.txt
>ID:1000
data about this
more data
data
>ID:1002
foo
This might work for you (GNU sed):
sed $'1i:a\ns#.*#/^&$/bb#;$ad;:b;n;/^>/ba;bb' ids_file | sed -f - data_file
This builds a sed script from the ids file and runs the script against the data file. The sed script looks for those ids in the ids file and prints the id line and those lines following until the next id where it loops back check the id. All other lines are deleted.
You can use the following awk script:
awk 'NR==FNR{i[$1];next} NF>1 && $1 in i{print ">"$0}' RS='>' ids.txt data.txt
Output:
>ID:1000
data about this
more data
data
>ID:1002
etc
The key to my solution is to replace the default record separator \n by > using RS='>'. Using this trick it is pretty simple to access the individual fields of the data.
Explanation
We are passing both files to awk, ids.txt and data.txt and awk will process them in order.
NR==FNR{i[$1];next} runs unless awk is parsing the first file, ids.txt. NR represents the current record number, and FNR the number of the record in the current file. They are equal only when parsing the first file. i[$1] adds the value of the id (without the leading > since it is the field separator) as key to the array i. next stops further processing of the line.
$1 in i {print ">"$0} will check if the first column of the data record - the id - is a key in our array i and prints the record while adding the > back to the front of it.
Note that we are additionally checking if NF>1 (meaning the record is not empty) because awk will return an empty first record because the data file starts with the record delimiter >. <none> in array will result into true in awk and would print and additional >.
I am trying to export data from postgresq to CSV files but when I do have newlines in text in the database, the exported data will be broken on several lines, which makes much harder to read the CSV file, not to say that most applications will fail to load it properly.
Here is how I export the data now:
PRESQL="\pset format unaligned
\pset fieldsep \",\"
\pset footer off
\o 'out.csv'
"
cat <(echo $PRESQL) $QUERYFILE | psql …
Sa far, so good, unless you have newlines in the text fields. Any hack that would allow me to generate a very simple to parse CSV file (with one record per line)?
It was a mistake to consider that a CSV can be forced to have one line per row. The RFC states clear that newlines are to be enclosed in double quotes.
You can try replace() or regexp_replace() function.
The answer to the followinig SO question should give you an idea: How to remove carriage returns and new lines in Postgresql?