Greenplum : Getting filenames processed via an external table - postgresql

we are processing multiple files using external table. Is there any way I can get the file name being processed in external tables and stored it in database table?
Only workaround I can find is appending the file name to every record in the flat file which isn't ideal when huge dataset and multiple files.
Can anyone help on this
Thanks

No, the file name is simply never passed from the gpfdist daemon back to Greenplum. So you have to append the file name to each line - you can use gpfdist transformation for doing so

I was struggling with this as well, here's my solution. Please note I'm not an expert in linux, so there may be a one liner solution.
So I wanted to add a filename column in front of my records.
That can be done in sed, I've created a transform.sh file, with the following content:
#/bin/sh
filename=$1
#echo $filename >> transform.txt
sed -e "s|^|$filename\v|" $filename
Please note that I was using vertical tab as a delimiter, \v. Also in the filename you could have / hence using | . In order to have the value of $filename we have to use double quites for sed.
Test it, it looks good.
./transform.sh countersamples-2016-03-02--11-51-10.csv
countersamples-2016-03-02--11-51-10.csv
timestamp
machine
category
instance
name
value
countersamples-2016-03-02--11-51-10.csv
2016-03-02 11:51:10.064
DESKTOP-4PLQKVL
Memory
% Committed Bytes In Use
74.8485488891602
This part is done, lets continue with gpfdist. We need a yaml file that can be passed to gpfdist, I named this transform.yaml
Content:
---
VERSION: 1.0.0.1
TRANSFORMATIONS:
add_filename:
TYPE: input
CONTENT: data
COMMAND: /bin/bash transform.sh %filename%
Please note that we have the %filename% value here. It seems that gpfdist prefilters the files that needs to be handled, and passes them 1 by 1 to our transform.
Lets fire up gpfdist:
gpfdist -c transform.yaml -v
Now go into greenplum and create an external table such as:
CREATE READABLE EXTERNAL TABLE "ext_transform"
(
"filename" text,
"timestamp" timestamp without time zone ,
"machine" text ,
"category" text ,
"instance" text ,
"name" text ,
"value" double precision
)
LOCATION ('gpfdist://localhost:8080/*/countersamples*.csv#transform=add_filename')
FORMAT 'TEXT'
( HEADER DELIMITER '\013' NULL AS '\\N' ESCAPE AS '\\' )
And when we select data from it:
select * from "ext_transform";
We see:
I've created 2 folders to see how it reacts if the files are not in the same folder as the transform. This way I can distinguish between the 2 files, even if their data is identical.

Related

how import csv file into Postgres with empty values?

I am trying to import one csv file into Postgres which does contain age values, however there are also some empty values, since not all ages are known.
I would like to import the columns as real, since the columns contain ages with decimals like 98.45. The empty values for people when age is not known is apparently considered as strings, however I still would like to import the ages values as numbers. So I was wondering how to import real values, even when some cells in the csv are empty and thus are considered according to Postgres as string values.
for creation I used the following code, since I am dealing with decimal values.
Create table psychosocial.age (
respnr integer Primary key,
fage real,
gage real,
hage real);
after importing csv file, I get the following error
ERROR: invalid input syntax for integer: "11455, , , "
CONTEXT: COPY age, line 2, column respnr: "11455, , , "
One problem is that you're trying to import white spaces into numeric fields. So, first you have to pre-process your csv file before importing it.
Below is an example of how you can solve it using awk. From your console execute the following command:
$ cat file.csv | awk '{sub(/^ +/,""); gsub(/, /,",")}1' | psql db -c "COPY psychosocial.age FROM STDIN WITH CSV HEADER"
In case you're wondering how to pipe commands, take a look at these answers. Here a more detailed example on how to use COPY and the STDIN.
You also have to take into account that having quotation marks on integer fields can be problematic, e.g:
"11455, , , "
This will result in an error, since postgres will parse "11455 as a single value and will try to store it in an interger field, which will obviously fail. Instead, format your csv file to be like this:
11455, , ,
or even
11455,,,
You can achieve this also using awk from your console:
$ awk '{gsub(/\"/,"")};1' file.csv

mongoimport: set type for all fields when importing CSV

I have multiple problems with importing a CSV with mongoimport that has a headerline.
Following is the case:
I have a big CSV file with the names of the fields in the first line.
I know you can set this line to use as field names with: --headerline.
I want all field types to be strings, but mongoimport sets the types automatically to what it looks like.
IDs such as 0001 will be turn into 1, which can have bad side effects.
Unfortunately, there is (as far as i know) no way of setting them as string with a single command, but by naming each field and setting it type with
--columnsHaveTypes --fields "name.string(), ... "
When I did that, the next problem appeared.
The headerline (with all field names) got imported as values in a separate document.
So basically, my questions are:
Is there a way of setting all field types as string using the --headerline command ?
Alternative, is there a way to ignore the first line ?
I had this problem when uploading 41 million record CSV file into mongodb.
./mongoimport -d testdb -c testcollection --type csv --columnsHaveTypes -f
"RECEIVEDDATE.date(2006-01-02 15:04:05)" --file location/test.csv
As above we have a command to upload file with data types called '-f' or '--fields' but when we use this command to the file that contain header line, mondodb upload first row as well i.e header lines row then its leads error 'cannot convert to datatype' or upload column names also as data set.
Unfortunately we cannot use '--headerline' command instead of '--fields'.
Here the solutions that I found for this problem.
1)Remove header column and upload using '--fields' command as above command. if you re use linux environment you can use below command to remove first row of the huge file i.e header line.it took 2-3 mints for me.(depending on the machine performance)
sed -i -e "1d" location/test.csv
2)upload the file using '--headerline' command then mongodb uploads the file with its default identified data types.Then open mongodb shell command use testdb then run javascript command that get each record and change it into specific data types.But if you have huge file this will takes time.
found this solution from stackoverflow
db.testcollection.find().forEach( function (x) {
x.RECEIVEDDATE = new Date(x.RECEIVEDDATE ); db.testcollection .save(x);});
If you wanna remove the unnecessary rows that not fit to data type use below command.
mongodb document
'--parseGrace skipRow'
I found a solution, that I am comfortable with
Basically, I wanted to use mongoimport within my Clojure Code to import a CSV file in the DB and do a lot of stuff with it automatically. Due to the above mentioned problems I had to find a workaround, to delete this wrong document.
I did following to "solve" this problem:
To set the types as I wanted, I wrote a function to read the first line, put it in a vector and then used String concatenation to set these as fields.
Turning this: id,name,age,hometown,street
into this: id.string(),name.string(),age.string() etc
Then I used the values from the vector to identify the document with
{ name : "name"
age : "age"
etc : "etc" }
and then deleted it with a simple remving.find() command.
Hope this helps any dealing with the same kind of problem.
https://docs.mongodb.com/manual/reference/program/mongoimport/#example-csv-import-types reads:
MongoDB 3.4 added support for specifying field types. Specify field names and types in the form .() using --fields, --fieldFile, or --headerline.
so your first line within the csv file should have names with types. e.g.:
name.string(), ...
and the mongoimport parameters
--columnsHaveTypes --headerline --file <filename.csv>
As to the question of how to remove the first line, you can employ pipes. mongoimport reads from STDIN if no --file option passed. E.g.:
tail -n+2 <filename.csv> | mongoimport --columnsHaveTypes --fields "name.string(), ... "

The value in "sym" file disappears when using splayed tables

I am using the following line:
`:c:/dir/ set .Q.en[`:c:/dir; tablename]
Everything is ok if I don't exit KDB, but if I do and then try to load the table using
get `dir
all the symbol columns are integer. I would really appreciate your help into understanding why this happens.
It looks like you forgot to repeat the table name on the l.h.s. of set.
Try
q)`:c:/dir/tablename/ set .Q.en[`:c:/dir; tablename]
This will correctly save table columns in c:/dir/tablename subdirectory and place the sym file alongside. Now you should be able to load both your table and the sym file by using the \l command or specifying c:/dir on the command line when you restart q
q c:/dir
or
q
q)\l c:/dir
(no backticks or leading :'s in either of those commands)
If you want to use get on this table, you will have to load sym separately:
q)load`:c:/dir/sym
q)get`:c:/dir/tablename/
(note the leading : in the path specs)
Finally, you may want to take a look at the rsave command which will save your table without you having to write tablename twice.
.Q.en takes 2 oarams - file handle and table data
Your first param isnt a hsym - should be backtick then colon then path to your db root
Also set takes 2 params - first in this case should be the path to where you want to save like dir/splayedTableName/

updating table rows based on txt file

I have been searching but so far I only found how to insert date into tables based on a csv files.
I have the following scenario:
Directory name = ticketID
Inside this directory I have a couple of files, like:
Description.txt
Summary.txt - Contains ticket header and has been imported succefully.
Progress_#.txt - this is everytime a ticket gets udpdated. I get a new file.
Solution.txt
Importing the Issue.txt was easy since this was actually a CSV.
Now my problem is with Description and Progress files.
I need to update the existing rows with the data from this files. Something on the line of
update table_ticket set table_ticket.description = Description.txt where ticket_number = directoryname
I'm using PostgreSQL and the COPY command is valid for new data and it would still fail due to the ',;/ special chars.
I wanted to do this using bash script, but it seem that it is it won't be possible:
for i in `find . -type d`
do
update table_ticket
set table_ticket.description = $i/Description.txt
where ticket_number = $i
done
Of course the above code would take into consideration connection to the database.
Anyone has a idea on how I could achieve this using shell script. Or would it be better to just make something in Java and read and update the record, although I would like to avoid this approach.
Thanks
Alex
Thanks for the answer, but I came across this:
psql -U dbuser -h dbhost db
\set content = `cat PATH/Description.txt`
update table_ticket set description = :'content' where ticketnr = TICKETNR;
Putting this into a simple script I created the following:
#!/bin/bash
for i in `find . -type d|grep ^./CS`
do
p=`echo $i|cut -b3-12 -`
echo $p
sed s/PATH/${p}/g cmd.sql > cmd.tmp.sql
ticketnr=`echo $p|cut -b5-10 -`
sed -i s/TICKETNR/${ticketnr}/g cmd.tmp.sql
cat cmd.tmp.sql
psql -U supportAdmin -h localhost supportdb -f cmd.tmp.sql
done
The downside is that it will create always a new connection, later I'll change to create a single file
But it does exactly what I was looking for, putting the contents inside a single column.
psql can't read the file in for you directly unless you intend to store it as a large object in which case you can use lo_import. See the psql command \lo_import.
Update: #AlexandreAlves points out that you can actually slurp file content in using
\set myvar = `cat somefile`
then reference it as a psql variable with :'myvar'. Handy.
While it's possible to read the file in using the shell and feed it to psql it's going to be awkward at best as the shell offers neither a native PostgreSQL database driver with parameterised query support nor any text escaping functions. You'd have to roll your own string escaping.
Even then, you need to know that the text encoding of the input file is valid for your client_encoding otherwise you'll insert garbage and/or get errors. It quickly lands up being easier to do it in a langage with proper integration with PostgreSQL like Python, Perl, Ruby or Java.
There is a way to do what you want in bash if you really must, though: use Pg's delimited dollar quoting with a randomized delimiter to help prevent SQL injection attacks. It's not perfect but it's pretty darn close. I'm writing an example now.
Given problematic file:
$ cat > difficult.txt <__END__
Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()
__END__
and sample table:
psql -c 'CREATE TABLE testfile(filecontent text not null);'
You can:
#!/bin/bash
filetoread=$1
sep=$(printf '%04x%04x\n' $RANDOM $RANDOM)
psql <<__END__
INSERT INTO testfile(filecontent) VALUES (
\$x${sep}\$$(cat ${filetoread})\$x${sep}\$
);
__END__
This could be a little hard to read and the random string generation is bash specific, though I'm sure there are probably portable approaches.
A random tag string consisting of alphanumeric characters (I used hex for convenience) is generated and stored in seq.
psql is then invoked with a here-document tag that isn't quoted. The lack of quoting is important, as <<'__END__' would tell bash not to interpret shell metacharacters within the string, wheras plain <<__END__ allows the shell to interpret them. We need the shell to interpret metacharacters as we need to substitute sep into the here document and also need to use $(...) (equivalent to backticks) to insert the file text. The x before each substitution of seq is there because here-document tags must be valid PostgreSQL identifiers so they must start with a letter not a number. There's an escaped dollar sign at the start and end of each tag because PostgreSQL dollar quotes are of the form $taghere$quoted text$taghere$.
So when the script is invoked as bash testscript.sh difficult.txt the here document lands up expanding into something like:
INSERT INTO testfile(filecontent) VALUES (
$x0a305c82$Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()$x0a305c82$
);
where the tags vary each time, making SQL injection exploits that rely on prematurely ending the quoting difficult.
I still advise you to use a real scripting language, but this shows that it is indeed possible.
The best thing to do is to create a temporary table, COPY those from the files in question, and then run your updates.
Your secondary option would be to create a function in a language like pl/perlu and do this in the stored procedure, but you will lose a lot of performance optimizations that you can do when you update from a temp table.

How to treat a comma as text in output?

There's 1 column that contains commas. When I output my query to csv, these commas break the csv format. What I've been doing to avoid this is a simple
replace(A."Sales Rep",',','')
Is there a better way of doing this so that I can actually get the commas in the final output without breaking the csv file?
Thanks!
You can use the COPY command to get PostgreSQL to build the CSV for you:
COPY -- copy data between a file and a table
Something like one of these:
copy your_table to 'filename' csv
copy your_table to 'filename' csv force quote *
copy your_table to stdout csv force quote *
copy your_table to stdout csv force quote * header
...
You have to be the super user to copy to a filename though. If you're inside psql, you can use the \copy command:
Performs a frontend (client) copy. This is an operation that runs an SQL COPY command, but instead of the server reading or writing the specified file, psql reads or writes the file and routes the data between the server and the local file system.
The syntax is pretty much the same:
\copy your_table to 'filename.csv' csv force quote * header
...
Quote the fields with "
a,this has a , in it,b
would become
a,"this has a, in it",b
and if the fields have BOTH a , and a ", double the quotes:
a,this has a " and , in it,b
becomes
a,"this has a "" and , in it",b