mongoimport: set type for all fields when importing CSV - mongodb

I have multiple problems with importing a CSV with mongoimport that has a headerline.
Following is the case:
I have a big CSV file with the names of the fields in the first line.
I know you can set this line to use as field names with: --headerline.
I want all field types to be strings, but mongoimport sets the types automatically to what it looks like.
IDs such as 0001 will be turn into 1, which can have bad side effects.
Unfortunately, there is (as far as i know) no way of setting them as string with a single command, but by naming each field and setting it type with
--columnsHaveTypes --fields "name.string(), ... "
When I did that, the next problem appeared.
The headerline (with all field names) got imported as values in a separate document.
So basically, my questions are:
Is there a way of setting all field types as string using the --headerline command ?
Alternative, is there a way to ignore the first line ?

I had this problem when uploading 41 million record CSV file into mongodb.
./mongoimport -d testdb -c testcollection --type csv --columnsHaveTypes -f
"RECEIVEDDATE.date(2006-01-02 15:04:05)" --file location/test.csv
As above we have a command to upload file with data types called '-f' or '--fields' but when we use this command to the file that contain header line, mondodb upload first row as well i.e header lines row then its leads error 'cannot convert to datatype' or upload column names also as data set.
Unfortunately we cannot use '--headerline' command instead of '--fields'.
Here the solutions that I found for this problem.
1)Remove header column and upload using '--fields' command as above command. if you re use linux environment you can use below command to remove first row of the huge file i.e header line.it took 2-3 mints for me.(depending on the machine performance)
sed -i -e "1d" location/test.csv
2)upload the file using '--headerline' command then mongodb uploads the file with its default identified data types.Then open mongodb shell command use testdb then run javascript command that get each record and change it into specific data types.But if you have huge file this will takes time.
found this solution from stackoverflow
db.testcollection.find().forEach( function (x) {
x.RECEIVEDDATE = new Date(x.RECEIVEDDATE ); db.testcollection .save(x);});
If you wanna remove the unnecessary rows that not fit to data type use below command.
mongodb document
'--parseGrace skipRow'

I found a solution, that I am comfortable with
Basically, I wanted to use mongoimport within my Clojure Code to import a CSV file in the DB and do a lot of stuff with it automatically. Due to the above mentioned problems I had to find a workaround, to delete this wrong document.
I did following to "solve" this problem:
To set the types as I wanted, I wrote a function to read the first line, put it in a vector and then used String concatenation to set these as fields.
Turning this: id,name,age,hometown,street
into this: id.string(),name.string(),age.string() etc
Then I used the values from the vector to identify the document with
{ name : "name"
age : "age"
etc : "etc" }
and then deleted it with a simple remving.find() command.
Hope this helps any dealing with the same kind of problem.

https://docs.mongodb.com/manual/reference/program/mongoimport/#example-csv-import-types reads:
MongoDB 3.4 added support for specifying field types. Specify field names and types in the form .() using --fields, --fieldFile, or --headerline.
so your first line within the csv file should have names with types. e.g.:
name.string(), ...
and the mongoimport parameters
--columnsHaveTypes --headerline --file <filename.csv>
As to the question of how to remove the first line, you can employ pipes. mongoimport reads from STDIN if no --file option passed. E.g.:
tail -n+2 <filename.csv> | mongoimport --columnsHaveTypes --fields "name.string(), ... "

Related

how import csv file into Postgres with empty values?

I am trying to import one csv file into Postgres which does contain age values, however there are also some empty values, since not all ages are known.
I would like to import the columns as real, since the columns contain ages with decimals like 98.45. The empty values for people when age is not known is apparently considered as strings, however I still would like to import the ages values as numbers. So I was wondering how to import real values, even when some cells in the csv are empty and thus are considered according to Postgres as string values.
for creation I used the following code, since I am dealing with decimal values.
Create table psychosocial.age (
respnr integer Primary key,
fage real,
gage real,
hage real);
after importing csv file, I get the following error
ERROR: invalid input syntax for integer: "11455, , , "
CONTEXT: COPY age, line 2, column respnr: "11455, , , "
One problem is that you're trying to import white spaces into numeric fields. So, first you have to pre-process your csv file before importing it.
Below is an example of how you can solve it using awk. From your console execute the following command:
$ cat file.csv | awk '{sub(/^ +/,""); gsub(/, /,",")}1' | psql db -c "COPY psychosocial.age FROM STDIN WITH CSV HEADER"
In case you're wondering how to pipe commands, take a look at these answers. Here a more detailed example on how to use COPY and the STDIN.
You also have to take into account that having quotation marks on integer fields can be problematic, e.g:
"11455, , , "
This will result in an error, since postgres will parse "11455 as a single value and will try to store it in an interger field, which will obviously fail. Instead, format your csv file to be like this:
11455, , ,
or even
11455,,,
You can achieve this also using awk from your console:
$ awk '{gsub(/\"/,"")};1' file.csv

Greenplum : Getting filenames processed via an external table

we are processing multiple files using external table. Is there any way I can get the file name being processed in external tables and stored it in database table?
Only workaround I can find is appending the file name to every record in the flat file which isn't ideal when huge dataset and multiple files.
Can anyone help on this
Thanks
No, the file name is simply never passed from the gpfdist daemon back to Greenplum. So you have to append the file name to each line - you can use gpfdist transformation for doing so
I was struggling with this as well, here's my solution. Please note I'm not an expert in linux, so there may be a one liner solution.
So I wanted to add a filename column in front of my records.
That can be done in sed, I've created a transform.sh file, with the following content:
#/bin/sh
filename=$1
#echo $filename >> transform.txt
sed -e "s|^|$filename\v|" $filename
Please note that I was using vertical tab as a delimiter, \v. Also in the filename you could have / hence using | . In order to have the value of $filename we have to use double quites for sed.
Test it, it looks good.
./transform.sh countersamples-2016-03-02--11-51-10.csv
countersamples-2016-03-02--11-51-10.csv
timestamp
machine
category
instance
name
value
countersamples-2016-03-02--11-51-10.csv
2016-03-02 11:51:10.064
DESKTOP-4PLQKVL
Memory
% Committed Bytes In Use
74.8485488891602
This part is done, lets continue with gpfdist. We need a yaml file that can be passed to gpfdist, I named this transform.yaml
Content:
---
VERSION: 1.0.0.1
TRANSFORMATIONS:
add_filename:
TYPE: input
CONTENT: data
COMMAND: /bin/bash transform.sh %filename%
Please note that we have the %filename% value here. It seems that gpfdist prefilters the files that needs to be handled, and passes them 1 by 1 to our transform.
Lets fire up gpfdist:
gpfdist -c transform.yaml -v
Now go into greenplum and create an external table such as:
CREATE READABLE EXTERNAL TABLE "ext_transform"
(
"filename" text,
"timestamp" timestamp without time zone ,
"machine" text ,
"category" text ,
"instance" text ,
"name" text ,
"value" double precision
)
LOCATION ('gpfdist://localhost:8080/*/countersamples*.csv#transform=add_filename')
FORMAT 'TEXT'
( HEADER DELIMITER '\013' NULL AS '\\N' ESCAPE AS '\\' )
And when we select data from it:
select * from "ext_transform";
We see:
I've created 2 folders to see how it reacts if the files are not in the same folder as the transform. This way I can distinguish between the 2 files, even if their data is identical.

MongoDB - import csv

I am importing a csv from command prompt using mongoimport statement.
Some of the description fields in my csv contains blank lines in their value, which is breaking when I am importing the csv into mongoDB. (Please note - When I am viewing the same csv in excel, it is coming out to be perfect)
There is a message on command prompt when I run mongoimport command, as follows:
"CSV file ends while inside quoted field".
It did not solve my problem either.
Here is an example that defines my CSV
Input CSV format
Column1,Column2,Column3,Column4
Values:
Val1,Val2,Val3
Val1,"abcdsc \n \n \n some text",Val3
Please advise how can I proceed further
try mongoimport manual
--ignoreBlanks
In csv and tsv exports, ignore empty fields. If not specified, mongoimport creates fields without values in imported documents.

Semi-Colon Delimiter in Mongoimport

I have been trying to import several csv files into MongoDB using the Mongoimport tool. The thing is that despite what the name says in several countries the csv files are saved with semi-colons instead of commas making me unable to use the mongoimport tool properly.
There are some workarounds for this by changing the delimiter option in the region settings, however for several reasons I don't have access to the machine that generates this csv files so I can't do that.
I was wondering is there any way to import this csv files using the mongo tools instead of me having to write something to replace all the semi-colons on a file with commas? Since I find pretty strange mongo overlooking that in some countries semi-colons are used.
mongodb supports tsv then we should replace ";" by "\t" :
tr ";" "\t" < file.csv | mongoimport --type tsv ...
It looks like this is not supported,I can't find the option to specify a delimiter among the allowed arguments for 'mongoimport' on document page http://docs.mongodb.org/manual/reference/program/mongoimport/#bin.mongoimport .
You can file a feature request on jira if it's something you'd like to
see supported.

updating table rows based on txt file

I have been searching but so far I only found how to insert date into tables based on a csv files.
I have the following scenario:
Directory name = ticketID
Inside this directory I have a couple of files, like:
Description.txt
Summary.txt - Contains ticket header and has been imported succefully.
Progress_#.txt - this is everytime a ticket gets udpdated. I get a new file.
Solution.txt
Importing the Issue.txt was easy since this was actually a CSV.
Now my problem is with Description and Progress files.
I need to update the existing rows with the data from this files. Something on the line of
update table_ticket set table_ticket.description = Description.txt where ticket_number = directoryname
I'm using PostgreSQL and the COPY command is valid for new data and it would still fail due to the ',;/ special chars.
I wanted to do this using bash script, but it seem that it is it won't be possible:
for i in `find . -type d`
do
update table_ticket
set table_ticket.description = $i/Description.txt
where ticket_number = $i
done
Of course the above code would take into consideration connection to the database.
Anyone has a idea on how I could achieve this using shell script. Or would it be better to just make something in Java and read and update the record, although I would like to avoid this approach.
Thanks
Alex
Thanks for the answer, but I came across this:
psql -U dbuser -h dbhost db
\set content = `cat PATH/Description.txt`
update table_ticket set description = :'content' where ticketnr = TICKETNR;
Putting this into a simple script I created the following:
#!/bin/bash
for i in `find . -type d|grep ^./CS`
do
p=`echo $i|cut -b3-12 -`
echo $p
sed s/PATH/${p}/g cmd.sql > cmd.tmp.sql
ticketnr=`echo $p|cut -b5-10 -`
sed -i s/TICKETNR/${ticketnr}/g cmd.tmp.sql
cat cmd.tmp.sql
psql -U supportAdmin -h localhost supportdb -f cmd.tmp.sql
done
The downside is that it will create always a new connection, later I'll change to create a single file
But it does exactly what I was looking for, putting the contents inside a single column.
psql can't read the file in for you directly unless you intend to store it as a large object in which case you can use lo_import. See the psql command \lo_import.
Update: #AlexandreAlves points out that you can actually slurp file content in using
\set myvar = `cat somefile`
then reference it as a psql variable with :'myvar'. Handy.
While it's possible to read the file in using the shell and feed it to psql it's going to be awkward at best as the shell offers neither a native PostgreSQL database driver with parameterised query support nor any text escaping functions. You'd have to roll your own string escaping.
Even then, you need to know that the text encoding of the input file is valid for your client_encoding otherwise you'll insert garbage and/or get errors. It quickly lands up being easier to do it in a langage with proper integration with PostgreSQL like Python, Perl, Ruby or Java.
There is a way to do what you want in bash if you really must, though: use Pg's delimited dollar quoting with a randomized delimiter to help prevent SQL injection attacks. It's not perfect but it's pretty darn close. I'm writing an example now.
Given problematic file:
$ cat > difficult.txt <__END__
Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()
__END__
and sample table:
psql -c 'CREATE TABLE testfile(filecontent text not null);'
You can:
#!/bin/bash
filetoread=$1
sep=$(printf '%04x%04x\n' $RANDOM $RANDOM)
psql <<__END__
INSERT INTO testfile(filecontent) VALUES (
\$x${sep}\$$(cat ${filetoread})\$x${sep}\$
);
__END__
This could be a little hard to read and the random string generation is bash specific, though I'm sure there are probably portable approaches.
A random tag string consisting of alphanumeric characters (I used hex for convenience) is generated and stored in seq.
psql is then invoked with a here-document tag that isn't quoted. The lack of quoting is important, as <<'__END__' would tell bash not to interpret shell metacharacters within the string, wheras plain <<__END__ allows the shell to interpret them. We need the shell to interpret metacharacters as we need to substitute sep into the here document and also need to use $(...) (equivalent to backticks) to insert the file text. The x before each substitution of seq is there because here-document tags must be valid PostgreSQL identifiers so they must start with a letter not a number. There's an escaped dollar sign at the start and end of each tag because PostgreSQL dollar quotes are of the form $taghere$quoted text$taghere$.
So when the script is invoked as bash testscript.sh difficult.txt the here document lands up expanding into something like:
INSERT INTO testfile(filecontent) VALUES (
$x0a305c82$Shell metacharacters like: $!(){}*?"'
SQL-significant characters like "'()$x0a305c82$
);
where the tags vary each time, making SQL injection exploits that rely on prematurely ending the quoting difficult.
I still advise you to use a real scripting language, but this shows that it is indeed possible.
The best thing to do is to create a temporary table, COPY those from the files in question, and then run your updates.
Your secondary option would be to create a function in a language like pl/perlu and do this in the stored procedure, but you will lose a lot of performance optimizations that you can do when you update from a temp table.