I have a delimited text file with delimiter as ~|^.
I need to ingest this file into marklogic using MLCP. For this I tried MLCP ingestion using 2 ways.
Using MLCP without options file
mlcp.sh import -username admin -password admin -input_file_type delimited_text -delimiter "~|^" -document_type json -host localhost -database test -port 8052 -output_uri_prefix /test/data/ -generate_uri -output_uri_suffix .json \-output_collections "Test" -input_file_path inputfile1.csv
Using MLCP with options file
mlcp.sh import -username admin -password admin -options_file delim.opt -document_type json -host localhost -database test -port 8052 -output_uri_prefix /test/data/ -generate_uri -output_uri_suffix .json \-output_collections "Test" -input_file_path inputfile1.csv
My options file looks like this (delim.opt):
-input_file_type
delimited_text
-delimiter
"~|^"
But in both the ways, mlcp didnt work and I got the following error:
java.lang.IllegalArgumentException: Invalid delimiter: ~|^
Can anyone please help me with how I can ingest these types of CSV files through MLCP into MarkLogic?
I believe MarkLogic content pump cannot support parsing multi-character delimiters. MarkLogic content pump uses the Apache Commons CSV library to parse delimited text. As of today, it looks like there is an open issue with parsing delimited text for multi-character delimiters, see issue CSV-206.
For now you could create new delimited text files with single character delimiters. I often use sed in the command line to replace strings in files. If you go this route be aware that you'll need to escape any occurrences of the new delimiter in the record values.
Related
I want to upload data into Google cloudsql instance from csv file stored on GCS bucket. I am using postgresql database and to import csv files using gcloud sql import csv command in my shell script. There is an issue as some csv files contains " characters and to ignore that I want to add " as Escape character but gcloud sql import csv command doesn't have any fields to add escape character. Does anybody have anyidea on that?
As per documentation, to import CSV files to Cloud SQL PostgreSQL, the CSV file has to follow a specific format.
We can also see that for the command you're using there isn't any parameter that fits your requirements.
Instead, as an alternative, I'd use some sort of lint or text editor and try to massively remove the characters that conflict you, if possible.
I have a client who requires a database export from a SQL Server 2016 database in UTF8 no BOM. I've used PowerShell to import the raw output from the database (which is in ANSI) and output the file in UTF-8.
Now I am hearing back if I could remove some 'special characters', and saw that PowerShell has changed it as shown in the picture.
Is there any way PowerShell could keep the character or remove it entirely?
This might also happen with other characters in the future, our sample dataset only contains this particular character.
EDIT: The Customer has a batch script which exports a select request from a MSSQL Server to a CSV File. Script as follows:
sqlcmd -S [SERVER]\[INSTANCE] -U sa -P [PASSWORD] -d [DATABASE] -I -i "C:\Path\To\Query.sql" -o "C:\Path\For\Output\Ouput.csv" -W -s"|"
The CSV is seperated by a Pipe.
The request was then to add double-quotes as text identifier as well as change the encoding to UTF-8 no BOM. The Database apparently exports the file in ANSI.
I've created a powershell script since I know it will automatically add the double quotes for me and I should be able to change encoding through it.
Script goes as follows:
$file = Import-Csv -Path "C:\Path\For\Output\Ouput.csv" -Encoding "UTF7" -Delimiter "|"
$file | Export-Csv -path "C:\Path\For\Output\Ouput.csv" -delimiter "|" -Encoding "UTF8noBOM" -NoTypeInformation
The reason for the -Encoding UTF7 flag in the input step was that without it, we had problems with special letters like ß and äöü (we're in Germany, those will be frequent).
After running the file through this script, it's mostly as it should be however the example in the screenshot is a problem for the people trying to import the file into their system afterwards.
Did this help? I'll gladly provide any further information, thank you in you advanced!
EDIT: Found a solution. I've edited the customers original script which creates the export from the database, I've added the -u flag making the output Unicode. It's not UTF8 yet but the powershell script can now convert the file properly, also no need to set import encoding to UTF7. Thanks to JosefZ for questioning my use of forced UTF7 encoding, made me realise I was looking at the wrong place to fix this.
I have a database setup with a UTF-8 encoding. Trying to copy a table to csv, where the filename has a special character writes out the filename wrong to disk.
On a Windows 10 localhost PostgreSQL installation:
copy
(select 'tønder')
to 'C:\temp\Sønderborg.csv' (FORMAT CSV, HEADER TRUE, DELIMITER ';', ENCODING 'UTF8');
Names the csv file: Sønderborg.csv and not Sønderborg.csv.
Both
SHOW CLIENT ENCODING;
SHOW SERVER_ENCODING;
returns UTF8
How can one control the csv filename encoding? The encoding inside the csv is ok writing Tønder!
UPDATE
I have run the copy command from pgAdmin, DataGrip and a psql console. DataGrip uses JDBC and will only handle UTF8. All three applications writes the csv filename in wrong encoding. The only difference is that the psql console says the client encoding is WIN1252.
I don't think it's possible to change this behaviour. It looks like Postgres assumes that the filename encoding matches the server_encoding (as suggested on the mailing lists here and here). The only workaround I could find was to run the command while connected to a WIN1252-encoded database, which is probably not very helpful.
If you're trying to run this on the same machine as the server itself, then instead of using the server-side COPY, you can run psql's client-side \copy, which will respect your client_encoding when interpreting the file path:
psql -c "\copy (select 'tønder') to 'C:\temp\Sønderborg.csv' (FORMAT CSV, HEADER TRUE, DELIMITER ';', ENCODING 'UTF8')"
Note that cmd.exe (and even powershell.exe) still uses legacy DOS encodings by default, so you might need to run chcp 1252 to set the console codepage before launching psql.
I'm facing an issue while trying to import .json files that were exported using the mongoexport command.
The generated .json files contains the character $ in some variables such as $oid and $numberLong().
{"_id":{"$oid":"55aff0e7b3bdf92b314b6fa6"},"activated":true,"authRole":"USER","authToken":"5bdad308-4a11-4890-8c3e-82c29530f1bc","birthDate":{"$date":"2015-08-06T03:00:00.000Z"},"comercialPhone":"99999994","email":"test#mail.com","mobilePhone":"99999999","name":"Test Test","password":"$2a$10$y","validationToken":"b2cd0d71-cb47-405d-bf7f-e46e1a8706e4","version":{"$numberLong":"35"}}
However, this format is not acceptable while importing the files. This format seems to be the strict mode, but I'd like to generate json files using the shell format which shows $oid as ObjectId.
Is there any workaround for this?
Thanks!
Using MySQL Administrator GUI tool I have exported some data tables retrieved from an sql dumpfile to csv files.
I then tried to import these CSV files into a PostgreSQL database using the postgres COPY command. I've tried entering
COPY articles FROM '[insert .csv dir here]' DELIMITERS ',' CSV;
and also the same command without the delimiters part.
I get an error saying
ERROR: invalid input syntax for integer: "id"
CONTEXT: COPY articles, line 1, column id: "id"
In conclusion my question is what are some thoughts and solutions to this problem? Could it possibly be something to do with the way I created the csv files? or have I made a rookie mistake elsewhere?
If you have header columns just add the header qualifier to the copy statement as per
documentation to skip that line
http://www.postgresql.org/docs/8.4/static/sql-copy.html