Remove quotes for String in Clickhouse while exporting - export-to-csv

I'm trying to export data to csv from clickhouse cli.
I have a field which is string and when exported to CSV this field has quotes around it.
I want to export without the quotes but couldn't find any setting that can be set.
I went through https://clickhouse.yandex/docs/en/interfaces/formats but the Values section mentions
Strings, dates, and dates with times are output in quotes
While for JSON they have a flag that is to be set for removing quotes around Int64 and UInt64
For compatibility with JavaScript, Int64 and UInt64 integers are enclosed in double quotes by default. To remove the quotes, you can set the configuration parameter output_format_json_quote_64bit_integers to 0.
I was wondering if there is such kind of flag for strings in CSV as well.
I'm exporting using the below command
clickhouse client --multiquery --host="localhost" --port="9000" --query="SELECT field1, field2 from tableName format CSV" > /data/content.csv
I want to try removing the quotes from the shell as the last thing if nothing works.
Any help on the way I can remove the quotes while the CSV is generated would be appreciated.

Nope, there isn't. However you can easily achieve this by arrayStringConcat.
SELECT arrayStringConcat([toString(field1), toString(field2)], ',') from tableName format TSV;
Edit
In order to make Nullable output as empty string, you might need if function.
if(isNull(field1), '', assumeNotNull(field1))
This works for any types, while assumeNotNull alone only works for String

Related

SalesForce Spark Delimiter issue

I have a glue job, in which am reading table from SF using soql:
df = (
spark.read.format("com.springml.spark.salesforce")
.option("soql", sql)
.option("queryAll", "true")
.option("sfObject", sf_table)
.option("bulk", bulk)
.option("pkChunking", pkChunking)
.option("version", "51.0")
.option("timeout", "99999999")
.option("username", login)
.option("password", password)
.load()
)
and whenever there is a combination of double-quotes and commas in the string it messes up my table schema, like so:
in source:
Column A
Column B
Column C
000AB
"text with, comma"
123XX
read from SF in df :
Column A
Column B
Column C
000AB
"text with
comma"
Is there any option to avoid such cases when this comma is treated as a delimiter? I tried various options but nothing worked. And SOQL doesn't accept REPLACE or SUBSTRING functions, their text manipulation functions are, well, basically there aren't any.
All the information I'm giving need to be tested. I do not have the same env so it is difficult for me to try anything but here is what I foud.
When you check the official doc, you find that there is a field metadataConfig. The documentation of this field can be found here : https://resources.docs.salesforce.com/sfdc/pdf/bi_dev_guide_ext_data_format.pdf
On page 2, csv format, it says :
If a field value contains a control character or a new line the field value must be contained within double quotes (or your
fieldsEscapedBy value). The default control characters (fieldsDelimitedBy, fieldsEnclosedBy,
fieldsEscapedBy, or linesTerminatedBy) are comma and double quote. For example, "Director of
Operations, Western Region".
which kinda sounds like you current problem.
By default, the values are comma and double quotes, so, I do not understand why it is failing. But, apparently, in your output, it keeps the double quotes, so, maybe, it considers only simple quote.
You should try to enforce the format and add in you code :
.option("metadataConfig", '{"fieldsEnclosedBy": "\"", "fieldsDelimitedBy": ","}')
# Or something similar - i could'nt test, so you need to try by yourself

Postgresql - import from CSV null values wrapped in double quotes

So I am trying to import some data into postgresql using the COPY command.
Here is a sample of what the data looks like:
"UNIQ_ID","SP_grd1","SACN_grd1","BIOME_grd1","Meso_grd1","DM_grd1","VEG_grd1","lcov90_alb","WMA_grd1"
"G01_00000002","199058001.00000","1.00000","6.00000","24889.00000","2.00000","381.00000","33.00000","9.00000"
"G01_00000008","*********************","1.00000","*********************","24889.00000","2.00000","*********************","34.00000","*********************"
the issue that I am having is the double quotes that are wrapping the ********************* which are the null values.
I am using the following in order to create the data table and copy the data:
CREATE TABLE bravo.G01(UNIQ_ID character varying(18), SP_grd1 double precision ,SACN_grd1 numeric,BIOME_grd1 numeric,Meso_grd1 double precision,DM_grd1 numeric,VEG_grd1 numeric,lcov90_alb numeric,WMA_grd1 numeric);
COPY bravo.g01(UNIQ_ID,SP_grd1,SACN_grd1,BIOME_grd1,Meso_grd1,DM_grd1,VEG_grd1,lcov90_alb,WMA_grd1) FROM 'F:\GreenBook-Backup\LUdatacube_20171206\CSV_Data_bravo\G01.csv' DELIMITER ',' NUll AS '*********************' CSV HEADER ;
the create table command works fine but I encounter an error with the NULL AS statement. If I edit the text file and remove the double quotes then the import works fine.
I assume that as CSVs with double quotes and null values are very common there must be a work around here that I am missing. I certainly don't want to go and edit each of my CSVs so that it doesn't have double quotes!
You might want to try adding FORCE_NULL( column_name [, ...] ) option.
As the documentation stated for FORCE_NULL:
Match the specified columns' values against the null string, even if it has been quoted, and if a match is found set the value to NULL. In the default case where the null string is empty, this converts a quoted empty string into NULL. This option is allowed only in COPY FROM, and only when using CSV format.
The option available from Postgres 9.4: https://www.postgresql.org/docs/10/static/sql-copy.html
If you're on a unix-like platform, you could use sed to replace the null-strings with something postgresql will recognize automatically as null. On windows, powershell exposes similar functionality.
This approach is more general if you need to perform other types of clean up on the data before loading.
The regex pattern to match your null-string is "[\*]*"
cleaning the file with sed:
[unix]>sed 's/"[\*]*"//g' test.csv > test2.csv
cleaning the file with windows powershell:
[windows-powershell]>cat test.csv | %{$_ -replace '"[\*]*"', ""} > test2.csv
loading into postgresql can then be shorter.:
psql>\copy bravo.g01 FROM 'test2.csv' WITH CSV HEADER;

how to select character varying data properly in postgresql

I tried to select a data which is in column "fileName" and its fileName is '2016-11-22-12-55-09_hyun.png'
I tired the
select * from images where 'fileName' like '2016-11-22-12-55-09_hyun.png'
However it can not select anything, nor has any kind of error info.
How can I select this file with its filename? Thank you so much.
Single quotes denote a string literal. So in this query you aren't evaluating the column filename, but checking whether the string 'filename' is like the string '2016-11-22-12-55-09_hyun.png', which it of course is not. Just drop the quotes from filename and you should be OK. Also note that since you aren't using any wildcards, using the like operator is pretty pointless, and you could (should) just a plain old equality check:
select * from images where fileName = '2016-11-22-12-55-09_hyun.png'
-- No quotes -------------^--------^

hstore value with single quote

I asked similar question here for: hstore value with space. And get solved by user: Clodoaldo Neto. Now I have come across next case with string containing single quote.
SELECT 'k=>"name", v=>"St. Xavier's Academy"'::hstore;
I tried it by using dollar-quoted string constant by reading http://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS
SELECT 'k=>"name", v=>$$St. Xavier's Academy$$'::hstore;
But I couldn't get it right.
How to make postgresql hstore using strings containing single quote?
It seems like there are more such exceptions possible for this query. How to address them all at once?
You can escape the embedded single quote that same way you'd escape any other single quote inside a string literal: double it.
SELECT 'k=>"name", v=>"St. Xavier''s Academy"'::hstore;
-- ------------------------------^^
Alternatively, you could dollar quote the whole string:
SELECT $$k=>"name", v=>"St. Xavier's Academy"$$::hstore;
Whatever interface you're using to talk to PostgreSQL should be taking care of these quoting and escaping issues. If you're using manual string wrangling to build your SQL then you should be using your driver's quoting and placeholder methods.
hstore's internal parsing understands double quotes around keys:
Double-quote keys and values that include whitespace, commas, =s or >s.
Dollar quoting is, as you noted, for SQL string literals, hstore's parser doesn't know what they mean.

postgresql how to have COPY interpret formatted numeric fields automatically?

I have an input CSV file containing something like:
SD-32MM-1001,"100.00",4/11/2012
SD-32MM-1001,"1,000.00",4/12/2012
I was trying to COPY import that into a postgresql table(varchar,float8,date) and ran into an error:
# copy foo from '/tmp/foo.csv' with header csv;
ERROR: invalid input syntax for type double precision: "1,000.00"
Time: 1.251 ms
Aside from preprocessing the input file, is there some setting in PG that will have it read a file like the one above and convert to numeric form in COPY? Something other than COPY?
If preprocessing is required, can it be set as part of the COPY command? (Not the psql \copy)?
Thanks a lot.
The option to pre processing is to first copy to a temporary table as text. From there insert into the definitive table using the to_number function:
select to_number('1,000.00', 'FM000,009.99')::double precision;
It's an odd CSV file that surrounds numeric values with double quotes, but leaves values like SD-32MM-1001 unquoted. In fact, I'm not sure I've ever seen a CSV file like that.
If I were in your shoes, I'd try copy against a file formatted like this.
"SD-32MM-1001",100.00,4/11/2012
"SD-32MM-1001",1000.00,4/12/2012
Note that numbers have no commas. I was able to import that file successfully with
copy test from '/fullpath/test.dat' with csv
I think your best bet is to get better formatted output from your source.