escape quotes while creating a databricks table - pyspark

I have the following csv file delimited by comma
col1,col2,col3
123,ABC,"DEF,EFG"
456,XYZ,"CFD,FGG"
I am creating an external table in databricks using
spark.sql(f'''
CREATE TABLE table_name USING CSV
OPTIONS (
header="true",
delimiter=",",
inferSchema="true"
path=/mnt/csvfile/abc.csv
)
''')
Due to the comma inside a column csv is wrapping it in quotes. How do I escape the quotes to get the below output in the table?
col1 col2 col3
123 ABC DEF,EFG
456 XYZ CFD,FGG

Related

Extract strings with multiple words in Postgres 11.0

I have following column in Postgres table. I would like to only get values where there are multiple words in a string.
col1
nilotinib hydrochloride
ergamisol
ergamisol
methdilazine hydrochloride
The desired output is
col1
nilotinib hydrochloride
methdilazine hydrochloride
I am using following pattern to extract the strings but it's not working
SELECT regexp_match(col1, '^\w+\s+.*') from tb1;
To filter rows, use a WHERE clause in your statement:
SELECT col1
FROM tb1
WHERE col1 ~ '^\w+\s+.*';
See the string matching documentation for alternatives to your pattern. For your case, col1 ~ '\s' should be sufficient, or col1 SIMILAR TO '%[[:space:]]%'.

Postgres CSV import - handle empty strings as integers

I have a ton of CSV files that I'm trying to import into Postgres. The CSV data is all quoted regardless of what the data type is. Here's an example:
"3971","14","34419","","","","","6/25/2010 9:07:02 PM","70.21.238.46 "
The first 4 columns are supposed to be integers. Postgres handles the cast from the string "3971" to the integer 3971 correctly, but it pukes at the empty string in the 4th column.
PG::InvalidTextRepresentation: ERROR: invalid input syntax for type integer: ""
This is the command I'm using:
copy "mytable" from '/path/to/file.csv' with delimiter ',' NULL as '' csv header
Is there a proper way to tell Postgres to treat empty strings as null?
How to do this. Since I'm working in psql and using a file that the server user can't reach I use \copy, but the principle is the same:
create table csv_test(col1 integer, col2 integer);
cat csv_test.csv
"1",""
"","2"
\copy csv_test from '/home/aklaver/csv_test.csv' with (format 'csv', force_null (col1, col2));
COPY 2
select * from csv_test ;
col1 | col2
------+------
1 | NULL
NULL | 2

How to import a CSV containing a jsonb column type

I'm trying to import data into a table with a jsonb column type, using a csv. I've read the csv specs that say any column value containing double quotes needs to:
be wrapped in quotes (double quotes at beginning and end)
double quotes escaped with a double quote (so if you want a double quote, you must use 2 double quotes instead of just 1 double quote)
My csv column value for the jsonb type looks like this (shortened for brevity):
"[
{
""day"": 0,
""schedule"": [
{
""open"": ""07:00"",
""close"": ""12:00""
}
]
}
]"
Note: opened this csv in notepad++ in case the editor is doing any special escaping, and all quotes are as they appear in editor.
Now I was curious about what the QUOTE AND ESCAPE values were in that PGAdmin error message, so here they are copied/pasted:
QUOTE '\"'
ESCAPE '''';""
To upload to PGAdmin, do I need to use \" to around each json token as (possibly?) suggested by that QUOTE value in the error message?
I'm using Go's encoding/csv package to write the csv.
I can load your file into a json or jsonb typed column using:
copy j from '/tmp/foo.csv' csv;
or
copy j from '/tmp/foo.csv' with (format csv);
or their \copy equivalents.
Based on your truncated (incomplete) text-posted-as-image, it is hard to tell what you are actually doing. But if you do it right, it will work.
The easiest workaround I've found is to copy the json data into a text column in a temporary staging table.
Then issue a query that follows the pattern:
insert into mytable (...) select ..., json_txtcol::json from staging_table
You can process it through another command before PostgreSQL receives the data, replacing the double double-quotes with an escaped double-quote.
For example:
COPY tablename(col1, col2, col3)
FROM PROGRAM $$sed 's/""/\\"/g' myfile.csv$$
DELIMITER ',' ESCAPE '\' CSV HEADER;
Here's a working example:
/tmp/input.csv contains:
Clive Dunn, "[ { ""day"": 0, ""schedule"": [{""open"": ""07:00"", ""close"": ""12:00""}]}]", 3
In psql (but should work in PgAdmin):
postgres=# CREATE TABLE test (person text, examplejson jsonb, num int);
CREATE TABLE
postgres=# COPY test (person, examplejson, num) FROM PROGRAM $$sed 's/""/\\"/g' /tmp/input.csv$$ CSV DELIMITER ',' ESCAPE '\';
COPY 1
postgres=# SELECT * FROM test;
person | examplejson | num
------------+-----------------------------------------------------------------+-----
Clive Dunn | [{"day": 0, "schedule": [{"open": "07:00", "close": "12:00"}]}] | 3
(1 row)
Disclosure: I am an EnterpriseDB (EDB) employee.

Converting DATE columns in Postgres

I have the following text file aatest.txt:
09/25/2019 | 1234.5
10/01/2018 | 6789.0
that would like to convert into zztext.txt:
2019-09-25 | 1234.5
2018-10-01 | 6789.0
My Postgres script is:
CREATE TABLE documents (tdate TEXT, val NUMERIC);
COPY documents FROM 'aatest.txt' WITH CSV DELIMITER '|';
SELECT TO_DATE(tdate, 'mm/dd/yyyy');
COPY documents TO 'zztest.txt' WITH CSV DELIMITER '|';
However I am getting the following error message:
ERROR: column "tdate" does not exist
What am I doing wrong? Thank you!
Your SELECT has no FROM clause, so you can't reference any columns. But you need to put that SELECT into the COPY statement anyways:
CREATE TABLE documents (tdate TEXT, val NUMERIC);
COPY documents FROM 'aatest.txt' WITH CSV DELIMITER '|';
COPY (select to_char(TO_DATE(tdate, 'mm/dd/yyyy'), 'yyyy-mm-dd'), val FROM documents)
TO 'zztest.txt' WITH CSV DELIMITER '|';

Concatenate a column from multiple rows into a single formatted string

I have rows like so:
roll_no
---------
0690543
0005331
0760745
0005271
And I want string like this :
"0690543.pdf" "0005331.pdf" "0760745.pdf" "0005271.pdf"
I have tried concat but unable to do so
You can use an aggregate function like string_agg, after first mangling the quotes and the .pdf extension to your column data. Use a space as your delimiter:
SELECT string_agg('"'||roll_no||'.pdf "', ' ') from myTable
SqlFiddle here