loading gzipped json file into redshift

loading gzipped json file into redshift - amazon-redshift

I'm looking to load from s3 gzipped files that looks like:
{"a": "a", "b": "a", "time": "20210210T10:10:00"}
{"a": "b", "b": "b", "time": "20210210T11:10:00"}
I created the table in redshift beforehand:
create table stTest(
a varchar(50),
b varchar(50),
time varchar(50));
This is what I run and get:
db=# COPY stTest FROM 's3://bucket/file.gz' credentials 'aws_access_key_id=x;aws_secret_access_key=y' json 'AUTO' gzip ACCEPTINVCHARS ' ' TRUNCATECOLUMNS TRIMBLANKS;
ERROR: S3 path "AUTO" has invalid format.
DETAIL:
-----------------------------------------------
error: S3 path "AUTO" has invalid format.
code: 8001
context: Parsing S3 Bucket
query: 72165606
location: s3_utility.cpp:132
process: padbmaster [pid=4690]
-----------------------------------------------
Would love for some help.

You need to specify the json field to Redshift column mapping. This is done the FORMAT option and a jsonpaths file. See https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format for format of the jsonpaths file.

Related

Error when copying a csv file to a table using SQL Shell (psql)

I have a file test.csv with two lines:
"smth", "1", "2", "3"
"smth more", "4", "5", "6"
I'm trying to copy it to a table with sql shell:
CREATE TABLE test_table(col1 TEXT, col2 TEXT, col3 TEXT, col4 TEXT);
COPY test_table FROM 'E:\\PostgreSQL13\\scripts\\test.csv';
And get the following error:
wrong syntax for type integer: """"smth"", ""1"", ""2"", ""3""""
What could be the problem here?
Thank you!

Tell copy you are inputing a csv, so it will use the default comma separator and recognize the double quotes. (text format is the default)
COPY test_table FROM 'E:\\PostgreSQL13\\scripts\\test.csv' (FORMAT 'csv')

Snowflake null values quoted in CSV breaks PostgreSQL unload

I am trying to shift data from Snowflake to Postgresql and to do so I first load it into s3 in CSV format. In the table, comas in text could appear, I therefore use FIELD_OPTIONALLY_ENCLOSED_BY snowflake unloading option to quote the content of the problematic cells. However when this happen + null values, I can't manage to have a valid CSV for PostgreSQL.
I created a simple table for you to understand the issue. Here it is :
CREATE OR REPLACE TABLE PUBLIC.TEST(
TEXT_FIELD VARCHAR(),
NUMERIC_FIELD INT
);
INSERT INTO PUBLIC.TEST VALUES
('A', 1),
(NULL, 2),
('B', NULL),
(NULL, NULL),
('Hello, world', NULL)
;
COPY INTO #STAGE/test
FROM PUBLIC.TEST
FILE_FORMAT = (
COMPRESSION = NONE,
TYPE = CSV,
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ''
)
OVERWRITE = TRUE;
Snowflake will from that create the following CSV
"A",1
"",2
"B",""
"",""
"Hello, world",""
But after that, it is for me impossible to copy this CSV inside a PostgreSQL Table as it is.
Even thought from PostgreSQL documentation we have next to NULL option :
Specifies the string that represents a null value. The default is \N (backslash-N) in text format, and an unquoted empty string in CSV format.
Not setting COPY Option in PostgreSQL COPY INTO will result in a failed unloading. Indeed it won't work as we also have to specify the quote used using QUOTE. Here it'll be QUOTE '"'
Therefore during POSTGRESQL unloading, using :
FORMAT csv, HEADER false, QUOTE '"' will give :
DataError: invalid input syntax for integer: "" CONTEXT: COPY test, line 3, column numeric_field: ""
FORMAT csv, HEADER false, NULL '""', QUOTE '"' will give :
NotSupportedError: CSV quote character must not appear in the NULL specification
FYI, To test the unloading in s3 I will use this command in PostgreSQL:
CREATE IF NOT EXISTS TABLE PUBLIC.TEST(
TEXT_FIELD VARCHAR(),
NUMERIC_FIELD INT
);
CREATE EXTENSION IF NOT EXISTS aws_s3 CASCADE;
SELECT aws_s3.table_import_from_s3(
'PUBLIC.TEST',
'',
'(FORMAT csv, HEADER false, NULL ''""'', QUOTE ''"'')',
'bucket',
'test_0_0_0.csv',
'aws_region'
)
Thanks a lot for any ideas on what I could do to make it happen? I would love to find a solution that don't requires modifying the csv between snowflake and postgres. I think it is an issue more on the Snowflake side as it don't really make sense to quote null values. But PostgreSQL is not helping either.

When you set the NULL_IF value to '', you are actually telling Snowflake to convert NULLS to a BLANK, which then get quoted. When you are copying out of Snowflake, the copy options are "backwards" in a sense and NULL_IF acts more like an IFNULL.
This is the code that I'd use on the Snowflake side, which will result in an unquoted empty string in your CSV file:
FILE_FORMAT = (
COMPRESSION = NONE,
TYPE = CSV,
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
NULL_IF = ()
)

What is the format of LSN in DMS for DB2 Endpoint?

I am trying to migrate data from DB2 to AWS S3 using Amazon DMS CDC task.
I want to capture current LSN (Log Sequence Number) and add it as a new column in all tables. I have tried below approached and it is working fine for DB2 v9.7.0.2 with Fix Pack 2 but when I try the same approach with DB2 v10.5.0.7 with Fix Pack 7 then it gives value in a different format that I am not understanding.
Approach
I have added below transformation for all tables which add new column "LSN_Number"
{
"rule-type": "transformation",
"rule-id": "2",
"rule-name": "2",
"rule-target": "column",
"object-locator": {
"schema-name": "%",
"table-name": "%"
},
"rule-action": "add-column",
"value": "LSN_Number",
"expression": "$AR_H_STREAM_POSITION",
"data-type": {
"type": "string",
"length": 50
}
}
Value of LSN_Number column with DB2 v9.7.0.2 with Fix Pack 2 is as below (expected)
0000000000000000006BD1E6BD1ED46BC2
000000000000000000000006BD1ED4AAFF
000000000000000000000006BD1ED54B82
000000000000000000000006BD1ED585CD
000000000000000000000006BD1ED61E74
But the value of LSN_Number column with DB2 v10.5.0.7 with Fix Pack 7 is as below
010000000003AD0AD500000000360A8D38|0200000000000000000000002E3808795F
010000000003AD0AFF00000000360A8E38|0200000000000000000000002E3808F81F
010000000003AD133300000000360AF312|0200000000000000000000002E38405DB5
what is the difference between both format and how can I parse (010000000003AD0AD500000000360A8D38|0200000000000000000000002E3808795F)it with PySpark.

Change Json text to json array

Currently in my table data is like this
Field name : author
Field Data : In json form
When we run select query
SELECT bs.author FROM books bs; it returns data like this
"[{\"author_id\": 1, \"author_name\": \"It happend once again\", \"author_slug\": \"fiction-books\"}]"
But I need selected data should be like this
[
{
"author_id": 1,
"author_name": "It happend once again",
"author_slug": "fiction-books"
}
]
Database : PostgreSql
Note : Please avoid PHP code or iteration by PHP code

The answer depends on the version of PostgreSQL you are using and ALSO what client you are using but PostgreSQL has lots of builtin json processing functions.
https://www.postgresql.org/docs/10/functions-json.html
Your goal is also not clearly defined...If all you want to do is pretty print the json, this is included.
# select jsonb_pretty('[{"author_id": 1,"author_name":"It happend once again","author_slug":"fiction-books"}]') as json;
json
-------------------------------------------------
[ +
{ +
"author_id": 1, +
"author_name": "It happend once again",+
"author_slug": "fiction-books" +
} +
]
If instead you're looking for how to populate a postgres record set from json, this is also included:
# select * from json_to_recordset('[{"author_id": 1,"author_name":"It happend once again","author_slug":"fiction-books"}]')
as x(author_id text, author_name text, author_slug text);
author_id | author_name | author_slug
-----------+-----------------------+---------------
1 | It happend once again | fiction-books

Store random data in Postgres database from Python

I have the data in this form:
data={'[{"info": "No", "uid": null, "links": ["";, ""], "task_id": 1, "created": "2017-02-15T09:07:09.068145", "finish_time": "2017-02-15T09:07:14.620174", "calibration": null, "user_ip": null, "timeout": null, "project_id": 1, "id": 1}]', 'uuid': u'abc:def:ghi'}
I want to store this data in the Postgres DB. I have this query:
quer1='UPDATE table_1 SET data = "%s" WHERE id = "%s" '%(data1,id)
db_session.execute(quer1)
db_session.commit()
This query does execute but doesn't store anything in the db. Datatype of data is 'text'. I am not able to make where I am wrong. Please help.
Edited::
I updated my query to this:
quer1='UPDATE table_1 SET data = "%s" WHERE hitid = %s '%(data1,id)

First, never use % or str.format to insert values into your queries!!!
Assuming you are using psycopg2, your query should use the following format:
db_session.execute('UPDATE table_1 SET data = %s WHERE id = %s', (data1, id))
As #groteworld mentions, data = {1,2,'3',[4],(5),{6}} is not valid Python.
I will assume you are using a proper value for data in your actual code.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

loading gzipped json file into redshift - amazon-redshift

You need to specify the json field to Redshift column mapping. This is done the FORMAT option and a jsonpaths file. See https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-format for format of the jsonpaths file.

Related

Error when copying a csv file to a table using SQL Shell (psql)

Snowflake null values quoted in CSV breaks PostgreSQL unload

What is the format of LSN in DMS for DB2 Endpoint?

Change Json text to json array

Store random data in Postgres database from Python

Categories

Resources