NIFI Insert CSV File into Postgres Database with date fields - postgresql

I would like to insert csv file into my postgres database. I use processors :
Getfiles ->
Split (cause files are big) ->
UpdateAttribute (to add avro.schema) ->
ConvertCSvToAvro ->
Putdatabaserecord.
If i use only string/text fields (in my avro schema and in column postgres database), the result is ok.
But when i tried to format Date fields, i have an error.
My raw data (CSV) is :
date_export|num_etiquette|key
07/11/2019 01:36:00|BAROMETRExxxxx|BAROMETRE-xxxxx
My avro schema is :
{
"type":"record",
"name":"public.data_scope_gp_temp",
"fields":[
{"name":"date_export","type":{ "type": "int", "logicalType": "date"}},
{"name":"num_etiquette","type":"string"},
{"name":"cle_scope","type":"string"}
]}
My postgres schema is:
date_export date,
num_etiquette text COLLATE pg_catalog."default",
key text COLLATE pg_catalog."default"
Any idea ?Regards

You don't need UpdateAttribute or ConvertCsvToAvro to use PutDatabaseRecord. You can specify a CSVReader in PutDatabaseRecord, and your CSVReader can supply the Avro schema in the Schema Text property (don't forget to set your Schema Strategy to Use Schema Text).

Related

if we change the schema of a table in postgress will the sequence gets auto created?

Say we have a schema named : "test", a table in it as below:
CREATE TABLE IF NOT EXISTS test.details
(
id integer NOT NULL DEFAULT nextval('test.details_id_seq'::regclass),
username character varying(50) COLLATE pg_catalog."default" NOT NULL
)
as we could see id column is a Sequence which is already been created in this schema.
now if we create a new schema named "check" and altered details table schema as
create schema check
alter table test.details set schema check
will the sequence be auto created in check schema?
The sequence is an independent database object. If you change the schema of the table, the sequence will remain in the old schema, and the table will continue to use it. You can use ALTER SEQUENCE to change the sequence's schema as well, and the default value will continue to work (PostgreSQL stores the parsed default expression rather than the text).
All that becomes easier if you use identity columns to generate keys: there, you don't have to take care of the sequence. It is created automatically and changes schema with the table.

How to create column jsonb on AWS-Redshift

I have a table in my demo database.
CREATE TABLE airports_data (
airport_code character(3) NOT NULL,
airport_name text NOT NULL,
city text NOT NULL,
coordinates super NOT NULL,
timezone text NOT NULL
);
In the Postgres Database, the data in the coordinate column has POINT as data type, and are like this :
{"x":129.77099609375,"y":62.093299865722656}
When copy the table to csv file, the data are represent like this in the csv file :
"(129.77099609375,62.093299865722656)"
As i defined SUPER as data type for the column coordinate in redshift,how can i copy data type POINT from database postgres to csv file, and load csv file to redshift ?
In Amazon Redshift the SUPER data type is used to store semi-structured data.
This is the Amazon Redshift guide for loading and manipulating semi-structured data using the the SUPER data type.
As an example, if you have:
CREATE TABLE "public"."tmp_super2"
("id" VARCHAR(255) NULL, "data1" SUPER NULL, "data2" SUPER NULL)
BACKUP Yes;
With a CSV file named somefile.csv like this:
a|{}|{}
b|{\"a\":\"Hello World\", \"b\":100}|{\"z\":\"Hello Again\"}
Then you can load it with a COPY command like this:
COPY "public"."tmp_super2"
FROM 's3://yourbucket/yourfolder/somefile.csv'
REGION 'us-west-1'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
DELIMITER '|' ESCAPE
The COPY command is picky about double quotes when it is loading SUPER from CSV, hence the use of a pipe field delimiter, and escaping the double quotes.

How do I define the type of a field starting with a #?

I am trying to create a stream from some kakfa messages in json format like :
"beat": {
"name": "xxxxxxx",
"hostname": "xxxxxxxxxx",
"version": "zzzzz"
},
"log_instance": "forwarder-2",
"type": "prod",
"message": "{ ... json string.... }",
"#timestamp": "2020-06-14T23:31:33.925Z",
"input_type": "log",
"#version": "1"
}
I tried using
CREATE STREAM S (
beat STRUCT<
name VARCHAR,
hostname VARCHAR,
version VARCHAR
>,
log_instance VARCHAR,
type VARCHAR,
message VARCHAR, # for brevity - I also have this with a struct
#timestamp VARCHAR,
input_type VARCHAR,
#version` VARCHAR )
WITH (KAFKA_TOPIC='some_topic', VALUE_FORMAT='JSON');
However I get an error :
Caused by: line 10:5: extraneous input '#' expecting ....
I tied quoting and preceeding underscore but no luck. I also tried creating an entry in the registry but I could not create legit avro this way.
PS. How do I "bind" a topic to a registry schema?
Thanks.
If you're on a recent enough version of ksqlDB then simply quoting the column names with invalid characters should work:
CREATE STREAM S (
beat STRUCT<
name VARCHAR,
hostname VARCHAR,
version VARCHAR
>,
log_instance VARCHAR,
type VARCHAR,
message VARCHAR, # for brevity - I also have this with a struct
`#timestamp` VARCHAR,
input_type VARCHAR,
`#version` VARCHAR )
WITH (KAFKA_TOPIC='some_topic', VALUE_FORMAT='JSON');
If the above doesn't work, then it's likely you're on an old version of ksqlDB. Upgrading should fix this issue.
PS. How do I "bind" a topic to a registry schema?
ksqlDB will auto-publish the JSON schema to the Schema Registry if you use the JSON_SR format, rather than just JSON. The latter only supports reading the schema from the schema registry.
If you're more asking how you register a schema in the SR for an existing topic... then you're best off looking at the SR docs. Note, ksqlDB only supports the TopicNameStrategy naming strategy. The value schema has the subject {topic-name}-value, e.g. the following registers a JSON schema for the test topic's values.
curl -X POST -H "Content-Type: application/vnd.schemaregistry.v1+json" --data '{"schema": "{\"type\":\"record\",\"name\":\"Payment\",\"namespace\":\"io.confluent.examples.clients.basicavro\",\"fields\":[{\"name\":\"id\",\"type\":\"string\"},{\"name\":\"amount\",\"type\":\"double\"}]}"}' http://localhost:8081/subjects/test-value/versions
See the SR tutorial for more info: https://docs.confluent.io/current/schema-registry/schema_registry_tutorial.html
I also tried creating an entry in the registry but I could not create legit avro this way.
Avro does not allow # in its field names. However, it looks like your data is in JSON format, which does allow #. See the curl example above on how to register a JSON schema.

psql copy from csv automatically generating ids

Well consider a table created like this:
CREATE TABLE public.test
(
id integer NOT NULL DEFAULT nextval('user_id_seq'::regclass),
name text,
PRIMARY KEY (id)
)
So the table has a unique 'id' column that auto generates default values using a sequence.
Now I wish to import data from a csv file, extending this table. However "obviously" the ids need to be unique, and thus I wish to let the database itself generate the ids, the csv file itself (coming from a complete different source) has hence an "empty column" for the ids:
,username
,username2
However if I then import this csv using psql:
\copy public."user" FROM '/home/paul/Downloads/test.csv' WITH (FORMAT csv);
The following error pops up:
ERROR: null value in column "id" violates not-null constraint
So how can I do this?
The empty colum from the CSV file is interpreted as SQL NULL, and inserting that value overrides the DEFAULT and leads to the error.
You should omit the empty column from the file and use:
\copy public."user"(name) FROM '...' (FORMAT 'csv')
Then the default value will be used for id.

Copy from S3 AVRO file to Table in Redshift Results in All Null Values

I am trying to copy an AVRO file that is stored in S3 to a table I created in Redshift and I am getting all null values. However, the AVRO file does not have null values in it. I see the following error when I look at the log: "Missing newline: Unexpected character 0x79 found at location 9415"
I did some research online and the only post I could find said that values would be null if the column name case in the target table did not match the source file. I have ensured the case for the column in the target table is the same as the source file.
Here is mock snippet from the AVRO file:
Objavro.schemaĒ{"type":"record","name":"something","fields":[{"name":"g","type":["string","null"]},{"name":"stuff","type":["string","null"]},{"name":"stuff","type":["string","null"]}
Here is the sql code I am using in Redshift:
create table schema.table_name (g varchar(max));
copy schema.table_name
from 's3://bucket/folder/file.avro'
iam_role 'arn:aws:iam::xxxxxxxxx:role/xx-redshift-readonly'
format as avro 'auto';
I am expecting to see a table with one column called g where each row has the value stuff.