postgres SQL: getting rid of NA while migrating data from csv file - postgresql

I am migrating data from a "csv" file into a newly created table named fortune500. the code is shown below
CREATE TABLE "fortune500"(
"id" SERIAL,
"rank" INTEGER,
"title" VARCHAR PRIMARY KEY,
"name" VARCHAR,
"ticker" CHAR(5),
"url" VARCHAR,
"hq" VARCHAR,
"sector" VARCHAR,
"industry" VARCHAR,
"employees" INTEGER,
"revenues" INTEGER,
"revenues_change" REAL,
"profits" NUMERIC,
"profits_change" REAL,
"assets" NUMERIC,
"equity" NUMERIC
);
Then I wanted to migrate data from a csv file using the below code:
COPY "fortune500"("rank", "title", "name", "ticker", "url", "hq", "sector", "industry", "employees",
"revenues", "revenues_change", "profits", "profits_change", "assets", "equity")
FROM 'C:\Users\Yasser A.RahmAN\Desktop\SQL for Business Analytics\fortune.csv'
DELIMITER ','
CSV HEADER;
But I got the below error message due to NA values in one of the columns.
ERROR: invalid input syntax for type real: "NA"
CONTEXT: COPY fortune500, line 12, column profits_change: "NA"
SQL state: 22P02
So how can I get rid of 'NA' values while migrating the data?

Consider using a staging table that would not have restrictive data types and then do your transformations and insert into the final table after the data had been loaded into staging. This is known as ELT (Extract - Load - Transform) approach. You could also use some external tools to implement an ETL process, and do the transformation in that tool, before it reaches your database.
In your case, an ELT approach would be to first create a table with all text types, load that table and then insert into your final table, casting the text values into appropriate types, either filtering out the values that cannot be casted or inserting NULLs, or maybe 0, where that cast can't be made - depending on your requirements. For example you'd filter out rows where profits_change = 'NA' (or better, WHERE NOT (profits_change ~ '^\d+\.?\d+$') to check for a numeric value, or you'd insert NULL or 0:
CASE
WHEN profits_change ~ '^\d+\.?\d+$'
THEN profits_change::real
ELSE NULL -- or 0, depending what you need
END
You'd perform this kind of validation for all fields.
Alternatively, if it's a one off thing - just edit your CSV before importing.

Related

Unable to make generated column in postgresql for Json data

I'm trying out generated column with postgres-12. I need to create a table with generated column with JSON data. I'm going to receive "name" field as key there . However, while doing so - I got below error:
postgres=# create table json_tab2 (data jsonb ,
postgres(# "json_tab2.pname" text generated always as (data ->> "name" ) stored
postgres(# );
ERROR: column "name" does not exist
LINE 2: ...on_tab2.pname" text generated always as (data ->> "name" ) ...
After this: I tried to alter existing table- because that has value into json data for generated column - so it should be able to identify "name" now. This time I ran below:
postgres=# alter table json_tab add column Pname text generated always as (data ->> "name") stored
;
ERROR: column "name" does not exist
However, "name" has value here:
data
-------------------------------------------------
{"age": 31, "city": "New York", "name": "John"}
I'm unable to understand - what I'm doing wrong here
The righthand side of the ->> operator should be a value. In this case, since it's a string, you need to surround it with single quotes ('):
create table json_tab2 (
data jsonb,
pname text generated always as (data ->> 'name') stored
-- Here ---------------------------------^----^
);

Removing keys/values from a JSONB object in Postgresql

I am trying to adapt the Audit trigger to use JSONB instead of hstore. This function stores all inserts/updates/deletes in a separate table.
The trigger has 2 interesting fields : row_data (contains the OLD.* values) and changed_fields (contains only the modified fields/values).
I have trouble converting this part.
In the original function, we have the following code :
audit_row.row_data = hstore(OLD.*);
audit_row.changed_fields = (hstore(NEW.*) - audit_row.row_data);
In my implementation, row_data and changed_fields are of type JSONB
In the documentation, the "-" operator will match on keys only and I obviously need to match on both key AND value.
As an example, if the OLD value is this :
select jsonb_object('{field1,1,field2,a string,field3,TRUE}');
jsonb_object
---------------------------------------------------------
{"field1": "1", "field2": "a string", "field3": "TRUE"}
and only field2 was updated, I need to see this :
?column?
------------------------
{"field2": "a string"}
which I would get with this query :
select jsonb_object('{field1,1,field2,a string,field3,TRUE}')
#- '{field1}'
#- '{field3}';
Is there an elegant way to do this (like how it's done with hstore) or should I keep the hstore implementation and convert changed_fields to JSONB (with everything seen as text) ?
I could also loop on all fields in NEW and add them to changed_fields if a match could not be found but how to do this inside a function ?

Column value become NULL when creating Hive table from BSON file

I created a Hive(3.1.2) table from a BSON file dump from MongoDB (4.0).After creating the table, I select couples of entries from the table. However some of them value is null.
I tried to print the table row from BSON using python. It printed the values correct. Means the value not missing. Any clue about how to further trouble shoot?
SQL to create hive table.
CREATE EXTERNAL TABLE `tmp_test_status`(
`id` string COMMENT 'frame_id',
`createdAt` INT,
`updatedAt` string,
`task` string)
row format serde 'com.mongodb.hadoop.hive.BSONSerDe'
with serdeproperties('mongo.columns.mapping'='{"id":"_id"}')
stored as inputformat 'com.mongodb.hadoop.mapred.BSONFileInputFormat'
outputformat 'com.mongodb.hadoop.hive.output.HiveBSONFileOutputFormat'
LOCATION
'oss://data-warehouse/hive/warehouse/data.db/tmp_test_status';
===========================================
Data I printed by python bson lib.
{'_id': '00003a02-280d-4e59-8483-a0143e0a3359', 'createdAt': '1557999191951', 'updatedAt': '1557999191951', 'task': 'lane', '__v': 0}
===========================================
Data I selected from Hive table:
00003a02-280d-4e59-8483-a0143e0a3359 NULL NULL lane
093e72ae-206b-4112-ac28-5ba38f9485d0 NULL NULL lane
093ebe41-183c-47b4-ab25-93336875ae10 NULL NULL lane
093ec16b-ba1d-4ddc-90bc-9981342e8071 NULL NULL lane
I found the answer my self, the reason is that the BSON file attribute name distinguish lower and upper case, but Hive not. If the attribute name contain upper case in BSON file, then Hive will return NULL when query.Simply map the attribute name by table properties worked for me.
with serdeproperties('mongo.columns.mapping'='{"id":"_id", "createdAt": "createdAt", "updatedAt": "updatedAt", "reLabeled1" : "reLabeled1", "isValid": "isValid"}')

insert a CSV file with polygons into PostgreSQL

I've got a CSV file including 3 columns named boundaries,category and city.
the data in every cell,below the column "boundaries" is comprised of something like this:
{"coordinates"=>[[[-79.86938774585724, 43.206149439482836], [-79.87618446350098, 43.19090988330086], [-79.88626956939697, 43.19328385965552], [-79.88325476646423, 43.200029828720744], [-79.8932647705078, 43.20258723593195], [-79.88930583000183, 43.211150250203886], [-79.86938774585724, 43.206149439482836]]], "type"=>"Polygon"}
how can I create a table with a proper data type for column "boundaries"?
The data you have specified is in JSON format, so you could either store the boundaries data as a jsonb table column.
e.g: CREATE TABLE cities ( city varchar, category varchar, boundaries jsonb)
The alternative is to parse the JSON and store the coordinates in a PostgreSQL ARRAY column: something like:
CREATE TABLE cities (
city varchar,
category varchar,
boundary_coords point ARRAY,
boundary_type varchar
)

CSV file data into a PostgreSQL table

I am trying to create a database for movielens (http://grouplens.org/datasets/movielens/). We've got movies and ratings. Movies have multiple genres. I splitted those out into a separate table since it's a 1:many relationship. There's a many:many relationship as well, users to movies. I need to be able to query this table multiple ways.
So I created:
CREATE TABLE genre (
genre_id serial NOT NULL,
genre_name char(20) DEFAULT NULL,
PRIMARY KEY (genre_id)
)
.
INSERT INTO genre VALUES
(1,'Action'),(2,'Adventure'),(3,'Animation'),(4,'Children\s'),(5,'Comedy'),(6,'Crime'),
(7,'Documentary'),(8,'Drama'),(9,'Fantasy'),(10,'Film-Noir'),(11,'Horror'),(12,'Musical'),
(13,'Mystery'),(14,'Romance'),(15,'Sci-Fi'),(16,'Thriller'),(17,'War'),(18,'Western');
.
CREATE TABLE movie (
movie_id int NOT NULL DEFAULT '0',
movie_name char(75) DEFAULT NULL,
movie_year smallint DEFAULT NULL,
PRIMARY KEY (movie_id)
);
.
CREATE TABLE moviegenre (
movie_id int NOT NULL DEFAULT '0',
genre_id tinyint NOT NULL DEFAULT '0',
PRIMARY KEY (movie_id, genre_id)
);
I dont know how to import my movies.csv with columns movie_id, movie_name and movie_genre For example, the first row is (1;Toy Story (1995);Animation|Children's|Comedy)
If I INSERT manually, it should be look like:
INSERT INTO moviegenre VALUES (1,3),(1,4),(1,5)
Because 3 is Animation, 4 is Children and 5 is Comedy
How can I import all data set this way?
You should first create a table that can ingest the data from the CSV file:
CREATE TABLE movies_csv (
movie_id integer,
movie_name varchar,
movie_genre varchar
);
Note that any single quotes (Children's) should be doubled (Children''s). Once the data is in this staging table you can copy the data over to the movie table, which should have the following structure:
CREATE TABLE movie (
movie_id integer, -- A primary key has implicit NOT NULL and should not have default
movie_name varchar NOT NULL, -- Movie should have a name, varchar more flexible
movie_year integer, -- Regular integer is more efficient
PRIMARY KEY (movie_id)
);
Sanitize your other tables likewise.
Now copy the data over, extracting the unadorned name and the year from the CSV name:
INSERT INTO movie (movie_id, movie_name)
SELECT parts[1], parts[2]::integer
FROM movies_csv, regexp_matches(movie_name, '([[:ascii:]]*)\s\(([\d]*)\)$') p(parts)
Here the regular expression says:
([[:ascii:]]*) - Capture all characters until the matches below
\s - Read past a space
\( - Read past an opening parenthesis
([\d]*) - Capture any digits
\) - Read past a closing parenthesis
$ - Match from the end of the string
So on input "Die Hard 17 (John lives forever) (2074)" it creates a string array with {'Die Hard 17 (John lives forever)', '2074'}. The scanning has to be from the end $, assuming all movie titles end with the year of publication in parentheses, in order to preserve parentheses and numbers in movie titles.
Now you can work on the movie genres. You have to split the string on the bar | using the regex_split_to_table() function and then join to the genre table on the genre name:
INSERT INTO moviegenre
SELECT movie_id, genre_id
FROM movies_csv, regexp_split_to_table(movie_genre, '\|') p(genre) -- escape the |
JOIN genre ON genre.genre_name = p.genre;
After all is done and dusted you can delete the movies_csv table.