insert a CSV file with polygons into PostgreSQL - postgresql

I've got a CSV file including 3 columns named boundaries,category and city.
the data in every cell,below the column "boundaries" is comprised of something like this:
{"coordinates"=>[[[-79.86938774585724, 43.206149439482836], [-79.87618446350098, 43.19090988330086], [-79.88626956939697, 43.19328385965552], [-79.88325476646423, 43.200029828720744], [-79.8932647705078, 43.20258723593195], [-79.88930583000183, 43.211150250203886], [-79.86938774585724, 43.206149439482836]]], "type"=>"Polygon"}
how can I create a table with a proper data type for column "boundaries"?

The data you have specified is in JSON format, so you could either store the boundaries data as a jsonb table column.
e.g: CREATE TABLE cities ( city varchar, category varchar, boundaries jsonb)
The alternative is to parse the JSON and store the coordinates in a PostgreSQL ARRAY column: something like:
CREATE TABLE cities (
city varchar,
category varchar,
boundary_coords point ARRAY,
boundary_type varchar
)

Related

postgres SQL: getting rid of NA while migrating data from csv file

I am migrating data from a "csv" file into a newly created table named fortune500. the code is shown below
CREATE TABLE "fortune500"(
"id" SERIAL,
"rank" INTEGER,
"title" VARCHAR PRIMARY KEY,
"name" VARCHAR,
"ticker" CHAR(5),
"url" VARCHAR,
"hq" VARCHAR,
"sector" VARCHAR,
"industry" VARCHAR,
"employees" INTEGER,
"revenues" INTEGER,
"revenues_change" REAL,
"profits" NUMERIC,
"profits_change" REAL,
"assets" NUMERIC,
"equity" NUMERIC
);
Then I wanted to migrate data from a csv file using the below code:
COPY "fortune500"("rank", "title", "name", "ticker", "url", "hq", "sector", "industry", "employees",
"revenues", "revenues_change", "profits", "profits_change", "assets", "equity")
FROM 'C:\Users\Yasser A.RahmAN\Desktop\SQL for Business Analytics\fortune.csv'
DELIMITER ','
CSV HEADER;
But I got the below error message due to NA values in one of the columns.
ERROR: invalid input syntax for type real: "NA"
CONTEXT: COPY fortune500, line 12, column profits_change: "NA"
SQL state: 22P02
So how can I get rid of 'NA' values while migrating the data?
Consider using a staging table that would not have restrictive data types and then do your transformations and insert into the final table after the data had been loaded into staging. This is known as ELT (Extract - Load - Transform) approach. You could also use some external tools to implement an ETL process, and do the transformation in that tool, before it reaches your database.
In your case, an ELT approach would be to first create a table with all text types, load that table and then insert into your final table, casting the text values into appropriate types, either filtering out the values that cannot be casted or inserting NULLs, or maybe 0, where that cast can't be made - depending on your requirements. For example you'd filter out rows where profits_change = 'NA' (or better, WHERE NOT (profits_change ~ '^\d+\.?\d+$') to check for a numeric value, or you'd insert NULL or 0:
CASE
WHEN profits_change ~ '^\d+\.?\d+$'
THEN profits_change::real
ELSE NULL -- or 0, depending what you need
END
You'd perform this kind of validation for all fields.
Alternatively, if it's a one off thing - just edit your CSV before importing.

Can the categories in the postgres tablefunc crosstab() function be integers?

It's all in the title. Documentation has something like this:
SELECT *
FROM crosstab('...') AS ct(row_name text, category_1 text, category_2 text);
I have two tables, lab_tests and lab_tests_results. All of the lab_tests_results rows are tied to the primary key id integer in the lab_tests table. I'm trying to make a pivot table where the lab tests (identified by an integer) are row headers and the respective results are in the table. I can't get around a syntax error at or around the integer.
Is this possible with the current set up? Am I missing something in the documentation? Or do I need to perform an inner join of sorts to make the categories strings? Or modify the lab_tests_results table to use a text identifier for the lab tests?
Thanks for the help, all. Much appreciated.
Edit: Got it figured out with the help of Dmitry. He had the data layout figured out, but I was unclear on what kind of output I needed. I was trying to get the pivot table to be based on batch_id numbers in the lab_tests_results table. Had to hammer out the base query and casting data types.
SELECT *
FROM crosstab('SELECT lab_tests_results.batch_id, lab_tests.test_name, lab_tests_results.test_result::FLOAT
FROM lab_tests_results, lab_tests
WHERE lab_tests.id=lab_tests_results.lab_test AND (lab_tests.test_name LIKE ''Test Name 1'' OR lab_tests.test_name LIKE ''Test Name 2'')
ORDER BY 1,2'
) AS final_result(batch_id VARCHAR, test_name_1 FLOAT, test_name_2 FLOAT);
This provides a pivot table from the lab_tests_results table like below:
batch_id |test_name_1 |test_name_2
---------------------------------------
batch1 | result1 | <null>
batch2 | result2 | result3
If I understand correctly your tables look something like this:
CREATE TABLE lab_tests (
id INTEGER PRIMARY KEY,
name VARCHAR(500)
);
CREATE TABLE lab_tests_results (
id INTEGER PRIMARY KEY,
lab_tests_id INTEGER REFERENCES lab_tests (id),
result TEXT
);
And your data looks something like this:
INSERT INTO lab_tests (id, name)
VALUES (1, 'test1'),
(2, 'test2');
INSERT INTO lab_tests_results (id, lab_tests_id, result)
VALUES (1,1,'result1'),
(2,1,'result2'),
(3,2,'result3'),
(4,2,'result4'),
(5,2,'result5');
First of all crosstab is part of tablefunc, you need to enable it:
CREATE EXTENSION tablefunc;
You need to run it one per database as per this answer.
The final query will look like this:
SELECT *
FROM crosstab(
'SELECT lt.name::TEXT, lt.id, ltr.result
FROM lab_tests AS lt
JOIN lab_tests_results ltr ON ltr.lab_tests_id = lt.id'
) AS ct(test_name text, result_1 text, result_2 text, result_3 text);
Explanation:
The crosstab() function takes a text of a query which should return 3 columns; (1) a column for name of a group, (2) a column for grouping, (3) the value. The wrapping query just selects all the values those crosstab() returns and defines the list of columns after (the part after AS). First is the category name (test_name) and then the values (result_1, result_2). In my query I'll get up to 3 results. If I have more then 3 results then I won't see them, If I have less then 3 results I'll get nulls.
The result for this query is:
test_name |result_1 |result_2 |result_3
---------------------------------------
test1 |result1 |result2 |<null>
test2 |result3 |result4 |result5

How to get size of PostgreSQL jsonb field?

I have a table with jsonb field in table.
CREATE TABLE data.items
(
id serial NOT NULL,
datab jsonb
)
How to get size of this field in a query like this:
select id, size(datab) from data.items
For the number of bytes used to store:
select id, pg_column_size(datab) from data.items;
For the number of elements on the jsonb object:
select id, jsonb_array_length(datab) from data.items;
If the column uses EXTENDED storage (TOAST compression), you should use octet_length(datab::text) instead of pg_column_size

SQl Server 2012 autofill one column from another

I have a table where a user inputs name, dob, etc. and I have a User_Name column that I want automatically populated from other columns.
For example input is: Name - John Doe, DOB - 01/01/1900
I want the User_Name column to be automatically populated with johndoe01011900 (I already have the algorithm to concatenate the column parts to achieve the desired result)
I just need to know how (SQL, Trigger) to have the User_Name column filled once the user completes imputing ALL target columns. What if the user skips around and does not input the data in order? Of course the columns that are needed are (not null).
This should do it:
you can use a calculated field with the following calculation:
LOWER(REPLACE(Name, ' ', ''))+CONVERT( VARCHAR(10), DateOfBirth, 112))
In the below sample I have used a temp table but this is the same for regular tables as well.
SAMPLE:
CREATE TABLE #temp(Name VARCHAR(100)
, DateOfBirth DATE
, CalcField AS LOWER(REPLACE(Name, ' ', ''))+CONVERT( VARCHAR(10), DateOfBirth, 112));
INSERT INTO #temp(Name
, DateOfBirth)
VALUES
('John Doe'
, '01/01/1900');
SELECT *
FROM #temp;
RESULT:

CSV file data into a PostgreSQL table

I am trying to create a database for movielens (http://grouplens.org/datasets/movielens/). We've got movies and ratings. Movies have multiple genres. I splitted those out into a separate table since it's a 1:many relationship. There's a many:many relationship as well, users to movies. I need to be able to query this table multiple ways.
So I created:
CREATE TABLE genre (
genre_id serial NOT NULL,
genre_name char(20) DEFAULT NULL,
PRIMARY KEY (genre_id)
)
.
INSERT INTO genre VALUES
(1,'Action'),(2,'Adventure'),(3,'Animation'),(4,'Children\s'),(5,'Comedy'),(6,'Crime'),
(7,'Documentary'),(8,'Drama'),(9,'Fantasy'),(10,'Film-Noir'),(11,'Horror'),(12,'Musical'),
(13,'Mystery'),(14,'Romance'),(15,'Sci-Fi'),(16,'Thriller'),(17,'War'),(18,'Western');
.
CREATE TABLE movie (
movie_id int NOT NULL DEFAULT '0',
movie_name char(75) DEFAULT NULL,
movie_year smallint DEFAULT NULL,
PRIMARY KEY (movie_id)
);
.
CREATE TABLE moviegenre (
movie_id int NOT NULL DEFAULT '0',
genre_id tinyint NOT NULL DEFAULT '0',
PRIMARY KEY (movie_id, genre_id)
);
I dont know how to import my movies.csv with columns movie_id, movie_name and movie_genre For example, the first row is (1;Toy Story (1995);Animation|Children's|Comedy)
If I INSERT manually, it should be look like:
INSERT INTO moviegenre VALUES (1,3),(1,4),(1,5)
Because 3 is Animation, 4 is Children and 5 is Comedy
How can I import all data set this way?
You should first create a table that can ingest the data from the CSV file:
CREATE TABLE movies_csv (
movie_id integer,
movie_name varchar,
movie_genre varchar
);
Note that any single quotes (Children's) should be doubled (Children''s). Once the data is in this staging table you can copy the data over to the movie table, which should have the following structure:
CREATE TABLE movie (
movie_id integer, -- A primary key has implicit NOT NULL and should not have default
movie_name varchar NOT NULL, -- Movie should have a name, varchar more flexible
movie_year integer, -- Regular integer is more efficient
PRIMARY KEY (movie_id)
);
Sanitize your other tables likewise.
Now copy the data over, extracting the unadorned name and the year from the CSV name:
INSERT INTO movie (movie_id, movie_name)
SELECT parts[1], parts[2]::integer
FROM movies_csv, regexp_matches(movie_name, '([[:ascii:]]*)\s\(([\d]*)\)$') p(parts)
Here the regular expression says:
([[:ascii:]]*) - Capture all characters until the matches below
\s - Read past a space
\( - Read past an opening parenthesis
([\d]*) - Capture any digits
\) - Read past a closing parenthesis
$ - Match from the end of the string
So on input "Die Hard 17 (John lives forever) (2074)" it creates a string array with {'Die Hard 17 (John lives forever)', '2074'}. The scanning has to be from the end $, assuming all movie titles end with the year of publication in parentheses, in order to preserve parentheses and numbers in movie titles.
Now you can work on the movie genres. You have to split the string on the bar | using the regex_split_to_table() function and then join to the genre table on the genre name:
INSERT INTO moviegenre
SELECT movie_id, genre_id
FROM movies_csv, regexp_split_to_table(movie_genre, '\|') p(genre) -- escape the |
JOIN genre ON genre.genre_name = p.genre;
After all is done and dusted you can delete the movies_csv table.