How to copy into Postgres table from csv with added column? - postgresql

I have a table in Postgres that I would like to copy into from a csv file. I usually do as so:
\copy my_table from '/workdir/some_file.txt' with null as 'NULL' delimiter E'|' csv header;
The problem is now however that my_table has one column extra that I would like to fill in manually on copy, with the same value 'b'. Here are my tables:
some_file.txt:
col1 | col2 | col3
0 0 1
0 1 3
my_table :
xtra_col | col1 | col2 | col3
a 5 2 5
a 6 2 5
a 7 2 5
Desired my_table after copy into:
xtra_col | col1 | col2 | col3
a 5 2 5
a 6 2 5
a 7 2 5
b 0 0 1
b 0 1 3
Is there a way to mention the persisting 'b' value in the copy statement for column `xtra_col'. If not, how should I approach this problem?

You could set a (temporary) default value for the xtra_col:
ALTER TABLE my_table ALTER COLUMN xtra_col SET DEFAULT 'b';
COPY my_table (col1, col2, col3) FROM '/workdir/some_file.txt' WITH (FORMAT CSV, DELIMITER '|', NULL 'NULL', HEADER true);
ALTER TABLE my_table ALTER COLUMN xtra_col DROP DEFAULT;
is there a way to not repeat columns in my_table? the real my_table has 20 columns and i wouldnt want to call all of them.
If my_table has a lot of columns and you wish to avoid having to type out all the column names,
you could dynamically generate the COPY command like this:
SELECT format($$COPY my_table(%s) FROM '/workdir/some_file.txt' WITH (FORMAT CSV, DELIMITER '|', NULL 'NULL', HEADER true);$$
, string_agg(quote_ident(attname), ','))
FROM pg_attribute
WHERE attrelid = 'my_table'::regclass
AND attname != 'xtra_col'
AND attnum > 0
you could then copy-and-paste the SQL to run it.
Or, for totally hands-free operation, you could create a function to generate the SQL and execute it:
CREATE OR REPLACE FUNCTION test_func(filepath text, xcol text, fillval text)
RETURNS void
LANGUAGE plpgsql
AS $func$
DECLARE sql text;
BEGIN
EXECUTE format($$ALTER TABLE my_table ALTER COLUMN %s SET DEFAULT '%s';$$, xcol, fillval);
SELECT format($$COPY my_table(%s) FROM '%s' WITH (FORMAT CSV, DELIMITER '|', NULL 'NULL', HEADER true);$$
, string_agg(quote_ident(attname), ','), filepath)
INTO sql
FROM pg_attribute
WHERE attrelid = 'my_table'::regclass
AND attname != 'xtra_col'
AND attnum > 0;
EXECUTE sql;
EXECUTE format($$ALTER TABLE my_table ALTER COLUMN %s DROP DEFAULT;$$, xcol);
END;
$func$;
SELECT test_func('/workdir/some_file.txt', 'xtra_col', 'b');
This is the sql I used to test the solution above:
DROP TABLE IF EXISTS test;
CREATE TABLE test (
xtra_col text
, col1 int
, col2 int
, col3 int
);
INSERT INTO test VALUES
('a', 5, 2, 5)
, ('a', 6, 2, 5)
, ('a', 7, 2, 5);
with the contents of /tmp/data being
col1 | col2 | col3
0 | 0 | 1
0 | 1 | 3
Then
SELECT test_func('/tmp/data', 'xtra_col', 'b');
SELECT * FROM test;
results in
+----------+------+------+------+
| xtra_col | col1 | col2 | col3 |
+----------+------+------+------+
| a | 5 | 2 | 5 |
| a | 6 | 2 | 5 |
| a | 7 | 2 | 5 |
| b | 0 | 0 | 1 |
| b | 0 | 1 | 3 |
+----------+------+------+------+
(5 rows)
Regarding the pg.dropped column:
The test_func call does not seem to produce the pg.dropped column, at least on the test table used above:
unutbu=# SELECT *
FROM pg_attribute
WHERE attrelid = 'test'::regclass;
+----------+----------+----------+---------------+--------+--------+----------+-------------+-----------+----------+------------+----------+------------+-----------+-------------+--------------+------------+-------------+--------------+--------+------------+---------------+
| attrelid | attname | atttypid | attstattarget | attlen | attnum | attndims | attcacheoff | atttypmod | attbyval | attstorage | attalign | attnotnull | atthasdef | attidentity | attisdropped | attislocal | attinhcount | attcollation | attacl | attoptions | attfdwoptions |
+----------+----------+----------+---------------+--------+--------+----------+-------------+-----------+----------+------------+----------+------------+-----------+-------------+--------------+------------+-------------+--------------+--------+------------+---------------+
| 53393 | tableoid | 26 | 0 | 4 | -7 | 0 | -1 | -1 | t | p | i | t | f | | f | t | 0 | 0 | | | |
| 53393 | cmax | 29 | 0 | 4 | -6 | 0 | -1 | -1 | t | p | i | t | f | | f | t | 0 | 0 | | | |
| 53393 | xmax | 28 | 0 | 4 | -5 | 0 | -1 | -1 | t | p | i | t | f | | f | t | 0 | 0 | | | |
| 53393 | cmin | 29 | 0 | 4 | -4 | 0 | -1 | -1 | t | p | i | t | f | | f | t | 0 | 0 | | | |
| 53393 | xmin | 28 | 0 | 4 | -3 | 0 | -1 | -1 | t | p | i | t | f | | f | t | 0 | 0 | | | |
| 53393 | ctid | 27 | 0 | 6 | -1 | 0 | -1 | -1 | f | p | s | t | f | | f | t | 0 | 0 | | | |
| 53393 | xtra_col | 25 | -1 | -1 | 1 | 0 | -1 | -1 | f | x | i | f | f | | f | t | 0 | 100 | | | |
| 53393 | col1 | 23 | -1 | 4 | 2 | 0 | -1 | -1 | t | p | i | f | f | | f | t | 0 | 0 | | | |
| 53393 | col2 | 23 | -1 | 4 | 3 | 0 | -1 | -1 | t | p | i | f | f | | f | t | 0 | 0 | | | |
| 53393 | col3 | 23 | -1 | 4 | 4 | 0 | -1 | -1 | t | p | i | f | f | | f | t | 0 | 0 | | | |
+----------+----------+----------+---------------+--------+--------+----------+-------------+-----------+----------+------------+----------+------------+-----------+-------------+--------------+------------+-------------+--------------+--------+------------+---------------+
(10 rows)
As far as I know, the pg.dropped column is a natural result of how PostgreSQL works when a column is dropped. So no fix is necessary.
Rows whose attname contains pg.dropped also have a negative attnum.
This is why attnum > 0 was used in test_func -- to remove such rows from the generated list of column names.
My experience with Postgresql is limited, so I might be wrong. If you can produce an example which generates a pg.dropped "column" with positive attnum, I'd very much like to see it.

I usually load a file into a temporary table then insert (or update) from there. In this case,
CREATE TEMP TABLE input (LIKE my_table);
ALTER TABLE input DROP xtra_col;
\copy input from 'some_file.txt' ...
INSERT INTO my_table
SELECT 'b', * FROM input;
The INSERT statement looks tidy, but that can only really be achieved when the columns you want to exclude are on either end of my_table. In your (probably simplified) example, xtra_col is at the front so we can quickly append the remaining columns using *.
If the arrangement of CSV file columns differs my_table much more than that, you'll need to start typing out column names.

Related

Find rows in relation with at least n rows in a different table without joins

I have a table as such (tbl):
+----+------+-----+
| pk | attr | val |
+----+------+-----+
| 0 | ohif | 4 |
| 1 | foha | 56 |
| 2 | slns | 2 |
| 3 | faso | 11 |
+----+------+-----+
And another table in n-to-1 relationship with tbl (tbl2):
+----+-----+
| pk | rel |
+----+-----+
| 0 | 0 |
| 1 | 1 |
| 2 | 0 |
| 3 | 2 |
| 4 | 2 |
| 5 | 3 |
| 6 | 1 |
| 7 | 2 |
+----+-----+
(tbl2.rel -> tbl.pk.)
I would like to select only the rows from tbl which are in relationship with at least n rows from tbl2.
I.e., for n = 2, I want this table:
+----+------+-----+
| pk | attr | val |
+----+------+-----+
| 0 | ohif | 4 |
| 1 | foha | 56 |
| 2 | slns | 2 |
+----+------+-----+
This is the solution I came up with:
SELECT DISTINCT ON (tbl.pk) tbl.*
FROM (
SELECT tbl.pk
FROM tbl
RIGHT OUTER JOIN tbl2 ON tbl2.rel = tbl.pk
GROUP BY tbl.pk
HAVING COUNT(tbl2.*) >= 2 -- n
) AS tbl_candidates
LEFT OUTER JOIN tbl ON tbl_candidates.pk = tbl.pk
Can it be done without selecting the candidates with a subquery and re-joining the table with itself?
I'm on Postgres 10. A standard SQL solution would be better, but a Postgres solution is acceptable.
OK, just join once, as below:
select
t1.pk,
t1.attr,
t1.val
from
tbl t1
join
tbl2 t2 on t1.pk = t2.rel
group by
t1.pk,
t1.attr,
t1.val
having(count(1)>=2) order by t1.pk;
pk | attr | val
----+------+-----
0 | ohif | 4
1 | foha | 56
2 | slns | 2
(3 rows)
Or just join once and use CTE(with clause), as below:
with tmp as (
select rel from tbl2 group by rel having(count(1)>=2)
)
select b.* from tmp t join tbl b on t.rel = b.pk order by b.pk;
pk | attr | val
----+------+-----
0 | ohif | 4
1 | foha | 56
2 | slns | 2
(3 rows)
Is the SQL clearer?

Replace null by negative id number in not consecutive rows in hive

I have this table in my database:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| NULL | C |
| 3 | D |
| NULL | D |
| NULL | E |
| 4 | F |
---------------
And I want to transform this table into a table that replace nulls by consecutive negative ids:
| id | desc |
|-------------|
| 1 | A |
| 2 | B |
| -1 | C |
| 3 | D |
| -2 | D |
| -3 | E |
| 4 | F |
---------------
Anyone knows how can I do this in hive?
Below approach works
select coalesce(id,concat('-',ROW_NUMBER() OVER (partition by id))) as id,desc from database_name.table_name;

Finding value difference in column pairs

I'm using SQL server 2008R2 and I have a view which returns the following:
+----+-------+-------+-------+-------+-------+-------+
| ID | col1A | col1B | col2A | col2B | col3A | col3B |
+----+-------+-------+-------+-------+-------+-------+
| 1 | 1 | 1 | 3 | 5 | 4 | 4 |
| 2 | 1 | 1 | 5 | 5 | 5 | 4 |
| 3 | 3 | 4 | 5 | 5 | 4 | 4 |
| 4 | 1 | 2 | 5 | 5 | 4 | 3 |
| 5 | 1 | 1 | 2 | 2 | 3 | 3 |
+----+-------+-------+-------+-------+-------+-------+
As you can see this view contains column pairs (col1A and col1B), (col2A and col2B), (col3A and col3B).
I need to query this view and find rows where the column pairs contain different values.
So I would be looking to return:
+----+------------+---+-----+
| ID | ColumnType | A | B |
+----+------------+---+-----+
| 1 | Col2 | 3 | 5 |
| 2 | Col3 | 5 | 4 |
| 3 | Col1 | 3 | 4 |
| 4 | Col1 | 1 | 2 |
| 4 | Col3 | 4 | 3 |
+----+------------+---+-----+
I think I need to use UNPIVOT but not sure how – appreciate any suggestions?
Since you are using SQL Server 2008+ you can use CROSS APPLY to unpivot the pair of columns and then you can easily compare the values in the A and B to return the rows that don't match:
select t.ID,
c.ColumnType,
c.A,
c.B
from [dbo].[yourview] t
cross apply
(
values
('Col1', Col1A, Col1B),
('Col2', Col2A, Col2B),
('Col3', Col3A, Col3B)
) c (ColumnType, A, B)
where c.A <> c.B;
If you have different datatypes in your columns, then you'll need to convert the data to the same type. You can do this conversion within the VALUES clause:
select t.ID,
c.ColumnType,
c.A,
c.B
from [dbo].[yourview] t
cross apply
(
values
('Col1', cast(Col1A as varchar(50)), Col1B),
('Col2', cast(Col2A as varchar(50)), Col2B),
('Col3', cast(Col3A as varchar(50)), Col3B)
) c (ColumnType, A, B)
where c.A <> c.B

Postgres values as columns

I am working with PostgreSQL 9.3, and I have this:
PARENT_TABLE
ID | NAME
1 | N_A
2 | N_B
3 | N_C
CHILD_TABLE
ID | PARENT_TABLE_ID | KEY | VALUE
1 | 1 | K_A | V_A
2 | 1 | K_B | V_B
3 | 1 | K_C | V_C
5 | 2 | K_A | V_D
6 | 2 | K_C | V_E
7 | 3 | K_A | V_F
8 | 3 | K_B | V_G
9 | 3 | K_C | V_H
Note that I might add K_D in KEY's, it's completely dynamic.
What I want is a query that returns me the following:
QUERY_TABLE
ID | NAME | K_A | K_B | K_C | others K_...
1 | N_A | V_A | V_B | V_C | ...
2 | N_B | V_D | | V_E | ...
3 | N_C | V_F | V_G | V_H | ...
Is this possible to do ? If so, how ?
Since there can be values missing, you need the "safe" form of crosstab() with the column names as second parameter:
SELECT * FROM crosstab(
'SELECT p.id, p.name, c.key, c."value"
FROM parent_table p
LEFT JOIN child_table c ON c.parent_table_id = p.id
ORDER BY 1'
,$$VALUES ('K_A'::text), ('K_B'), ('K_C')$$)
AS t (id int, name text, k_a text, k_b text, k_c text; -- use actual data types
Details in this related answer:
PostgreSQL Crosstab Query
About adding "extra" columns:
Pivot on Multiple Columns using Tablefunc

PostgreSQL group by bug on Unicode strings?

I have a very weird thing happening, where I noticed that a group by (word) wasn't always grouping by word if that word is a UTF-8 string. In the same query, I get cases where it's been grouped correctly, and cases where it hasn't. I wonder if anybody knows what's up with that?
select *,count(*) over (partition by md5(word)) as k
from (
select word,count(*) as n
from :tmpwl
group by 1
) a order by 1,2 limit 12;
/* gives:
word | n | k
------+---+---
いい | 1 | 1
くず | 1 | 1
ごみ | 1 | 1
さま | 1 | 1
さん | 1 | 1
へま | 1 | 1
まめ | 1 | 1
よく | 1 | 1
ろく | 1 | 1
ネガ | 1 | 2 -- what the heck?
ネガ | 1 | 2
パス | 1 | 1
*/
Note that the following workaround works fine:
select word,n,count(*) over (partition by md5(word)) as k
from (
select md5(word),max(word) as word,count(*) as n
from :tmpwl
group by 1
) a order by 1,2 limit 12;
/* gives:
word | n | k
------+---+---
いい | 1 | 1
くず | 1 | 1
ごみ | 1 | 1
さま | 1 | 1
さん | 1 | 1
へま | 1 | 1
まめ | 1 | 1
よく | 1 | 1
ろく | 1 | 1
ネガ | 2 | 1
パス | 1 | 1
プア | 1 | 1
*/
The version is PostgreSQL 8.2.14 (Greenplum Database 4.0.4.0 build 3 Single-Node Edition) on x86_64-unknown-linux-gnu, compiled by GCC gcc.exe (GCC) 4.1.1 compiled on Nov 30 2010 17:20:26.
The source table :tmpwl:
\d :tmpwl
Table "pg_temp_25149.pdtmp_foo706453357357532"
Column | Type | Modifiers
----------+---------+-----------
baseword | text |
word | text |
value | integer |
lexicon | text |
nalts | bigint |
Distributed by: (word)