Import column from file with additional fixed fields - import

Can I somehow import a column or columns from a file, where I specify one or more fields held fixed for all rows?
For example:
CREATE TABLE users(userid int PRIMARY KEY, fname text, lname text);
COPY users (userid,fname) from 'users.txt';
but where lname is assumed to be 'SMITH' for all the rows in users.txt?
My actual setting is more complex, where the field I want to supply for all rows is part of the PRIMARY KEY.
Possibly something of this nature:
COPY users (userid,fname,'smith' as lname) from 'users.txt';

Since I can't find a native solution to this in Cassandra, my solution was to perform a preparation step with Perl so the file contained all the relevant columns prior to calling COPY. This works fine, although I would prefer an answer that avoided this intermediate step.
e.g. adding a column with 'Smith' for every row to users.txt and calling:
COPY users (userid,fname,lname) from 'users.txt';

Related

Basic questions about Cloud SQL

I'm trying to populate a cloud sql database using a cloud storage bucket, but I'm getting some errors. The csv has the headers (or column names) as first row and does not have all the columns (some columns in the database can be null, so I'm loading the data I need for now).
The database is in postgresql and this is the first database in GCP I'm trying to configure and I'm a little bit confused.
Does it matters if the csv file has the column names?
Does the order of the columns matter in the csv file? (I guess they do if there are not present in the csv)
The PK of the table is a serial number, which I'm not including in the csv file. Do I need to include also the PK? I mean, because its a serial number it should be "auto assigned", right?
Sorry for the noob questions and thanks in advance :)
This is all covered by the COPY documentation.
It matters in that you will have to specify the HEADER option so that the first line is skipped:
[...] on input, the first line is ignored.
The order matters, and if the CSV file does not contain all the columns in the same order as the table, you have to specify them with COPY:
COPY mytable (col12, col2, col4, ...) FROM '/dir/afile' OPTIONS (...);
Same as above: if you omit a table column in the column list, it will be filled with the default value, in that case that is the autogenerated number.

I have two columns, I want the second column to have the same values as the first column

I have two columns, I want the second column to have the same values as the first column always, in PostgreSQL.
The columns are landmark_id (integer) and name (varchar), I want the name column to always have the same values (id's) from landmark_id.
landmark_id (integer) | name (varchar)
1 1
2 2
3 3
I don't understand why you would want to do that, but I can think of two ways to accomplish your request. One is by using a generated column
CREATE TABLE g (
landmark_id int,
name varchar(100) GENERATED ALWAYS AS (landmark_id::varchar) STORED
)
and the other is by enforcing a constraint
CREATE TABLE c (
landmark_id int,
name varchar(100),
CONSTRAINT equality_cc CHECK (landmark_id = name::varchar)
)
Both approaches will cause the name column to occupy disk space. The first approach will not allow you to specify the name column in INSERT or UPDATE statements. In the latter case, you will be forced to specify both columns when inserting.
You could also have used a trigger to update the second column.
Late edit: Others suggested using a view. I agree that it's a better idea than what I wrote.
Create a view, as suggested by #jarlh in comments. This automatically generates column name for you on the fly. This is usually preferred to storing essentially the same data multiple times as in an actual table, where the data occupies more disk space and also can get out of sync. For example:
CREATE VIEW landmarks_names AS
SELECT landmark_id,
landmark_id::text AS name
FROM landmarks;

Is it possible to insert and replace rows with pgloader?

My use case is the following: I have data coming from a csv file and I need to load it into a table (so far so good, nothing new here). It might happen that same data is sent with updated columns, in which case I would like to try to insert and replace in case of duplicate.
So my table is as follows:
CREATE TABLE codes (
code TEXT NOT NULL,
position_x INT,
position_y INT
PRIMARY KEY (code)
);
And incoming csv file is like this:
TEST01,1,1
TEST02,1,2
TEST0131,3
TEST04,1,4
It might happen that sometime in the future I get another csv file with:
TEST01,1,1000 <<<<< updated value
TEST05,1,5
TEST0631,6
TEST07,1,7
Right now what is happening is when I run for the first file, everything is fine, but when I execute for the second one I'm getting an error:
2017-04-26T10:33:51.306000+01:00 ERROR Database error 23505: duplicate key value violates unique constraint "codes_pkey"
DETAIL: Key (code)=(TEST01) already exists.
I load data using:
pgloader csv.load
And my csv.load file looks like this:
LOAD CSV
FROM 'codes.csv' (code, position_x, position_y)
INTO postgresql://localhost:5432/codes?tablename=codes (code, position_x, position_y)
WITH fields optionally enclosed by '"',
fields terminated by ',';
Is what I'm trying to do possible with pgloader?
I also tried dropping constrains for the primary key but then I end up with duplicate entries in the table.
Thanks a lot for your help.
No, you can't. As per reference
To work around that (load exceptions, eg PK violations), pgloader cuts the data into batches of 25000 rows
each, so that when a problem occurs it's only impacting that many rows
of data.
in brackets - mine...
The best you can do is load csv to table with same structure and then merge data with help of query (EXCEPT, OUTER JOIN ... where null and so on)

(PostgreSQL) Copying from CSV when all the date from a "column" are reference (foreign keys) from another table

I'm using the Copy form to import date in my Database and it works fine when I'm using it on tables with no triggers or reference to other tables.
When there are I get the Message:
"Query returned successfully: 0 rows affected, xxxx ms execution time"
When I disable all reference triggers (Foreign keys) and UpdRel It works, but their columns returns empty.
The tables were created using a CMDB software (CMDBuild) and the reference columns have an Integer type, in other words, I'm trying to import Integer numbers which are the same as a "code" Column from another table.
The copy form I'm using:
COPY "TableName"("Col1","Col2","Col3"...) FROM 'C:\file.csv' DELIMITER ';' CSV;
An example of the Tables:
table1:
Name: Servers
Attributes: Code(Referenced1), Description, Operational_System, IP, Domain, etc
table2:
Name: Hard_Disk
Attributes: Code(Referenced2), Description, Size, Interface, Serial, etc
table3
Name: Servers_X_HD
Attributes: Code, Description, Server(Reference1), Hard_Disk(Reference2), Size, etc..
I have successfully imported all Servers and HD data, I'm Having trouble importing table3 which have the foreign keys triggers.
I'm sorry if It's confusing, I'm not really that familiar with SQL or coding in general, I'm mostly a google-search-learner, besides a quick Java course I took.
If there is any info anyone needs to help me, I'll gladly provide within my reach..
I've discovered the issue. There was another column on all of those tables labelled "id" that was what was being used to be the reference for foreign keys.
The "code" column was just used as a "disguise" and to facilitate the user viewing of CMDBuild (by using defined integer numbers by the user, instead of the randomly generated "ids" by postgres).
By replacing the numbers of the "code" column with the ones on the "id" column, I managed to copy the info from the CSVs with no issues.

How to use BULK INSERT when rows depend on foreign keys values?

My question is related to this one I asked on ServerFault.
Based on this, I've considered the use of BULK INSERT. I now understand that I have to prepare myself a file for each entities I want to save into the database. No matter what, I still wonder whether this BULK INSERT will avoid the memory issue on my system as described in the referenced question on ServerFault.
As for the Streets table, it's quite simple! I have only two cities and five sectors to care about as the foreign keys. But then, how about the Addresses? The Addresses table is structured like this:
AddressId int not null identity(1,1) primary key
StreetNumber int null
NumberSuffix_Value int not null DEFAULT 0
StreetId int null references Streets (StreetId)
CityId int not null references Cities (CityId)
SectorId int null references Sectors (SectorId)
As I said on ServerFault, I have about 35,000 addresses to insert. Shall I memorize all the IDs? =P
And then, I now have the citizen people to insert who have an association with the addresses.
PersonId int not null indentity(1,1) primary key
Surname nvarchar not null
FirstName nvarchar not null
IsActive bit
AddressId int null references Addresses (AddressId)
The only thing I can think of is to force the IDs to static values, but then, I lose any flexibility that I had with my former approach with the INSERT..SELECT stategy.
What are then my options?
I force the IDs to be always the same, then I have to SET IDENTITY_INSERT ON so that I can force the values into the table, this way I always have the same IDs for each of my rows just as suggested here.
How to BULK INSERT with foreign keys? I can't get any docs on this anywhere. =(
Thanks for your kind assistance!
EDIT
I edited in order to include the BULK INSERT SQL instruction that finally made it for me!
I had my Excel workbook ready with the information I needed to insert. So, I simply created a few supplemental worksheet and began to write formulas in order to "import" the information data to these new sheets. I had one for each of my entities.
Streets;
Addresses;
Citizens.
As for the two other entities, it wasn't worthy to bulk insert them, as I had only two cities and five sectors (cities subdivisions) to insert. Once the both the cities and sectors inserted, I noted their respective IDs and began to ready my record sets for bulk insert. Using the power of Excel to compute the values and to "import" the foreign keys was a charm of itself, by the way. Afterwards, I have saved each of the worksheets to a separated CSV file. My records were then ready to bulked.
USE [DatabaseName]
GO
delete from Citizens
delete from Addresses
delete from Streets
BULK INSERT Streets
FROM N'C:\SomeFolder\SomeSubfolder\Streets.csv'
WITH (
FIRSTROW = 2
, KEEPIDENTITY
, FIELDTERMINATOR = N','
, ROWTERMINATOR = N'\n'
, CODEPAGE = N'ACP'
)
GO
FIRSTROW
Indicates the row number at which to begin the insert. In my situation, my CSVs contained the column headers, so the second row was the one to begin with. Aside, one could possibly want to start anywhere in his file, let's say the 15th row.
KEEPIDENTITY
Allows one to bulk-insert specified in-file entity IDs even though the table has an identity column. This parameter is the same as SET INDENTITY_INSERT my_table ON before a row insert when you wish to insert with a precise id.
As for the other parameters, they speak by themselves.
Now that this is explained, the same code was repeated for each of the two remaining entities to insert Addresses and Citizens. And because the KEEPIDENTITY was specified, all of my foreign keys remained still, though my primary keys were set as identities in SQL Server.
Only a few tweaks though, just the exact same thing as marc_s said in his answer, just import your data as fast as you can into a staging table with no restriction at all. This way, you're gonna make your life much easier, while following good practices nevertheless. =)
The basic idea is to bulk insert your data into a staging table that doesn't have any restrictions, any constraints etc. - just bulk load the data as fast as you can.
Once you have the data in the staging table, then you need to start to worry about constraints etc. when you insert the data from the staging table into the real tables.
Here, you could e.g.
insert only those rows into your real work tables that match all the criteria (and mark them as "successfully inserted" in your staging table)
handle all rows that are left in the staging table that aren't successfully inserted by some error / recovery process - whatever that could be: printing a report with all the "problem" rows, tossing them into an "error bin" or whatever - totally up to you.
Key point is: the actual BULK INSERT should be into a totally unconstrained table - just load the data as fast as you can - and only then in a second step start to worry about constraints and lookup data and references and stuff like that