Adding a new column to Table which contains live data - tsql

I have a large table consisting of over 60 millions records and I would like to add 2 new columns for data migration purposes. There are indexes on the table and some of them are large. So, by me adding the 2 new columns to the table, will I run the risk of slowing down the database whilst it attempts to add them and maybe time-out? Or will it just work?
I know that if I try and rearrange the columns SQL Server will ask me to drop and re-create the table, so I definately don't want this. Is this something everyone is challenged with?

We've had the same problem with column and index changes on larger tables.
I would simply add the columns using ALTER TABLE. The column order, though nice, is irrelevant.
If the columns are NULLable them time is reasonable. if you want to add a default value and make them NOT NULL, then this is more work obviously. However, I would consider adding as NOT NULL, then setting to a value, then changing to NOT NULL to make it 3 steps you can do at different times. We do this to reduce the time window we need, even if the whole process tales longer

Related

PostgreSQL slow update with index

Very simply update to reset 1 column in a table with approx 5mil rows as:
UPDATE t_Daily
SET Price= NULL
Price is not part of any of the indexes on that table.
Running this without indexes takes 45s.
Running this with one or more indexes takes at least 20 mins (I keep having to stop it).
I fully understand why maintaining indexes affects the performance of insert and update statements, but this update makes no changes to the table indexes so why does it have this terrible effect on performance?
Any ideas much appreciated.
That is normal and expected: updating an index can be about ten times as expensive as updating the table itself. The table has no ordering!
If price is not indexed, you can use HOT updates that avoid updating the indexes. To make use of that, the table has to be defined with a fillfactor under 100 so that updated rows can find room in the same block as the original rows.
Found some further info (thanks to Laurenz-Albe for the HOT tip).
This link https://malisper.me/postgres-heap-only-tuples/ states that
Due to MVCC, an update in Postgres consists of finding the row being updated, and inserting a new version of the row back into the database. The main downside to doing this is the need to readd the row to every index
So it is re-writing the index despite only updating a column not in the index.

kdb+/q optimize union function

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!
If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables
this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

Efficient way of insert millions of rows, convert data and deal with it, on PostgreSQL+PostGIS

I have a big collection of data I want to use for user search later.
Currently I have 200 millions resources (~50GB). For each, I have latitude+longitude. The goal is to create spatial index to be able to do spatial queries on it.
So for that, the plan is to use PostgreSQL + PostGIS.
My data are on CSV file. I tried to use custom function to not insert duplicates, but after days of processing I gave up. I found a way to load it fast in the database: with COPY it takes less than 2 hours.
Then, I need to convert latitude+longitude on Geometry format. For that I just need to do:
ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
After some checking, I saw that for 200 millions, I have 50 millions points. So, I think the best way is to have a table "TABLE_POINTS" that will store all the points, and a table "TABLE_RESOURCES" that will store resources with point_key.
So I need to fill "TABLE_POINTS" and "TABLE_RESOURCES" from temporary table "TABLE_TEMP" and not keeping duplicates.
For "POINTS" I did:
INSERT INTO TABLE_POINTS (point)
SELECT DISTINCT ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326))
FROM TABLE_RESOURCES
I don't remember how much time it took, but I think it was matter of hours.
Then, to fill "RESOURCES", I tried:
INSERT INTO TABLE_RESOURCES (...,point_key)
SELECT DISTINCT ...,point_key
FROM TABLE_TEMP, TABLE_POINTS
WHERE ST_SetSRID(ST_MakePoint(longi::double precision,lat::double precision),4326) = point;
but again take days, and there is no way to see how far the query is ...
Also something important, the number of resources will continue to grow up, currently should be like 100K added by day, so storage should be optimized to keep fast access to data.
So if you have any idea for the loading or the optimization of the storage you are welcome.
Look into optimizing postgres first (ie google postgres unlogged, wal and fsync options), second do you really need points to be unique? Maybe just have one table with resources and points combined and not worry about duplicate points as it seems your duplicate lookup maybe whats slow.
For DISTINCT to work efficiently, you'll need a database index on those columns for which you want to eliminate duplicates (e.g. on the latitude/longitude columns, or even on the set of all columns).
So first insert all data into your temp table, then CREATE INDEX (this is usually faster that creating the index beforehand, as maintaining it during insertion is costly), and only afterwards do the INSERT INTO ... SELECT DISTINCT.
An EXPLAIN <your query> can tell you whether the SELECT DISTINCT now uses the index.

How should a table with two sets of almost duplicate column names be designed?

I have a table that has around 40 columns. The only difference in the columns names is that the last 20 all start with "B" before the column name. This table is used for comparing. In other words, compare the data in the first 20 columns to the data in the last 20 columns.
I know this is very bad design, so how should this table be redesigned, so that there are only 20 columns, yet we can still compare the data?
EDIT: if it helps, we also use this data to find a matched cohort
Also note that performance is of main concern here. By duplicating the columns the getting of data is extremely fast.
Thanks!
Two possible architectures and a query tip.
1) Build your table with a "Type" column, and use that to flag "primary" vs. "alternate". In your case, "A" vs. "B" might be appropriate.
2) Build a vertical partition, two identical tables (for primary and alternate data), that share a common primary key. (If Id = 42 is in one table, it must be in the other--unless "alternate" data is optional, in which case don't populate the second table.) Also optionally, have a third table that tracks all possible primary keys, along with any data that is known to always be common to both tables.
Tip: Read up on SELECT...EXCEPT and SELECT...INTERSECT. They run disturbingly quickly, and are idea for comparing all columns and rows between two datasets for differences (except) and matches (intersect). You can use this fairly easily with either of the two structures, and it would work with your existing code as well (though it might be fussier to write the query).

How to alter Postgres table data based on its contents?

This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.