I have a table that contains both metadata about the row as well as some numeric information. The metadata is much bigger (URLs and free text versus just a few numbers for the other part).
Most of my queries ignore the metadata, e.g. they are just adding up some subset of the numbers.
If I split metadata into a different table, would that make these queries meaningfully faster? My table has about 30 million rows.
Selecting all the columns of the table will reduce the query performance.
SELECT * FROM TABLE;
Selecting only the required column will not affect the query performance
SELECT column1, column2 FROM TABLE1;
Disadvantages of having too many columns in PostgreSQL explained here
Related
I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.
I have a table in PostgreSQL,
but the problem is that my data isn't organized in a proper data order.
For example, the first row of my table is '2017-05-30', and last row is '2017-02-23'.
So I want to "sort" my table by date.
I'm not asking about
SELECT * FROM MY_TABLE ORDER BY DATE;
I want to "update" my table.
How can I do this?
You can't sort a PostgreSQL table in the sense you ask.
In relational algebra, the order of the rows is unimportant and there is no guarantee that rows in a table are stored in any specific order. There is also no way to ensure that rows are returned in a particular order unless you specify the order specifically e.g. by using the ORDER BY. Otherwise, you shouldn't rely on the order of the returned rows.
As pointed out the in comments, RDBMS may rearrange the order of rows in query results for optimization purposes and so on.
You can, if you like, add a new sequence number field using row_number() indicating the ranks of rows with respect to your order (e.g. the date field).
I'm having the following Redshift performance issue:
I have a table with ~ 2 billion rows, which has ~100 varchar columns and one int8 column (intCol). The table is relatively sparse, although there are columns which have values in each row.
The following query:
select colA from tableA where intCol = ‘111111’;
returns approximately 30 rows and runs relatively quickly (~2 mins)
However, the query:
select * from tableA where intCol = ‘111111’;
takes an undetermined amount of time (gave up after 60 mins).
I know pruning the columns in the projection is usually better but this application needs the full row.
Questions:
Is this just a fundamentally bad thing to do in Redshift?
If not, why is this particular query taking so long? Is it related to the structure of the table somehow? Is there some Redshift knob to tweak to make it faster? I haven't yet messed with the distkey and sortkey on the table, but it's not clear that those should matter in this case.
The main reason why the first query is faster is because Redshift is a columnar database. A columnar database
stores table data per column, writing a same column data into a same block on the storage. This behavior is different from a row-based database like MySQL or PostgreSQL. Based on this, since the first query selects only colA column, Redshift does not need to access other columns at all, while the second query accesses all columns causing a huge disk access.
To improve the performance of the second query, you may need to set "sortkey" to colA column. By setting sortkey to a column, that column data will be stored in sorted order on the storage. It reduces the cost of disk access when fetching records with a condition including that column.
We store billions of rows in an infobright table which currently has about 45 columns. We want to add 50 more columns to it. Will adding these columns bring down the performance of reads? Is creating a new table for these columns a better option? Or, since infobright is a column oriented database additions of 50 extra columns not matter much?
Thanks!
I think "adding these columns" will not "bring down the performance of reads" that do not use the added columns.
I think "creating a new table for these columns" is not "a better option".
Since "infobright is a column oriented database additions of 50 extra columns" should have no effect on the performance of queries that do not use the added columns.
The maximum number of columns for Infobrigh6t tables is 4096. However, that is if they are only TINYINT columns. I would suggest that you do not use more than 1000 columns. The key though is ensuring that in your SQL query that you do not do a SELECT * FROM. You should SELECT CustomerID, CustomerName FROM instead for ONLY those columns necessary to resolve your needs.
I have sql 2008 R2 database. I created a table and when trying to execute a select statement (with order by clause) against it, I receive the error "Cannot create a row of size 8870 which is greater than the allowable maximum row size of 8060."
I am able to select the data without an order by clause, however the order by clause is important and I require it. I have tried a ROBUST PLAN option but I still received the same error.
My table has 300+ columns with data type TEXT. I have tried using varchar and nvarchar, but have had no success.
Can someone please provide some insight?
Update:
Thanks for comments. I agree. 300+ columns in one table is not very good design. What I'm trying to do is bring excel tabs into the database as data tables. Some tabs have 300+ columns.
I first use a CREATE statement to create a table based on the excel tab so the columns vary. Then I do various SELECT, UPDATE, INSERT, etc statements on the table after the table is created with data.
The structure of the table usually follow this patter:
fkVersionID, RowNumber(autonumber), Field1, Field2, Field3, etc...
is there any way to get around the 8060 row size limit?
You mentioned that you tried nvarchar and varchar ... remember that nvarchar doubles the bytes used, but it is the only one of the two to support foreign characters in some cases, such as accent marks.
varchar is a good choice if you can limit its maximum size appropriately.
8000 characters is still a real limit, but if on average each varchar column is no more than 26 characters, you'll be okay.
You could go riskier and go with varchar and 50char length, but on average only utilize 26characters per column.. meaning one column maybe 36 character length, and the next is 16character length... then you are okay again. (As long as you never exceed the average of 26characters per column for the 300 columns.)
Obviously with dynamic number of fields, and potential to way exceed the 8000 character limit, it is doomed by SQL's specs.
Your only other alternative is to create multiple tables and when you access the data, have a unique key to join appropriate records on. So in your select statement, use the join, and from multiple tables then you can handle rows with 8000 + 8000 + ...
So it is doable, but you have to work with SQL rules.
I believe you're running into this limitation:
There is no limit to the number of items in the ORDER BY clause. However, there is a limit of 8,060 bytes for the row size of intermediate worktables needed for sort operations. This limits the total size of columns specified in an ORDER BY clause.
I had a legacy app like this, it was a nightmare.
First, I broke it into multiple tables, all one-to-one. This is bad, but less bad than what you've got.
Then I changed the queries to request only the columns that were actually needed. (I can't tell if you have that option.)