I have a table in PostgreSQL,
but the problem is that my data isn't organized in a proper data order.
For example, the first row of my table is '2017-05-30', and last row is '2017-02-23'.
So I want to "sort" my table by date.
I'm not asking about
SELECT * FROM MY_TABLE ORDER BY DATE;
I want to "update" my table.
How can I do this?
You can't sort a PostgreSQL table in the sense you ask.
In relational algebra, the order of the rows is unimportant and there is no guarantee that rows in a table are stored in any specific order. There is also no way to ensure that rows are returned in a particular order unless you specify the order specifically e.g. by using the ORDER BY. Otherwise, you shouldn't rely on the order of the returned rows.
As pointed out the in comments, RDBMS may rearrange the order of rows in query results for optimization purposes and so on.
You can, if you like, add a new sequence number field using row_number() indicating the ranks of rows with respect to your order (e.g. the date field).
Related
For example, if I change a value of any record, it will be updated again but then it is pushed back to the last position, just like deleting and adding another record.
Without an ORDER BY clause, rows will be returned in whatever order is most convenient to the database engine.
In a simple SELECT, rows will usually be returned in the order they're stored on disk, and that can change at any time if the table is updated, vacuum'd, cluster'd or backed up and restored, but...
If the columns you select allow the database to do an index-only scan, then rows will be returned in a different order than when a table scan is used.
If it decides to do an index-scan, it will probably return the rows in index order, but if it does a bitmap index scan or a seq scan, it will probably be table order on disk.
If your SELECT uses a JOIN, then it will scan one of the tables first, and that will influence the order of returned rows.
If it decides to use a hash-join or a merge-join or some other type of join, then the row order will also change.
It's a common beginner mistake to forget that tables are sets, and select results are also sets, and sets are unordered. If your code relies on the order of unordered things, then it will not work.
If you want a specific order, then you must use ORDER BY.
I am writing a query with code to select all records from a table where a column value is contained in a CSV. I found a suggestion that the best way to do this was using ARRAY functionality in PostgresQL.
I have a table price_mapping and it has a primary key of id and a column customer_id of type bigint.
I want to return all records that have a customer ID in the array I will generate from csv.
I tried this:
select * from price_mapping
where ARRAY[customer_id] <# ARRAY[5,7,10]::bigint[]
(the 5,7,10 part would actually be a csv inserted by my app)
But I am not sure that is right. In application the array could contain 10's of thousands of IDs so want to make sure I am doing right with best performance method.
Is this the right way in PostgreSQL to retrieve large collection of records by pre-defined column value?
Thanks
Generally this is done with the SQL standard in operator.
select *
from price_mapping
where customer_id in (5,7,10)
I don't see any reason using ARRAY would be faster. It might be slower given it has to build arrays, though it might have been optimized.
In the past this was more optimal:
select *
from price_mapping
where customer_id = ANY(VALUES (5), (7), (10)
But new-ish versions of Postgres should optimize this for you.
Passing in tens of thousands of IDs might run up against a query size limit either in Postgres or your database driver, so you may wish to batch this a few thousand at a time.
As for the best performance, the answer is to not search for tens of thousands of IDs. Find something which relates them together, index that column, and search by that.
If your data is big enough, try this:
Read your CSV using a FDW (foreign data wrapper)
If you need this connection often, you might build a materialized view from it, holding only needed columns. Refresh this when new CSV is created.
Join your table against this foreign table or materialized viev.
Does anyone know why the order of the rows changed after I made an update to table? Is there any way to make the order go back or change to another order eg:order by alphabetical?
This is the update I performed:
update t set amount = amount + 1 where account = accountNumber
After this update when I go and see the table, the order has changed
A table doesn't have a natural row order, some database systems will actually refuse your query if you don't add an ORDER BY clause at the end of your SELECT
Why did the order change?
Because the database engine fetches your rows in the physical order they come from the storage. Some engines, like SQL Server, can have a CLUSTERED INDEX which forces a physical order, but it is still never really guaranteed that you get your results in that precise order.
The clustered index exist mostly as an optimization. PostgreSQL has a similar CLUSTER function to change the physical order, but it's an heavy process which locks the table : http://www.postgresql.org/docs/9.1/static/sql-cluster.html
How to force an alphabetical order of the rows?
Add an ORDER BY clause in your query.
SELECT * FROM table ORDER BY column
If I had a table of products and another table of manufacturers, and I wanted that table to have a count of products, is there a way in postgres to say "this column equals the number of rows in this other table that meet this condition"?
EDIT: I mean to say that the column value will be automatically calculated. So if I have a table with a column for the number of products that are red, I want this column to consistently equal the number of rows that result from doing select * from products where color='red';, without having to consistently perform that query myself.
You should not store calculated values in an operational database. If it's data warehouse, go ahead.
You can use a view to do the calculation for you.
http://sqlfiddle.com/#!15/0b744/1
You can use a materialized view to increase performance, and refresh it with a trigger on products table.
I want to remove duplicates from a large table having about 1million rows and increasing every hour. It has no unique id and has about ~575 columns but sparsely filled.
The table is 'like' a log table where new entries are appended every hour without unique timestamp.
The duplicates are like 1-3% but I want to remove it anyway ;) Any ideas?
I tried ctid column (as here) but its very slow.
The basic idea that works generally well with PostgreSQL is to create an index on the hash of the set of columns as a whole.
Example:
CREATE INDEX index_name ON tablename (md5((tablename.*)::text));
This will work unless there are columns that don't play well with the requirement of immutability (mostly timestamp with time zone because their cast-to-text value is session-dependent).
Once this index is created, duplicates can be found quickly by self-joining with the hash, with a query looking like this:
SELECT t1.ctid, t2.ctid
FROM tablename t1 JOIN tablename t2
ON (md5((t1.*)::text) = md5((t2.*)::text))
WHERE t1.ctid > t2.ctid;
You may also use this index to avoid duplicates rows in the future rather than periodically de-duplicating them, by making it UNIQUE (duplicate rows would be rejected at INSERT or UPDATE time).