Performance of text column update in PostgreSQL - postgresql

I would like to store some potentially large log files in a text column in Postgres. And I would like to frequently append additional content to these files using an UPDATE statement. Will this be efficient?
I'd like to do something like this:
UPDATE mytable
SET mytextlog = mytextlog || 'This will be appended'
WHERE id = 1
Will Postgres simply append the additional content without rewriting what is in the column already? Or will it read the content of the column into memory, append the additional content, and then write back the entire column? I realize this is implementation detail. I'm just a little concerned that what I'm planning to do will be unnecessarily slow as the text being stored becomes large. Thanks in advance!

Related

What is the best way to store a large set of single-column data in PostgreSQL

I have a massive list of strings in a text file, the file is about 100gb uncompressed.
Each line of the text file is a single word (rougly 50 characters long), no spaces or punctuation.
The table will be created and populated from this text file once, further updates to the table will not be necessary and if it helps the table can become read only.
The use-case is a function which would look something like this:
/**
* Search the table and return true if the word exists, false if not.
* /
wordExists(wordToCheck: string): boolean {}
I'm looking for advice here on what would be the best way to store the data to ensure that lookups are as fast and efficient as possible.
I'm not sure if breaking the word up into parts to try and assist in indexing it would help or not, I'm also not sure if it will help to partition this list.
Anyone have any advice for me?
100GB is fairly long, and assuming each word averages 50 characters in length, this means that roughly a single column would have 2 billion words/records in it. This is not so large that it is beyond the capability of Postgres.
I suggest creating and populating your table, and then adding a hash index:
CREATE INDEX idx ON yourTable USING HASH (text_col);
Now any query against this table should be able to use the index for very rapid lookup. For example, to see if a word exists, use:
SELECT EXISTS (SELECT 1 FROM yourTable WHERE text_col = 'meatballs');
If you run the explain plan, you should see the hash index being used with a fast lookup time.

Load the temp table data into a text file using temp table's handle

I have created a temp table I want to load all the data of a temp table including field names in a text file using the temp table's handle what can I do?
Using the default buffer handle of the tamp table (hTable:DEFAULT-BUFFER-HANDLE), you can then loop through the fields of the table.
DO i = 1 TO hBufferHandle:NUM-FIELDS:
...
END
You would do that twice, once to output the field names as headers to your text file, and then once for each record of the temp table to export the values.
You will have to handle things like extents.
You will have to deal with data types, making decisions on what to do, depending on what you have in your table.
In theory it's not very complex code, and you could write a simple reusable library to do the work.
Use the documentation to find the full syntax.

Filling additional columns of an internal table with additional data by SELECT statement? Can this be done?

SELECT matnr ersda ernam laeda
FROM mara
INTO CORRESPONDING FIELDS OF TABLE gt_mara
UP TO 100 ROWS.
At this point I have 100 entries in the itab gt_mara.
SELECT aenam vpsta pstat lvorm mtart
FROM mara
INTO CORRESPONDING FIELDS OF TABLE gt_mara
FOR ALL ENTRIES IN gt_mara
WHERE matnr = gt_mara-matnr AND
ersda = gt_mara-ersda AND
ernam = gt_mara-ernam AND
laeda = gt_mara-laeda.
At this point I have 59 entries. Which makes sense. This code is buggy, because it might be modifying the selection criteria at run time.
Anyway what i intended was this: select the first 4 fields of the table at one point, and then select the other 5 at some other.
Of course, this is just an example. Perhaps the second select would be done on a different table with the same key or with a different number of fields.
So can this even be done?
Are there more efficient methods to achieve this than what comes to my mind by default (redoing the complete select) ?
Ok I think the essence of your question is more about whether you can update certain unfilled fields in an internal table directly through a second select statement.
The answer is no. Your second select statement would replace the contents in table gt_mara, so you would be left with an internal table where the first 4 fields are blank, and the last 5 are filled.
The best you could do is something like this:
SELECT matnr ersda ernam laeda
FROM mara
INTO CORRESPONDING FIELDS OF TABLE gt_mara
UP TO 100 ROWS.
SELECT matnr aenam vpsta pstat lvorm mtart
FROM mara
INTO CORRESPONDING FIELDS OF TABLE gt_mara2
FOR ALL ENTRIES IN gt_mara
WHERE matnr = gt_mara-matnr AND
ersda = gt_mara-ersda AND
ernam = gt_mara-ernam AND
laeda = gt_mara-laeda.
loop at gt_mara2 into ls_mara.
modify gt_mara from ls_mara transporting aenam vpsta pstat lvorm mtart
where matnr = ls_mara-matnr.
endloop.
This is obviously quite inefficient, which is why you would always try to make the database do as much of the work for you before you bring the data back to the application server. Obviously if the data is coming from the same table selecting it all in one go is going to be your best option. In most cases even if the data is in different tables you would be better off creating a view or using a join.
In rare cases it is necessary to loop at your internal table to fill in some fields that were not available to you when you did the original select.
Either SELECT everything you need right away (which is the preferred solution if the data comes from the same table) or SELECT the additional stuff later (which is a good idea if the stuff comes from a different table that is not used for the first selection). For assembling the result set, the database usually needs to access the entire dataset anyway, so it doesn't really hurt to select some additional fields - in contrast to hitting the database again with a massive SELECT statement (if the FOR ALL ENTRIES table gets large). Also bear in mind that - depending on the kind of processing you're doing - the contents of the table might have changed in the meantime. If the database transaction (LUW) ends (which is always the case between dialog steps), you loose the database-level transaction isolation.

How to tell if record has changed in Postgres

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.
Basic problem.
I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).
The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.
Solution that I'm contemplating.
Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).
For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.
In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")
I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.
My question?
Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.
A few final things:
I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.
I have lots of similar operations. What I do is COPY to temporary staging tables:
CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0; -- only copy structure, no data
COPY target_tmp FROM '/path/to/target.csv';
For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!
ANALYZE target_tmp;
Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.
ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);
You don't need the performance stuff for small imports.
Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..
Maybe DELETE what isn't there any more?
DELETE FROM target_tbl t
WHERE NOT EXISTS (
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);
Then UPDATE what's already there:
UPDATE target_tbl t
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id
To avoid empty UPDATEs, simply add:
...
AND col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns
Or, if the whole row is relevant:
...
AND t IS DISTINCT FROM t1; -- check the whole row
Then INSERT what's new:
INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING (target_id)
WHERE t.target_id IS NULL;
Clean up if your session goes on (temp tables are dropped at end of session automatically):
DROP TABLE target_tmp;
Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.

How to alter Postgres table data based on its contents?

This is probably a super simple question, but I'm struggling to come up with the right keywords to find it on Google.
I have a Postgres table that has among its contents a column of type text named content_type. That stores what type of entry is stored in that row.
There are only about 5 different types, and I decided I want to change one of them to display as something else in my application (I had been directly displaying these).
It struck me that it's funny that my view is being dictated by my database model, and I decided I would convert the types being stored in my database as strings into integers, and enumerate the possible types in my application with constants that convert them into their display names. That way, if I ever got the urge to change any category names again, I could just change it with one alteration of a constant. I also have the hunch that storing integers might be somewhat more efficient than storing text in the database.
First, a quick threshold question of, is this a good idea? Any feedback or anything I missed?
Second, and my main question, what's the Postgres command I could enter to make an alteration like this? I'm thinking I could start by renaming the old content_type column to old_content_type and then creating a new integer column content_type. However, what command would look at a row's old_content_type and fill in the new content_type column based off of that?
If you're finding that you need to change the display values, then yes, it's probably a good idea not to store them in a database. Integers are also more efficient to store and search, but I really wouldn't worry about it unless you've got millions of rows.
You just need to run an update to populate your new column:
update table_name set content_type = (case when old_content_type = 'a' then 1
when old_content_type = 'b' then 2 else 3 end);
If you're on Postgres 8.4 then using an enum type instead of a plain integer might be a good idea.
Ideally you'd have these fields referring to a table containing the definitions of type. This should be via a foreign key constraint. This way you know that your database is clean and has no invalid values (i.e. referential integrity).
There are many ways to handle this:
Having a table for each field that can contain a number of values (i.e. like an enum) is the most obvious - but it breaks down when you have a table that requires many attributes.
You can use the Entity-attribute-value model, but beware that this is too easy to abuse and cause problems when things grow.
You can use, or refer to my implementation solution PET (Parameter Enumeration Tables). This is a half way house between between 1 & 2.