pg_column_size reports different vastly sizes for table.* than specific columns

pg_column_size reports different vastly sizes for table.* than specific columns - postgresql

I have a simple example where pg_column_size is reporting vastly different values. I think it has to do with whether or not it's considering TOASTed values, but I'm not sure. Here's the setup:
CREATE TABLE foo (bar TEXT);
INSERT INTO foo (bar) VALUES (repeat('foo', 100000));
SELECT pg_column_size(bar) as col, pg_column_size(foo.*) as table FROM foo;
What I'm seeing in Postgres 9.6 is,
col table
3442 300028
There's an order of magnitude difference here. Thoughts? What's the right way for me to calculate the size of the row? One idea I have is,
SELECT pg_column_size(bar), pg_column_size(foo.*) - octet_length(bar) + pg_column_size(bar) FROM foo;
Which should subtract out the post-TOAST size and add in the TOAST size.
Edit: My proposed work around only works on character columns, e.g. won't work on JSONB.

The first value is the compressed size of the TOASTed value, while the second value is the uncompressed size of the whole row.
SELECT 'foo'::regclass::oid;
┌───────┐
│ oid │
├───────┤
│ 36344 │
└───────┘
(1 row)
SELECT sum(length(chunk_data)) FROM pg_toast.pg_toast_36344;
┌──────┐
│ sum │
├──────┤
│ 3442 │
└──────┘
(1 row)
foo.* (or foo for that matter) is a “wholerow reference” in PostgreSQL, its data type is foo (which is created when the table is created).
PostgreSQL knows that foo.bar is stored externally, so it returns its size as it is in the TOAST table, but foo (a composite type) isn't, so you get the total size.
See the relevant piece of code from src/backend/access/heap/tuptoaster.c:
Size
toast_datum_size(Datum value)
{
struct varlena *attr = (struct varlena *) DatumGetPointer(value);
Size result;
if (VARATT_IS_EXTERNAL_ONDISK(attr))
{
/*
* Attribute is stored externally - return the extsize whether
* compressed or not. We do not count the size of the toast pointer
* ... should we?
*/
struct varatt_external toast_pointer;
VARATT_EXTERNAL_GET_POINTER(toast_pointer, attr);
result = toast_pointer.va_extsize;
}
[...]
else
{
/*
* Attribute is stored inline either compressed or not, just calculate
* the size of the datum in either case.
*/
result = VARSIZE(attr);
}
return result;
}

Related

Polars equivalent to SQL `COUNT(DISTINCT expr,[expr...])`, or other method of checking uniqueness

When processing data, I often add a check after each step to validate that the data still has the unique key I think it does. For example, I might check that my data is still unique on (a, b). To accomplish this, I would typically check that the number of distinct combinations of columns a and b equals the total number of rows.
In polars, to get a COUNT(DISTINCT ...) I can do
(
df
.select(['a', 'b'])
.unique()
.height
)
But height does not work on LazyFrames, so I need to actually materialize the entire data with this method, I think (?). Is there a better way?
For reference, in R's data.table library I would do
mtc_dt <- data.table::as.data.table(mtcars)
stopifnot(data.table::uniqueN(mtc_dt[, .(mpg, disp)]) == nrow(mtc_dt))
To any contributors reading:
Thanks for the great package! Has sped up many of my workflows to a fraction of the time.

You can use a map function that asserts on the unique count.
This allows you to get an eager DataFrame in the middle of a query plan.
Note that we turn off projection_pushdown optimization, as the optimizer is not able to know which subset of columns we select.
df = pl.DataFrame({
"foo": [1, 2, 3],
"bar": [None, "hello", None]
})
def unique_check(df: pl.DataFrame, subset: list[str]) -> pl.DataFrame:
assert df.select(pl.struct(subset).unique().count()).item() == df.height
return df
out = (df.lazy()
.map(lambda df: unique_check(df, ["foo", "bar"]), projection_pushdown=False)
.select("bar")
.collect()
)
print(out)
shape: (3, 1)
┌───────┐
│ bar │
│ --- │
│ str │
╞═══════╡
│ null │
│ hello │
│ null │
└───────┘
Not turning of predicate_pushdown is better, but then we must ensure the subset is selected before the map.

The answer here provides a technique that can answer this question: gather the columns together in a struct column, and then apply .n_unique() to that struct. That question uses groupby, but it will work without groupby as well.
(
df
.with_column(pl.struct(['a', 'b'].alias('ident'))
['ident'].n_unique()
)
I was able to run code more or less identical to this on a dataset I am working with, and got a sensible answer.
Note that I am not sure if this materializes the entire table before aggregating, nor if this works specifically on lazy data frames. If not, please let me know, and I will retract this answer.

If you have
df=pl.DataFrame({'a':[1,2,3],'b':[2,3,4],'c':[3,4,5]}).lazy()
and you want to see if [a,b] are unique without returning all the data, you can lazily group by and count those groups. With that, you can add a filter such that only rows with a count greater than 1 are returned. Only after those expressions are strung to the LazyFrame do you collect and if your pair of columns remain unique as you intend then the result will have 0 rows.
df \
.groupby(['a','b']) \
.agg(pl.count()). \
filter(pl.col('count')>1).select('count').collect().height

"ERROR: operator does not exist : integer = integer[]" when using "IN" operator in PostgreSQL. Why do I get this error?

What I'm trying to achieve
First I need to query table tableY to get all userids that fulfill the inner WHERE condition. Then I aggregate it into an array of userids with array_agg(userid). Then in the outer query, I need to select users from the tableX table with userids that exist inside the array I created before from tableY.
Error
I get the following error:
ERROR: operator does not exist: integer = integer[]
LINE 2: WHERE 3 IN ((
HINT: No operator matches the given name and argument types. You might need to add explicit type cast.
My query:
SELECT * FROM mydb.tableX
WHERE 3 IN ((
SELECT array_agg(userid) userids FROM
(
SELECT
DISTINCT(uc.userid), eui.firstname
FROM mydb.tableY uc
JOIN mydb.tableX eui ON uc.userid = eui.auth_userid
WHERE uc.level = 4
AND uc.subjectid = 1
AND uc.lineid = 5
GROUP BY uc.userid, eui.firstname
ORDER BY eui.firstname
) AS userids
))
Btw, I only use the "3" as a hard coded example now to get the query running in the first place.
Question
Why do I get the error?
Thanks!

The array_agg is useless in this context. It is only significant overhead and in block some possible optimization.
Write just WHERE d IN (SELECT userid ...
Note - when you really need to check if some value is in an array, you should to use operator = ANY(), but this is not this case:
postgres=# SELECT 1 WHERE 1 = ANY(ARRAY[1,2,3]);
┌──────────┐
│ ?column? │
╞══════════╡
│ 1 │
└──────────┘
(1 row)

Whats the meaning of select attributeName(tableName) from tablename in postgresql

Using Postgresql I have an apparently strange behavior that I don't understand
Assume to have a simple table
create table employee (
number int primary key,
surname varchar(20) not null,
name varchar(20) not null);
It is well clear for me the meaning of
select name from employee
However, I obtain all the names also with
select name(employee) from employee
and I do not understand this last statement.
I'm using PostgreSQL 13 and pgadmin 4

I'd like to expand #Abelisto's answer with this quotation from PostgreSQL docs:
Another special syntactical behavior associated with composite values is that we can use functional notation for extracting a field of a composite value. The simple way to explain this is that the notations field(table) and table.field are interchangeable. For example, these queries are equivalent:
SELECT c.name FROM inventory_item c WHERE c.price > 1000;
SELECT name(c) FROM inventory_item c WHERE price(c) > 1000;
...
This equivalence between functional notation and field notation makes it possible to use functions on composite types to implement “computed fields”. An application using the last query above wouldn't need to be directly aware that somefunc isn't a real column of the table.

Just an assumption.
There are two syntactic ways in PostgreSQL to call a function that receives a row as its argument. For example:
create table t(x int, y int); insert into t values(1, 2);
create function f(a t) returns int language sql as 'select a.x+a.y';
select f(t), t.f from t;
┌───┬───┐
│ f │ f │
├───┼───┤
│ 3 │ 3 │
└───┴───┘
Probably it is implemented to make the syntax same for columns also:
select f(t), t.f, x(t), t.x from t;
┌───┬───┬───┬───┐
│ f │ f │ x │ x │
├───┼───┼───┼───┤
│ 3 │ 3 │ 1 │ 1 │
└───┴───┴───┴───┘

How to specify PostGIS geography value in a composite type literal?

I have a custom composite type:
CREATE TYPE place AS (
name text,
location geography(point, 4326)
);
I want to create a value of that type using a literal:
SELECT $$("name", "ST_GeogFromText('POINT(121.560800 29.901200)')")$$::place;
This fails with:
HINT: "ST" <-- parse error at position 2 within geometry
ERROR: parse error - invalid geometry
But this executes just fine:
SELECT ST_GeogFromText('POINT(121.560800 29.901200)');
I wonder what's the correct way to specify PostGIS geography value in a composite type literal?

You are trying to push a function call ST_GeogFromText into a text string. This will not be allowed, as it creates a possibility for SQL injection.
In second call you need ST_GeogFromText to mark the type of input. For a composite type, you did that already in type definition, so you can skip that part:
[local] gis#gis=# SELECT $$("name", "POINT(121.560800 29.901200)")$$::place;
┌───────────────────────────────────────────────────────────┐
│ place │
├───────────────────────────────────────────────────────────┤
│ (name,0101000020E610000032E6AE25E4635E40BB270F0BB5E63D40) │
└───────────────────────────────────────────────────────────┘
(1 row)
Time: 0,208 ms
Another option would be to use non-literal form, which allows function calls:
[local] gis#gis=# SELECT ('name', ST_GeogFromText('POINT(121.560800 29.901200)'))::place;;
┌───────────────────────────────────────────────────────────┐
│ row │
├───────────────────────────────────────────────────────────┤
│ (name,0101000020E610000032E6AE25E4635E40BB270F0BB5E63D40) │
└───────────────────────────────────────────────────────────┘
(1 row)
Time: 5,004 ms

Postgres using named column in where clause

In Postgres I'm struggling with this syntax. It works in mysql but not sure what I'm doing wrong is.
So let's say I have a json document. I want to select a column in that document and return the result as "text"
So my query would look like this.
SELECT member_id, data->>'username' AS username
FROM player.player
Returns this as expected.
Now lets say I want to select a name from the column so my query would look like this.
SELECT member_id, data->>'username' AS username
FROM player.player WHERE username LIKE 'sam'
When I run the query I get this.
'
Why does it do that? The json I'm returning is returning as text data type since I'm using json->> on a column.

PostgreSQL is based on SQL standard, and there are not possible to use a alias on same query level. You should to use derived table and filter on higher level:
postgres=# select 1 as x where x = 1;
ERROR: column "x" does not exist
LINE 1: select 1 as x where x = 1;
^
postgres=# select * from (select 1 as x) s where x = 1;
┌───┐
│ x │
╞═══╡
│ 1 │
└───┘
(1 row)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pg_column_size reports different vastly sizes for table.* than specific columns - postgresql

Related

Polars equivalent to SQL `COUNT(DISTINCT expr,[expr...])`, or other method of checking uniqueness

"ERROR: operator does not exist : integer = integer[]" when using "IN" operator in PostgreSQL. Why do I get this error?

Whats the meaning of select attributeName(tableName) from tablename in postgresql

How to specify PostGIS geography value in a composite type literal?

Postgres using named column in where clause

Categories

Resources