postgres: how to count text values (e.g. "yes") in a column - postgresql-9.4

I would like to count how many yes exist in each row at column B.
Any efficient way of dealing this?
I am actually aggregating data so I am not considering creating a new table using "string_to_array".

Related

Postgresql - Return column subset from cursor

I have a legacy stored procedure returning a number (row count) a cursor with many columns; I need to retrieve a subset of the selected columns. I can think of three ways of doing it:
Invoke the existing procedure from the outside, and map columns to my own data structures trimming unneeded columns;
Write a new stored procedure, mostly identical to the existing one but returning different columns;
Write a new stored procedure, invoking the old one internally and filtering columns (the referenced entities and thus the number of rows are exactly the same as the existing procedure).
Number 2 is obviously a no-go.
Number 1 is viable. As far as I know, there is little difference in the computing cost between retrieving one or more columns, in that the engine has to read full rows regardless, before filtering unrequired columns; I do have a feeling it would be heavier on the runtime invoking the procedure from the outside, as objects representing unneeded columns would exist on returning from the DB call.
I would be interested in implementing Number 3, but I would prefer to maintain the same return type as the existing function (count + refcursor) for conformity.
I think I could transfer all the rows in the cursor returned by the existing function into a temporary table as described e.g. in this question, and use it as a source for the output cursor but:
I am not sure of how the output cursor would behave with a temporary table created with a drop-on-commit clause (would the results exist reliably after the procedure has terminated? Would the temporary table be dropped as expected?);
I read that temporary tables are expensive to use, and it feels like overkill for what in the end is a filtering of columns on the same rows from a pre-computed result.
Is there a way to query the existing cursor so that it may be used as a source for the output cursor, while filtering columns?

kdb+/q optimize union function

To give you a bit of background. I have a process which does this large complex calculation which takes a while to complete. It runs on a timer. After some investigation I realise that what is causing the slowness isn't the actual calculation but the internal q function, union.
I am trying to union two simple tables, table A and table B. A is approximately 5m rows and B is 500. Both tables have only two columns. First column is a symbol. Table A is actually a compound primary key of a table. (Also, how do you copy directly from the console?)
n:5000000
big:([]n?`4;n?100)
small:([]500?`4;500?100)
\ts big union small
I tried keying both columns and upserting, join and then distinct, "big, small where not small in big" but nothing seems to work :(
Any help will be appreciated!
If you want to upsert the big table it has to be keyed and upsert operator should be used. For example
n:5000000
//big ids are unique numbers from 0 to 499999
//table is keyed with 1! operator
big:1!([]id:(neg n)?n;val:n?100)
//big ids are unique numbers. 250 from 0-4999999 and 250 from 500000-1000000 intervals
small:([]id:(-250?n),(n+-250?n);val:500?100)
If big is global variable it is efficient to upsert it as
`big upsert small
if big is local
big: big upsert small
As the result big will have 500250 elements, because there are 250 common keys (id column) in big and small tables
this may not be relevant, but just a quick thought. If your big table has a column which has type `sym and if this column does not really show up that much throughout your program, why not cast it to string or other value? if you are doing this update process every single day then as the data gets packed in your partitioned hdb, whenever the new data is added, kdb+ process has to reassign/rewrite its sym file and i believe this is the part that actually takes a lot of time, not the union calculation itself..
if above is true, i'd suggest either rewriting your schema for the table which minimises # of rehashing(not sure if this is the right term though!) on your symfile. or, as the above person mentioned, try to assign attribute to your table.. this may reduce the time too.

100 columns vs Array of length 100

I have a table with 100+ values corresponding to each row, so I'm exploring different ways to store them.
Without any indexes, would I lose anything if I store these 100 values in an integer[] column in postgresql? As compared to storing them in separate columns.
Plus, since we can add indexes to array elemnets,
CREATE INDEX test_index on test ((foo[1]));
Would there be a performance difference queries using such an index as compared to regular index on a column?
As far as I've read, this performance difference would come into picture in arrays with variable length elements; but I'm not sure about fixed length ones.
Don't go for the lazy way.
If you need to store 100 and more values as array, it is ok, if it has sense has array for your application, your data.
If you need to query for a specific element of the array, then this design is not good, regardless of performances, and you must use columns. This will help you in the moment you must delete a "column" in the middle or redesign it.
Anyway, as wrote by Frank in comments, if values are all same type, consider to model them to another table (if also the meaning is the same).

How should a table with two sets of almost duplicate column names be designed?

I have a table that has around 40 columns. The only difference in the columns names is that the last 20 all start with "B" before the column name. This table is used for comparing. In other words, compare the data in the first 20 columns to the data in the last 20 columns.
I know this is very bad design, so how should this table be redesigned, so that there are only 20 columns, yet we can still compare the data?
EDIT: if it helps, we also use this data to find a matched cohort
Also note that performance is of main concern here. By duplicating the columns the getting of data is extremely fast.
Thanks!
Two possible architectures and a query tip.
1) Build your table with a "Type" column, and use that to flag "primary" vs. "alternate". In your case, "A" vs. "B" might be appropriate.
2) Build a vertical partition, two identical tables (for primary and alternate data), that share a common primary key. (If Id = 42 is in one table, it must be in the other--unless "alternate" data is optional, in which case don't populate the second table.) Also optionally, have a third table that tracks all possible primary keys, along with any data that is known to always be common to both tables.
Tip: Read up on SELECT...EXCEPT and SELECT...INTERSECT. They run disturbingly quickly, and are idea for comparing all columns and rows between two datasets for differences (except) and matches (intersect). You can use this fairly easily with either of the two structures, and it would work with your existing code as well (though it might be fussier to write the query).

Adding a new column to Table which contains live data

I have a large table consisting of over 60 millions records and I would like to add 2 new columns for data migration purposes. There are indexes on the table and some of them are large. So, by me adding the 2 new columns to the table, will I run the risk of slowing down the database whilst it attempts to add them and maybe time-out? Or will it just work?
I know that if I try and rearrange the columns SQL Server will ask me to drop and re-create the table, so I definately don't want this. Is this something everyone is challenged with?
We've had the same problem with column and index changes on larger tables.
I would simply add the columns using ALTER TABLE. The column order, though nice, is irrelevant.
If the columns are NULLable them time is reasonable. if you want to add a default value and make them NOT NULL, then this is more work obviously. However, I would consider adding as NOT NULL, then setting to a value, then changing to NOT NULL to make it 3 steps you can do at different times. We do this to reduce the time window we need, even if the whole process tales longer