Optimal approach for large bulk updates to indexed bit sets - postgresql

The environment for this question is PostgreSQL 9.6.5 on AWS RDS.
The question is about an optimal schema design and batch update strategy for a table with 300 million rows containing the following logical data model:
id: primary key, string up to 40 characters long
code: integer 1-999
year: integer year
flags: variable number (1000+) each associated with a name, new flags added over time. Ideally, a flag should be thought of as having three values: absent (null), on (true/1) and off (false/0). It is possible, at the cost of additional updates (see below), to treat a flag as a simple bit (on or off, no absent). "On" values are typically very sparse: < 1/1000.
Queries typically involve boolean expressions on the presence or absence of one or more flags (by name) with code and year occasionally involved also.
The data is updated in batch via Apache Spark, i.e., updates can be represented as flat file(s), e.g., in COPY format, or as SQL operations. Only one update is active at any one time. Updates to code and year are very infrequent. Updates to flags affect 1-5% of rows per update (3-15 million rows). It is possible for the update rows to include all flags and their values, just the "on" flags to be updated or just the flags whose values have changed. In the former case, Spark would need to query the data to get the current values of flags.
There will be a small read load during updates.
The question is about an optimal schema and associated update strategy to support the query & updates as described.
Some comments from research so far:
Using 1,000+ boolean columns would create a very efficient row representation but, in addition to some DDL complexity, would require 1,000+ indexes.
Bit strings would be great if there was a way to index individual bits. Also, they do not offer a good way to represent absent flags. Using this approach would require maintaining a lookup table between flag names and bit IDs. Merging updates, if needed, works with ||, though, given PostgreSQL's MVCC there doesn't seem to be much benefit to updating just flags as opposed to replacing an entire row.
JSONB fields offer indexing. They also offer null representation but that comes at a cost: all flags that are "off" would need to be explicitly set, which would make the fields quite large. If we ignore null representation, JSONB fields would be relatively small. To further shrink them, we could use short 1-3 character field names with a lookup table. Same comments re: merging as with bit strings.
tsvector/tsquery: have no experience with this data type but, in theory, seems to be an exact representation of a set of "on" flags by name. Must use a lookup table mapping flag names to tokens with the additional requirement to ensure there are no collisions due to stemming.

Don't store the flags in the main table.
Assuming that the main table is called data, define something like the following:
CREATE TABLE flag_names (
id smallint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE flag (
flagname_id smallint NOT NULL REFERENCES flag_names(id),
data_id text NOT NULL REFERENCES data(id),
value boolean NOT NULL,
PRIMARY KEY (flagname_id, data_id)
);
If a new flag is created, insert a new row in flag_names.
If a flag is set to TRUE or FALSE, insert or update a row in the flag table.
Join flag with data to test if a certain flag is set.

Related

Postgres hstore for time series

I am new to postgres and am experimenting with the hstore extension.Looking for some guidance. I need to support basic reporting on timeseries data for various products that we sell. I have a large amount data in the format "Timestamp, Value" for each product. This data is available in a csv fle for each product.
I am thinking of using hstore to store this data in the key value format. Assuming that all the timeseries data for a single product can be stored in a single hstore object. I need to be able to query this data by specific times, say what was the value of a product at a given time? Also need to run simple queries like retrieving the times where the product costed more than $100.
I'm planning to have a table with a product id column and an hstore column. But I am not very clear on how to make this work:
The hstore column needs to be loaded from thousands of timestamp,value records that exist in a csv. The hstore should be appended whenever we get a new csv.
The table needs to store the productId and corresponding Timeseries data.
Can you please advise if using hstore would be helpful ? If yes then how can I load data from csv as explained above. Also, if there could be any impact on the performance on inserts/updates in the hstore, as data grows please share your experiences.
I do think you should start with a simple, normalised schema first, especially since you are new to PostgreSQL. Something like:
CREATE TABLE product_data
(
product TEXT, -- I'm making an assumption about the types of your columns
time TIMESTAMP,
value DOUBLE PRECISION,
PRIMARY KEY (product, time);
);
I would definitely keep hstore and similar options in mind, if and when your data becomes large enough that efficiency is more important and simplicity. But note that all options have an efficiency tradeoff.
Do you know how much data you're going to support? Number of products, number of distinct timestamps for each product?
What other queries do you want to run? A query for the times where a single product cost more than $100 would benefit from an index on (product, value), if the product has many distinct timestamps.
Other options
hstore is most useful if you want to store a table set of arbitrary key-value pairs in a row. You could use it here, with a row for each product, and each distinct timestamp for that product being a key in the product's table. The downsides are that keys and values in hstore are text, whereas your keys are timestamps, and your values are numbers of some kind. So there will be a certain reduction in type checking, and a certain increase in type casting cost required. Another possible downside is that some queries on the hstore might not use indexes very efficiently. The above table can use simple btree indexes for range queries (say you want to pull out the values between two dates for a product). But hstore indexes are much more limited; you can use a gist or gin index on an hstore column to find all the rows that feature a certain key.
Another option (which I've played with and use experimentally for some of my databases) is arrays. Basically, each product will have an array of values, and each timestamp maps to an index in the array. This is easy if the timestamps are perfectly regular. For example, if all your products had a value every hour for every day, you could use a table like this:
CREATE TABLE product_data
(
product TEXT,
day DATE,
values DOUBLE PRECISION[], -- An array from 0 to 23.
PRIMARY KEY (product, day);
);
You can construct views and indexes to make querying this table moderate easy. (I wrote a blog post on this technique at http://ejrh.wordpress.com/2011/03/20/vector-denormalisation-in-postgresql/.)
But my advice is still: start with a simple table, then explore ways to improve efficiency when you know you're going to need them.

How to formulate index_name in SQL?

I am trying to create an index on one of my tables with an accurate label. Here is how I am trying it...expecting "sysname" to resolve to the column or table name. But after I run this command and view it in the Object Explorer, it is listed as
"[<Name of Missing Index, sysname + '_prod',>]".
How do u define index_names in a better descriptive fashion? (I am trying to add the extension "_prod" to the index_name, since INDEX of index_name already exists).
USE [AMDMetrics]
GO
CREATE NONCLUSTERED INDEX
[<Name of Missing Index, sysname + '_prod',>]
ON [SMARTSOLVE].[V_CXP_CUSTOMER_PXP] ([QXP_UDF_STRING_8], [QXP_REPORT_DATE],
[QXP_XRS_DESCRIPTION])
INCLUDE ([QXP_ID], [QXP_EXCEPTION_NO], [QXP_BASE_EXCEPTION], [QXP_CATEGORY],
[QXP_OCCURENCE_DATE], [QXP_COORD_ID], [QXP_SHORT_DESC], [QXP_ROOT_CAUSE],
[QXP_DESCRIPTION], [QXP_QEI_ID], [PXP_LOT_NUMBER], [CXP_ID], [CXP_AWARE_DATE],
[QXP_XSV_CODE], [QXP_COORD_NAME], [PXP_PRODUCT_CODE], [PXP_PRODUCT_NAME],
[QXP_ORU_NAME], [QXP_RESOLUTION_DESC], [QXP_CLOSED_DATE], [CXP_CLIENT_CODE],
[CXP_CLIENT_NAME])
I'm not 100% sure what you are trying to do, but it seems like you are trying to find a way to properly name your index (or find a good naming convention). Conventions are best when they are easy to follow, and make sense to people without having to explain it to them. A lot of different conventions fit this MO, but the one that is most common is this:
Index Type Prefix Complete Index name
-------------------------------------------------------------------
Index (not unique, non clustered) IDX_ IDX_<name>_<column>
Index (unique, non clustered) UDX_ UDX_<name>_<column>
Index (not unique, clustered) CIX_ CIX_<name>_<column>
Index (unique, clustered) CUX_ CUX_<name>_<column>
Although on a different note, I have to question why you have so many columns in your INCLUDE list....without knowing the size of those columns, there are some drawbacks to adding so many columns:
Avoid adding unnecessary columns. Adding too many index columns,
key or nonkey, can have the following performance implications:
- Fewer index rows will fit on a page. This could create I/O increases
and reduced cache efficiency.
- More disk space will be required to store the index. In particular,
adding varchar(max), nvarchar(max), varbinary(max), or xml data types
as nonkey index columns may significantly increase disk space requirements.
This is because the column values are copied into the index leaf level.
Therefore, they reside in both the index and the base table.
- Index maintenance may increase the time that it takes to perform modifications,
inserts, updates, or deletes, to the underlying table or indexed view.
You will have to determine whether the gains in query performance outweigh
the affect to performance during data modification and in additional disk
space requirements.
From here: http://msdn.microsoft.com/en-us/library/ms190806.aspx

How should a table with two sets of almost duplicate column names be designed?

I have a table that has around 40 columns. The only difference in the columns names is that the last 20 all start with "B" before the column name. This table is used for comparing. In other words, compare the data in the first 20 columns to the data in the last 20 columns.
I know this is very bad design, so how should this table be redesigned, so that there are only 20 columns, yet we can still compare the data?
EDIT: if it helps, we also use this data to find a matched cohort
Also note that performance is of main concern here. By duplicating the columns the getting of data is extremely fast.
Thanks!
Two possible architectures and a query tip.
1) Build your table with a "Type" column, and use that to flag "primary" vs. "alternate". In your case, "A" vs. "B" might be appropriate.
2) Build a vertical partition, two identical tables (for primary and alternate data), that share a common primary key. (If Id = 42 is in one table, it must be in the other--unless "alternate" data is optional, in which case don't populate the second table.) Also optionally, have a third table that tracks all possible primary keys, along with any data that is known to always be common to both tables.
Tip: Read up on SELECT...EXCEPT and SELECT...INTERSECT. They run disturbingly quickly, and are idea for comparing all columns and rows between two datasets for differences (except) and matches (intersect). You can use this fairly easily with either of the two structures, and it would work with your existing code as well (though it might be fussier to write the query).

How to query Cassandra by date range

I have a Cassandra ColumnFamily (0.6.4) that will have new entries from users. I'd like to query Cassandra for those new entries so that I can process that data in another system.
My sense was that I could use a TimeUUIDType as the key for my entry, and then query on a KeyRange that starts either with "" as the startKey, or whatever the lastStartKey was. Is this the correct method?
How does get_range_slice actually create a range? Doesn't it have to know the data type of the key? There's no declaration of the data type of the key anywhere. In the storage_conf.xml file, you declare the type of the columns, but not of the keys. Is the key assumed to be of the same type as the columns? Or does it do some magic sniffing to guess?
I've also seen reference implementations where people store TimeUUIDType in columns. However, this seems to have scale issues as this particular key would then become "hot" since every change would have to update it.
Any pointers in this case would be appreciated.
When sorting data only the column-keys are important. The data stored is of no consequence neither is the auto-generated timestamp. The CompareWith attribute is important here. If you set CompareWith as UTF8Type then the keys will be interpreted as UTF8Types. If you set the CompareWith as TimeUUIDType then the keys are automatically interpreted as timestamps. You do not have to specify the data type. Look at the SlicePredicate and SliceRange definitions on this page http://wiki.apache.org/cassandra/API This is a good place to start. Also, you might find this article useful http://www.sodeso.nl/?p=80 In the third part or so he talks about slice ranging his queries and so on.
Doug,
Writing to a single column family can sometimes create a hot spot if you are using an Order-Preserving Partitioner, but not if you are using the default Random Partitioner (unless a subset of users create vastly more data than all other users!).
If you sorted your rows by time (using an Order-Preserving Partitioner) then you are probably even more likely to create hotspots, since you will be adding rows sequentially and a single node will be responsible for each range of the keyspace.
Columns and Keys can be of any type, since the row key is just the first column.
Virtually, the cluster is a circular hash key ring, and keys get hashed by the partitioner to get distributed around the cluster.
Beware of using dates as row keys however, since even the randomization of the default randompartitioner is limited and you could end up cluttering your data.
What's more, if that date is changing, you would have to delete the previous row since you can only do inserts in C*.
Here is what we know :
A slice range is a range of columns in a row with a start value and an end value, this is used mostly for wide rows as columns are ordered. Known column names defined in the CF are indexed however so they can be retrieved specifying names.
A key slice, is a key associated with the sliced column range as returned by Cassandra
The equivalent of a where clause uses secondary indexes, you may use inequality operators there, however there must be at least ONE equals clause in your statement (also see https://issues.apache.org/jira/browse/CASSANDRA-1599).
Using a key range is ineffective with a Random Partitionner as the MD5 hash of your key doesn't keep lexical ordering.
What you want to use is a Column Family based index using a Wide Row :
CompositeType(TimeUUID | UserID)
In order for this not to become hot, add a first meaningful key ("shard key") that would split the data accross nodes such as the user type or the region.
Having more data than necessary in Cassandra is not a problem, it's how it is designed, so what you must ask yourself is "what do I need to query" and then design a Column Family for it rather than trying to fit everything in one CF like you'd do in an RDBMS.

DB2 Auto generated Column / GENERATED ALWAYS pros and cons over sequence

Earlier we were using 'GENERATED ALWAYS' for generating the values for a primary key. But now it is suggested that we should, instead of using 'GENERATED ALWAYS' , use sequence for populating the value of primary key. What do you think can be the reason of this change? It this just a matter of choice?
Earlier Code:
CREATE TABLE SCH.TAB1
(TAB_P INTEGER NOT NULL GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1, NO CACHE),
.
.
);
Now it is
CREATE TABLE SCH.TAB1
(TAB_P INTEGER ),
.
.
);
now while inserting, generate the value for TAB_P via sequence.
I tend to use identity columns more than sequences, but I'll compare the two for you.
Sequences can generate numbers for any purpose, while an identity column is strictly attached to a column in a table.
Since a sequence is an independent object, it can generate numbers for multiple tables (or anything else), and is not affected when any table is dropped. When a table with a identity column is dropped, there is no memory of what value was last assigned by that identity column.
A table can have only one identity column, so if you want to want to record multiple sequential numbers into different columns in the same table, sequence objects can handle that.
The most common requirement for a sequential number generator in a database is to assign a technical key to a row, which is handled well by an identity column. For more complicated number generation needs, a sequence object offers more flexibility.
This might probably be to handle ids in case there are lots of deletes on the table.
For eg: In case of identity, if your ids are
1
2
3
Now if you delete record 3, your table will have
1
2
And then if your insert a new record, the ids will be
1
2
4
As opposed to this, if you are not using an identity column and are generating the id using code, then after delete for the new insert you can calculate id as max(id) + 1, so the ids will be in order
1
2
3
I can't think of any other reason, why an identity column should not be used.
Heres something I found on the publib site:
Comparing IDENTITY columns and sequences
While there are similarities between IDENTITY columns and sequences, there are also differences. The characteristics of each can be used when designing your database and applications.
An identity column has the following characteristics:
An identity column can be defined as
part of a table only when the table
is created. Once a table is created,
you cannot alter it to add an
identity column. (However, existing
identity column characteristics might
be altered.)
An identity column
automatically generates values for a
single table.
When an identity
column is defined as GENERATED
ALWAYS, the values used are always
generated by the database manager.
Applications are not allowed to
provide their own values during the
modification of the contents of the
table.
A sequence object has the following characteristics:
A sequence object is a database
object that is not tied to any one
table.
A sequence object generates
sequential values that can be used in
any SQL or XQuery statement.
Since a sequence object can be used
by any application, there are two
expressions used to control the
retrieval of the next value in the
specified sequence and the value
generated previous to the statement
being executed. The PREVIOUS VALUE
expression returns the most recently
generated value for the specified
sequence for a previous statement
within the current session. The NEXT
VALUE expression returns the next
value for the specified sequence. The
use of these expressions allows the
same value to be used across several
SQL and XQuery statements within
several tables.
While these are not all of the characteristics of these two items, these characteristics will assist you in determining which to use depending on your database design and the applications using the database.
I don't know why anyone would EVER use an identity column rather than a sequence.
Sequences accomplish the same thing and are far more straight forward. Identity columns are much more of a pain especially when you want to do unloads and loads of the data to other environments. I not going to go into all the differences as that information can be found in the manuals but I can tell you that the DBA's have to almost always get involved anytime a user wants to migrate data from one environment to another when a table with an identity is involved because it can get confusing for the users. We have no issues when a sequence is used. We allow the users to update any schema objects so they can alter their sequences if they need to.