Is it a good idea to store attributes in an integer column and perform bitwise operations to retrieve them? - tsql

In a recent CODE Magazine article, John Petersen shows how to use bitwise operators in TSQL in order to store a list of attributes in one column of a db table.
Article here.
In his example he's using one integer column to hold how a customer wants to be contacted (email,phone,fax,mail). The query for pulling out customers that want to be contacted by email would look like this:
SELECT C.*
FROM dbo.Customers C
,(SELECT 1 AS donotcontact
,2 AS email
,4 AS phone
,8 AS fax
,16 AS mail) AS contacttypes
WHERE ( C.contactmethods & contacttypes.email <> 0 )
AND ( C.contactmethods & contacttypes.donotcontact = 0 )
Afterwards he shows how to encapsulate this in to a table function.
My questions are these:
1. Is this a good idea? Any drawbacks? What problems might I run in to using this approach of storing attributes versus storing them in two extra tables (Customer_ContactType, ContactType) and doing a join with the Customer table? I guess one problem might be if my attribute list gets too long. If the column is an integer then my attribute list could only be at most 32.
2. What is the performance of doing these bitwise operations in queries as you move in to the tens of thousands of records? I'm guessing that it would not be any more expensive than any other comparison operation.

If you wish to filter your query based on the value of any of those bit values, then yes this is a very bad idea, and is likely to cause performance problems.
Besides, there simply isn't any need - just use the bit data type.
The reason why using bitwise operators in this way is a bad idea is that SQL server maintains statistics on various columns in order to improve query performance - for example if you have an email column, SQL server can tell you roughly what percentage of values that email column are true and select an appropriate execution plan based on that knowledge.
If however you have flags column, SQL server will have absolutely no idea how many records in a table match flags & 2 (email) - it doesn't maintain these sorts of indexes. Without this sort of information available to it SQL server is far more likely to choose a poor execution plan.

And don't forget the maintenance problems using this technique would cause. As it is not standard, all new devs will probably be confused by the code and not know how to adjust it properly. Errors will abound and be hard to find. It is also hard to do reporting type queries from. This sort of trick stuff is almost never a good idea from a maintenance perspective. It might look cool and elegant, but all it really is - is clunky and hard to work with over time.

One major performance implication is that there will not be a lookup operator for indexes that works in this way. If you said WHERE contact_email=1 there might be an index on that column and the query would use it; if you said WHERE (contact_flags & 1)=1 then it wouldn't.
** One column stores one piece of information only - it's the database way. **
(Didnt see - Kragen's answer also states this point, way before mine)

In opposite order: The best way to know what your performance is going to be is to profile.
This is, most definately, an "It Depends" question. I personally would never store such things as integers. For one thing, as you mention, there's the conversion factor. For another, at some point you or some other DBA, or someone is going to have to type:
Select CustomerName, CustomerAddress, ContactMethods, [etc]
From Customer
Where CustomerId = xxxxx
because some data has become corrupt, or because someone entered the wrong data, or something. Having to do a join and/or a function call just to get at that basic information is way more trouble than it's worth, IMO.
Others, however, will probably point to the diversity of your options, or the ability to store multiple value types (email, vs phone, vs fax, whatever) all in the same column, or some other advantage to this approach. So you would really need to look at the problem you're attempting to solve and determine which approach is the best fit.

Related

POSTGRESQL JSONB column for storing follower-following relationship [duplicate]

Imagine a web form with a set of check boxes (any or all of them can be selected). I chose to save them in a comma separated list of values stored in one column of the database table.
Now, I know that the correct solution would be to create a second table and properly normalize the database. It was quicker to implement the easy solution, and I wanted to have a proof-of-concept of that application quickly and without having to spend too much time on it.
I thought the saved time and simpler code was worth it in my situation, is this a defensible design choice, or should I have normalized it from the start?
Some more context, this is a small internal application that essentially replaces an Excel file that was stored on a shared folder. I'm also asking because I'm thinking about cleaning up the program and make it more maintainable. There are some things in there I'm not entirely happy with, one of them is the topic of this question.
In addition to violating First Normal Form because of the repeating group of values stored in a single column, comma-separated lists have a lot of other more practical problems:
Can’t ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
Can’t use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
Can’t enforce uniqueness: no way to prevent 1,2,3,3,3,5
Can’t delete a value from the list without fetching the whole list.
Can't store a list longer than what fits in the string column.
Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan. May have to resort to regular expressions, for example in MySQL:
idlist REGEXP '[[:<:]]2[[:>:]]' or in MySQL 8.0: idlist REGEXP '\\b2\\b'
Hard to count elements in the list, or do other aggregate queries.
Hard to join the values to the lookup table they reference.
Hard to fetch the list in sorted order.
Hard to choose a separator that is guaranteed not to appear in the values
To solve these problems, you have to write tons of application code, reinventing functionality that the RDBMS already provides much more efficiently.
Comma-separated lists are wrong enough that I made this the first chapter in my book: SQL Antipatterns, Volume 1: Avoiding the Pitfalls of Database Programming.
There are times when you need to employ denormalization, but as #OMG Ponies mentions, these are exception cases. Any non-relational “optimization” benefits one type of query at the expense of other uses of the data, so be sure you know which of your queries need to be treated so specially that they deserve denormalization.
"One reason was laziness".
This rings alarm bells. The only reason you should do something like this is that you know how to do it "the right way" but you have come to the conclusion that there is a tangible reason not to do it that way.
Having said this: if the data you are choosing to store this way is data that you will never need to query by, then there may be a case for storing it in the way you have chosen.
(Some users would dispute the statement in my previous paragraph, saying that "you can never know what requirements will be added in the future". These users are either misguided or stating a religious conviction. Sometimes it is advantageous to work to the requirements you have before you.)
There are numerous questions on SO asking:
how to get a count of specific values from the comma separated list
how to get records that have only the same 2/3/etc specific value from that comma separated list
Another problem with the comma separated list is ensuring the values are consistent - storing text means the possibility of typos...
These are all symptoms of denormalized data, and highlight why you should always model for normalized data. Denormalization can be a query optimization, to be applied when the need actually presents itself.
In general anything can be defensible if it meets the requirements of your project. This doesn't mean that people will agree with or want to defend your decision...
In general, storing data in this way is suboptimal (e.g. harder to do efficient queries) and may cause maintenance issues if you modify the items in your form. Perhaps you could have found a middle ground and used an integer representing a set of bit flags instead?
Yes, I would say that it really is that bad. It's a defensible choice, but that doesn't make it correct or good.
It breaks first normal form.
A second criticism is that putting raw input results directly into a database, without any validation or binding at all, leaves you open to SQL injection attacks.
What you're calling laziness and lack of SQL knowledge is the stuff that neophytes are made of. I'd recommend taking the time to do it properly and view it as an opportunity to learn.
Or leave it as it is and learn the painful lesson of a SQL injection attack.
I needed a multi-value column, it could be implemented as an xml field
It could be converted to a comma delimited as necessary
querying an XML list in sql server using Xquery.
By being an xml field, some of the concerns can be addressed.
With CSV: Can't ensure that each value is the right data type: no way to prevent 1,2,3,banana,5
With XML: values in a tag can be forced to be the correct type
With CSV: Can't use foreign key constraints to link values to a lookup table; no way to enforce referential integrity.
With XML: still an issue
With CSV: Can't enforce uniqueness: no way to prevent 1,2,3,3,3,5
With XML: still an issue
With CSV: Can't delete a value from the list without fetching the whole list.
With XML: single items can be removed
With CSV: Hard to search for all entities with a given value in the list; you have to use an inefficient table-scan.
With XML: xml field can be indexed
With CSV: Hard to count elements in the list, or do other aggregate queries.**
With XML: not particularly hard
With CSV: Hard to join the values to the lookup table they reference.**
With XML: not particularly hard
With CSV: Hard to fetch the list in sorted order.
With XML: not particularly hard
With CSV: Storing integers as strings takes about twice as much space as storing binary integers.
With XML: storage is even worse than a csv
With CSV: Plus a lot of comma characters.
With XML: tags are used instead of commas
In short, using XML gets around some of the issues with delimited list AND can be converted to a delimited list as needed
Yes, it is that bad. My view is that if you don't like using relational databases then look for an alternative that suits you better, there are lots of interesting "NOSQL" projects out there with some really advanced features.
Well I've been using a key/value pair tab separated list in a NTEXT column in SQL Server for more than 4 years now and it works. You do lose the flexibility of making queries but on the other hand, if you have a library that persists/derpersists the key value pair then it's not a that bad idea.
I would probably take the middle ground: make each field in the CSV into a separate column in the database, but not worry much about normalization (at least for now). At some point, normalization might become interesting, but with all the data shoved into a single column you're gaining virtually no benefit from using a database at all. You need to separate the data into logical fields/columns/whatever you want to call them before you can manipulate it meaningfully at all.
If you have a fixed number of boolean fields, you could use a INT(1) NOT NULL (or BIT NOT NULL if it exists) or CHAR (0) (nullable) for each. You could also use a SET (I forget the exact syntax).

How do I efficiently execute large queries?

Consider the following demo schema
trades:([]symbol:`$();ccy:`$();arrivalTime:`datetime$();tradeDate:`date$(); price:`float$();nominal:`float$());
marketPrices:([]sym:`$();dateTime:`datetime$();price:`float$());
usdRates:([]currency$();dateTime:`datetime$();fxRate:`float$());
I want to write a query that gets the price, translated into USD, at the soonest possible time after arrivalTime. My beginner way of doing this has been to create intermediate tables that do some filtering and translating column names to be consistent and then using aj and ajo to join them up.
In this case there would only be 2 intermediate tables. In my actual case there are necessarily 7 intermediate tables and records counts, while not large by KDB standards, are not small either.
What is considered best practice for queries like this? It seems to me that creating all these intermediate tables is resource hungry. An alternative to the intermediate tables is 2 have a very complicated looking single query. Would that actually help things? Or is this consumption of resources just the price to pay?
For joining to the next closest time after an event take a look at this question:
KDB reverse asof join (aj) ie on next quote instead of previous one
Assuming that's what your looking for then you should be able to perform your price calculation either before or after the join (depending on the size of your tables it may be faster to do it after). Ultimately I think you will need two (potentially modified as per above) aj's (rates to marketdata, marketdata to trades).
If that's not what you're looking for then I could give some more specifics although some sample data would be useful.
My thoughts:
The more verbose/readible your code, the better for you to debug later and any future readers/users of your code.
Unless absolutely necessary, I would try and avoid creating 7 copies of the same table. If you are dealing with large tables memory could quickly become a concern. Particularly if the processing takes a long time, you could be creating large memory spikes. I try to keep to updating 1-2 variables at different stages e.g.:
res: select from trades;
res:aj[`ccy`arrivalTime;
res;
select ccy:currency, arrivalTime:dateTime, fxRate from usdRates
]
res:update someFunc fxRate from res;
Sean beat me to it, but aj for a time after/ reverse aj is relatively straight forward by switching bin to binr in the k code. See the suggested answer.
I'm not sure why you need 7 intermediary tables unless you are possibly calculating cross rates? In this case I would typically join ccy1 and ccy2 with 2 ajs to the same table and take it from there.
Although it may be unavoidable in your case if you have no control over the source data, similar column names / greater consistency across schemas is generally better. e.g. sym vs symbol

How to use RLS with compund field

In Redshift we have a table (let's call it entity) which among other columns it has two important ones: hierarchy_id & entity_timestampt, the hierarchy_id is a combination of the ids of three hierarchical dimensions (A, B, C; each one having a relationship of one-to-many with the next one).
Thus: hierarchy_id == A.a_id || '-' || B.b_id || '-' || C.c_id
Additionally the table is distributed according to DISTKEY(hierarchy_id) and sorted using COMPOUND SORTKEY(hierarchy_id, entity_timestampt).
Over this table we need to generate multiple reports, some of them are fixed to the depths level of the hierarchy, while others will be filtered by higher parts and group the results by the lowers. However, the first layer of the hierarchy (the A dimension) is what defines our security model, users will never have access to different A dimensions other than the one they belong (this is our tenant information).
The current design proven to be useful for that matter when we were prototyping the reports in plain SQL as we could do things like this for the depths queries:
WHERE
entity.hierarchy_id = 'fixed_a_id-fixed_b_id-fixed_c_id' AND
entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'
Or like this for filtering by other points of the hierarchy:
WHERE
entity.hierarchy_id LIKE 'fixed_a_id-%' AND
entity.entity_timestampt BETWEEN 'start_date' AND 'end_data'
Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.
Now we want to use QuickSight for creating and sharing those reports using the embedding capabilities. But we haven't found a way to filter the data of the analysis as we want.
We tried to use the RLS by tags for annonymous users, but we have found two problems:
How to inject the A.a_id part of the query in the API that generates the embedding URL in a secure way (i.e. that users can't change it), While allowing them to configure the other parts of the hierarchy. And finally combining those independent pieces in the filter; without needing to generate a new URL each time users change the other parts.
(however, we may live with this limitation but)
How to do partial filters; i.e., the ones that looked like LIKE 'fixed_a_id-fixed_b_id-%' Since it seems RLS is always an equals condition.
Is there any way to make QuickSight to work as we want with our current table design? Or would we need to change the design?
For the latter, we have thought on keeping the three dimension ids as separated columns, that way we may add RLS for the A.a_id column and use parameters for the other ones, the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.
COMPOUND SORTKEY(hierarchy_id, entity_timestampt)
You are aware you are sorting on only the first eight bytes of hierarchy_id? and the ability of the zone map to differentiate between blocks is based purely on the first eight bytes of the string?
I suspect you would have done a lot better to have had three separate columns.
Which would still take advantage of the DISTKEY & SORTKEY setup, even though we are filtering just for a partial path of the hierarchy.
I may be wrong - I would need to check - but I think if you use operators of any kind (such as functions, or LIKE, or even addition or subtraction) on a sortkey, the zone map does not operate and you read all blocks.
Also in your case, it may be - I've not tried using it yet - if you have AQUA enabled, because you're using LIKE, your entire query is being processed by AQUA. The performance consequences of this, positive and/or negative, are completely unknown to me.
Have you been using the system tables to verify your expectations of what is going on with your queries when it comes to zone map use?
the problem would be for the reports that group by lower parts of the hierarchy, it is not clear how we could define the DISTKEY and SORTKEY so that the queries are properly optimized.
You are now facing the fundamental nature of sorted column-store; the sorting you choose defines the queries you can issue and so also defines the queries you cannot issue.
You either alter your data design, in some way, so what you want becomes possible, or you can duplicate the table in question where each duplicate has different sorting orders.
The first is an art, the second has obvious costs.
As an aside, although I've never used Quicksight, my experience with all SQL generators has been that they are completely oblivious to sorting and so the SQL they issue cannot be used on Big Data (as sorting is the method by which Big Data can be handled in a timely manner).
If you do not have Big Data, you'll be fine, but the question then is why are you using Redshift?
If you do have Big Data, the only solution I know of is to create a single aggregate table per dashboard, about 100k rows, and have the given dashboard use and only use that one table. The dashboard should normally simply read the entire table, which is fine, and then you avoid the nightmare SQL it normally will produce.

Should I create an index on a column with repetitive values and where in lookup

I have a materialized view (which is very much a table) where I need to make where in kind of queries.
The column I want to query (say view_id) definitely has repetitions (15-20).
The where in queries would also be very large i.e - it would contain a lot of view_id to query.
Should I go ahead and create an index on this column?
Will it give me some performance improvements?
I have another column which would help help me have a multi column index(unique). Should this be a better option?
With questions such as these on performance, there is no substitute for testing it with your exact case. There's little harm in trying it out (even on a production system, but utilize a test system if you can!), other than perhaps slowing performance until you undo what you did. Postgres makes this kind of tinkering safe.
#tim-biegeleisen's first comment is spot on: with your setup, your cardinality is reduced, but that doesn't mean it's not a win.
In short, try it and see. There is no better answer you will get than what your own dataset and access patterns will give you.

Can Postgres tables be optimized by the column order?

I recently had to propose a set of new Postgres tables to our DB team that will be used by an application I am writing. They failed the design because my table had fields that were listed like so:
my_table
my_table_id : PRIMARY KEY, AUTO INCREMENT INT
some_other_table_id, FOREIGN KEY INT
some_text : CHARACTER VARYING(100)
some_flag : BOOLEAN
They said that the table would not be optimal because some_text appears before some_flag, and since CHARACTER VARYING fields search slower than BOOLEANs, when doing a table scan, it is faster to have a table structure whose columns are sequenced from greatest precision to least precision; so, like this:
my_table
my_table_id : PRIMARY KEY, AUTO INCREMENT INT
some_other_table_id, FOREIGN KEY INT
some_flag : BOOLEAN
some_text : CHARACTER VARYING(100)
These DBAs come from a Sybase background and have only recently switched over as our Postgres DBAs. I am thinking that this is perhaps a Sybase optimization that doesn't apply to Postgres (I would think Postgres is smart enough to somehow not care about column sequence).
Either way I can't find any Postgres documentation that confirms or denies. Looking for any battle-worn Postgres DBAs to weigh-in as to whether this is a valid or bogus (or conditionally-valid!) claim.
Speaking from my experience with Oracle on similar issues, where there was a big change in behaviour between versions 9 and 10 (or 8 and 9) if memory serves (due to CPU overhead in finding column data within a row), I don't believe you should rely on documented behaviour for an issue like this when a practical experiment would be fairly straightforward and conclusive.
So I'd suggest that you create a test case for this. Create two tables with exactly the same data and the columns in a different order, and run repeated and varied tests. Try to implement the entire test as a single script that can be run on a development or test system and which tells you the answer. Maybe the DBA's are right, and you can say, "Hey, confirmed your thoughts on this, thanks a lot", or you might find no measurable and significant difference. In the latter case you can hand the entire test to the DBA's, and explain how you can't reproduce the problem. Let them run the tests.
Either way, someone is going to learn something, and you've got a test case you can apply to future (or past) versions.
Lastly, post here on what you found ;)
What your DBA's are probably referring to, is the access strategy for "gettting to" the boolean value in a given tuple (/row).
In THEIR proposed design, a system can "get to" that value by looking at byte 9.
In YOUR proposed design, the system must first inspect the LENGTH field of all varying-length columns [that come before your boolean column], before it can know the byte offset where the boolean value can be found. That is ALWAYS going to be slower than "their" way.
Their consideration is one of PHYSICAL design (and it is a correct one). Damir's answer is also correct, but it is an answer from the perspective of LOGICAL design.
If the remark by your DBA's is really intended as "criticism of a 'bad' design", then they deserve to be pointed out that LOGICAL design is your job (and column order doesn't matter at that level), and PHYSICAL design is their job. And if they expect you to do the PHYSICAL design (their job) as well, then there is no longer any reason for the boss to keep them employed.
From a database design point, there is no difference between your design and what your DBA suggests -- your application should not care. In relational databases (logically) there is no such thing as order of columns; actually if order of columns matters (logically) it failed 1NF.
So, simply pass all create table scripts to your DBAs and let them implement (reorder columns) in any way they feel it is optimal on the physical level. You simply continue with the application.
Database design can not fail on order of columns -- it is simply not part of the design process.
Future users of large data banks must be protected from having to know
how the data is organized in the machine ...
... the problems treated hare are those of data independence -- the
independence of application programs and terminal activities from
growth in data types and changes ...
E.F. Codd ~ 1979
Changes to the physical level ... must not require a change to an
application ...
Rule 8: Physical data independence (E.F. Codd ~ 1985)
So here we are -- 33 years later ...