Privilege Management - privileges

I have a weird problem. Suppose there is a 8-bit string to represent privilege of some process. I have 4 type of privileges A,B,C,D. Under each category, there might be sub privileges as well. How to represent it in just 8 bits?

If they are disjunct, just give each privilege its own unique ID, and hope they never grow beyond 256. Use a lookup table to know what's what, and what supercategories they have.

Related

anything wrong about having MANY sequences in postgres?

I am developing an application using a virtual private database pattern in postgres.
So every user gets his id and all rows of this user will hold this id to be separated from others. this id should also be part of the primary key. In addition every row has to have a id which is unique in the scope of the user. This id will be the other part of the primary key.
If we have to scale this across multiple servers we can also append a third column to the pk identifying the shard this id was generated at.
My question now is how to create per user unique ids. I came along with some options which i am not sure about all the implications. The 2 solutions that seem most promising to me are:
creating one sequence per user:
this can be done automatically, using a trigger, every time a user is created. This is for sure transaction safe and I think it should be quite ok in terms of performance.
What I am worried about is that this has to work for a lot of users (100k+) and I don't know how postgres will deal with 100k+ sequences. I tried to find out how sequences are implemented but without luck.
counter in user table:
keep all users in a table with a field holding the latest id given for this user.
when a user starts a transaction I can lock the row in the user table and create a temp sequence with the latest id from the user table as a starting value. this sequence can then be used to supply ids for new entries.
before exiting the transaction the current value has to be written back to the user table and the lock has to be released.
If another transaction from the same user tries to concurrently insert rows it will stall until the first transaction releases its lock on the user table.
This way I do not need thousands of sequences and i don't think that ther will be concurrent accesses from one user frequently (the application has oltp character - so there will not be long lasting transactions) and even if this happens it will just stall for about a second and not hurt anything.
The second part of my question is if I should just use 2 columns (or maybe three if the shard_id joins the game) and make them a composite pk or if I should put them together in one column. I think handling will be way easier having them in separate columns but what does performance look like? Lets assume both values are 32bit integers - is it better tho have 2 int columns in an index or 1 bigint column?
thx for all answers,
alex
I do not think sequences would be scalable to the level you want (100k sequences). A sequence is implemented as a relation with just one row in it.
Each sequence will appear in the system catalog (pg_class) which also contains all of the tables, views, etc. Having 100k rows there is sure to slow the system down dramatically. The amount of memory required to hold all of the data structures associated with these sequence relations would be also be large.
Your second idea might be more practical, if combined with temporary sequences, might be more scalable.
For your second question, I don't think a composite key would be any worse than a single column key, so I would go with whatever matches your functional needs.

Does specifying the schema name for tables affect performance?

Disclaimer: this is not for optimization, just out of curiosity.
I'm wondering if this:
SET search_path TO myscheme; -- obviously this is done once per connection
SELECT foo, bar FROM table1 WHERE [..clauses..]
is somehow faster / slower than
SELECT foo, bar FROM myscheme.table1 WHERE [..clauses..]
or if there are some other implications that could suggest specifying the schema (or not) in every query.
I've done some (really few) tests and I can't see any difference in terms of speed.
The second one is faster, but barely. SET is extremely cheap.
Generally, a schema-qualified table name has the potential to be slightly faster, since the query to the system catalog can be more specific. But you won't be able to measure that. It's just irrelevant, performance-wise.
Setting the search_path does have implications for security and convenience though. It is generally a good idea.
For instance, consider the advice in the manual on "Writing SECURITY DEFINER Functions Safely".
Are you aware that there are many ways to set the search_path?
For all practical purposes this should make any measurable difference.
IF there is a measurable difference, I would guess, that the system has
a search_path with very many schema entries (several hundreds) and/or
an extreme number of relations (so many that the relation catalogue does not fit into memory for the normal workload).
In that case you have definitely other problems than qualifying you schema names.

Obfuscating an integer column

I could have posted this to SQL forums, but I rather look for an idea or best practice, that is why I have chosen this forum.
I have got an integer column in SQL called Payroll Number and it is unique to employee. we will be interrogating employee information from this system via SQL views and put into another system, but we dont want payroll numbers to be appeared as they are on this system. Therefore, we need to hash those payroll numbers on SQL so that views will serve hashed user-friendly numbers.
I spent quite a lot of time reading encryption techniques in SQL, but they are using complex algorithms to hash data and produce binary. But what I am after is less complext and obfuscating a number rather than hashing.
For instance, payroll number is 6 characters long(145674), I want to be able to generate maybe 9-10 characters long integer number from this number and use it on other systems.
I had a look at XOR'ing but I need something more robust and elegant.
How do you guys do these things? Do you write your simple algorithm obfuscate your integers? I need to do this on SQL leve, what do you suggest?
Thanks for your help
Regards
It is not hard to hash a value but it is hard to hash a value and be sure of uniqueness and have it be a number. However, I do have a cross database solution.
Make a new table - with two columns, id (auto generated from random starting point) and payroll id.
Everytime you need to use a user externally insert them into this table. This will give you a local unique id you can use (internally and externally) but it is not the payroll id.
In fact, if you have an internal id already (eg user id from the user table) just use that. There is no advantage to hashing this value if it is never decoded. However, you can use the autogen of id as your random unique hash -- it has all the properties you need.

Can Postgres tables be optimized by the column order?

I recently had to propose a set of new Postgres tables to our DB team that will be used by an application I am writing. They failed the design because my table had fields that were listed like so:
my_table
my_table_id : PRIMARY KEY, AUTO INCREMENT INT
some_other_table_id, FOREIGN KEY INT
some_text : CHARACTER VARYING(100)
some_flag : BOOLEAN
They said that the table would not be optimal because some_text appears before some_flag, and since CHARACTER VARYING fields search slower than BOOLEANs, when doing a table scan, it is faster to have a table structure whose columns are sequenced from greatest precision to least precision; so, like this:
my_table
my_table_id : PRIMARY KEY, AUTO INCREMENT INT
some_other_table_id, FOREIGN KEY INT
some_flag : BOOLEAN
some_text : CHARACTER VARYING(100)
These DBAs come from a Sybase background and have only recently switched over as our Postgres DBAs. I am thinking that this is perhaps a Sybase optimization that doesn't apply to Postgres (I would think Postgres is smart enough to somehow not care about column sequence).
Either way I can't find any Postgres documentation that confirms or denies. Looking for any battle-worn Postgres DBAs to weigh-in as to whether this is a valid or bogus (or conditionally-valid!) claim.
Speaking from my experience with Oracle on similar issues, where there was a big change in behaviour between versions 9 and 10 (or 8 and 9) if memory serves (due to CPU overhead in finding column data within a row), I don't believe you should rely on documented behaviour for an issue like this when a practical experiment would be fairly straightforward and conclusive.
So I'd suggest that you create a test case for this. Create two tables with exactly the same data and the columns in a different order, and run repeated and varied tests. Try to implement the entire test as a single script that can be run on a development or test system and which tells you the answer. Maybe the DBA's are right, and you can say, "Hey, confirmed your thoughts on this, thanks a lot", or you might find no measurable and significant difference. In the latter case you can hand the entire test to the DBA's, and explain how you can't reproduce the problem. Let them run the tests.
Either way, someone is going to learn something, and you've got a test case you can apply to future (or past) versions.
Lastly, post here on what you found ;)
What your DBA's are probably referring to, is the access strategy for "gettting to" the boolean value in a given tuple (/row).
In THEIR proposed design, a system can "get to" that value by looking at byte 9.
In YOUR proposed design, the system must first inspect the LENGTH field of all varying-length columns [that come before your boolean column], before it can know the byte offset where the boolean value can be found. That is ALWAYS going to be slower than "their" way.
Their consideration is one of PHYSICAL design (and it is a correct one). Damir's answer is also correct, but it is an answer from the perspective of LOGICAL design.
If the remark by your DBA's is really intended as "criticism of a 'bad' design", then they deserve to be pointed out that LOGICAL design is your job (and column order doesn't matter at that level), and PHYSICAL design is their job. And if they expect you to do the PHYSICAL design (their job) as well, then there is no longer any reason for the boss to keep them employed.
From a database design point, there is no difference between your design and what your DBA suggests -- your application should not care. In relational databases (logically) there is no such thing as order of columns; actually if order of columns matters (logically) it failed 1NF.
So, simply pass all create table scripts to your DBAs and let them implement (reorder columns) in any way they feel it is optimal on the physical level. You simply continue with the application.
Database design can not fail on order of columns -- it is simply not part of the design process.
Future users of large data banks must be protected from having to know
how the data is organized in the machine ...
... the problems treated hare are those of data independence -- the
independence of application programs and terminal activities from
growth in data types and changes ...
E.F. Codd ~ 1979
Changes to the physical level ... must not require a change to an
application ...
Rule 8: Physical data independence (E.F. Codd ~ 1985)
So here we are -- 33 years later ...

Is it a good idea to store attributes in an integer column and perform bitwise operations to retrieve them?

In a recent CODE Magazine article, John Petersen shows how to use bitwise operators in TSQL in order to store a list of attributes in one column of a db table.
Article here.
In his example he's using one integer column to hold how a customer wants to be contacted (email,phone,fax,mail). The query for pulling out customers that want to be contacted by email would look like this:
SELECT C.*
FROM dbo.Customers C
,(SELECT 1 AS donotcontact
,2 AS email
,4 AS phone
,8 AS fax
,16 AS mail) AS contacttypes
WHERE ( C.contactmethods & contacttypes.email <> 0 )
AND ( C.contactmethods & contacttypes.donotcontact = 0 )
Afterwards he shows how to encapsulate this in to a table function.
My questions are these:
1. Is this a good idea? Any drawbacks? What problems might I run in to using this approach of storing attributes versus storing them in two extra tables (Customer_ContactType, ContactType) and doing a join with the Customer table? I guess one problem might be if my attribute list gets too long. If the column is an integer then my attribute list could only be at most 32.
2. What is the performance of doing these bitwise operations in queries as you move in to the tens of thousands of records? I'm guessing that it would not be any more expensive than any other comparison operation.
If you wish to filter your query based on the value of any of those bit values, then yes this is a very bad idea, and is likely to cause performance problems.
Besides, there simply isn't any need - just use the bit data type.
The reason why using bitwise operators in this way is a bad idea is that SQL server maintains statistics on various columns in order to improve query performance - for example if you have an email column, SQL server can tell you roughly what percentage of values that email column are true and select an appropriate execution plan based on that knowledge.
If however you have flags column, SQL server will have absolutely no idea how many records in a table match flags & 2 (email) - it doesn't maintain these sorts of indexes. Without this sort of information available to it SQL server is far more likely to choose a poor execution plan.
And don't forget the maintenance problems using this technique would cause. As it is not standard, all new devs will probably be confused by the code and not know how to adjust it properly. Errors will abound and be hard to find. It is also hard to do reporting type queries from. This sort of trick stuff is almost never a good idea from a maintenance perspective. It might look cool and elegant, but all it really is - is clunky and hard to work with over time.
One major performance implication is that there will not be a lookup operator for indexes that works in this way. If you said WHERE contact_email=1 there might be an index on that column and the query would use it; if you said WHERE (contact_flags & 1)=1 then it wouldn't.
** One column stores one piece of information only - it's the database way. **
(Didnt see - Kragen's answer also states this point, way before mine)
In opposite order: The best way to know what your performance is going to be is to profile.
This is, most definately, an "It Depends" question. I personally would never store such things as integers. For one thing, as you mention, there's the conversion factor. For another, at some point you or some other DBA, or someone is going to have to type:
Select CustomerName, CustomerAddress, ContactMethods, [etc]
From Customer
Where CustomerId = xxxxx
because some data has become corrupt, or because someone entered the wrong data, or something. Having to do a join and/or a function call just to get at that basic information is way more trouble than it's worth, IMO.
Others, however, will probably point to the diversity of your options, or the ability to store multiple value types (email, vs phone, vs fax, whatever) all in the same column, or some other advantage to this approach. So you would really need to look at the problem you're attempting to solve and determine which approach is the best fit.