Background: I am looking at consolidating security roles for an application. There are 20 roles that can apply to one or more of five access levels (like district, county, state). Most users have multiple roles for each access level, so I wanted to get statistics on what the most common groupings are for users by access level.
EDIT: The business users were confused between job roles of the people who are entered into the system and security roles for people who use the system. There was a direct mapping between the job roles and the security roles. What the business owners are finding is that they don't want to manage the security for the number of roles that a user can possibly be assigned.
I created the equivalent of a flags enum for the roles (Id is the primary key of the Role, and the flag is my flay value):
DECLARE #flags table(Id int,Flag int);
insert into #flags
values (1 , 1),
(2 , 2),
(4 , 4),
(5 , 8),
(6 , 16),
(7 , 32),
(8 , 64),
(9 , 128),
(10 , 256),
(11 , 512),
(12 , 1024),
(13 , 2048),
(14 , 4096),
(15 , 8192),
(16 , 16384),
(17 , 32768),
(18 , 65536),
(19 , 131072),
(20 , 262144),
(21 , 524288)
I've queried out the SUM of user roles for each of the users in the system grouped by access level and would like to query out what the most common role groupings are for the user population so that I can ensure that I don't make the new role set too general.
UserName Roles AccessLevel
User1#test.edu 8192 County
User1#test.edu 262400 District
Any pointers in the right direction would be good. I'm not very versed in statistics, so if the whole method I'm using is bad, let me know.
Related
I am trying to execute a query with an ORDER BY clause and a LIMIT clause for performance. Consider the following schema.
ONE
(id, name)
(1 , a)
(2 , b)
(5 , c)
TWO
(id, name)
(3 , d)
(4 , e)
(5 , f)
I want to be able to get a list of people from tables one and two ordered by ID.
The current query I have is as follows.
WITH combined AS (
(SELECT * FROM one ORDER BY id DESC)
UNION ALL
(SELECT * FROM two ORDER BY id DESC)
)
SELECT * FROM combined ORDER BY id LIMIT 5
the output of the table will be
(id, name)
(1 , a)
(2 , b)
(3 , d)
(4 , e)
(5 , c)
You'll notice that last row "c" or "f" will change based on the order of the UNION (one UNION two versus two UNION one). That's not important as I only care about the order for ID.
Unfortunately, this query does a full scan of both tables as per the ORDER BY on "combined". My table one and two are both billions of rows.
I am looking for a query that will be able to search both tables simultaneously, if possible. Meaning rather than looking through all of "one" for the entries that I need, it first looks to sort both by ID and then look for the minimum from both tables such that if the ID in one table is lower than the ID in another table, the query will look in the other table until the other table's ID is higher or equal to the first table before looking through the first table again.
The correct order of reading the table, given one UNION two would be a, b, d, e, c/f.
Do you just mean this?
WITH combined AS (
(SELECT * FROM one ORDER BY id LIMIT 5)
UNION ALL
(SELECT * FROM two ORDER BY id LIMIT 5)
)
SELECT * FROM combined ORDER BY id LIMIT 5
That will select the 5 "lowest id" rows from each table (which is the minimum you need to guarantee 5 output rows) and then find the lowest of those.
Thanks to a_horse_with_no_name's comment on Richard Huxton's answer regarding adding an index, the query runs considerably faster, from indeterminate to under one minute.
In my case, the query was still too slow, and I came across the following solution.
Consider using results from one table to limit results from another table. The following solution, in combination with indexing by id, worked for my tables with billions of rows, but operates on the assumption that table "one" is faster than table "two" to finish the query.
WITH first as (SELECT * FROM one ORDER BY id LIMIT 5),
filter as (SELECT min(id) FROM first),
second as (SELECT * FROM two
WHERE id < (SELECT filter.id FROM filter)
ORDER BY id LIMIT 5)
combined AS (
(SELECT * FROM first ORDER BY id LIMIT 5)
UNION ALL
(SELECT * FROM second ORDER BY id LIMIT 5)
)
SELECT * FROM combined ORDER BY id LIMIT 5
By using the minimum ID from the first complete query, I can limit the scope that the database scans for completion of the second query.
I have traffic logs from my site.
I want to sample traffic from 10% of the user base.
But each record in the database is a visit, and each customer can have many visits. Getting only 10% of traffic would be incorrect, because 20% of users may generate 80% of traffic.
Table structure is simple
user_id, page
How do I get traffic from a random 10% of customers without too many nested subqueries?
If using MySQL you can try:
/* Calculate 10% of the users, rounding up to account for values below 1 */
SET #limit = CEIL((SELECT COUNT(DISTINCT(user_id)) FROM TRAFFIC) / 10);
/* Prepare a statement for getting the traffic */
PREPARE STMT FROM 'SELECT *
FROM TRAFFIC T
INNER JOIN (
SELECT DISTINCT(user_id)
FROM TRAFFIC
LIMIT ?
) U
ON T.user_id = U.user_id';
/* Execute the statement using the pre-computed limit. */
EXECUTE STMT USING #limit;
Here's a similar implementation in PostgreSQL (based on feedback):
SELECT *
FROM TRAFFIC T
INNER JOIN (
SELECT DISTINCT user_id
FROM TRAFFIC
LIMIT CEIL((SELECT COUNT(DISTINCT user_id) FROM TRAFFIC) / 10)
) U
ON T.user_id = U.user_id;
If your users are stored in a different table (and the log table's user_id is a foreign key to that) you can use the tablesample option to get 10% of the users in a sub-select:
select *
from the_table
where user_id in (select id
from users
tablesample system (10));
If you don't have such a table Jake's query (without the prepared statement) is probably the way to go.
I'm using SQL Server 2008 R2 on my development machine (not a server box).
I have a table with 12.5 million records. It has 126 columns, half of which are int. Most columns in most rows are NULL. I've also tested with an EAV design which seems 3-4 times faster to return the same records (but that means pivoting data to make it presentable in a table).
I have a website that paginates the data. When the user tries to go to the last page of records (last 25 records), the resulting query is something like this:
select * from (
select
A.Id, part_id as PartObjectId,
Year_formatted 'year', Make_formatted 'Make',
Model_formatted 'Model',
row_number() over ( order by A.id ) as RowNum
FROM vehicles A
) as innerQuery where innerQuery.RowNum between 775176 and 775200
... but this takes nearly 3 minutes to run. That seems excessive? Is there a better way to structure this query? In the browser front-end I'm using jqGrid to display the data. The user can navigate to the next, previous, first, or last page. They can also filter and order data (example: show all records whose Make is "Bugatti").
vehicles.Id is int and is the primary key (clustered ASC). part_id is int, Make and Model are varchar(100) and typically only contain 20 - 30 characters.
Table vehicles is updated ~100 times per day in individual transactions, and 20 - 30 users use the webpage to view, search, and edit/add vehicles 8 hours/day. It gets read from and updated a lot.
Would it be wise to shard the vehicles table into multiple tables only containing say 3 million records each? Would that have much impact on performance?
I see lots of videos and websites talking about people having tables with 100+ million rows that are read from and updated often without issue.
Note that the performance issues I observe are on my own development computer. The database has a dedicated 16GB of RAM. I'm not using SSD or even SCSI for that matter. So I know hardware would help, but 3 minutes to retrieve the last 25 records seems a bit excessive no?
Though I'm running these tests on SQL Server 2008 R2, I could also use 2012 if there is much to be gained from doing so.
Yes there is a better way, even on older releases of MsSQL But it is involved. First, this process should be done in a stored procedure. The stored procedure should take as 2 of it's input parameters, the page requested (#page)and the page size (number of records per page - #pgSiz).
In the stored procedure,
Create a temporary table variable and put into it a sorted list of the integer Primary Keys for all the records, with a rowNumber column that is itself an indexed, integer, Primary Key for the temp table
Declare #PKs table
(rowNo integer primary key Identity not null,
vehicleId integer not null)
Insert #PKS (vehicleId)
Select vehicleId from Vehicles
Order By --[Here put sort criteria as you want pages sorted]
--[Try to only include columns that are in an index]
then, based on which page (and the page size), (#page, #pgSiz) the user requested, the stored proc selects the actual data for that page by joining to this temp table variable:
Select [The data columns you want]
From #PKS p join Vehicles v
on v.VehicleId = p.VehicleId
Where rowNo between #page*#pgSiz+1 and (#page+1)*#pgSiz
order by rowNo -- if you want to sort page of records on server
assuming #page is 0-based. Also, the Stored proc will need some input argument validation to ensure that the #page, #pgSize values are reasonable (do not take the code pas the end of the records.)
I'm a member of a MLM network and I'm also a developer. My question is regarding the database structure to build a MLM software with infinite levels. Example:
Person 1 (6000 people is his network - but only 4 direct linked to him)
How to store that data and query how many points does his network produce?
I could possibly do it use many-to-many relationship, but once we have a lot of users and a huge network, it costs a lot to query and loop through these records.
In any database, if each member of the "tree" has the same properties, it's best to use a self referencing table, especially if each tree has 1 and only 1 direct parent.
IE.
HR
------
ID
first_name
last_name
department_id
sal
boss_hr_id (referneces HR.ID)
Usually the big boss would have a NULL boss_hr_id
To query such a structure, in postgres, you can use CTEs ("with recursive" statement)
For table above, a query like this will work:
with recursive ret(id, first_name, last_name, dept_id,boss_hr_id) as
(
select * from hr
where hr.id=**ID_OF_PERSON_YOU_ARE_QUERYING_STRUCTURE**
union
select hr.id, hr.first_name, hr.last_name,hr.dept_id,hr.boss_hr_id, lev+1 from hr
inner join ret on ret.boss_hr_id=hr.hr_id
)
select * from ret
)
The app, I am working on is like flikr but with groups concept. Each group consists of multiple users and user can do activities like upload,share,comment etc. within their group only.
I am thinking of creating a schema per group to organized data under group-name namespace in order to manage it easily & efficiently.
Will it have any adverse effect on database backup plans ?
Is there any practical limits on number of schemas per database ?
When splitting identically-structured data into schemas, you need to anticipate the fact that you won't need to query them as global entities again. Because it's as cumbersome and anti-SQL as having them in different tables of the same schema.
As an example, say you have 100 groups of users, in 100 schemas named group1..group100, each with a photo table.
To get the total number of photos in your system, you'd need to do:
select sum(n) FROM
(
select count(*) as n from group1.photos
UNION
select count(*) as n from group2.photos
UNION
select count(*) as n from group3.photos
...
UNION
select count(*) as n from group100.photos
)
This sort of query or view needs also to be rebuilt any time a group is added or removed.
This is neither easy or efficient, it's a programmer's nightmare.