How do I replace a SSN with a 9 digit random number in SQL Server 2008R2? - tsql

To satisfy security requirements, I need to find a way to replace SSN's with unique, random 9 digit numbers, before providing said database to a developer. The SSN is in a column in a table of a database. There may be 10's of thousands of rows in said table. The number does not need hyphens. I am a beginner with SQL and programming in general.
I have been unable to find a solution for my specific needs. Nothing seems quite right. But if you know of a thread that I have missed, please let me know.
Thanks for any help!

Here is one way.
I'm assuming that you already have a backup of the real data as this update is not reversible.
Below I've assumed your table name is Person with your ssn column named SSN.
UPDATE Person SET
SSN = CAST(LEFT(CAST(ABS(CAST(CAST(NEWID() as BINARY(10)) as int)) as varchar(max)) + '00000000',9) as int)

If they do not have to be random, you could just replace them with ascending numeric values. Failing that, you’d have to generate a random number. As you may have discovered, the RAND function will only generate a single value per query statement (select, update, etc.); the work-around to that is the newid() function, which would generate a GUID for each row produced by a query (run SELECT newid() from MyTable to see how this works). Wrap this in a checksum() to generate an integer; modulus that by 1,000,00,000 to get a value within the SSN range (0 to 999,999,999); and, assuming you’re storing it as a char(9) prefix it with leading zeros.
Next trick is ensuring it’s unique for all values in your table. This gets tricky, and I’d do it by setting up a temp table with the values, populating it, then copying them over. Lessee now…
DECLARE #DummySSN as table
(
PrimaryKey int not null
,NewSSN char(9) not null
)
-- Load initial values
INSERT #DummySSN
select
UserId
,right('000000000' + cast(abs(checksum(newid()))%1000000000 as varchar(9)), 9)
from Users
-- Check for dups
select NewSSN from #DummySSN group by NewSSN having count(*) > 1
-- Loop until values are unique
IF exists (SELECT 1 from #DummySSN group by NewSSN having count(*) > 1)
UPDATE #DummySSN
set NewSSN = right('000000000' + cast(abs(checksum(newid()))%1000000000 as varchar(9)), 9)
where NewSSN in (select NewSSN from #DummySSN group by NewSSN having count(*) > 1)
-- Check for dups
select NewSSN from #DummySSN group by NewSSN having count(*) > 1
This works for a small table I have, and it should work for a large one. I don’t see this turning into an infinite loop, but even so you might want to add a check to exit the loop after say 10 iterations,

I've run a couple million tests in this and it seems to generate random (URN) 9 digit numbers (no leading zeros).
I cannot think of a more efficient way to do this.
SELECT CAST(FLOOR(RAND(CHECKSUM(NEWID())) * 900000000 ) + 100000000 AS BIGINT)
The test used;
;WITH Fn(N) AS
(
SELECT CAST(FLOOR(RAND(CHECKSUM(NEWID())) * 900000000 ) + 100000000 AS BIGINT)
UNION ALL
SELECT CAST(FLOOR(RAND(CHECKSUM(NEWID())) * 900000000 ) + 100000000 AS BIGINT)
FROM Fn
)
,Tester AS
(
SELECT TOP 5000000 *
FROM Fn
)
SELECT LEN(MIN(N))
,LEN(MAX(N))
,MIN(N)
,MAX(N)
FROM Tester
OPTION (MAXRECURSION 0)

Not so fast, but easiest... I added some dot's...
DECLARE #tr NVARCHAR(40)
SET #tr = CAST(ROUND((888*RAND()+111),0) AS CHAR(3)) + '.' +
CAST(ROUND((8888*RAND()+1111),0) AS CHAR(4)) + '.' + CAST(ROUND((8888*RAND()+1111),0) AS
CHAR(4)) + '.' + CAST(ROUND((88*RAND()+11),0) AS CHAR(2))
PRINT #tr

If the requirement is to obfuscate a database then this will return the same unique value for each distinct SSN in any table preserving referential integrity in the output without having to do a lookup and translate.
SELECT CAST(RAND(SSN)*999999999 AS INT)

Related

postgresql: How to grab an existing id from a not subsequent ids of a table

Postgresql version 9.4
I have a table with an integer column, which has a number of integers with some gaps, like the sample below; I'm trying to get an existing id from the column at random with the following query, but it returns NULL occasionally:
CREATE TABLE
IF NOT EXISTS test_tbl(
id INTEGER);
INSERT INTO test_tbl
VALUES (10),
(13),
(14),
(16),
(18),
(20);
-------------------------------
SELECT * FROM test_tbl;
-------------------------------
SELECT COALESCE(tmp.id, 20) AS classification_id
FROM (
SELECT tt.id,
row_number() over(
ORDER BY tt.id) AS row_num
FROM test_tbl tt
) tmp
WHERE tmp.row_num =floor(random() * 10);
Please let me know where I'm doing wrong.
but it returns NULL occasionally
and I must add to this that it sometimes returns more than 1 rows, right?
in your sample data there are 6 rows, so the column row_num will have a value from 1 to 6.
This:
floor(random() * 10)
creates a random number from 0 up to 0.9999...
You should use:
floor(random() * 6 + 1)::int
to get a random integer from 1 to 6.
But this would not solve the problem, because the WHERE clause is executed once for each row, so there is a case that row_num will never match the created random number, so it will return nothing, or it will match more than once so it will return more than 1 rows.
See the demo.
The proper (although sometimes not the most efficient) way to get a random row is:
SELECT id FROM test_tbl ORDER BY random() LIMIT 1
Also check other links from SO, like:
quick random row selection in Postgres
You could select one row and order by random(), this way you are ensured to hit an existing row
select id
from test_tbl
order by random()
LIMIT 1;

lead and lag on large table 1billion rows

I have a table T as follows with 1 Billion records. Currently, this table has no Primary key or Indexes.
create table T(
day_c date,
str_c varchar2(20),
comm_c varchar2(20),
src_c varchar2(20)
);
some sample data:
insert into T
select to_date('20171011','yyyymmdd') day_c,'st1' str_c,'c1' comm_c,'s1' src_c from dual
union
select to_date('20171012','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171013','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171014','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171015','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171016','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171017','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171018','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171019','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171020','yyyymmdd'),'st1','c1','s1' from dual;
The expected result is to generate the date ranges for the changes in column src_c.
I have the following code snippet which provides the desired result. However, it is slow as the cost of running lag and lead is quite high on the table.
WITH EndsMarked AS (
SELECT
day_c,str_c,comm_c,src_c,
CASE WHEN src_c= LAG(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_START,
CASE WHEN src_c= LEAD(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_END
FROM T
), GroupsNumbered AS (
SELECT
day_c,str_c,comm_c,
src_c,
IS_START,
IS_END,
COUNT(CASE WHEN IS_START = 1 THEN 1 END)
OVER (ORDER BY day_c) AS GroupNum
FROM EndsMarked
WHERE IS_START=1 OR IS_END=1
)
SELECT
str_c,comm_c,src_c,
MIN(day_c) AS GROUP_START,
MAX(day_c) AS GROUP_END
FROM GroupsNumbered
GROUP BY str_c,comm_c, src_c,GroupNum
ORDER BY groupnum;
Output :
STR_C COMM_C SRC_C GROUP_START GROUP_END
st1 c1 s1 11-OCT-17 13-OCT-17
st1 c1 s2 14-OCT-17 16-OCT-17
st1 c1 s1 17-OCT-17 20-OCT-17
Any suggestion to speed up?
Oracle database :12c.
SGA Memory:20GB
Total CPU:22
Explain plan:
Order by day_c only, or do you need to partition by str_c and comm_c first? It seems so - in which case I am not sure your query is correct, and Sentinel's solution will need to be adjusted accordingly.
Then:
For some reason (which escapes me), it appears that the match_recognize clause (available only since Oracle 12.1) is faster than analytic functions, even when the work done seems to be the same.
In your problem, (1) you must read 1 billion rows from disk, which can't be done faster than the hardware allows (do you REALLY need to do this on all 1 billion rows, or should you archive a large portion of your table, perhaps after performing this identification of GROUP_START and GROUP_END)? (2) you must order the data by day_c no matter what method you use, and that is time consuming.
With that said, the tabibitosan method (see Sentinel's answer) will be faster than the start-of-group method (which is close to, but simpler than what you currently have).
The match_recognize solution, which will probably be faster than any solution based on analytic functions, looks like this:
select str_c, comm_c, src_c, group_start, group_end
from t
match_recognize(
partition by str_c, comm_c
order by day_c
measures x.src_c as src_c,
first(day_c) as group_start,
last(day_c) as group_end
pattern ( x y* )
define y as src_c = x.src_c
)
-- Add ORDER BY clause here, if needed
;
Here is a quick explanation of how this works; for developers who are not familiar with match_recognize, I provided links to a few good tutorials in a Comment below this Answer.
The match_recognize clause partitions the input rows by str_c and comm_c and orders them by day_c. So far this is exactly the same work that analytic functions do.
Then in the PATTERN and DEFINE clauses I declare and define two "classes" of rows, which will be flagged as X and Y, respectively. X is any row (there are no restrictions on it in the DEFINE clause). However, Y is restricted: it must have the same src_c as the last X row preceding it.
So, in each partition, and reading from the earliest row to the latest (within the partition), I am looking for any number of matches, where a match consists of an arbitrary row (marked X), followed by as many Y rows as possible; where Y means "same src_c as the first row in this match. So, this will identify sequences of rows where the src_c did not change.
For each match that is found, the clause will output the src_c value from the X row (which is the same, really, for all the rows in that match), and the first and the last value in the day_c column for that match. That is what we need to put in the SELECT clause of the overall query.
You can eliminate one CTE by using the Tabibito-san (Traveler) method:
with Groups as (
select t.*
, row_number() over (order by day_c)
- row_number() over (partition by str_c
, comm_c
, src_c
order by day_c) GroupNum
from t
)
select str_c
, comm_c
, src_c
, min(day_c) GROUP_START
, max(day_c) GROUP_END
from Groups
group by str_c
, comm_c
, src_c
, GroupNum

How to remove duplicates in postgres (no unique id) [duplicate]

This question already has answers here:
How to delete duplicate rows without unique identifier
(10 answers)
Closed 5 years ago.
I have some difficulty in removing duplicates rows. I thought user_id and time_id together acting as an identifier but there were even duplicates for those.
user_id (text), time_id(bigint), value1 (numeric)
user_id; time_id; value1|
aaa;1;3|
aaa;1;3|
aaa;2;4|
baa;3;1|
In this case how do I remove duplicates?
Since I have 16 distinct values in time_id and 15,000 distinct ones in user_id, I tried something like this but I do not have an unique id..
DELETE FROM tablename a
USING tablename b
WHERE a.unique_id < b.unique_id
AND a.user_id = b.user_id
time_id = 1 (repeat till time_id 16)
Each table in Postgres has a few hidden system columns. One of them (ctid) is unique by definition and can be used in cases when a primary key is missing.
DELETE FROM tablename a
USING tablename b
WHERE a.ctid < b.ctid
AND a.user_id = b.user_id
AND a.time_id = b.time_id;
The problem is due to lack of primary key. Using hidden columns should not be a systematic method (see comments below). Once you delete duplicates you should create a primary key on (user_id, time_id) or create a new unique column for this purpose.
Please use any advice on deletions with care, make sure you have a way to "undo it" if needed. I think you need to add an auto-numbered column to assist in this endeavor
alter table tablename add column is_uniq serial
Then I'd suggest using row_number() to help identify the rows you do want to retain (where rn=1) and those to be deleted (where rn>1). Use the following as a guide:
select *
, ROW_NUMBER()over(partition by user_id, time_id, value1 order by is_uniq) as rn from tablename
I'm not sure if there is any other columns(s) to use for order by, but if there are then you can include that into over clause as well.
Once you have the "is_uniq" column and the rn>1 rows you should be able to safely delete the unwanted rows.
If you don't want to rely on ctid (personally,I do) ,you can add a unique column (such as aserial) and use that for identity-purposes,
CREATE TABLE lutser
( user_id text not null
, time_i integer not null
, value integer not null
);
INSERT INTO lutser(user_id,time_i,value) VALUES
('aaa', 1, 3)
,('aaa', 1, 3)
,('aaa', 2, 4)
,('baa', 3, 1)
;
SELECT*FROM lutser;
ALTER TABLE lutser
ADD COLUMN seq serial NOT NULL UNIQUE
;
SELECT*FROM lutser;
DELETE FROM lutser del
WHERE EXISTS(
SELECT*FROM lutser x
WHERE x.user_id=del.user_id
AND x.time_i=del.time_i
AND x.seq < del.seq
);
ALTER TABLE lutser
ADD PRIMARY KEY (user_id,time_i)
;
SELECT*FROM lutser;

GROUP BY clause sees all VARCHAR fields as different

I have witnessed a strange behaviour while trying to GROUP BY a VARCHAR field.
Let the following example, where I try to spot customers that have changed name at least once in the past.
CREATE TABLE #CustomersHistory
(
Id INT IDENTITY(1,1),
CustomerId INT,
Name VARCHAR(200)
)
INSERT INTO #CustomersHistory VALUES (12, 'AAA')
INSERT INTO #CustomersHistory VALUES (12, 'AAA')
INSERT INTO #CustomersHistory VALUES (12, 'BBB')
INSERT INTO #CustomersHistory VALUES (44, '444')
SELECT ch.CustomerId, count(ch.Name) AS cnt
FROM #CustomersHistory ch
GROUP BY ch.CustomerId HAVING count(ch.Name) != 1
Which oddly yields (as if 'AAA' from first INSERT was different from the second one)
CustomerId cnt // (I was expecting)
12 3 // 2
44 1 // 1
Is this behaviour specific to T-SQL?
Why does it behave in this rather counter-intuitive way?
How is it customary to overcome this limitation?
Note: This question is very similar to GROUP BY problem with varchar, where I didn't find the answer to Why
Side Note: Is it good practice to use HAVING count(ch.Name) != 1 instead of HAVING count(ch.Name) > 1 ?
The COUNT() operator will count all rows regardless of value. I think you might want to use a COUNT(DISTINCT ch.Name) which will only count unique names.
SELECT ch.CustomerId, count(DISTINCT ch.Name) AS cnt
FROM #CustomersHistory ch
GROUP BY ch.CustomerId HAVING count(DISTINCT ch.Name) > 1
For more information, take a look at the COUNT() article on book online

Getting the minimum of two values in SQL

I have two variables, one is called PaidThisMonth, and the other is called OwedPast. They are both results of some subqueries in SQL. How can I select the smaller of the two and return it as a value titled PaidForPast?
The MIN function works on columns, not variables.
SQL Server 2012 and 2014 supports IIF(cont,true,false) function. Thus for minimal selection you can use it like
SELECT IIF(first>second, second, first) the_minimal FROM table
While IIF is just a shorthand for writing CASE...WHEN...ELSE, it's easier to write.
The solutions using CASE, IIF, and UDF are adequate, but impractical when extending the problem to the general case using more than 2 comparison values. The generalized
solution in SQL Server 2008+ utilizes a strange application of the VALUES clause:
SELECT
PaidForPast=(SELECT MIN(x) FROM (VALUES (PaidThisMonth),(OwedPast)) AS value(x))
Credit due to this website:
http://sqlblog.com/blogs/jamie_thomson/archive/2012/01/20/use-values-clause-to-get-the-maximum-value-from-some-columns-sql-server-t-sql.aspx
Use Case:
Select Case When #PaidThisMonth < #OwedPast
Then #PaidThisMonth Else #OwedPast End PaidForPast
As Inline table valued UDF
CREATE FUNCTION Minimum
(#Param1 Integer, #Param2 Integer)
Returns Table As
Return(Select Case When #Param1 < #Param2
Then #Param1 Else #Param2 End MinValue)
Usage:
Select MinValue as PaidforPast
From dbo.Minimum(#PaidThisMonth, #OwedPast)
ADDENDUM:
This is probably best for when addressing only two possible values, if there are more than two, consider Craig's answer using Values clause.
For SQL Server 2022+ (or MySQL or PostgreSQL 9.3+), a better way is to use the LEAST and GREATEST functions.
SELECT GREATEST(A.date0, B.date0) AS date0,
LEAST(A.date1, B.date1, B.date2) AS date1
FROM A, B
WHERE B.x = A.x
With:
GREATEST(value [, ...]) : Returns the largest (maximum-valued) argument from values provided
LEAST(value [, ...]) Returns the smallest (minimum-valued) argument from values provided
Documentation links :
MySQL http://dev.mysql.com/doc/refman/5.0/en/comparison-operators.html
Postgres https://www.postgresql.org/docs/current/functions-conditional.html
SQL Server https://learn.microsoft.com/en-us/sql/t-sql/functions/logical-functions-least-transact-sql
I just had a situation where I had to find the max of 4 complex selects within an update.
With this approach you can have as many as you like!
You can also replace the numbers with aditional selects
select max(x)
from (
select 1 as 'x' union
select 4 as 'x' union
select 3 as 'x' union
select 2 as 'x'
) a
More complex usage
#answer = select Max(x)
from (
select #NumberA as 'x' union
select #NumberB as 'x' union
select #NumberC as 'x' union
select (
Select Max(score) from TopScores
) as 'x'
) a
I'm sure a UDF has better performance.
Here is a trick if you want to calculate maximum(field, 0):
SELECT (ABS(field) + field)/2 FROM Table
returns 0 if field is negative, else, return field.
Use a CASE statement.
Example B in this page should be close to what you're trying to do:
http://msdn.microsoft.com/en-us/library/ms181765.aspx
Here's the code from the page:
USE AdventureWorks;
GO
SELECT ProductNumber, Name, 'Price Range' =
CASE
WHEN ListPrice = 0 THEN 'Mfg item - not for resale'
WHEN ListPrice < 50 THEN 'Under $50'
WHEN ListPrice >= 50 and ListPrice < 250 THEN 'Under $250'
WHEN ListPrice >= 250 and ListPrice < 1000 THEN 'Under $1000'
ELSE 'Over $1000'
END
FROM Production.Product
ORDER BY ProductNumber ;
GO
This works for up to 5 dates and handles nulls. Just couldn't get it to work as an Inline function.
CREATE FUNCTION dbo.MinDate(#Date1 datetime = Null,
#Date2 datetime = Null,
#Date3 datetime = Null,
#Date4 datetime = Null,
#Date5 datetime = Null)
RETURNS Datetime AS
BEGIN
--USAGE select dbo.MinDate('20120405',null,null,'20110305',null)
DECLARE #Output datetime;
WITH Datelist_CTE(DT)
AS (
SELECT #Date1 AS DT WHERE #Date1 is not NULL UNION
SELECT #Date2 AS DT WHERE #Date2 is not NULL UNION
SELECT #Date3 AS DT WHERE #Date3 is not NULL UNION
SELECT #Date4 AS DT WHERE #Date4 is not NULL UNION
SELECT #Date5 AS DT WHERE #Date5 is not NULL
)
Select #Output=Min(DT) FROM Datelist_CTE;
RETURN #Output;
END;
Building on the brilliant logic / code from mathematix and scottyc, I submit:
DECLARE #a INT, #b INT, #c INT = 0;
WHILE #c < 100
BEGIN
SET #c += 1;
SET #a = ROUND(RAND()*100,0)-50;
SET #b = ROUND(RAND()*100,0)-50;
SELECT #a AS a, #b AS b,
#a - ( ABS(#a-#b) + (#a-#b) ) / 2 AS MINab,
#a + ( ABS(#b-#a) + (#b-#a) ) / 2 AS MAXab,
CASE WHEN (#a <= #b AND #a = #a - ( ABS(#a-#b) + (#a-#b) ) / 2)
OR (#a >= #b AND #a = #a + ( ABS(#b-#a) + (#b-#a) ) / 2)
THEN 'Success' ELSE 'Failure' END AS Status;
END;
Although the jump from scottyc's MIN function to the MAX function should have been obvious to me, it wasn't, so I've solved for it and included it here: SELECT #a + ( ABS(#b-#a) + (#b-#a) ) / 2. The randomly generated numbers, while not proof, should at least convince skeptics that both formulae are correct.
Use a temp table to insert the range of values, then select the min/max of the temp table from within a stored procedure or UDF. This is a basic construct, so feel free to revise as needed.
For example:
CREATE PROCEDURE GetMinSpeed() AS
BEGIN
CREATE TABLE #speed (Driver NVARCHAR(10), SPEED INT);
'
' Insert any number of data you need to sort and pull from
'
INSERT INTO #speed (N'Petty', 165)
INSERT INTO #speed (N'Earnhardt', 172)
INSERT INTO #speed (N'Patrick', 174)
SELECT MIN(SPEED) FROM #speed
DROP TABLE #speed
END
Select MIN(T.V) FROM (Select 1 as V UNION Select 2 as V) T
SELECT (WHEN first > second THEN second ELSE first END) the_minimal FROM table