How to Implement a Reliable Web Page Counter? - counter

What's a good way to implement a Web Page counter?
On the surface this is a simple problem, but it gets problematic when dealing with search engine crawlers and robots, multiple clicks by the same user, refresh clicks.
Specifically what is a good way to ensure links aren't just 'clicked up' by user by repeatedly clicking? IP address? Cookies? Both of these have a few drawbacks (IP Addresses aren't necessarily unique, cookies can be turned off).
Also what is the best way to store the data? Increment a counter individually or store each click as a record in a log table, then summarize occasionally.
Any live experience would be helpful,
+++ Rick ---

Use IP Addresses in conjunction with Sessions. Count every new session for an IP address as one hit against your counter. You can store this data in a log database if you think you'll ever need to look through it. This can be useful for calculating when your site gets the most traffic, how much traffic per day, per IP, etc.

So I played around with this a bit based on the comments here. What I came up with is counting up a counter in a simple field. In my app I have code snippet entities with a Views property.
When a snippet is viewed a method filters out (white list) just what should hopefully be browsers:
public bool LogSnippetView(string snippetId, string ipAddress, string userAgent)
{
if (string.IsNullOrEmpty(userAgent))
return false;
userAgent = userAgent.ToLower();
if (!(userAgent.Contains("mozilla") || !userAgent.StartsWith("safari") ||
!userAgent.StartsWith("blackberry") || !userAgent.StartsWith("t-mobile") ||
!userAgent.StartsWith("htc") || !userAgent.StartsWith("opera")))
return false;
this.Context.LogSnippetClick(snippetId, IpAddress);
}
The stored procedure then uses a separate table to temporarily hold the latest views which store the snippet Id, entered date and ip address. Each view is logged and when a new view comes in it's checked to see if the same IP address has accessed this snippet within the last 2 minutes. if so nothing is logged.
If it's a new view the view is logged (again SnippetId, IP, Entered) and the actual Views field is updated on the Snippets table.
If it's not a new view the table is cleaned up with any views logged that are older than 4 minutes. This should result in a minmal number of entries in the View log table at any time.
Here's the stored proc:
ALTER PROCEDURE [dbo].[LogSnippetClick]
-- Add the parameters for the stored procedure here
#SnippetId AS VARCHAR(MAX),
#IpAddress AS VARCHAR(MAX)
AS
BEGIN
SET NOCOUNT ON;
-- check if don't allow updating if this ip address has already
-- clicked on this snippet in the last 2 minutes
select Id from SnippetClicks
WHERE snippetId = #SnippetId AND ipaddress = #IpAddress AND
DATEDIFF(minute, Entered, GETDATE() ) < 2
IF ##ROWCOUNT = 0
BEGIN
INSERT INTO SnippetClicks
(SnippetId,IpAddress,Entered) VALUES
(#SnippetId,#IpAddress,GETDATE())
UPDATE CodeSnippets SET VIEWS = VIEWS + 1
WHERE id = #SnippetId
END
ELSE
BEGIN
-- clean up
DELETE FROM SnippetClicks WHERE DATEDIFF(minute,Entered,GETDATE()) > 4
END
END
This seems to work fairly well. As others mentioned this isn't perfect but it looks like it's good enough in initial testing.

If you get to use PHP, you may use sessions to track activity from particular users. In conjunction with a database, you may track activity from particular IP addresses, which you may assume are the same user.
Use timestamps to limit hits (assume no more than 1 hit per 5 seconds, for example), and to tell when new "visits" to the site occur (if the last hit was over 10 minutes ago, for example).
You may find $_SERVER[] properties that aid you in detecting bots or visitor trends (such as browser usage).
edit:
I've tracked hits & visits before, counting a page view as a hit, and +1 to visits when a new session is created. It was fairly reliable (more than reliable enough for the purposes I used it for. Browsers that don't support cookies (and thus, don't support sessions) and users that disable sessions are fairly uncommon nowadays, so I wouldn't worry about it unless there is reason to be excessively accurate.

If I were you, I'd give up on my counter being accurate in the first place. Every solution (e.g. cookies, IP addresses, etc.), like you said, tends to be unreliable. So, I think your best bet is to use redundancy in your system: use cookies, "Flash-cookies" (shared objects), IP addresses (perhaps in conjunction with user-agents), and user IDs for people who are logged in.
You could implement some sort of scheme where any unknown client is given a unique ID, which gets stored (hopefully) on the client's machine and re-transmitted with every request. Then you could tie an IP address, user agent, and/or user ID (plus anything else you can think of) to every unique ID and vice-versa. The timestamp and unique ID of every click could be logged in a database table somewhere, and each click (at least, each click to your website) could be let through or denied depending on how recent the last click was for the same unique ID. This is probably reliable enough for short term click-bursts, and long-term it wouldn't matter much anyway (for the click-up problem, not the page counter).
Friendly robots should have their user agent set appropriately and can be checked against a list of known robot user agents (I found one here after a simple Google search) in order to be properly identified and dealt with seperately from real people.

Related

Message library

The scenario is: some user sending messages to some group of people.
I was thinking to create one ROW for that specific conversation into one CLASS. WHERE in that ROW contains information such "sender name", "receiver " and addition I have column (PFRelation) which connects this specific row to another class where all messages from the user to the receiver would be saved(vice-versa) into.
So this action will happen every time the user starts a new conversation.
The benefit from this prospective :
Privacy because the only convo that is being saved are only from the user and the receiver group.
Downside of this prospective:
We all know that parse only provide 30reqs/s for free which means that 1 min =1800 reqs. So every time I create a new class to keep track of the convo. Am I using a lot of requests ?
I am looking suggestions and thoughts for the ideal way before I implement this messenger library.
It sounds like you have come up with something that is similar to what I have used before to implement messaging in an app with Parse as a backend. It's also important to think about how your UI will be querying for data. In general, it's most important to ensure that it is very easy and fast to read data. For most social apps, the following quote from Facebook's engineering team on Haystack is particularly relevant.
Haystack is an object store that we designed for sharing photos on
Facebook where data is written once, read often, never modified, and
rarely deleted.
The crucial piece of information here is written once, read often, never modified, and rarely deleted. No matter what approach you decide to take, keep that in mind while engineering your solution. The approach that I have used before to implement a messaging system using Parse is described below.
Overview
Each row (object) of the Message class corresponds with an individual text, picture, or video message that was posted. Each Message belongs to a Group. A Group can be as small as 2 User (private conversation) or grow as large as you like.
The RecentMessage class is the solution I came up with to deal with quickly and easily populating the UI. Each RecentMessage object corresponds to each Group that a given User may belong. Each User in a Group will have their own RecentMessage object which is kept up to date using beforeSave/afterSave cloud code triggers. Whenever a new Message is created, in the afterSave trigger we want to update all of the RecentMessage objects that belong to the Group.
You will most likely have a table in your app which displays all of the conversations that the user is part of. This is easily achieved by querying for all of that user's RecentMessage objects which already contains all of the Group information needed to load the rest of the messages when selected and also contains the most recent message's data (hence the name) to display in the table. Alternatively, RecentMessage could contain a pointer to the most recent Message, however I decided that copying the data was a beneficial tradeoff since it streamlines future queries.
Message
group (pointer to group which message is part of)
user (pointer to user who created it)
text (string)
picture (optional file)
video (optional file)
RecentMessage
group (group pointer)
user (user pointer)
lastMessage (string containing the text of most recent Message)
lastUser (pointer to the User who posted the most recent Message)
Group
members (array of user pointers)
name or whatever other info you want
Security/Privacy
Security and privacy are imperative when creating messaging functionality in your app. Make sure to read through the Parse Engineering security blog posts, and take your time to let it all soak in: Part I, Part II, Part III, Part IV, Part V.
Most important in our case is Part III which describes ACLs, or Access Control Lists. Group objects will have an ACL which corresponds to all of its member User. RecentMessage objects will have a restricted read/write ACL to its owner User. Message objects will inherit the ACL of the Group to which they belong, allowing all of the Group members to read. I recommend disabling the write ACL in the afterSave trigger so messages cannot be modified.
General Remarks
With regards to Parse and the request limit, you need to accept that fact that you will very quickly surpass the 30 req/s free tier. As a general rule of thumb, it's much better to focus on building the best possible user experience than to focus too much on scalability. By and large, issues of scalability rarely come into play because most apps fail. Not saying that to be discouraging — just something to keep in mind to prevent you from falling into the trap of over-engineering at the cost of time :)

How to optimize collection subscription in Meteor?

I'm working on a filtered live search module with Meteor.js.
Usecase & problem:
A user wants to do a search through all the users to find friends. But I cannot afford for each user to ask the complete users collection. The user filter the search using checkboxes. I'd like to subscribe to the matched users. What is the best way to do it ?
I guess it would be better to create the query client-side, then send it the the method to get back the desired set of users. But, I wonder : when the filtering criteria changes, does the new subscription erase all of the old one ? Because, if I do a first search which return me [usr1, usr3, usr5], and after that a search that return me [usr2, usr4], the best would be to keep the first set and simply add the new one to it on the client-side suscribed collection.
And, in addition, if then I do a third research wich should return me [usr1, usr3, usr2, usr4], the autorunned subscription would not send me anything as I already have the whole result set in my collection.
The goal is to spare processing and data transfer from the server.
I have some ideas, but I haven't coded enough of it yet to share it in a easily comprehensive way.
How would you advice me to do to be the more relevant possible in term of time and performance saving ?
Thanks you all.
David
It depends on your application, but you'll probably send a non-empty string to a publisher which uses that string to search the users collection for matching names. For example:
Meteor.publish('usersByName', function(search) {
check(search, String);
// make sure the user is logged in and that search is sufficiently long
if (!(this.userId && search.length > 2))
return [];
// search by case insensitive regular expression
var selector = {username: new RegExp(search, 'i')};
// only publish the necessary fields
var options = {fields: {username: 1}};
return Meteor.users.find(selector, options);
});
Also see common mistakes for why we limit the fields.
performance
Meteor is clever enough to keep track of the current document set that each client has for each publisher. When the publisher reruns, it knows to only send the difference between the sets. So the situation you described above is already taken care of for you.
If you were subscribed for users: 1,2,3
Then you restarted the subscription for users 2,3,4
The server would send a removed message for 1 and an added message for 4.
Note this will not happen if you stopped the subscription prior to rerunning it.
To my knowledge, there isn't a way to avoid removed messages when modifying the parameters for a single subscription. I can think of two possible (but tricky) alternatives:
Accumulate the intersection of all prior search queries and use that when subscribing. For example, if a user searched for {height: 5} and then searched for {eyes: 'blue'} you could subscribe with {height: 5, eyes: 'blue'}. This may be hard to implement on the client, but it should accomplish what you want with the minimum network traffic.
Accumulate active subscriptions. Rather than modifying the existing subscription each time the user modifies the search, start a new subscription for the new set of documents, and push the subscription handle to an array. When the template is destroyed, you'll need to iterate through all of the handles and call stop() on them. This should work, but it will consume more resources (both network and server memory + CPU).
Before attempting either of these solutions, I'd recommend benchmarking the worst case scenario without using them. My main concern is that without fairly tight controls, you could end up publishing the entire users collection after successive searches.
If you want to go easy on your server, you'll want to send as little data to the client as possible. That means every document you send to the client that is NOT a friend is waste. So let's eliminate all that waste.
Collect your filters (eg filters = {sex: 'Male', state: 'Oregon'}). Then call a method to search based on your filter (eg Users.find(filters). Additionally, you can run your own proprietary ranking algorithm to determine the % chance that a person is a friend. Maybe base it off of distance from ip address (or from phone GPS history), mutual friends, etc. This will pay dividends in efficiency in a bit. Index things like GPS coords or other highly unique attributes, maybe try out composite indexes. But remember more indexes means slower writes.
Now you've got a cursor with all possible friends, ranked from most likely to least likely.
Next, change your subscription to match those friends, but put a limit:20 on there. Also, only send over the fields you need. That way, if a user wants to skip this step, you only wasted sending 20 partial docs over the wire. Then, have an infinite scroll or 'load more' button the user can click. When they load more, it's an additive subscription, so it's not resending duplicate info. Discover Meteor describes this pattern in great detail, so I won't.
After a few clicks/scrolls, the user won't find any more friends (because you were smart & sorted them) so they will stop trying & move on to the next step. If you returned 200 possible friends & they stop trying after 60, you just saved 140 docs from going through the pipeline. There's your efficiency.

GWT: Pragmatic unlocking of an entity

I have a GWT (+GAE) webapp that allows users to edit Customer entities. When a user starts editing, the lockedByUser attribute is set on the Customer entity. When the user finishes editing the Customer, the lockedByUser attribute is cleared.
No Customer entity can be modified by 2 users at the same time. If a user tries to open the Customer screen which is already opened by a different user, he get's a "Customer XYZ is being modified by user ABC".
The question is what is the most pragmatic and robust way to handle the case where the user forcefully closes the browser and hence the lockedByUser attribute is not cleared.
My first thought is a timer on the user side that would update the lockRefreshedTime each 30 seconds or so. A different user trying to modify the Customer would then look at the lockRefreshedTime and if the if the refresh happened more then say 35 seconds ago, it would acquire the lock by setting the lockedByUser and updating the lockRefreshedTime.
Thanks,
Matyas
FWIW, your lock with expiry approach is the one used by WebDAV (and implemented in tools like Microsoft Word, for instance).
To cope for network latency, you should renew your lock at least half-way through the lock lifetime (e.g. the lock expires after 2 minutes, and you renew it every minute).
Have a look there for much more details on how clients and servers should behave: https://www.rfc-editor.org/rfc/rfc4918#section-6 (note that, for example, they always assume failure is possible: "a client MUST NOT assume that just because the timeout has not expired, the lock still exists"; see https://www.rfc-editor.org/rfc/rfc4918#section-6.6 )
Another approach is to have an explicit lock/unlock flow, rather than an implicit one.
Alternatively, you could allow several users to update the customer at the same time, using a "one field at a time" approach: send an RPC to update a specific field on each ValueChangeEvent on that field. Handling conflicts (another user has updated the field) is then made a bit easier, or could be simply ignored: if user A changed the customers address from "foo" to "bar", it really means to set "bar" in the field, not to change _from "foo" to "bar", so if the actual value on the server has already be updated by user B from "foo" to "baz", that wouldn't be a problem, user A would have probably still set the value to "bar", changing it from "foo" or from "baz" doesn't really matter.
Using a per-field approach, "implicit locks" (the time it takes to edit and send the changes to the server) are much shorter, because they're reduced to a single field.
The "challenge" then is to update the form in near real-time when another user saved a change to the edited customer; or you could choose to not do that (not try to do it in near real-time).
The way to go is this:
Execute code on window close in GWT
You have to ask the user to confirm to really close the window in edit mode.
If the user really wants to exit you can then send an unlock call.

How to implement robust pagination with a RESTful API when the resultset can change?

I'm implementing a RESTful API which exposes Orders as a resource and supports pagination through the resultset:
GET /orders?start=1&end=30
where the orders to paginate are sorted by ordered_at timestamp, descending. This is basically approach #1 from the SO question Pagination in a REST web application.
If the user requests the second page of orders (GET /orders?start=31&end=60), the server simply re-queries the orders table, sorts by ordered_at DESC again and returns the records in positions 31 to 60.
The problem I have is: what happens if the resultset changes (e.g. a new order is added) while the user is viewing the records? In the case of a new order being added, the user would see the old order #30 in first position on the second page of results (because the same order is now #31). Worse, in the case of a deletion, the user sees the old order #32 in first position on the second page (#31) and wouldn't see the old order #31 (now #30) at all.
I can't see a solution to this without somehow making the RESTful server stateful (urg) or building some pagination intelligence into each client... What are some established techniques for dealing with this?
For completeness: my back-end is implemented in Scala/Spray/Squeryl/Postgres; I'm building two front-end clients, one in backbone.js and the other in Python Django.
The way I'd do it, is to make the indices from old to new. So they never change. And then when querying without any start parameter, return the newest page. Also the response should contain an index indicating what elements are contained, so you can calculate the indices you need to request for the next older page. While this is not exactly what you want, it seems like the easiest and cleanest solution to me.
Initial request: GET /orders?count=30 returns:
{
"start"=1039;
"count"=30;
...//data
}
From this the consumer calculates that he wants to request:
Next requests: GET /orders?start=1009&count=30 which then returns:
{
"start"=1009;
"count"=30;
...//data
}
Instead of raw indices you could also return a link to the next page:
{
"next"="/orders?start=1009&count=30";
}
This approach breaks if items get inserted or deleted in the middle. In that case you should use some auto incrementing persistent value instead of an index.
The sad truth is that all the sites I see have pagination "broken" in that sense, so there must not be an easy way to achieve that.
A quick workaround could be reversing the ordering, so the position of the items is absolute and unchanging with new additions. From your front page you can give the latest indices to ensure consistent navigation from up there.
Pros: same url gives the same results
Cons: there's no evident way to get the latest elements... Maybe you could use negative indices and redirect the result page to the absolute indices.
With a RESTFUL API, Application state should be in the client. Here the application state should some sort of time stamp or version number telling when you started looking at the data. On the server side, you will need some form of audit trail, which is properly server data, as it does not depend on whether there have been clients and what they have done. At the very least, it should know when the data last changed. No contradiction with REST here.
You could add a version parameter to your get. When the client first requires a page, it normally does not send a version. The server replies contains one. For instance, if there are links in the reply to next/other pages, those links contains &version=... The client should send the version when requiring another page.
When the server recieves some request with a version, it should at least know whether the data have changed since the client started looking and, dependending of what sort of audit trail you have, how they have changed. If they have not, it answer normally, transmitting the same version number. If they have, it may at least tell the client. And depending how much it knows on how the data have changed, it may taylor the reply accordingly.
Just as an example, suppose you get a request with start, end, version, and that you know that since version was up to date, 3 rows coming before start have been deleted. You might send a redirect with start-3, end-3, new version.
WebSockets can do this. You can use something like pusher.com to catch realtime changes to your database and pass the changes to the client. You can then bind different pusher events to work with models and collections.
Just Going to throw it out there. Please feel free to tell me if it's completely wrong and why so.
This approach is trying to use a left_off variable to sort through without using offsets.
Consider you need to make your result Ordered by timestamp order_at DESC.
So when I ask for first result set
it's
SELECT * FROM Orders ORDER BY order_at DESC LIMIT 25;
right?
This is the case when you ask for the first page (in terms of URL probably the request that doesn't have any
yoursomething.com/orders?limit=25&left_off=$timestamp
Then When receiving your data set. just grab the timestamp of last viewed item. 2015-12-21 13:00:49
Now to Request next 25 items go to: yoursomething.com/orders?limit=25&left_off=2015-12-21 13:00:49 (to lastly viewed timestamp)
In Sql you would just make the same query and say where timestamp is equal or less than $left_off
SELECT * FROM (SELECT * FROM Orders ORDER BY order_at DESC) as a
WHERE a.order_at < '2015-12-21 13:00:49' LIMIT 25;
You should get a next 25 items from the last seen item.
For those who sees this answer. Please comment if this approach is relevant or even possible in the first place. Thank you.

How to separate a person's identity from his personal data?

I'm writing an app which main purpose is to keep list of users
purchases.
I would like to ensure that even I as a developer (or anyone with full
access to the database) could not figure out how much money a
particular person has spent or what he has bought.
I initially came up with the following scheme:
--------------+------------+-----------
user_hash | item | price
--------------+------------+-----------
a45cd654fe810 | Strip club | 400.00
a45cd654fe810 | Ferrari | 1510800.00
54da2241211c2 | Beer | 5.00
54da2241211c2 | iPhone | 399.00
User logs in with username and password.
From the password calculate user_hash (possibly with salting etc.).
Use the hash to access users data with normal SQL-queries.
Given enough users, it should be almost impossible to tell how much
money a particular user has spent by just knowing his name.
Is this a sensible thing to do, or am I completely foolish?
I'm afraid that if your application can link a person to its data, any developer/admin can.
The only thing you can do is making it harder to do the link, to slow the developer/admin, but if you make it harder to link users to data, you will make it harder for your server too.
Idea based on #no idea :
You can have a classic user/password login to your application (hashed password, or whatever), and a special "pass" used to keep your data secure. This "pass" wouldn't be stored in your database.
When your client log in your application I would have to provide user/password/pass. The user/password is checked with the database, and the pass would be used to load/write data.
When you need to write data, you make a hash of your "username/pass" couple, and store it as a key linking your client to your data.
When you need to load data, you make a hash of your "username/pass" couple, and load every data matching this hash.
This way it's impossible to make a link between your data and your user.
In another hand, (as I said in a comment to #no) beware of collisions. Plus if your user write a bad "pass" you can't check it.
Update : For the last part, I had another idea, you can store in your database a hash of your "pass/password" couple, this way you can check if your "pass" is okay.
Create a users table with:
user_id: an identity column (auto-generated id)
username
password: make sure it's hashed!
Create a product table like in your example:
user_hash
item
price
The user_hash will be based off of user_id which never changes. Username and password are free to change as needed. When the user logs in, you compare username/password to get the user_id. You can send the user_hash back to the client for the duration of the session, or an encrypted/indirect version of the hash (could be a session ID, where the server stores the user_hash in the session).
Now you need a way to hash the user_id into user_hash and keep it protected.
If you do it client-side as #no suggested, the client needs to have user_id. Big security hole (especially if it's a web app), hash can be easily be tampered with and algorithm is freely available to the public.
You could have it as a function in the database. Bad idea, since the database has all the pieces to link the records.
For web sites or client/server apps you could have it on your server-side code. Much better, but then one developer has access to the hashing algorithm and data.
Have another developer write the hashing algorithm (which you don't have access to) and stick in on another server (which you also don't have access to) as a TCP/web service. Your server-side code would then pass the user ID and get a hash back. You wouldn't have the algorithm, but you can send all the user IDs through to get all their hashes back. Not a lot of benefits to #3, though the service could have logging and such to try to minimize the risk.
If it's simply a client-database app, you only have choices #1 and 2. I would strongly suggest adding another [business] layer that is server-side, separate from the database server.
Edit:
This overlaps some of the previous points. Have 3 servers:
Authentication server: Employee A has access. Maintains user table. Has web service (with encrypted communications) that takes user/password combination. Hashes password, looks up user_id in table, generates user_hash. This way you can't simply send all user_ids and get back the hashes. You have to have the password which isn't stored anywhere and is only available during authentication process.
Main database server: Employee B has access. Only stores user_hash. No userid, no passwords. You can link the data using the user_hash, but the actual user info is somewhere else.
Website server: Employee B has access. Gets login info, passes to authentication server, gets hash back, then disposes login info. Keeps hash in session for writing/querying to the database.
So Employee A has user_id, username, password and algorithm. Employee B has user_hash and data. Unless employee B modifies the website to store the raw user/password, he has no way of linking to the real users.
Using SQL profiling, Employee A would get user_id, username and password hash (since user_hash is generated later in code). Employee B would get user_hash and data.
Keep in mind that even without actually storing the person's identifying information anywhere, merely associating enough information all with the same key could allow you to figure out the identity of the person associated with certain information. For a simple example, you could call up the strip club and ask which customer drove a Ferrari.
For this reason, when you de-identify medical records (for use in research and such), you have to remove birthdays for people over 89 years old (because people that old are rare enough that a specific birthdate could point to a single person) and remove any geographic coding that specifies an area containing fewer than 20,000 people. (See http://privacy.med.miami.edu/glossary/xd_deidentified_health_info.htm)
AOL found out the hard way when they released search data that people can be identified just by knowing what searches are associated with an anonymous person. (See http://www.fi.muni.cz/kd/events/cikhaj-2007-jan/slides/kumpost.pdf)
The only way to ensure that the data can't be connected to the person it belongs to is to not record the identity information in the first place (make everything anonymous). Doing this, however, would most likely make your app pointless. You can make this more difficult to do, but you can't make it impossible.
Storing user data and identifying information in separate databases (and possibly on separate servers) and linking the two with an ID number is probably the closest thing that you can do. This way, you have isolated the two data sets as much as possible. You still must retain that ID number as a link between them; otherwise, you would be unable to retrieve a user's data.
In addition, I wouldn't recommend using a hashed password as a unique identifier. When a user changes their password, you would then have to go through and update all of your databases to replace the old hashed password IDs with the new ones. It is usually much easier to use a unique ID that is not based on any of the user's information (to help ensure that it will stay static).
This ends up being a social problem, not a technological problem. The best solutions will be a social solution. After hardening your systems to guard against unauthorized access (hackers, etc), you will probably get better mileage working on establishing trust with your users and implementing a system of policies and procedures regarding data security. Include specific penalties for employees who misuse customer information. Since a single breach of customer trust is enough to ruin your reputation and drive all of your users away, the temptation of misusing this data by those with "top-level" access is less than you might think (since the collapse of the company usually outweighs any gain).
The problem is that if someone already has full access to the database then it's just a matter of time before they link up the records to particular people. Somewhere in your database (or in the application itself) you will have to make the relation between the user and the items. If someone has full access, then they will have access to that mechanism.
There is absolutely no way of preventing this.
The reality is that by having full access we are in a position of trust. This means that the company managers have to trust that even though you can see the data, you will not act in any way on it. This is where little things like ethics come into play.
Now, that said, a lot of companies separate the development and production staff. The purpose is to remove Development from having direct contact with live (ie:real) data. This has a number of advantages with security and data reliability being at the top of the heap.
The only real drawback is that some developers believe they can't troubleshoot a problem without production access. However, this is simply not true.
Production staff then would be the only ones with access to the live servers. They will typically be vetted to a larger degree (criminal history and other background checks) that is commiserate with the type of data you have to protect.
The point of all this is that this is a personnel problem; and not one that can truly be solved with technical means.
UPDATE
Others here seem to be missing a very important and vital piece of the puzzle. Namely, that the data is being entered into the system for a reason. That reason is almost universally so that it can be shared. In the case of an expense report, that data is entered so that accounting can know who to pay back.
Which means that the system, at some level, will have to match users and items without the data entry person (ie: a salesperson) being logged in.
And because that data has to be tied together without all parties involved standing there to type in a security code to "release" the data, then a DBA will absolutely be able to review the query logs to figure out who is who. And very easily I might add regardless of how many hash marks you want to throw into it. Triple DES won't save you either.
At the end of the day all you've done is make development harder with absolutely zero security benefit. I can't emphasize this enough: the only way to hide data from a dba would be for either 1. that data to only be accessible by the very person who entered it or 2. for it to not exist in the first place.
Regarding option 1, if the only person who can ever access it is the person who entered it.. well, there is no point for it to be in a corporate database.
It seems like you're right on track with this, but you're just over thinking it (or I simply don't understand it)
Write a function that builds a new string based on the input (which will be their username or something else that cant change overtime)
Use the returned string as a salt when building the user hash (again I would use the userID or username as an input for the hash builder because they wont change like the users' password or email)
Associate all user actions with the user hash.
No one with only database access can determine what the hell the user hashes mean. Even an attempt at brute forcing it by trying different seed, salt combinations will end up useless because the salt is determined as a variant of the username.
I think you've answered you own question with your initial post.
Actually, there's a way you could possibly do what you're talking about...
You could have the user type his name and password into a form that runs a purely client-side script which generates a hash based on the name and pw. That hash is used as a unique id for the user, and is sent to the server. This way the server only knows the user by hash, not by name.
For this to work, though, the hash would have to be different from the normal password hash, and the user would be required to enter their name / password an additional time before the server would have any 'memory' of what that person bought.
The server could remember what the person bought for the duration of their session and then 'forget', because the database would contain no link between the user accounts and the sensitive info.
edit
In response to those who say hashing on the client is a security risk: It's not if you do it right. It should be assumed that a hash algorithm is known or knowable. To say otherwise amounts to "security through obscurity." Hashing doesn't involve any private keys, and dynamic hashes could be used to prevent tampering.
For example, you take a hash generator like this:
http://baagoe.com/en/RandomMusings/javascript/Mash.js
// From http://baagoe.com/en/RandomMusings/javascript/
// Johannes Baagoe <baagoe#baagoe.com>, 2010
function Mash() {
var n = 0xefc8249d;
var mash = function(data) {
data = data.toString();
for (var i = 0; i < data.length; i++) {
n += data.charCodeAt(i);
var h = 0.02519603282416938 * n;
n = h >>> 0;
h -= n;
h *= n;
n = h >>> 0;
h -= n;
n += h * 0x100000000; // 2^32
}
return (n >>> 0) * 2.3283064365386963e-10; // 2^-32
};
mash.version = 'Mash 0.9';
return mash;
}
See how n changes, each time you hash a string you get something different.
Hash the username+password using a normal hash algo. This will be the same as the key of the 'secret' table in the database, but will match nothing else in the database.
Append the hashed pass to the username and hash it with the above algorithm.
Base-16 encode var n and append it in the original hash with a delimiter character.
This will create a unique hash (will be different each time) which can be checked by the system against each column in the database. The system can be set up be allow a particular unique hash only once (say, once a year), preventing MITM attacks, and none of the user's information is passed across the wire. Unless I'm missing something, there is nothing insecure about this.