Why is my identifier collision rate increasing? - hash

I'm using a hash of IP + User Agent as a unique identifier for every user that visits a website. This is a simple scheme with a pretty clear pitfall: identifier collisions. Multiple individuals browse the internet with the same IP + user agent combination. Unique users identified by the same hash will be recognized as a single user. I want to know how frequently this identifier error will be made.
To calculate the frequency, I've created a two-step funnel that should theoretically convert at zero percent: publish.click > signup.complete. (Users have to signup before they publish.) Running this funnel for 1 day gives me a conversion rate of 0.37%. That figure is, I figured, my unique identifier collision probability for that funnel. Looking at the raw data (a table about 10,000 rows long), I confirmed this hypothesis. 37 signups were completed by new users identified by the same hash as old users who completed publish.click during the funnel period (1 day). (I know this because hashes matched up across the funnel, while UIDs, which are assigned at signup, did not.)
I thought I had it all figured out...
But then I ran the funnel for 1 week, and the conversion rate increased to 0.78%. For 5 months, the conversion rate jumped to 1.71%.
What could be at play here? Why is my conversion (collision) rate increasing with widening experiment period?
I think it may have something to do with the fact that unique users typically only fire signup.complete once, while they may fire publish.click multiple times over the course of a period. I'm struggling however to put this hypothesis into words.
Any help would be appreciated.

Possible explanations starting with the simplest:
The collision rate is relatively stable, but your initial measurement isn't significant because of the low volume of positives that you got. 37 isn't very many. In this case, you've got two decent data points.
The collision rate isn't very stable and changes over time as usage changes (at work, at home, using mobile, etc.). The fact that you got three data points that show an upward trend is just a coincidence. This wouldn't surprise me, as funnel conversion rates change significantly over time, especially on a weekly basis. Also bots that we haven't caught.
If you really get multiple publishes, and sign-ups are absolutely a one-time thing, then your collision rate would increase as users who only signed up and didn't publish eventually publish. That won't increase their funnel conversion, but it will provide an extra publish for somebody else to convert on. Essentially, every additional publish raises the probability that I, as a new user, am going to get confused with a previous publish event.
Note from OP. Hypothesis 3 turned out to be the correct hypothesis.

Related

How to determine when to start a counter to ensure it never catches the previous counter

I have a problem where I have several events that are occurring in a project, the events happen semi-concurrently, where they do not start at the same time but multiple can still be occurring at once.
Each event is a team of people working on a linear task, starting at the beginning and then working their way to the end. Their progress is based on a physical distance.
I essentially need to figure out each events start time in order for no teams to be at the same location, nor passing eachother, at any point.
I am trying to program this in MATLAB so that the output would be the start and end time for each event. The idea would be to optimize the total time taken for the project.
I am not sure where to begin with something like this so any advice would be greatly appreciated.
If I understand correct, you just want to optimize the "calendar" of events with limited resources (aka space/teams).
This kind of problems are those called NP and there is no "easy" way to search for the best solution.
You here have two options:
Greedy like algorithm: You will have your solution in a resonable time but it won't be the best one.
Brute force like algorithm: You will find the best solution but maybe not in the time you need it.
Usually if the amount of events is low you can go for 2nd option but if don't you may need to go for the first one.
No mather which one you choose first thing you will need to do is to compute if a solution is valid. What does this mean? It means to check for every event wheter if it collisions whith others in time, space and teams.
So lets imagine the problem of making the calendar on a University. There you have to think about:
Students
Teacher
Classroom
So for each event I have to check if another event have same students, teacher or classroom at the same time. First of all I will check the events that match in time with the actual event. Then I will compare the actual event with all the others.
Once you have this done you could just write a greedy algorithm that starts placing events on time just checking if it collides with some other.

Picking a check digit algorithm

I am generating random OTP-style strings that serve as a short-term identifier to link two otherwise unrelated systems (which have authentication at each end). These need to be read and re-entered by users, so in order to reduce the error rate and reduce the opportunities for forgery, I'd like to make one of the digits a check digit. At present my random string conforms to the pattern (removing I and O to avoid confusion):
^[ABCDEFGHJKLMNPQRSTUVWXYZ][0-9]{4}$
I want to append one extra decimal digit for the check. So far I've implemented this as a BLAKE2 hash (from libsodium) that's converted to decimal and truncated to 1 char. This gives only 10 possibilities for the check digit, which isn't much. My primary objective is to detect single character errors in the input.
This approach kind of works, but it seems that one digit is not enough to detect single char errors, and undetected errors are quite easy to find, for example K37705 and K36705 are both considered valid.
I do not have a time value baked into this OTP; instead it's purely random and I'm relying on keeping a record of the OTPs that have been generated recently for each user, which are deleted periodically, and I'm reducing opportunities for brute-forcing by rate and attempt-count limiting.
I'm guessing that BLAKE2 isn't a good choice here, but given there are only 10 possibilities for the result, I don't know that others will be better. What would be a better algorithm/approach to use?
Frame challenge
Why do you need a check digit?
It doesn't improve security, and a five digits is trivial for most humans to get correct. Check if server side and return an error message if it's wrong.
Normal TOTP tokens are commonly 6 digits, and actors such as google has determined that people in general manage to get them orrect.

Rewards instead of penalty in optaplanner

So I have lectures and time periods and some lectures need to be taught in a specific time period. How do i do that?
Does scoreHolder.addHardConstraintMatch(kcontext, 10); solve this as a hard constraint? Does the value of positive 10 ensure the constraint of courses being in a specific time period?
I'm aware of the Penalty pattern but I don't want to make a lot of CoursePeriodPenalty objects. Ideally, i'd like to only have one CoursePeriodReward object to say that CS101 should be in time period 9:00-10:00
Locking them with Immovable planning entities won't work as I suspect you still want OptaPlanner to decide the room for you - and currently optaplanner only supports MovableSelectionFilter per entity, not per variable (vote for the open jira for that).
A positive hard constraint would definitely work. Your score will be harder to interpret for your users though, for example a solution with a hard score of 0 won't be feasible (either it didn't get that +10 hard points or it lost 10 hard points somewhere else).
Or you could add a new negative hard constraint type that says if != desiredTimeslot then loose 10 points.

a database to store a non stationary distribution

I have many categorical distributions. A categorical distribution is where one describes the probability of an event drawn from a set of k events. I need to be able to access the probability of an event very quickly.
One way to store a categorical distribution is in Redis using a sorted set. Each key indexes a separate distribution, each member of the sorted set is a specific event and each score is the number of times you've seen that event. For each key(distribution) you would also store the sum of counts for each event in that distribution, so you can normalise properly.
The question I'd like to ask is: what is a good way to store this data if the probabilities are changing over time? I'd essentially like to be able to forget old observations - i.e. to decrement the score and normalisation constant for each key at some regular interval.
With the redis approach above, I could run a cron job every d minutes, iterate over each distribution and decrement each count in the zscore and the normalisation constant. However, this seems bit wrong (I'm sure my mum told me to never iterate over KEYS *), and so I'm wondering if anyone else has solved this problem a bit more comprehensively?
I'm guessing that what feels wrong to you is some combination of:
The need to visit every distribution, every member of each ZSET, and the normalisation constant whenever the cron job runs
The way that the unconditional decrement operation will, over time, skew distributions in favor of events that happen multiple times per cron cycle
I haven't done anything like this before, but one solution comes to mind if you're able to spare more storage.
The idea is to store, at a regular interval, a timestamped queue of snapshots. Each snapshot represents the event counts in your distributions for that interval of time. When you want to expire the old probabilities in your distribution, you pop the expired snapshots off the list and decrement the ZSETs accordingly.
More concretely, you'll need to:
Keep track in memory of the events that occur during the interval [tk - tk-1) and how many times each occurred -- a set of (event, count) pairs. This is in addition to the (presumably) real-time updating of the ZSET scores and normalisation factors that you currently do.
At each tick tk, store the snapshot:
Create a unique key Sk to represent the snapshot at tk -- like a UUID or similar
For each event E in the snapshot, create a unique hash key q(E). Choose a key encoding that will allow you to recover the distribution (ZSET) key and event (member) key for that event.
Call HSET Sk with the event key q(E) and event count |E| to store the event data. Repeat for all events in the snapshot.
RPUSH SNAPSHOTS <timestamp>:Sk
At each expiry tick tm, expire old snapshots:
LPOP the SNAPSHOTS list, decoding the timestamp and verifying whether expired.
If not expired, LPUSH it back onto the SNAPSHOTS list and you're done until the next expiry tick. Otherwise...
Decode the snapshot key Sk
Using the results of HKEYS Sk, decode each event key q(E), get the corresponding count, and then decrement the appropriate ZSET and normalisation factor by that amount.
Repeat while expired snapshots still exist in the SNAPSHOTS list.
The amount of extra storage required will depend on the length of the snapshot and expiry intervals and the number of distinct events that occur within each snapshot interval.
In the worst case, every distribution and event will be represented in each snapshot, so this will not help with wrongness factor #1. Optimistically, a suitably small percentage of distributions and/or events will be represented in any snapshot, and the efficiency of the expiration process will improve. But this will address wrongness factor #2 even in the worst case, since events that happened recently will not be unconditionally decremented in your distributions each time the expiration cron job runs.

How to approximate processing time?

It's common to see messages like "Installation will take 10 min aprox." , etc in desktop applications. So, I wonder how can I calculate an approximate of how much time a certain process will take. Off course I won't install anything but I want to update some internal data and depending on the user usage this might take some time.
Is this possible in a iPhone app? How Cocoa guys do this, would it be the same way in iPhone apps?
Thanks in advance.
UPDATE: I want to rewrite/edit some files on disk, most of the time these files are not the same size so I cannot use timers for the first iteration and calculate the rest from that.
Is there any API that helps on calculating this?
If you have some list of things to process, each "thing" - usually better to measure a group of 10 or so "things" - is a unit of work. Your goal is to see how long it takes to process a single group and report the estimated time to completion.
One way is to create an NSDate at the start of each group and a new one at the end (the top and bottom of your for loop) for each group. Multiply the difference in seconds by however many groups you have left (minus the one you just processed) and that should be a reasonable estimate of the time remaining.
Of course this gets more complicated if one "thing" takes a lot longer to process than another "thing" - the above approach assumes all things take the same amount of time. In this case, however, you may need to keep track of an average window (across the last n "things" or groups thereof).
A more detailed response would require more details about your model and what work you're performing.