Taking second-by-second data and transforming it into hour-by-hour? - aggregate

So, I have a dataset that contains values sampled every second.
I would like to transform the above dataset that has the second-by-second data, so that it is indexed every hour. And the value at each hour is the running sum total through the day.
I haven't been able to find anything similar in my searches, so if anyone could point me in the best direction to find out the best method to accomplish this; it would be greatly appreciated!

It really depends on what you are using to implement this as well as the greater context of your application, but generally a running sum is fairly simple in most cases. For instance, if you are using a language such as the one in Matlab, there are functions that allow you to sum all of the contents in an array (as you would in Excel). Other languages have libraries or packages you can call on to do this also, and I recommend looking it up if you are using anything higher level than, say, C.
However, let's assume you want to write your own function to do this. The way that jumps to my mind is a single iteration through your data array. Say your array has n elements in it. With your loop, designate a variable, and in each loop iteration, increase the value via a sum, for example:
my $sum = $dataArray[0]; # running sum tracker, initialize to first value
for ($i=1; $i < <length of your array>; $i++) {
$sum += $dataArray[$i];
}
In the end, this loop would have complexity O(n). I would also then add a conditional in the loop to throw the sum for index $i into some other data structure and index it with $i when $i is a multiple of seconds in an hour (3600 samples). My favorite way to do that would probably a hash or associative array to map $i => $sum pairs as this would allow me to track EXACTLY where cut off each running sum. But there's no reason a plain old array can't suffice if you are willing to write the code to convert your 1:n indices into "time" and just assume they correspond to "hour 0, hour 1, hour 2 ....".
WARNING: If you do this, I caution you that there is no substitute for having timestamps with your data. Sampling rate can have hardware drift or approximation error while scripting, and this can lead to significant skew in time vs. data accounting if you are not careful.

Related

Don't you get a random number after doing modulo on a hashed number?

I'm trying to understand hash tables, and from what I've seen the modulo operator is used to select which bucket a key will be placed in. I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation. Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
I'm sure I'm misunderstanding this. Can someone explain?
Don't you get a random number after doing modulo on a hashed number?
It depends on the hash function.
Say you have an identify hash for numbers - h(n) = n - then if the keys being hashed are generally incrementing numbers (perhaps with an occasional ommision), then after hashing they'll still generally hit successive buckets (wrapping at some point from the last bucket back to the first), with low collision rates overall. Not very random, but works out well enough. If the keys are random, it still works out pretty well - see the discussion of random-but-repeatable hashing below. The problem is when the keys are neither roughly-incrementing nor close-to-random - then an identity hash can provide terrible collision rates. (You might think "this is a crazy bad example hash function, nobody would do this; actually, most C++ Standard Library implementations' hash functions for integers are identity hashes).
On the other hand, if you have a hash function that say takes the address of the object being hashed, and they're all 8 byte aligned, then if you take the mod and the bucket count is also a multiple of 8, you'll only ever hash to every 8th bucket, having 8 times more collisions than you might expect. Not very random, and doesn't work out well. But, if the number of buckets is a prime, then the addresses will tend to scatter much more randomly over the buckets, and things will work out much better. This is the reason the GNU C++ Standard Library tends to use prime numbers of buckets (Visual C++ uses power-of-two sized buckets so it can utilise a bitwise AND for mapping hash values to buckets, as AND takes one CPU cycle and MOD can take e.g. 30-40 cycles - depending on your exact CPU - see here).
When all the inputs are known at compile time, and there's not too many of them, then it's generally possible to create a perfect hash function (GNU gperf software is designed specifically for this), which means it will work out a number of buckets you'll need and a hash function that avoids any collisions, but the hash function may take longer to run than a general purpose function.
People often have a fanciful notion - also seen in the question - that a "perfect hash function" - or at least one that has very few collisions - in some large numerical hashed-to range will provide minimal collisions in actual usage in a hash table, as indeed this stackoverflow question is about coming to grips with the falsehood of this notion. It's just not true if there are still patterns and probabilities in the way the keys map into that large hashed-to range.
The gold standard for a general purpose high-quality hash function for runtime inputs is to have a quality that you might call "random but repeatable", even before the modulo operation, as that quality will apply to the bucket selection as well (even using the dumber and less forgiving AND bit-masking approach to bucket selection).
As you've noticed, this does mean you'll see collisions in the table. If you can exploit patterns in the keys to get less collisions that this random-but-repeatable quality would give you, then by all means make the most of that. If not, the beauty of hashing is that with random-but-repeatable hashing your collisions are statistically related to your load factor (the number of stored elements divided by the number of buckets).
As an example, for separate chaining - when your load factor is 1.0, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation.
This is still true post-modulo. Minimising the same result means each post-modulo value has (about) the same number of keys mapping to it. We're particularly concerned about in-use keys stored in the table, if there's a non-uniform statistical distribution to the use of keys. With a hash function that exhibits the random-but-repeatable quality, there will be random variation in post-modulo mapping, but overall they'll be close enough to evenly balanced for most practical purposes.
Just to recap, let me address this directly:
Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
So:
random is good: if you get something like the random-but-repeatable hash quality, then your average hash collisions will statistically be capped at low levels, and in practice you're unlikely to ever see a particularly horrible collision chain, provided you keep the load factor reasonable (e.g. <= 1.0)
that said, your "near-perfect hash function...between 0 and 100,000" may or may not be high quality, depending on whether the distribution of values has patterns in it that would produce collisions. When in doubt about such patterns, use a hash function with the random-but-repeatable quality.
What would happen if you took a random number instead of using a hash function? Then doing the modulo on it? If you call rand() twice you can get the same number - a proper hash function doesn't do that I guess, or does it? Even hash functions can output the same value for different input.
This comment shows you grappling with the desirability of randomness - hopefully with earlier parts of my answer you're now clear on this, but anyway the point is that randomness is good, but it has to be repeatable: the same key has to produce the same pre-modulo hash so the post-modulo value tells you the bucket it should be in.
As an example of random-but-repeatable, imagine you used rand() to populate a uint32_t a[256][8] array, you could then hash any 8 byte key (e.g. including e.g. a double) by XORing the random numbers:
auto h(double d) {
uint8_t i[8];
memcpy(i, &d, 8);
return a[i[0]] ^ a[i[1]] ^ a[i[2]] ^ ... ^ a[i[7]];
}
This would produce a near-ideal (rand() isn't a great quality pseudo-random number generator) random-but-repeatable hash, but having a hash function that needs to consult largish chunks of memory can easily be slowed down by cache misses.
Following on from what [Mureinik] said, assuming you have a perfect hash function, say your array/buckets are 75% full, then doing modulo on the hashed function will probably result in a 75% collision probability. If that's true, I thought they were much better. Though I'm only learning about how they work now.
The 75%/75% thing is correct for a high quality hash function, assuming:
closed hashing / open addressing, where collisions are handled by finding an alternative bucket, or
separate chaining when 75% of buckets have one or more elements linked therefrom (which is very likely to mean the load factor (which many people may think of when you talk about how "full" the table is) is already significantly more than 75%)
Regarding "I thought they were much better." - that's actually quite ok, as evidenced by the percentages of colliding chain lengths mentioned earlier in my answer.
I think you have the right understanding of the situation.
Both the hash function and the number of buckets affect the chance of collisions. Consider, for example, the worst possible hash function - one that returns a constant value. No matter how many buckets you have, all the entries will be lumped to the same bucket, and you'd have a 100% chance of collision.
On the other hand, if you have a (near) perfect hash function, the number of buckets would be the main factor for the chance of collision. If your hash table has only 20 buckets, the minimal chance of collision will indeed be 1 in 20 (over time). If the hash values weren't uniformly spread, you'd have a much higher chance of collision in at least one of the buckets. The more buckets you have, the less chance of collision. On the other hand, having too many buckets will take up more memory (even if they are empty), and ultimately reduce performance, even if there are less collisions.

Given an array, find top 'x' in O(1)

I'm gonna write the problem as I found it and I will then explain what confuses me.
"A teacher is marking his students' work from 0-10 but he only marks with an 8 or above for a certain number 'x'(x=15 for example) of the 'n' students. You are given an array with all the students' marks in random order. Find the 'x' best marks in O(1)."
We certainly have been taught hashing but this requires me to store all the data in a hash table which is definitely not O(1). Maybe we don't have to take the 'conversion' into account? If we do , maybe the coversion combined with the search time after will lead to a method different than hashing.
In that case, leaving O(1) aside , what is the fastest algorithm including both the conversion and the search time?
Simple: It's not possible.
O(1) can only achieved if all of input size, number of necessary comparisons and output size are constants. You may argue that x could be treated as constant, but it still doesn't work:
You need to inspect every single input element, all n of them, as the random input order does not even allow any heuristics to guess where the xth element would be, even if you already had correctly guessed the other x-1 elements already in constant time.
As the problem is stated, there is no solution which can do it in the upper bounds of O(1) or O(x).
Let's just assume your instructor corrects his mistake, and gives you a revised version which correctly states O(n) as the required upper bound.
In that case your hash approach is (almost) correct. The catch of using a hash function, is that you now need to account for potential collisions on the hash function, which are the reason why hash maps don't work strictly in O(1), but rather only "on average" in O(1).
As you know all possible values (grades from 0-10), you can just allocate buckets with a known index. Inside each bucket you may use linked lists, as they also allow constant time insertions and linear time iteration.

Number of comparison during a a closed address hashing?

Initially, all entries in the hash table are empty lists.
All elements with hash address i will be inserted into the linked list h[i]. If there is collision, during hashing of keys, the key will be added to the end of a linkedList.
For the average case of successful search, do i count it when the comparison is to check if the h[i] is null? if it's null it means that the linkedlist is null and it should return not found. Should it be 1 comparison or 0 comparison? in terms of complexity.
Sorry for this stupid question, i'm still learning algorithm complexity.
For "big-O" complexity it just doesn't matter, as there is no such thing as "O(2N+1)" complexity (from counting element and pointer comparisons) - it simplifies to O(N), where N is the number of elements in the bucket h[i]. Alternatively, you might say the average big-O complexity across buckets is O(N) where N is size / buckets, aka load factor.
If you're not doing big-O complexity analysis, we can't really tell you what you want to count. I would point out that comparisons of pointers to nullptr are much cheaper than object comparison involving an extra level of indirection or scanning along a large object (e.g. std::string objects too long for any Short-String-Optimisation buffer), so can often be neglected.
If in doubt as to what's wanted, I'd suggest you report the comparisons as in "searching for an element that's not present involves N object value comparisons and N+1 pointer comparisons, where N is the number of elements chained from h[i]".
If you must give just one expression (for example, some computerised multiple-choice test), I'd suggest a count of element comparisons is likely the desired answer - the number of value comparisons (i.e. 0 for an empty hash bucket), as it's most common to be interested in the complexity as a function of the number of data elements.
0 comparisons. If at h[i] you see a list of one entry and this is a hit (since you analyze successful search), this would be 1 comparison, and so on.

Hash function for an array of integers

What would be a good hash function for an array of integers?
For example I have two arrays [1,2,3] and [1,5]. What hash functions should I adopt to seperate both these arrays?
I thought of adding up each element after raising it to the power of 2 but this has a large cost associated with it due to multiple multiplications. Is there any simple hashing function for this scenario?
For that particular data set, just subtract one from the second-to-last item, that will give you a perfect minimal hash, with buckets 0 and 1 produced :-)
More seriously, the choice of a good hashing function does depend a great deal on the sort of data so that should be taken into consideration. It's hard to suggest something without knowing the properties of the data you'll be storing.
I would start simply by choosing an arbitrary function such as adding all the items in the array then adding the array length to that, and reducing it modulo some value:
numbuckets = 97
bucket = array.length() % numbuckets
for index in range (array.length()):
bucket = (bucket + array[index]) % numbuckets
Then examine the results (across a great many real data sets) to make sure there's not too many collisions. If there are, choose another function.
It's the same as with optimisation: measure, don't guess! Actually monitor the collisions and usage and act if it gets bad.

algorithm to compare numbers within a certain distance from each other

So I have an array of numbers that look something like
1,708,234
2,802,532
11,083,432
5,098,123
5,777,111
I want to find out when two numbers are within a certain distance from each other (say 1,500,000) so I can group them into the same location and have just one UI element represent both for the level of zoom I am looking at. How would one go about doing this smartly or efficiently. I'm thinking I would just start with the first entry, loop through all the elements, and if one was close to another, flag those two and put it in a dictionary of some sort. That would be my brute force method, but I'm thinking there has to be a better way.
I'm coding in obj-c btw if that makes or breaks any design decisions.
How many numbers are we dealing with here? If it's small enough:
Sort the numbers (generally n-log-n)
Run through each number, n, and compare its bigger neighbor, n+1, to see if it's within your range.
Repeat for n+2, n+3, until the number is no longer within your range.
Your brute force method there is O((n/2)^2). This method will bring it to O(n + n log(n)), or O(n log n) on the average case.