algorithm to compare numbers within a certain distance from each other - iphone

So I have an array of numbers that look something like
1,708,234
2,802,532
11,083,432
5,098,123
5,777,111
I want to find out when two numbers are within a certain distance from each other (say 1,500,000) so I can group them into the same location and have just one UI element represent both for the level of zoom I am looking at. How would one go about doing this smartly or efficiently. I'm thinking I would just start with the first entry, loop through all the elements, and if one was close to another, flag those two and put it in a dictionary of some sort. That would be my brute force method, but I'm thinking there has to be a better way.
I'm coding in obj-c btw if that makes or breaks any design decisions.

How many numbers are we dealing with here? If it's small enough:
Sort the numbers (generally n-log-n)
Run through each number, n, and compare its bigger neighbor, n+1, to see if it's within your range.
Repeat for n+2, n+3, until the number is no longer within your range.
Your brute force method there is O((n/2)^2). This method will bring it to O(n + n log(n)), or O(n log n) on the average case.

Related

Given an array, find top 'x' in O(1)

I'm gonna write the problem as I found it and I will then explain what confuses me.
"A teacher is marking his students' work from 0-10 but he only marks with an 8 or above for a certain number 'x'(x=15 for example) of the 'n' students. You are given an array with all the students' marks in random order. Find the 'x' best marks in O(1)."
We certainly have been taught hashing but this requires me to store all the data in a hash table which is definitely not O(1). Maybe we don't have to take the 'conversion' into account? If we do , maybe the coversion combined with the search time after will lead to a method different than hashing.
In that case, leaving O(1) aside , what is the fastest algorithm including both the conversion and the search time?
Simple: It's not possible.
O(1) can only achieved if all of input size, number of necessary comparisons and output size are constants. You may argue that x could be treated as constant, but it still doesn't work:
You need to inspect every single input element, all n of them, as the random input order does not even allow any heuristics to guess where the xth element would be, even if you already had correctly guessed the other x-1 elements already in constant time.
As the problem is stated, there is no solution which can do it in the upper bounds of O(1) or O(x).
Let's just assume your instructor corrects his mistake, and gives you a revised version which correctly states O(n) as the required upper bound.
In that case your hash approach is (almost) correct. The catch of using a hash function, is that you now need to account for potential collisions on the hash function, which are the reason why hash maps don't work strictly in O(1), but rather only "on average" in O(1).
As you know all possible values (grades from 0-10), you can just allocate buckets with a known index. Inside each bucket you may use linked lists, as they also allow constant time insertions and linear time iteration.

Finding Conditional Moments in a Markov Process

This question combines math and programming. I will first describe the general problem and then give an example that is (hopefully) simpler to understand.
General Question: Consider a Markov-chain process of N-states with transition matrix Π. Each state is associated with a value x_n (n in {1,…,n}). Our goal is to find the unconditional average of the first two moments (mean and var) along T-period paths conditional on (i) the path starts in a subset of states, N_0, (ii) it ends in a subset of states, N_T, and (iii) it is not going through a subset of states, N_not, in any of the periods between 1 to T-1. By saying we are interested in the unconditional average of these two moments, I basically mean what would be the average of these two moments in the stationary distribution. To be more concrete, let me illustrate the goal of the exercise in a simple case.
Simple Example: Consider a 3-state Markov-chain process with transition matrix Π, and let the three state be denoted by A, B, and C. Each of these states are associated with some value (x_A, x_B, and x_C), respectively. We are interested in what happens along paths that satisfy the following condition. The path starts at point A, after 3 periods are in either points B or C, and between periods 1 to 3 never go again through point A. Denote this condition by (#). So, for example, a path which we are interested in would be {A,B,B,C} with the associated values {x_A, x_B, x_B, x_C}. We are interested in the average and standard deviation along such paths. In particular, we would like to find the unconditional average of these first two moments in paths that satisfy (#).
Let me now propose a solution based on simulating the process. Since both T and N are quite large, this solution is too slow for my purpose.
Simulation Solution: Starting from some initial point simulate the process for a very long time period, and drop the first τ periods. Extract all paths along the simulation that satisfy condition (#) and compute the mean and std along each of these paths. Finally, simply take the average across these paths.
I’m hoping there is a better and more efficient way to achieve the goal. Since I want the solution to be accurate and the size of T and N the simulation takes a long time.
I would love to hear your thoughts and if you know of efficient methods to achieve this goal. Please let me know if something is not clear and I'll try to clarify it.
Thank you!!!
I think I know how to do this if N_0 consists of one state, let's call that state A.
The long run probability of being in A is pi(A) and can be obtained by solving pi = pi*P, with P the transition matrix.
The other thing you need to calculate is the probability of those transient paths. You probably need to introduce a modified P, where all states i in the set N_not are absorbing (i.e. P[i,i]=1 and P[i,j]=0 for j is not i). Then starting from a vector p(0) which has a 1 in the element corresponding to state A and 0 otherwise, you can keep calculating p(n) = p(n-1)*P to get the probabilities of your transient paths.
Multiply the result of that by pi(A) to get the unconditional probability.
You can probably do something like this as well when N_0 is a set, but I don't know how you should select p(0) in that case.

Sieve of Eratosthenes (reducing space complexity)

I wanted to generate prime numbers between two given numbers ‘a’ and ‘b’ (b > a). What I did was store Boolean values in an array of size b-1 (that is for numbers 2 to b) and then I applied the sieve method.
Is there a better way, that reduces space complexity, if I don't need all prime numbers from 2 to b?
You need to store all primes which are smaller of equal than the square root of b, then for each number between a and b check whether they are divisible by any of these numbers and they don't equal these numbers. So in our case the magic number is sqrt(b)
You can use segmented sieve of Eratosthenes. The basic idea is pretty simple.
In a typical sieve, we start with a large array of Booleans, all set to the same value. These represent odd numbers, starting from 3. We look at the first and see that it's true, so we add it to the list of prime numbers. Then we mark off every multiple of that number as not prime.
Now, the problem with this is that it's not very cache friendly. As we mark off the multiples of each number, we go through the entire array. Then when we reach the end, we start over from the beginning (which is no longer in the cache) and walk through the entire array again. Each time through the array, we read the entire array from main memory again.
For a segmented sieve, we do things a bit differently. We start by by finding only the primes up to the square root of the limit we care about. Then we use those to mark off primes in the main array. The difference here is the order in which we mark off primes. Instead of marking off all the multiples of three, then all the multiples of 5, and so on, we start by marking off the multiples of three for data that will fit in the cache. Then, instead of continuing on to more data in the array, we go back and mark off the multiples of five for the data that fits in the cache. Then the multiples of 7, and so on.
Then, when we've marked off all the multiples in that cache-sized chunk of data, we move on to the next cache-sized chunk of data. We start over with marking off multiples of 3 in this chunk, then multiples of 5, and so on until we've marked off all the multiples in this chunk. We continue that pattern until we've marked off all the non-prime numbers in all the chunks, and we're done.
So, given N primes below the square root of the limit we care about, a naive sieve will read the entire array of Booleans N times. By contrast, a segmented sieve will only read each chunk of the data once. Once a chunk of data is read from main memory, all the processing on that chunk is done before any more data is read from main memory.
The exact speed-up this gives will depend on the ratio of the speed of cache to the speed of main memory, the size of the array you're using vs. the size of the cache, and so on. Nonetheless, it is generally pretty substantial--for example, on my particular machine, looking for the primes up to 100 million, the segmented sieve has a speed advantage of about 10:1.
One thing you must remember, if you're using C++. A well-known issue with std::vector<bool> is Under C++98/03, vector<bool> was required to be a specialization that stored each Boolean as a single bit with some proxy trickery to get bool-like behavior. That requirement has since been lifted, but many libraries still include it.
With a non-segmented sieve, it's generally a useful trade-off. Although it requires a little extra CPU time to compute masks and such to modify only a single bit at a time, it saves enough bandwidth to main memory to more than compensate.
With a segmented sieve, bandwidth to main memory isn't nearly as large a factor, so using a vector<char> generally seems to give better results (at least with the compilers and processors I have handy).
Getting optimal performance from a segmented sieve does require knowledge of the size of your processor's cache, but getting it precisely correct isn't usually critical--if you assume the size is smaller than it really is, you won't necessarily get optimal use of your cache, but you usually won't lose a lot either.

Is there any formula to calculate the no of passes that a Quick Sort algorithm will take?

While working with Quick Sort algorithm I wondered whether any formula or some kind of stuff might be available for finding the no of passes that a particular set of values may take to completely sorted in ascending order.
Is there any formula to calculate the no of passes that a Quick Sort algorithm will take?
Any given set of values will have a different number of operations, based on pivot value selection method, and the actual values being sorted.
So...no, unless the approximations of 'between O(N log(N)) and O(N^2)' is good enough.
That one has to qualify the average versus worst case should be enough to show that the only way to determine the number of operations is to actually run the quicksort.

Efficient Access of elements in Matrix in Matlab

I have an m x n matrix of integers and where n is a fairly big number m and n ~1000. I want to iterate through all of these and perform a some operations, like accessing a particular cell and assigning a value of a particular cells.
However, at least in my implementation, this is rather inefficient as I have two for loops with Matrix(a,b) = Matrix(a,b+1) or something along these lines. Is there any other way to do this seeing as my current implementation takes a long time to traverse through about 100,000 cells and perform some operations.
Thank you
In matlab, it's almost always possible to avoid loops.
If you want to do Matrix(a,b)=Matrix(a,b+1), you should just do Matrix2=Matrix(:,2:end);
If you are more precise about what you do inside the loop, I can help you more.
Matlab uses column major ordering of matrixes in memory (unlike C). Are you sure you are iterating the indexes in the correct order? If not, try switching them and see if performance improves..
If you can't get rid of the for loops, one possibility would be to rewrite the expensive operations in C and create a MEX file as described here.