K&R quicksort code - quicksort

I examined the quick sort code in K&R book and after 2 hours I still fail to understand what the first swap (swap(a, left, (left+right)/2);) achieves. I tried removing it and the sorting still works.
can someone explain? is it a performance issue? and if so, why? this action seems random to me (that is that on some group of numbers it will improve performance and on some not).
thank you.
void qsort(int a[], int left, int right)
{
int i, last;
if (left >= right)
return;
swap(a, left, (left+right)/2);
last = left;
for (i = left + 1; i <= right; i++)
if(a[i] < a[left])
swap(a, ++last, i);
swap(a, left, last);
qsort(a, left, last-1);
qsort(a, last+1, right);
}

It puts the pivot element onto the very first position of the sub-array.
Then it continues to partition the sub-array around the pivot, so that after the partitioning is done, the sub-array looks like this: [pivot, [elements < pivot], [elements >= pivot]].
After that, the pivot is simply put into proper space, so the sub-array looks like this: [[elements < pivot], pivot, [elements >= pivot]].
And after that, the recursive call is made on two sub-sub-arrays.
Quick sort will always work, regardless of which element is chosen as a pivot. The catch is, if you choose the median element, then the time complexity will be linearithmic (O(nlogn)). However, if you choose, for example, the biggest element as a pivot, then the performance will degrade to quadratic (O(n^2)).
So, in essence, pivot selection is the key to performance of Quick-Sort, but it will work (and when I say work, I mean you'll get a sorted array, eventually).

K&R implementation chooses the middle index (i.e. (left+right)/2) for the pivot. Removing this line uses the leftmost element for the pivot instead. The implementation still works, but the performance degrades when the array is already sorted.
Wikipedia article explains this:
In the very early versions of quicksort, the leftmost element of the partition would often be chosen as the pivot element. Unfortunately, this causes worst-case behavior on already sorted arrays, which is a rather common use-case. The problem was easily solved by choosing either a random index for the pivot, choosing the middle index of the partition or (especially for longer partitions) choosing the median of the first, middle and last element of the partition for the pivot (as recommended by Sedgewick).

Related

What kind of sorting algorithm is this? (Code in matlab)

I have a discussion with a teacher. He argues that the following algorithm corresponds to the bubble sort but I insist that it is not. Who is right?
clc
clear
a=[0.2 4.333 1/3 5 7]
n=length(a)
for j=n:-1:1
for i=1:j-1
if a(j)>a(i)
else
c=a(i);
a(i)=a(j);
a(j)=c;
a
end
end
end
This doesn't look like bubble sort as I understand it. It starts from the final element, and compares it to each other element, swapping with the final element until the entire array has been run through, confirming that the largest number is at the end. Bubble sort compares numbers in adjacent pairs.

slicing assignment with negative index

I am having some problems regarding slicing assignment:
As i understand that general syntax of slicing is l[start:stop:step]
when we use positive step then we transverse forward and when we use negative step we transverse backward:
l=[1,2,3,4]
l[3:1:1]=[5]
when i use the above assignment then it inserts the element 5 at the index 3 like insert operation
but when i use
l[-3:-1:-1]=[5]
then it shows me value error....
i m totally confused..
please explain it.
Assuming you are asking about slices in Python,
the 'step' part will make the slice an extended slice.
Assigning to extended slices is only possible if the list on the right
hand side is of the same size as the extended slice.
see
https://docs.python.org/2.3/whatsnew/section-slices.html
So the confusing thing actually is that your l[3:1:1] = [5] does not raise
a ValueError, because the left and right size differ (0 and 1; note
that both your l[3:1:1] and l[-3:-1:-1] evaluate to empty lists).
I think that can be explained by the fact that a step of 1 is no different
from the original slice syntax [start:end], and may therefore be handled
as a normal slice.
If your goal is inserting, just don't use the step.

Insertion Sort Algorithm In place and loop variant

Part 1
I know that QuickSort can be used 'in place' but could someone explain to me how Insertion sort Algorithm does this using 'in place'.
From my understanding:
Insertion Sort starts at the first value and compares it to the next value, if that value is less than our value they switch positions. We continue this recursively. (Short explanation)
So would you say that this is 'in place' because we don't create a new array to do this, but just compare two elements in an array?
If my understanding was wrong could someone please explain the
algorithm for insertion sort in place.
Part 2
Also how would I use insertion sort to illustrate the idea of a loop invariant?
I know that a loop invariant is a condition that is immediately true before and after each iteration of a loop but I'm not sure how this would relate to an insertion sort.
Like Thorsten mentioned in the comments section, you have described bubble sort. Modified from Wikipedia, pseudocode for bubble sort is as follows:
procedure bubbleSort( A : list of sortable items )
n = length(A)
for i = 1 to n inclusive do // Outer Loop
for j = 1 to n-1-i inclusive do
/* if this pair is out of order */
if A[j] > A[j+1] then
swap(A[j], A[j+1])
end if
end for
end for
end procedure
The loop invariant in bubble sort (for the outer loop) would be that, after each iteration of the loop, the entire array until the current value of i would be sorted. This is because, each time one reaches the outer loop, it will be after going through all iterations of the inner loop (from i to n-1), finding the minimum element there and swapping that element with the ith one.
Bubble sort is, indeed, in place, since all the sorting happens within the original array itself, and does not require a separate external array.
Edit- now onto insertion sort:
Pseudo code is as follows (all hail Wikipedia):
for i = 1 to length(A) - 1
x = A[i]
j = i
while j > 0 and A[j-1] > x
A[j] = A[j-1]
j = j - 1
end while
A[j] = x[3]
end for
Here, at each step, what happens is that for each element, you select the appropriate location at which to insert it into the array, i.e., you insert it just after the first element that is smaller than it in the array. In a little more detail, what the inner loop does is that, it keeps shifting elements to the right till it encounter an element smaller than the element in consideration, at which point you insert the element just after the smaller element. What this will mean is that every element until the aforementioned element is sorted. the outer loop ensures that this is done for all elements within the array, which means that by the time the outer loop completes, the array is sorted.
The loop invariant for the outer loop is, like before, that after the ith iteration, all elements till the current i will be sorted. Just before the ith interation, however, all elements till i-1 will be sorted.
The insertion sort algorithm does not require an external array for sorting. More specifically, all operations are done on the array itself (except for the one variable we need to store the element that we are currently trying to insert into its appropriate location), and no external arrays are used- there is no copying from this array to another one, for example. So, the space complexity required by the algorithm (excluding, of course, the space the array itself occupies) will be O(1), as opposed to dependent on the size of the array, and the sort is in-place, much like bubble sort.

Quick sort with middle element as pivot

My understanding of quick sort is
Choose a pivot element (in this case I am choosing middle element as
pivot)
Initialize left and right pointers at extremes.
Find the first element to the left of the pivot which is greater than pivot.
Similarly find the first element to the right of the pivot which is smaller than pivot
Swap elements found in 3 and 4.
Repeat 3,4,5 unless left >= right.
Repeat the whole thing for left and right subarray as pivot is now placed at its place.
I am sure I am missing something here and being very stupid. But above does not seems to be working fot this array:
8,7,1,2,6,9,10,2,11 pivot: 6 left pointer at 8, right pointer at 11
2,7,1,2,6,9,10,8,11 swapped 2,8 left pointer at 7, right pointer at 10
Now what ? There is no element smaller than 6 on it's right side.
How 7 is going to go to the right of 6 ?
There is no upfront division between the left and the right side. In particular, 6 is not the division. Instead, the division is the result of moving the left and right pointer closer to each other until they meet. The result might be that one side is considerably smaller than the other.
Your description of the algorithm is fine. Nowhere does it say you have to stop at the middle element. Just continue to execute it as given.
BTW.: The pivot element might be moved during the sorting. Just continue to compare against 6, even if it has been moved.
Update:
There are indeed a few minor problems in your description of the algorithm. One is that either step 3 or step 4 need to include elements that are equal to the pivot. Let's rewrite it like this:
My understanding of quick sort is
Choose a pivot value (in this case, choose the value of the middle element)
Initialize left and right pointers at extremes.
Starting at the left pointer and moving to the right, find the first element which is greater than or equal to the pivot value.
Similarly, starting at the right pointer and moving to the left, find the first element, which is
smaller than pivot value
Swap elements found in 3 and 4.
Repeat 3,4,5 until left pointer is greater or equal to right pointer.
Repeat the whole thing for the two subarrays to the left and the right of the left pointer.
pivot value: 6, left pointer at 8, right pointer at 11
8,7,1,2,6,9,10,2,11 left pointer stays at 8, right pointer moves to 2
2,7,1,2,6,9,10,8,11 swapped 2 and 8, left pointer moves to 7, right pointer moves to 2
2,2,1,7,6,9,10,8,11 swapped 2 and 7, left pointer moves to 7, right pointer moves to 1
pointers have now met / crossed, subdivide between 1 and 7 and continue with two subarrays
Quick Sort Given an array of n elements (e.g., integers):
-If array only contains one element, return
-Else
pick one element to use as pivot.
Partition elements into two sub-arrays:
Elements less than or equal to pivot
Elements greater than pivot
Quicksort two sub-arrays
Return results
Let i and j are the left and right pivots, then code for one array will look like this:
1) While data[i] <= data[pivot]
++i
2) While data[j] > data[pivot]
--j
3) If i < j
swap data[i] and data[j]
4) While j > i, go to 1.
5) Swap data[j] and data[pivot_index]
Position of index j is where array is to-be partitioned in two half and then same steps are applied to them recursively.
At last you gets an sorted array.
Your confusion is because you think the partition should be the landmark separating the two. This is not true (for middle element pivot)!
Lomuto's partition (pivot = most right partition).
Left: (lo ... p-1) (note the pivot is not included)
Right: (p+1 ... high)
middle element as the pivot. The segment is partitioned:
Left: (lo ... p)
Right: (p+1 ... high)
[https://en.wikipedia.org/wiki/Quicksort]

iPhone hard computation and caching

I have problem. I have database with 500k records. Each record store latitude, longitude, specie of animal,date of observation. I must draw grid(15x10) above mapkit view, that show the concentration of specie in this grid cell. Each cell is 32x32 box.
If I calculate in run-time it is very slow.
Have somebody idea how to cache it?In memory or in database.
Data structure:
Observation:
Latitude
Longitude
Date
Specie
some other unimportant data
Screen sample:
alt text http://img6.imageshack.us/img6/7562/20091204201332.png
Each red box opocasity show count of species in this region.
Code that i use now:
data -> select from database, it is all observation in map region
for (int row = 0; row < rows; row++)
{
for (int column = 0; column < columns; column++)
{
speciesPerBox=0;
minG=boxes[row][column].longitude;
if (column!=columns-1) {
maxG=boxes[row][column+1].longitude;
} else {
maxG=buttomRight.longitude;
}
maxL=boxes[row][column].latitude;
if (row!=rows-1) {
minL=boxes[row+1][column].latitude;
} else {
minL=buttomRight.latitude;
}
for (int i=0; i<sightingCount; i++) {
l=data[i].latitude;
g=data[i].longitude;
if (l>=minL&&l<maxL&&g>=minG&&g<maxG) {
for (int j=0; j<speciesPerBox; j++) {
if (speciesCountArray[j]==data[i].specie) {
hasSpecie=YES;
}
}
if (hasSpecie==NO) {
speciesCountArray[speciesPerBox]=data[i].specie;
speciesPerBox++;
}
hasSpecie=NO;
}
}
}
mapData[row][column].count = speciesPerBox;
}
}
Since you data is static, you can pre-compute each species for each grid and store it in the database instead of all the location coordinates.
Since you have 15 x 10 = 150 cells, you'll end up with 150 * [num of species] records in the database, which should be a much smaller number.
Also, make sure you have indexes on the proper columns. Otherwise, your queries will have to scan every single record over and over again.
The loop for (int i=0; i<sightingCount; i++) is killing your performance. Especially the large number of if (l>=minL&&l<maxL&&g>=minG&&g<maxG) statements, where MOST OF the sightings will be skipped.
How large is sightingCount?
First you should use a kind of spatial optimization, e.g. a simple one: store species count lists per cell (lets call them "zones"). Define those zones rather large, so that you do not waste space. But smaller zones provide better performance, and too small zones will reverse the effect. So, make it configurable and test different zone sizes to find a good compromise!
When its time to sum up number of species in a cell for rendering, determine which zones the given cell overlaps (rather simple and fast "rectangle overlap" test). Then you only have to check the species counts of those zones. This largely reduces the iterations of your inner loop!
Thats the idea (of most "spatial optimizations"): divide and conquer; here you will divide your space, and then you can early reject the processing of a large number of irrelevant "sightings" with minimal effort (the added effort is the rectangle overlap test, but each test rejects multiple sightings, your current code tests each single sighting for relevance).
In a second step, also apply some obvious code optimizations: e.g. minL and maxL do not change per column. Computing minL and maxL can be moved to the outer loop (just before for( int column=0; ...).
As the latitudes of the grids are evenly distributed, you can even remove them from your grid cells, which saves some time in your iteration. Here an example (the spatial optimizations not included):
maxL=boxes[0][0].latitude;
minL=boxes[rows-1][0].latitude;
incL=maxL-minL;
for( int row = 0; row < rows; row++ )
{
for( int column = 0; column < columns; column++ )
{
speciesPerBox=0;
minG=boxes[row][column].longitude;
if (column!=columns-1) {
maxG=boxes[row][column+1].longitude;
} else {
maxG=buttomRight.longitude;
}
...
...
}
...
minL = maxL; // left edge = right edge of previous step
maxL += incL; // increment right edge
if( maxL >= 90 ) maxL -= 90; // check your scale, i assume 90°
}
Maybe this also works for the longitude loop, but longitude may not be evenly distributed (i.e. "incG" is different in each step).
Note that the spatial optimization will make a huge difference, the loop optimizations only a small (but still worth) difference.
With 500k records this sounds like a job for core data. Preferably core data on a desktop. If the data isn't being updated in realtime you should process the data on heavier hardware and just use the iPhone to display it. That would massively simplify the app because you would just to store the value for each map cell.
Even if you did want to process it on the iPhone, you should have the app process the data once and save the results. There appears to be no reason to have the app recalculate the species value of every map cell every time it wants t display a cell.
I would suggest creating a entity in core data to represent observations. Then another entity to represent geographical squares. Set a relationship between the squares and the observations that fall within the square. Then create a calculated value of species in the square entity. You would then only have to recalculate the species value if one of the observations changed.
This is the kind of problem that object graphs were created for. Even if the data is being continuously updated. Core data would only perform those calculations needed to accommodate the small number of observation objects that changed at any given time and it would do so in a highly optimized manner.
Edit01:
Approaching the problem from a completely different angle. Within core data.
(1) Create an object graph of observation records such that each each observation object has a reciprocal relation to the other observation objects that are closest to it geographically. This would create an object graph that would look like a flat irregular net.
(2) Create methods for the observationRecords class that (a) determine if the record lays within the bounds of an arbitrary geographic square (b) ask if each of its releated record if they are also in the square (c) return its own species count and the count of all the related records.
(3) Divide your map into the some reasonable small squares e.g. one second of arc square. Within that square select one linked record and add it to a list. Choose some percentage of all records like 1 in every 100 or 1,000 so that you cut the list down from 500k to to create a sublist that can be quickly searched by brute force predicate. Let's call those records in the list the gridflags.
(4) When the user zooms in, use brute force to find all the gridflag records with the geographical grid. Then ask each gridflag record to send messages to each of its linked records to see if (a) they're inside the grid, (b) what their species count is and (c) what the count is for their linked records that are also within the grid. (Use a flag to make sure each record is only queried once per search to prevent runaway recursion.)
This way, you only have to find one record inside each arbitrarily sized grid cell and that record will find all the other records for you. Instead of stepping through each record to see which record goes in what cell every time, you just have to process the records in each cell and those immediately adjacent. As you zoom in, the number of records you actually query shrinks instead of remaining constant. If a grid cell has only a handful of records, you only have to query a handful of records.
This would take some effort and time to set up but once you did it would be rather efficient especially when zoomed way in. For the top level, just have a preprocessed static map.
Hope I explained that well enough. It's hard to convey verbally.