Difference between range() and xrange() - range

I am reading through a tutorial on Python and its says "There is a variant xrange() which avoids the cost of building the whole list for performance sensitive cases (in Python 3000, range() will have the good performance behavior and you can forget about xrange())." Source.
I am asking about Python 2.x and not Python 3.
I'm not sure what this means. Am I to understand that range(a, b) creates a list of all the values from a to b and then iterates over them, while xrange(a, b) only creates each value as it iterates over? If this is the case then performance is only improved if code does not actually iterate over the entire list and breaks early.
Can someone comment on this?

You understood correctly.
The difference resides in the memory : when using range() the entire list will be allocated in memory whereas xrange() returns a generator (actually it returns an xrange object which acts as a generator)
The Python's documentation about xrange() is quite clear on this subject:
The advantage of xrange() over range() is minimal (since xrange() still has to create the values when asked for them) except when a very large range is used on a memory-starved machine or when all of the range’s elements are never used

In python3, range() doesn't generate the whole sequence at once, so it's also performant:
The advantage of the range type over a regular list or tuple is that a range object will always take the same (small) amount of memory, no matter the size of the range it represents (as it only stores the start, stop and step values, calculating individual items and subranges as needed).
Documentation

Related

What's the fastest way to create a C-compatible unbounded string in Ada?

I'm creating an Ada program for Windows that needs to be able to pass strings to some functions written in C. Until now I have been manipulating the strings in Ada using the Unbounded_String type, and then converting the data to an Interfaces.C.char_array before passing it to the C functions.
This works fine, only performance is a bit of an issue on slower, older computers. The C function is sometimes called repeatedly on a slightly modified version of a string, and requires the Unbounded_String to be converted to a similar char_array every time. The strings aren't modified by the C functions, so the only ever have to be converted to char_array.
I have thought of storing the strings in char_array, and converting from an Ada type each time the string is manipulated. The data is passed to C more often than it is changed, so it would improve performance. The problem with this approach is that often the length of the string will change, sometimes by a lot, and there is no way of knowing the maximum length beforehand.
The ideal solution would be to have something similar to an Unbounded_String only storing the string as a char_array. By this I mean something that is dynamically sized, allocating a new array when the old one isn't big enough and it should allow Ada Characters/Strings to be inserted (and also removed) into the array, converting only those characters to C chars.
Is there any (relatively) easy, fast way of doing this without having to implement it myself? Or is there any other quick way of manipulating C-compatible strings in Ada? Thanks in advance for any suggestions.
You don't mention how many objects you expect to have of your type, but I will assume that we are not talking about so many that you will be anywhere near exhausting your available address space.
Just encapsulate a sufficiently large char_array (say 10 times the largest expected size) in a private record, and create the needed operations to manipulate it.
If you're very unlucky, you may need to tell your compiler/run-time environment that you need an unusually large stack, but save that worry for when you actually experience it.

Write performance scala immutable collections

Quick question. I'm currently designing some database queries to extract reasonably large, but not massive datasets into memory, say approximately 10k-100k records.
So far I've been testing loading these resultsets into a scala.collection.immutable.Seq and have discovered it seems to take an incredibly long time to build the collection. Whereas if I change to a Vector or List the write into memory takes fractions of a second.
MY question is therefore why is Seq so slow in this case? If so in what cases would using Seq be more appropriate than Vector?
Thanks
It would help if you'd post the relevant snippet and which operations you call on the sequence -- immutable.Seq is represented using a List (see https://github.com/scala/scala/blob/v2.10.2/src/library/scala/collection/immutable/Seq.scala#L42). My guess is that you've been using :+ on the immutable.Seq, which under the hood appends to the end of the list by copying it (probably giving you quadratic overall performance), and when you switched to using immutable.List directly, you've been attaching to the beginning using :: (giving you linear performance).
Since Seq is just a List under the hood, you should use it when you attach to the beginning of the sequence -- the cons operator :: only creates a one node and links it to the rest of the list, which is as fast as it can get when it comes to immutable data structures. Otherwise, if you add to the end, and you insist on immutability, you should use a Vector (or the upcoming Conc lists!).
If you would like a validation of these claims, see this link where the performance of the two operations is compared using ScalaMeter -- lists are 8 times faster than vectors when you add to the beginning.
However, the most appropriate data structure should be either an ArrayBuffer or a VectorBuilder. These are mutable data structures that resize dynamically and if you build them using += you will get a reasonable performance. This is assuming that you are not storing primitives.

Merging huge sets (HashSet) in Scala

I have two huge (as in millions of entries) sets (HashSet) that have some (<10%) overlap between them. I need to merge them into one set (I don't care about maintaining the original sets).
Currently, I am adding all items of one set to the other with:
setOne ++= setTwo
This takes several minutes to complete (after several attempts at tweaking hashCode() on the members).
Any ideas how to speed things up?
You can get slightly better performance with Parallel Collections API in Scala 2.9.0+:
setOne.par ++ setTwo
or
(setOne.par /: setTwo)(_ + _)
There are a few things you might wanna try:
Use the sizeHint method to keep your sets at the expected size.
Call useSizeMap(true) on it to get better hash table resizing.
It seems to me that the latter option gives better results, though both show improvements on tests here.
Can you tell me a little more about the data inside the sets? The reason I ask is that for this kind of thing, you usually want something a bit specialized. Here's a few things that can be done:
If the data is (or can be) sorted, you can walk pointers to do a merge, similar to what's done using merge sort. This operation is pretty trivially parallelizable since you can partition one data set and then partition the second data set using binary search to find the correct boundary.
If the data is within a certain numeric range, you can instead use a bitset and just set bits whenever you encounter that number.
If one of the data sets is smaller than the other, you could put it in a hash set and loop over the other dataset quickly, checking for containment.
I have used the first strategy to create a gigantic set of about 8 million integers from about 40k smaller sets in about a second (on beefy hardware, in Scala).

Cost of isEqualToString: vs. Numerical comparisons

I'm working on a project with designing a core data system for searching and cataloguing images and documents. One of the objects in my data model is a 'key word' object. Every time I add a new key word I first want to first run though all of the existing keywords to make sure it doesn't already exist in the current context.
I've read in posts here and in a lot of my reading that doing string comparisons is a far more expensive processing than some other comparison operations. Since I could easily end up having to check many thousands of words before a new addition I'm wondering if it would be worth using some method that would represent the key word strings numerically for the purpose of this process. Possibly breaking down each character in the string into a number formed from the UTF code for each character and then storing that in an ID property for each key word.
I was wondering if anyone else thought any benefit might come from this approach or if anyone else had any better ideas.
What you might useful is a suitable hash function to convert your text strings into (probably) unique numbers. (You might still have to check for collision effects.)
Comparing intrinsic numbers in C code is a much faster for several reasons. It avoids the Objective C runtime dispatch overhead. It requires accessing less total memory. And the executable code for each comparison is usually just an instruction or 3, rather than a loop with incrementers and several decision points.

Optimizing word count

(This is rather hypothetical in nature as of right now, so I don't have too many details to offer.)
I have a flat file of random (English) words, one on each line. I need to write an efficient program to count the number of occurrences of each word. The file is big (perhaps about 1GB), but I have plenty of RAM for everything. They're stored on permanent media, so read speeds are slow, so I need to just read through it once linearly.
My two off-the-top-of-my-head ideas were to use a hash with words => no. of occurrences, or a trie with the no. of occurrences at the end node. I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
What approach would be best?
I think a trie with the count as the leaves could be faster.
Any decent hash table implementation will require reading the word fully, processing it using a hash function, and finally, a look-up in the table.
A trie can be implemented such that the search occurs as you are reading the word. This way, rather than doing a full look-up of the word, you could often find yourself skipping characters once you've established the unique word prefix.
For example, if you've read the characters: "torto", a trie would know that the only possible word that starts this way is tortoise.
If you can perform this inline searching faster on a word faster than the hashing algorithm can hash, you should be able to be faster.
However, this is total overkill. I rambled on since you said it was purely hypothetical, I figured you'd like a hypothetical-type of answer. Go with the most maintainable solution that performs the task in a reasonable amount of time. Micro-optimizations typically waste more time in man-hours than they save in CPU-hours.
I'd use a Dictionary object where the key is word converted to lower case and the value is the count. If the dictionary doesn't contain the word, add it with a value of 1. If it does contain the word, increment the value.
Given slow reading, it's probably not going to make any noticeable difference. The overall time will be completely dominated by the time to read the data anyway, so that's what you should work at optimizing. For the algorithm (mostly data structure, really) in memory, just use whatever happens to be most convenient in the language you find most comfortable.
A hash table is (if done right, and you said you had lots of RAM) O(1) to count a particular word, while a trie is going to be O(n) where n is the length of the word.
With a sufficiently large hash space, you'll get much better performance from a hash table than from a trie.
I think that a trie is overkill for your use case. A hash of word => # of occurrences is exactly what I would use. Even using a slow interpreted language like Perl, you can munge a 1GB file this way in just a few minutes. (I've done this before.)
I have enough RAM for a hash array, but I'm thinking that a trie would have as fast or faster lookups.
How many times will this code be run? If you're just doing it once, I'd say optimize for your time rather than your CPU's time, and just do whatever's fastest to implement (within reason). If you have a standard library function that implements a key-value interface, just use that.
If you're doing it many times, then grab a subset (or several subsets) of the data file, and benchmark your options. Without knowing more about your data set, it'd be dubious to recommend one over another.
Use Python!
Add these elements to a set data type as you go line by line, before asking whether it is in the hash table. After you know it is in the set, then add a dictionary value of 2, since you already added it to the set once before.
This will take some of the memory and computation away from asking the dictionary every single time, and instead will handle unique valued words better, at the end of the call just dump all the words that are not in the dictionary out of the set with a value of 1. (Intersect the two collections in respect to the set)
To a large extent, it depends on what you want you want to do with the data once you've captured it. See Why Use a Hash Table over a Trie (Prefix Tree)?
a simple python script:
import collections
f = file('words.txt')
counts = collections.defaultdict(int)
for line in f:
counts[line.strip()] +=1
print "\n".join("%s: %d" % (word, count) for (word, count) in counts.iteritems())