Retrieve keys from hash-table, sorted by the values, efficiently - emacs

I'm using Emacs Lisp, but have the cl package loaded, for some common lisp features.
I have a hash table containing up to 50K entries, with integer keys mapped to triplets, something like this (but in actual lisp):
{
8 => '(9 300 12)
27 => '(5 125 9)
100 => '(10 242 14)
}
The second value in the triplet is a score that has been calculated during a complex algorithm that built the hash-table. I need to collect a regular lisp list with all of the keys from the hash, ordered by the score (i.e. all keys ordered by the cadr of the value).
So for the above, I need this list:
'(27 100 8)
I'm currently doing this with two phases, which feels less efficient than it needs to be.
Is there a good way to do this?
My current solution uses maphash to collect the keys and the values into two new lists, then does a sort in the normal way, referring to the list of scores in the predicate. It feels like I could be combining the collection and the sorting together, however.
EDIT | I'm also not attached to using hash-table, though I do need constant access time for the integer keys, which are not linearly spaced.
EDIT 2| It looks like implementing a binary tree sort could work, where the labels in the tree are the scores and the values are the keys... this way I'm doing the sort as I map over the hash.
... Continues reading wikipedia page on sorting algorithms

Basically, you solution is correct: you need to collect the keys into a list:
(defun hash-table-keys (hash-table)
(let ((keys ()))
(maphash (lambda (k v) (push k keys)) hash-table)
keys))
and then sort the list:
(sort (hash-table-keys hash-table)
(lambda (k1 k2)
(< (second (gethash k1 hash-table))
(second (gethash k2 hash-table)))))
Combining key collection with sorting is possible: you need to collect the keys into a tree and then "flatten" the tree. However, this will only matter if you are dealing with really huge tables. Also, since Emacs Lisp compiles to bytecodes, you might find that using the sort built-in is still faster than using a tree. Consider also the development cost - you will need to write code whose value will be mostly educational.
Delving deeper, collecting the keys allocates the list of keys (which you will certainly need anyway for the result) and sort operates "in-place", so the "simple way" is about as good as it gets.
The "tree" way will allocate the tree (the same memory footprint as the required list of keys) and populating and flattening it will be the same O(n*log(n)) process as the "collect+sort" way. However, keeping the tree balanced, and then flattening it "in-place" (i.e., without allocating a new list) is not a simple exercise.
The bottom line is: KISS.

Related

Random access for hash table

I have an SBCL hash table where the hash keys are symbols. If the hash table was made with eq, will calling gethash give random access to the elements? I know these details are implementation specific, but so far I haven't been able to find a clear answer in the documentation.
I assume (also from the discussion in the comments) that by "give random access" you mean that the distribution of elements in the hash-table will be random and hence it will have O(1) access performance. The answer is yes, it will be. There are some degraded cases like this one (Why does `sxhash` return a constant for all structs?) when the distribution becomes skewed, but this is definitely not it. For eq comparisons the implementations will use the address of an object for hashing. In the case of SBCL, here's the actual code:
(defun eq-hash (key)
(declare (values hash (member t nil)))
;; I think it would be ok to pick off SYMBOL here and use its hash slot
;; as far as semantics are concerned, but EQ-hash is supposed to be
;; the lightest-weight in terms of speed, so I'm letting everything use
;; address-based hashing, unlike the other standard hash-table hash functions
;; which try use the hash slot of certain objects.
(values (pointer-hash key)
(sb-vm:is-lisp-pointer (get-lisp-obj-address key))))
However, you can also opt to use an eql hash-table (which I'd recommend: using eq should be reserved only for those who know what they are doing :) ). For this case, SBCL has a special function to hash symbols: symbol-hash. I assume, other implementation also do something similar, for symbol is, probably, the most frequent type of hash-table keys.
Hash tables, by design, give O(1) access and update of their elements. It's not implementation specific.
Since hashing works differently than comparing hash tables in standard CL is limited to eq, eql (default), equal, and equalp. In reality this only means the hash value for two values considered by one of these to be true will have the same hash value. SBCL lets you define hash functions but that is not portable.

Emacs - represent JSON-like structures

What would be the canonical way for emacs to represent JSON-like structures, or nested hashmaps ?
I have a structure with approximately 25 top-level keys. Each key has no more than a sub-key (ie. the value is another key/value element). Some of the final values are FIFO arrays.
I stated to model this using hash-map, but it feels cumbersome. Now I just stumbled upon assoc-lists, what would be the most appropriate in my case ?
Note : I intend to replicate parinfer in elisp, this part for now, and learn elisp at the same time.
You should use assoc-lists, which are the Emacs standard way of representing a map/dictionary/table. You see them in a lot of places: auto-mode-alist, minor-mode-alist, interpreter-mode-alist, etc. hash-map is only meant for speed, when you have 1000+ entries.
There's even an official way to convert JSON to an assoc-list:
(json-read-from-string "{\"foo\": {\"bar\": 5}}")
=> ((foo (bar . 5)))

Iterate over Emacs Lisp hash-table

How to iterate over items (keys, values) in Elisp hash-tables?
I created a hash-table (map, dictionary) using (make-hash-table) and populated it with different items. In Python i could iterate over dicts by:
for k in d # iterate over keys
for k in d.keys() # same
for k in d.values() # iterate over values
for k in d.items() # iterate over tuples (key, value)
How can i do the same the most succinct and elegant way possible, preferably without loop-macro?
(maphash (lambda (key value) ....your code here...) hash-table)
I'm going to advertise myself a bit, so take it with a grain of salt, but here are, basically, your options:
maphash - this is the built-in iteration primitive, fundamentally, no more ways to do it exist.
(loop for KEY being the hash-key of TABLE for VALUE being the hash-value of TABLE ...) is available in cl package. It will internally use maphash anyway, but it offers you some unification on top of different iterating primitives. You can use loop macro to iterate over multiple different things, and it reduces the clutter by removing the technical info from sight.
http://code.google.com/p/i-iterate/ Here's a library I'm working on to provide more versatile ways of iterating over different things and in different ways in Emacs Lisp. It is inspired by Common Lisp Iterate library, but it departed from it quite far (however, some basic principles still hold). If you were to try this library, the iteration over the hash-table would look like this: (++ (for (KEY VALUE) pairs TABLE) ...) or (++ (for KEY keys TABLE) ...) or (++ (for VALUE values TABLE) ...).
I will try to describe cons and pros of using either cl loop or i-iterate.
Unlike loop, iterate allows iterating over multiple hash-tables at once (but you must be aware of the additional cost it incurs: the keys of the second, third etc. hash-tables must be collected into a list before iterating, this is done behind the scenes).
Iterate provides arguably more Lisp-y syntax, which is easier to format in the editor.
With iterate you have more (and potentially even more in the future) options to combine iteration with other operations.
No one else so far is using it, beside myself :) It probably still has bugs and some things may be reworked, but it is near feature-freeze and is getting ready for proper use.
Significantly more people are familiar with either the built-in iteration primitives or the cl library.
Just as an aside, the full version of the iterate on hash-tables looks like this: (for VAR pairs|keys|values TABLE &optional limit LIMIT), where LIMIT stands for the number of element you want to look at (it will generate more efficient code, then if you were to break from the loop using more general-purpose tools).
maphash is the function you want. In addition I would suggest you to look at the manual (info "(elisp) Hash Tables")
Starting from 2013 there is a third-party library ht, which provides many convenient functions to operate on Elisp hash-tables.
Suppose you have a hash-table, where keys are strings and values are integers. To iterate over a hash-table and return a list, use ht-map:
(ht-map (lambda (k v) (+ (length k) v)) table)
;; return list of all values added to length of their keys
ht-each is just an alias for maphash. There are also anaphoric versions of the above 2 functions, called ht-amap and ht-aeach. Instead of accepting an anonymous function, they expose variables key and value. Here's the equivalent expression to the one above:
(ht-amap (+ (length key) value) table)
I would have preferred to put this into a comment, but my reputation
rating ironically prevents me from writing this in the appropriate
format...
loop is considered deprecated and so is the cl library,
because it didn't adhere to the convention of prefixing all symbols by
a common library prefix and thus polluted the obarray with symbols
without clear library association.
Instead use cl-lib which defines the same functions and macros but
names them e.g. cl-loop and cl-defun instead of loop and
defun*. If you need only the macros, you can import cl-macs
instead.

good style in lisp: cons vs list

Is it good style to use cons for pairs of things or would it be preferable to stick to lists?
like for instance questions and answers:
(list
(cons
"Favorite color?"
"red")
(cons
"Favorite number?"
"123")
(cons
"Favorite fruit?"
"avocado"))
I mean, some things come naturally in pairs; there is no need for something that can hold more than two, so I feel like cons would be the natural choice. However, I also feel like I should be sticking to one thing (lists).
What would be the better or more accepted style?
What you have there is an association list (alist). Alist entries are, indeed, often simple conses rather than lists (though that is a matter of preference: some people use lists for alist entries too), so what you have is fine. Though, I usually prefer to use literal syntax:
'(("Favorite color?" . "red")
("Favorite number?" . "123")
("Favorite fruit?" . "avocado"))
Alists usually use a symbol as the key, because symbols are interned, and so symbol alists can be looked up using assq instead of assoc. Here's how it might look:
'((color . "red")
(number . "123")
(fruit . "avocado"))
The default data-structure for such case should be a HASH-TABLE.
An association list of cons pairs is also a possible variant and was widely used historically. It is a valid variant, because of tradition and simplicity. But you should not use it, when the number of pairs exceeds several (probably, 10 is a good threshold), because search time is linear, while in hash-table it is constant.
Using a list for this task is also possible, but will be both ugly and inefficient.
You would need to decide for yourself based upon circumstances. There isn't a universal answer. Different tasks work differently with structures. Consider the following:
It is faster to search in a hash-table for keys, then it is in the alist.
It is easier to have an iterator and save its state, when working with alist (hash-table would need to export all of its keys as an array or a list and have a pointer into that list, while it is enough to only remember the pointer into alist to be able to restore the iterator's state and continue the iteration.
Alist vs list: they use the same amount of conses for even number of elements, given all other characters are atoms. When using lists vs alists you would have to thus make sure there isn't an odd number of elements (and you may discover it too late), which is bad.
But there are a lot more functions, including the built-in ones, which work on proper lists, and don't work on alists. For example, nth will error on alist, if it hits the cdr, which is not a list.
Some times certain macros would not function as you'd like them to with alists, for example, this:
(destructuring-bind (a b c d)
'((100 . 200) (300 . 400))
(format t "~&~{~s~^,~}" (list a b c d)))
will not work as you might've expected.
On the other hand, certain procedures may be "tricked" into doing something which they don't do for proper lists. For instance, when copying an alist with copy-list, only the conses, whose cdr is a list will be copied anew (depending upon the circumstances this may be a desired result).

What are circular lists good for (in Lisp or Scheme)?

I note that Scheme and Lisp (I guess) support circular lists, and I have used circular lists in C/C++ to 'simplify' the insertion and deletion of elements, but what are they good for?
Scheme ensures that they can be built and processed, but for what?
Is there a 'killer' data structure that needs to be circular or tail-circular?
Saying it supports 'circular lists' is a bit much. You can build all kinds of circular data structures in Lisp. Like in many programming languages. There is not much special about Lisp in this respect. Take your typical 'Algorithms and Datastructure' book and implement any circular data structure: graphs, rings, ... What some Lisps offer is that one can print and read circular data structures. The support for this is because in typical Lisp programming domains circular data structures are common: parsers, relational expressions, networks of words, plans, ...
It is quite common that data structures contain cycles. Real 'circular lists' are not that often used. For example think of a task scheduler which runs a task and after some time switches to the next. The list of tasks can be circular so that after the 'last' task the scheduler takes the 'first' task. In fact there is no 'last' and 'first' - it is just a circular list of tasks and the scheduler runs them without end. You could also have a list of windows in a window system and with some key command you would switch to the next window. The list of windows could be circular.
Lists are useful when you need a cheap next operation and the size of the data structure is unknown in advance. You can always add another node to the list or remove a node from a list. Usual implementations of lists make getting the next node and adding/removing an item cheap. Getting the next element from an array is also relatively simple (increase the index, at the last index go to the first index), but adding/removing elements usually needs more expensive shift operations.
Also since it is easy to build circular data structures, one just might do it during interactive programming. If you then print a circular data structure with the built-in routines it would be a good idea if the printer can handle it, since otherwise it may print a circular list forever...
Have you ever played Monopoly?
Without playing games with counters and modulo and such, how would you represent the Monopoly board in a computer implementation of the game? A circular list is a natural.
For example a double linked list data structure is "circular" in the Scheme/LISP point of view, i.e. if you try to print the cons-structure out you get backreferences, i.e. "cycles". So it's not really about having data structures that look like "rings", any data structure where you have some kind of backpointers is "circular" from the Scheme/LISP perspective.
A "normal" LISP list is single linked, which means that a destructive mutation to remove an item from inside the list is an O(n) operation; for double linked lists it is O(1). That's the "killer feature" of double linked lists, which are "circular" in the Scheme/LISP context.
Adding and removing elements to the beginning of a list is cheap. To
add or remove an element from the end of a list, you have to traverse
the whole list.
With a circular list, you can have a sort of fixed-length queue.
Setup a circular list of length 5:
> (import (srfi :1 lists))
> (define q (circular-list 1 2 3 4 5))
Let's add a number to the list:
> (set-car! q 6)
Now, let's make that the last element of the list:
> (set! q (cdr q))
Display the list:
> (take q 5)
(2 3 4 5 6)
So you can view this as a queue where elements enter at the end of the list and are removed from the head.
Let's add 7 to the list:
> (set-car! q 7)
> (set! q (cdr q))
> (take q 5)
(3 4 5 6 7)
Etc...
Anyways, this is one way that I've used circular-lists.
I use this technique in an OpenGL demo which I ported from an example in the Processing book.
Ed
One use of circular lists is to "repeat" values when using the srfi-1 version of map. For example, to add val to each element of lst, we could write:
(map + (circular-list val) lst)
For example:
(map + (circular-list 10) (list 0 1 2 3 4 5))
returns:
(10 11 12 13 14 15)
Of course, you could do this by replacing + with (lambda (x) (+ x val)), but sometimes the above idiom can be handier. Note that this only works with the srfi-1 version of map, which can accept lists of different sizes.