Faster code for 'distinct' on lists

Faster code for 'distinct' on lists - scala

This question refers to code generation with the Isabelle/HOL theorem prover.
When I export code for the distinct function on lists
export_code distinct in Scala file -
I get the following code
def member[A : HOL.equal](x0: List[A], y: A): Boolean = (x0, y) match {
case (Nil, y) => false
case (x :: xs, y) => HOL.eq[A](x, y) || member[A](xs, y)
}
def distinct[A : HOL.equal](x0: List[A]): Boolean = x0 match {
case Nil => true
case x :: xs => ! (member[A](xs, x)) && distinct[A](xs)
}
This code has quadratic runtime. Is there a faster version available? I think of something like importing "~~/src/HOL/Library/Code_Char" for strings at the beginning of my theory and efficient code generation for lists is set up.
A better implementation for distinct would be to sort the list in O(n log n) and iterate over the list once. But I guess one can do better?
Anyway, is there a faster implementation for distinct and maybe other functions from Main available?

I do not know of any faster implementation in Isabelle2013's library, but you can easily do it yourself as follows:
Implement a function distinct_sorted that determines distinctness on sorted lists.
Prove that distinct_sorted indeed implements distinct on sorted lists
Prove a lemma that implements distinct via distinct_list and sorting, and declare it as the new code equation for distinct.
In summary, this looks as follows:
context linorder begin
fun distinct_sorted :: "'a list => bool" where
"distinct_sorted [] = True"
| "distinct_sorted [x] = True"
| "distinct_sorted (x#y#xs) = (x ~= y & distinct_sorted (y#xs))"
lemma distinct_sorted: "sorted xs ==> distinct_sorted xs = distinct xs"
by(induct xs rule: distinct_sorted.induct)(auto simp add: sorted_Cons)
end
lemma distinct_sort [code]: "distinct xs = distinct_sorted (sort xs)"
by(simp add: distinct_sorted)
Next, you need an efficient sorting algorithm. By default, sort uses insertion sort. If you import Multiset from HOL/Library, sort will be implemented by quicksort. If you import Efficient Mergesort from the Archive of Formal Proofs, you get merge sort.
While this can improve efficiency, there's also a snag: After the above declarations, you can execute distinct only on lists whose elements are instances of the type class linorder. As this refinement happens only inside the code generator, your definitions and theorems in Isabelle are not affected.
For example, to apply distinct to a list of lists in any code equation, you first have to define a linear order on lists: List_lexord in HOL/Library does so by picking the lexicographic order, but this requires a linear order on the elements. If you want to use string, which abbreviates char list, Char_ord defines the usual order on char. If you map characters to the character type of the target language with Code_Char, you also need the adaptation theory Code_Char_ord for the combination with Char_ord.

Related

What kind of morphism is `filter` in category theory?

In category theory, is the filter operation considered a morphism? If yes, what kind of morphism is it? Example (in Scala)
val myNums: Seq[Int] = Seq(-1, 3, -4, 2)
myNums.filter(_ > 0)
// Seq[Int] = List(3, 2) // result = subset, same type
myNums.filter(_ > -99)
// Seq[Int] = List(-1, 3, -4, 2) // result = identical than original
myNums.filter(_ > 99)
// Seq[Int] = List() // result = empty, same type

One interesting way of looking at this matter involves not picking filter as a primitive notion. There is a Haskell type class called Filterable which is aptly described as:
Like Functor, but it [includes] Maybe effects.
Formally, the class Filterable represents a functor from Kleisli Maybe to Hask.
The morphism mapping of the "functor from Kleisli Maybe to Hask" is captured by the mapMaybe method of the class, which is indeed a generalisation of the homonymous Data.Maybe function:
mapMaybe :: Filterable f => (a -> Maybe b) -> f a -> f b
The class laws are simply the appropriate functor laws (note that Just and (<=<) are, respectively, identity and composition in Kleisli Maybe):
mapMaybe Just = id
mapMaybe (g <=< f) = mapMaybe g . mapMaybe f
The class can also be expressed in terms of catMaybes...
catMaybes :: Filterable f => f (Maybe a) -> f a
... which is interdefinable with mapMaybe (cf. the analogous relationship between sequenceA and traverse)...
catMaybes = mapMaybe id
mapMaybe g = catMaybes . fmap g
... and amounts to a natural transformation between the Hask endofunctors Compose f Maybe and f.
What does all of that have to do with your question? Firstly, a functor is a morphism between categories, and a natural transformation is a morphism between functors. That being so, it is possible to talk of morphisms here in a sense that is less boring than the "morphisms in Hask" one. You won't necessarily want to do so, but in any case it is an existing vantage point.
Secondly, filter is, unsurprisingly, also a method of Filterable, its default definition being:
filter :: Filterable f => (a -> Bool) -> f a -> f a
filter p = mapMaybe $ \a -> if p a then Just a else Nothing
Or, to spell it using another cute combinator:
filter p = mapMaybe (ensure p)
That indirectly gives filter a place in this particular constellation of categorical notions.

To answer are question like this, I'd like to first understand what is the essence of filtering.
For instance, does it matter that the input is a list? Could you filter a tree? I don't see why not! You'd apply a predicate to each node of the tree and discard the ones that fail the test.
But what would be the shape of the result? Node deletion is not always defined or it's ambiguous. You could return a list. But why a list? Any data structure that supports appending would work. You also need an empty member of your data structure to start the appending process. So any unital magma would do. If you insist on associativity, you get a monoid. Looking back at the definition of filter, the result is a list, which is indeed a monoid. So we are on the right track.
So filter is just a special case of what's called Foldable: a data structure over which you can fold while accumulating the results in a monoid. In particular, you could use the predicate to either output a singleton list, if it's true; or an empty list (identity element), if it's false.
If you want a categorical answer, then a fold is an example of a catamorphism, an example of a morphism in the category of algebras. The (recursive) data structure you're folding over (a list, in the case of filter) is an initial algebra for some functor (the list functor, in this case), and your predicate is used to define an algebra for this functor.

In this answer, I will assume that you are talking about filter on Set (the situation seems messier for other datatypes).
Let's first fix what we are talking about. I will talk specifically about the following function (in Scala):
def filter[A](p: A => Boolean): Set[A] => Set[A] =
s => s filter p
When we write it down this way, we see clearly that it's a polymorphic function with type parameter A that maps predicates A => Boolean to functions that map Set[A] to other Set[A]. To make it a "morphism", we would have to find some categories first, in which this thing could be a "morphism". One might hope that it's natural transformation, and therefore a morphism in the category of endofunctors on the "default ambient category-esque structure" usually referred to as "Hask" (or "Scal"? "Scala"?). To show that it's natural, we would have to check that the following diagram commutes for every f: B => A:
- o f
Hom[A, Boolean] ---------------------> Hom[B, Boolean]
| |
| |
| |
| filter[A] | filter[B]
| |
V ??? V
Hom[Set[A], Set[A]] ---------------> Hom[Set[B], Set[B]]
however, here we fail immediately, because it's not clear what to even put on the horizontal arrow at the bottom, since the assignment A -> Hom[Set[A], Set[A]] doesn't even seem functorial (for the same reasons why A -> End[A] is not functorial, see here and also here).
The only "categorical" structure that I see here for a fixed type A is the following:
Predicates on A can be considered to be a partially ordered set with implication, that is p LEQ q if p implies q (i.e. either p(x) must be false, or q(x) must be true for all x: A).
Analogously, on functions Set[A] => Set[A], we can define a partial order with f LEQ g whenever for each set s: Set[A] it holds that f(s) is subset of g(s).
Then filter[A] would be monotonic, and therefore a functor between poset-categories. But that's somewhat boring.
Of course, for each fixed A, it (or rather its eta-expansion) is also just a function from A => Boolean to Set[A] => Set[A], so it's automatically a "morphism" in the "Hask-category". But that's even more boring.

filter can be written in terms of foldRight as:
filter p ys = foldRight(nil)( (x, xs) => if (p(x)) x::xs else xs ) ys
foldRight on lists is a map of T-algebras (where here T is the List datatype functor), so filter is a map of T-algebras.
The two algebras in question here are the initial list algebra
[nil, cons]: 1 + A x List(A) ----> List(A)
and, let's say the "filter" algebra,
[nil, f]: 1 + A x List(A) ----> List(A)
where f(x, xs) = if p(x) x::xs else xs.
Let's call filter(p, _) the unique map from the initial algebra to the filter algebra in this case (it is called fold in the general case). The fact that it is a map of algebras means that the following equations are satisfied:
filter(p, nil) = nil
filter(p, x::xs) = f(x, filter(p, xs))

Efficient implementation of Catamorphisms in Scala

For a datatype representing the natural numbers:
sealed trait Nat
case object Z extends Nat
case class S(pred: Nat) extends Nat
In Scala, here is an elementary way of implementing the corresponding catamorphism:
def cata[A](z: A)(l: Nat)(f: A => A): A = l match {
case Z => z
case S(xs) => f( cata(z)(xs)(f) )
}
However, since the recursive call to cata isn't in tail position, this can easily trigger a stack overflow.
What are alternative implementation options that will avoid this? I'd rather not go down the route of F-algebras unless the interface ultimately presented by the code can look pretty much as simple as the above.
EDIT: Looks like this might be directly relevant: Is it possible to use continuations to make foldRight tail recursive?

If you were implementing a catamorphism on lists, that would be what in Haskell we call a foldr. We know that foldr does not have a tail-recursive definition, but foldl does. So if you insist on a tail-recusive program, the right thing to do is reverse the list argument (tail-recursively, in linear time), then use a foldl in place of the foldr.
Your example uses the simpler data type of naturals (and a truly "efficient" implementation would use machine integers, but we'll agree to leave that aside). What is the reverse of one of your natural numbers? Just the number itself, because we can think of it as a list with no data in each node, so we can't tell the difference when it is reversed! And what's the equivalent of the foldl? It's the program (forgive the pseudocode)
def cata(z, a, f) = {
var x = a, y = z;
while (x != Z) {
y = f(y);
x = pred(x)
}
return y
}
Or as a Scala tail-recursion,
def cata[A](z: A)(a: Nat)(f: A => A): A = a match {
case Z => z
case S(b) => cata( f(z) )(b)(f)
}
Will that do?

Yes, this is exactly the motivating example in the paper Clowns to the left of me, jokers to the right
(Dissecting Data Structures) (updated, better, but non-free version here http://dl.acm.org/citation.cfm?id=1328474).
The basic idea is that you want to turn your recursive function into a loop, so you need to figure out a data structure that keeps track of the state of the procedure, which is
What you've calculuated so far
What you have left to do.
The type of this state depends on the structure of the type you're doing the fold over, at any point in the fold you are at some node of the tree and you need to remember the tree structure of "the rest of the tree".
The paper shows how you can calculate that state type mechanically. If you do this for Lists, you get that the state you need to keep track of is
The operation run on all the previous values.
The list of elements left to process.
Which is exactly what foldl keeps track of, so it's kind of a coincidence that foldl and foldr can be given the same type.

Why does Haskell's foldr NOT stackoverflow while the same Scala implementation does?

I am reading FP in Scala.
Exercise 3.10 says that foldRight overflows (See images below).
As far as I know , however foldr in Haskell does not.
http://www.haskell.org/haskellwiki/
-- if the list is empty, the result is the initial value z; else
-- apply f to the first element and the result of folding the rest
foldr f z [] = z
foldr f z (x:xs) = f x (foldr f z xs)
-- if the list is empty, the result is the initial value; else
-- we recurse immediately, making the new initial value the result
-- of combining the old initial value with the first element.
foldl f z [] = z
foldl f z (x:xs) = foldl f (f z x) xs
How is this different behaviour possible?
What is the difference between the two languages/compilers that cause this different behaviour?
Where does this difference come from ? The platform ? The language? The compiler?
Is it possible to write a stack-safe foldRight in Scala? If yes, how?

Haskell is lazy. The definition
foldr f z (x:xs) = f x (foldr f z xs)
tells us that the behaviour of foldr f z xs with a non-empty list xs is determined by the laziness of the combining function f.
In particular the call foldr f z (x:xs) allocates just one thunk on the heap, {foldr f z xs} (writing {...} for a thunk holding an expression ...), and calls f with two arguments - x and the thunk. What happens next, is f's responsibility.
In particular, if it's a lazy data constructor (like e.g. (:)), it will immediately be returned to the caller of the foldr call (with the constructor's two slots filled by (references to) the two values).
And if f does demand its value on the right, with minimal compiler optimizations no thunks should be created at all (or one, at the most - the current one), as the value of foldr f z xs is immediately needed and the usual stack-based evaluation can used:
foldr f z [a,b,c,....,n] ==
a `f` (b `f` (c `f` (... (n `f` z)...)))
So foldr can indeed cause SO, when used with strict combining function on extremely long input lists. But if the combining function doesn't demand right away its value on the right, or only demands a part of it, the evaluation will be suspended in a thunk, and the partial result as created by f will be immediately returned. Same with the argument on the left, but they already come as thunks, potentially, in the input list.

Haskell is lazy. So foldr allocates on the heap, not the stack. Depending on the strictness of the argument function, it may allocate a single (small) result, or a large structure.
You're still losing space, compared to a strict, tail-recursive implementation, but it doesn't look as obvious, since you've traded stack for heap.

Note that the authors here are not referring to any foldRight definition in the scala standard library, such as the one defined on List. They are referring to the definition of foldRight they gave above in section 3.4.
The scala standard library defines the foldRight in terms of foldLeft by reversing the list (which can be done in constant stack space) then calling foldLeft with the the arguments of the passed function reversed. This works for lists, but won't work for a structure which cannot be safely reversed, for example:
scala> Stream.continually(false)
res0: scala.collection.immutable.Stream[Boolean] = Stream(false, ?)
scala> res0.reverse
java.lang.OutOfMemoryError: GC overhead limit exceeded
Now lets think about what should be the result of this operation:
Stream.continually(false).foldRight(true)(_ && _)
The answer should be false, it doesn't matter how many false values are in the stream or if it is infinite, if we are going to combine them with a conjunction, the result will be false.
haskell of course gets this with no problem:
Prelude> foldr (&&) True (repeat False)
False
And that is because of two important things: haskell's foldr will traverse the stream from left to right, not right to left, and haskell is lazy by default. The first item here, that foldr actually traverses the list from left to right might surprise or confuse some people who think of a right fold as starting from the right, but the important feature of a right fold is not which end of a structure it starts on, but in which direction the associativity is. So give a list [1,2,3,4] and an op named op, a left fold is
((1 op 2) op 3) op 4)
and a right fold is
(1 op (2 op (3 op 4)))
But the order of evaluation shouldn't matter. So what the authors have done here in chapter 3 is to give you a fold which traverses the list from left to right, but because scala is by default strict, we still will not be able to traverse our stream of infinite falses, but have some patience, they will get to that in chapter 5 :) I'll give you a sneak peek, lets look at the difference between foldRight as it is defined in the standard library and as it is defined in the Foldable typeclass in scalaz:
Here's the implementation from the scala standard library:
def foldRight[B](z: B)(op: (A, B) => B): B
Here's the definition from scalaz's Foldable:
def foldRight[B](z: => B)(f: (A, => B) => B): B
The difference is that the Bs are all lazy, and now we get to fold our infinite stream again, as long as we give a function which is sufficiently lazy in its second parameter:
scala> Foldable[Stream].foldRight(Stream.continually(false),true)(_ && _)
res0: Boolean = false

One easy way to demonstrate this in Haskell is to use equational reasoning to demonstrate lazy evaluation. Let's write the find function in terms of foldr:
-- Return the first element of the list that satisfies the predicate, or `Nothing`.
find :: (a -> Bool) -> [a] -> Maybe a
find p = foldr (step p) Nothing
where step pred x next = if pred x then Just x else next
foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f z [] = z
foldr f z (x:xs) = f x (foldr f z xs)
In an eager language, if you wrote find with foldr it would traverse the whole list and use O(n) space. With lazy evaluation, it stops at the first element that satisfies the predicate, and uses only O(1) space (modulo garbage collection):
find odd [0..]
== foldr (step odd) Nothing [0..]
== step odd 0 (foldr (step odd) Nothing [1..])
== if odd 0 then Just 0 else (foldr (step odd) Nothing [1..])
== if False then Just 0 else (foldr (step odd) Nothing [1..])
== foldr (step odd) Nothing [1..]
== step odd 1 (foldr (step odd) Nothing [2..])
== if odd 1 then Just 1 else (foldr (step odd) Nothing [2..])
== if True then Just 1 else (foldr (step odd) Nothing [2..])
== Just 1
This evaluation stops in a finite number of steps, in spite of the fact that the list [0..] is infinite, so we know that we're not traversing the whole list. In addition, there is an upper bound on the complexity of the expressions at each step, which translates into a constant upper bound on the memory required to evaluate this.
The key here is that the step function that we're folding with has this property: no matter what the values of x and next are, it will either:
Evaluate to Just x, without invoking the next thunk, or
Tail-call the next thunk (in effect, if not literally).

How to concisely express function iteration?

Is there a concise, idiomatic way how to express function iteration? That is, given a number n and a function f :: a -> a, I'd like to express \x -> f(...(f(x))...) where f is applied n-times.
Of course, I could make my own, recursive function for that, but I'd be interested if there is a way to express it shortly using existing tools or libraries.
So far, I have these ideas:
\n f x -> foldr (const f) x [1..n]
\n -> appEndo . mconcat . replicate n . Endo
but they all use intermediate lists, and aren't very concise.
The shortest one I found so far uses semigroups:
\n f -> appEndo . times1p (n - 1) . Endo,
but it works only for positive numbers (not for 0).
Primarily I'm focused on solutions in Haskell, but I'd be also interested in Scala solutions or even other functional languages.

Because Haskell is influenced by mathematics so much, the definition from the Wikipedia page you've linked to almost directly translates to the language.
Just check this out:
Now in Haskell:
iterateF 0 _ = id
iterateF n f = f . iterateF (n - 1) f
Pretty neat, huh?
So what is this? It's a typical recursion pattern. And how do Haskellers usually treat that? We treat that with folds! So after refactoring we end up with the following translation:
iterateF :: Int -> (a -> a) -> (a -> a)
iterateF n f = foldr (.) id (replicate n f)
or point-free, if you prefer:
iterateF :: Int -> (a -> a) -> (a -> a)
iterateF n = foldr (.) id . replicate n
As you see, there is no notion of the subject function's arguments both in the Wikipedia definition and in the solutions presented here. It is a function on another function, i.e. the subject function is being treated as a value. This is a higher level approach to a problem than implementation involving arguments of the subject function.
Now, concerning your worries about the intermediate lists. From the source code perspective this solution turns out to be very similar to a Scala solution posted by #jmcejuela, but there's a key difference that GHC optimizer throws away the intermediate list entirely, turning the function into a simple recursive loop over the subject function. I don't think it could be optimized any better.
To comfortably inspect the intermediate compiler results for yourself, I recommend to use ghc-core.

In Scala:
Function chain Seq.fill(n)(f)
See scaladoc for Function. Lazy version: Function chain Stream.fill(n)(f)

Although this is not as concise as jmcejuela's answer (which I prefer), there is another way in scala to express such a function without the Function module. It also works when n = 0.
def iterate[T](f: T=>T, n: Int) = (x: T) => (1 to n).foldLeft(x)((res, n) => f(res))
To overcome the creation of a list, one can use explicit recursion, which in reverse requires more static typing.
def iterate[T](f: T=>T, n: Int): T=>T = (x: T) => (if(n == 0) x else iterate(f, n-1)(f(x)))
There is an equivalent solution using pattern matching like the solution in Haskell:
def iterate[T](f: T=>T, n: Int): T=>T = (x: T) => n match {
case 0 => x
case _ => iterate(f, n-1)(f(x))
}
Finally, I prefer the short way of writing it in Caml, where there is no need to define the types of the variables at all.
let iterate f n x = match n with 0->x | n->iterate f (n-1) x;;
let f5 = iterate f 5 in ...

I like pigworker's/tauli's ideas the best, but since they only gave it as a comments, I'm making a CW answer out of it.
\n f x -> iterate f x !! n
or
\n f -> (!! n) . iterate f
perhaps even:
\n -> ((!! n) .) . iterate

Difference in asymptotic time of two variants of flatten

I am going through the Scala by Example document and I am having trouble with exercise 9.4.2. Here is the text:
Exercise 9.4.2 Consider the problem of writing a function flatten, which takes a list of element lists as arguments. The result of flatten should be the concatenation of all element lists into a single list. Here is an implementation of this method in terms of :\.
def flatten[A](xs: List[List[A]]): List[A] =
(xs :\ (Nil: List[A])) {(x, xs) => x ::: xs}
Consider replacing the body of flatten by
((Nil: List[A]) /: xs) ((xs, x) => xs ::: x)
What would be the difference in asymptotic complexity between the two versions of flatten?
In fact flatten is predefined together with a set of other userful function in an object
called List in the standatd Scala library. It can be accessed from user program by calling List.flatten. Note that flatten is not a method of class List – it would not make sense there, since it applies only to lists of lists, not to all lists in general.
I do not see how the asymptotic time of these two function variants are different. I'm sure it's because I am missing something fundamental about the meaning of fold left and fold right.
Here is a pdf of the document I am describing:
http://www.scala-lang.org/docu/files/ScalaByExample.pdf
I am generally finding this an excellent introduction into Scala.

Look at the implementation of concatenation ::: (p.68) (the rest of answer is masked with spoiler-tags, mouse-over to read !)
Witness that it's linear (in ::) in the size of the left argument (the list that ends up being the prefix of the result).
Assume (for the sake of the complexity analysis) that your list of lists contains n equal-sized small lists of size a fixed constant k, k<n. If you use foldLeft, you compute:
f (... (f (f a b1) b2) ...) bn
Where f is the concatenation. If you use foldRight:
f a1 (f a2 (... (f an b) ...))
With again f standing for the prefix notation of concatenation. In the second case it's easy : you add k elements at the head each time, so you do (k*n cons).
For the first case (foldLeft), in the first concatenation, the list (f a b1) is of size k. You add it on the second round to b2 to form (f (f a b1) b2) of size 2k ... You do (k+(k+k)+(3k)+... = k*sum_{i=1}^n(i) = k*n(n+1)/2 cons).
(Followup question : is this the only parameter that should be taken into account while thinking of the efficiency of that function ? Doesn't foldLeft have an advantage -not asymptotic complexity- that foldRight doesn't ?)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Faster code for 'distinct' on lists - scala

Related

What kind of morphism is `filter` in category theory?

Efficient implementation of Catamorphisms in Scala

Why does Haskell's foldr NOT stackoverflow while the same Scala implementation does?

How to concisely express function iteration?

Difference in asymptotic time of two variants of flatten

Categories

Resources