Purely functional data structures for text editors - scala

What would be good purely functional data structures for text editors? I want to be able to insert single characters into the text and delete single characters from the text with acceptable efficiency, and I would like to be able to hold on to old versions, so I can undo changes with ease.
Should I just use a list of strings and reuse the lines that don't change from version to version?

I don't know whether this suggestion is "good" for sophisticated definitions of "good", but it's easy and fun. I often set an exercise to write the core of a text editor in Haskell, linking with rendering code that I provide. The data model is as follows.
First, I define what it is to be a cursor inside a list of x-elements, where the information available at the cursor has some type m. (The x will turn out to be Char or String.)
type Cursor x m = (Bwd x, m, [x])
This Bwd thing is just the backward "snoc-lists". I want to keep strong spatial intuitions, so I turn things around in my code, not in my head. The idea is that the stuff nearest the cursor is the most easily accessible. That's the spirit of The Zipper.
data Bwd x = B0 | Bwd x :< x deriving (Show, Eq)
I provide a gratuitous singleton type to act as a readable marker for the cursor...
data Here = Here deriving Show
...and I can thus say what it is to be somewhere in a String
type StringCursor = Cursor Char Here
Now, to represent a buffer of multiple lines, we need Strings above and below the line with the cursor, and a StringCursor in the middle, for the line we're currently editing.
type TextCursor = Cursor String StringCursor
This TextCursor type is all I use to represent the state of the edit buffer. It's a two layer zipper. I provide the students with code to render a viewport on the text in an ANSI-escape-enabled shell window, ensuring that the viewport contains the cursor. All they have to do is implement the code that updates the TextCursor in response to keystrokes.
handleKey :: Key -> TextCursor -> Maybe (Damage, TextCursor)
where handleKey should return Nothing if the keystroke is meaningless, but otherwise deliver Just an updated TextCursor and a "damage report", the latter being one of
data Damage
= NoChange -- use this if nothing at all happened
| PointChanged -- use this if you moved the cursor but kept the text
| LineChanged -- use this if you changed text only on the current line
| LotsChanged -- use this if you changed text off the current line
deriving (Show, Eq, Ord)
(If you're wondering what the difference is between returning Nothing and returning Just (NoChange, ...), consider whether you also want the editor to go beep.) The damage report tells the renderer how much work it needs to do to bring the displayed image up to date.
The Key type just gives a readable dataype representation to the possible keystrokes, abstracting away from the raw ANSI escape sequences. It's unremarkable.
I provide the students with a big clue about to go up and down with this data model by offering these pieces of kit:
deactivate :: Cursor x Here -> (Int, [x])
deactivate c = outward 0 c where
outward i (B0, Here, xs) = (i, xs)
outward i (xz :< x, Here, xs) = outward (i + 1) (xz, Here, x : xs)
The deactivate function is used to shift focus out of a Cursor, giving you an ordinary list, but telling you where the cursor was. The corresponding activate function attempts to place the cursor at a given position in a list:
activate :: (Int, [x]) -> Cursor x Here
activate (i, xs) = inward i (B0, Here, xs) where
inward _ c#(_, Here, []) = c -- we can go no further
inward 0 c = c -- we should go no further
inward i (xz, Here, x : xs) = inward (i - 1) (xz :< x, Here, xs) -- and on!
I offer the students a deliberately incorrect and incomplete definition of handleKey
handleKey :: Key -> TextCursor -> Maybe (Damage, TextCursor)
handleKey (CharKey c) (sz,
(cz, Here, cs),
ss)
= Just (LineChanged, (sz,
(cz, Here, c : cs),
ss))
handleKey _ _ = Nothing
which just handles ordinary character keystrokes but makes the text come out backwards. It's easy to see that the character c appears right of Here. I invite them to fix the bug and add functionality for the arrow keys, backspace, delete, return, and so on.
It may not be the most efficient representation ever, but it's purely functional and enables the code to conform concretely to our spatial intuitions about the text that's being edited.

A Vector[Vector[Char]] would probably be a good bet. It is an IndexedSeq so has decent update / prepend / update performance, unlike the List you mention. If you look at Performance Characteristics, it's the only immutable collection mentioned that has effective constant-time update.

We use a text zipper in Yi, a serious text editor implementation in Haskell.
The implementation of the immutable state types is described in the following,
http://publications.lib.chalmers.se/records/fulltext/local_94979.pdf
http://publications.lib.chalmers.se/records/fulltext/local_72549.pdf
and other papers.

Fingertrees
Ropes
scala.collection.immutable.IndexSeq

I'd suggest to use zippers in combination with Data.Sequence.Seq which is based on finger trees. So you could represent the current state as
data Cursor = Cursor { upLines :: Seq Line
, curLine :: CurLine
, downLines :: Seq Line }
This gives you O(1) complexity for moving cursor up/down a single line, and since splitAt and (><) (union) have both O(log(min(n1,n2))) complexity, you'll get O(log(L)) complexity for skipping L lines up/down.
You could have a similar zipper structure for CurLine to keep a sequence of character before, at and after the cursor.
Line could be something space-efficient, such as ByteString.

I've implemented a zipper for this purpose for my vty-ui library. You can take a look here:
https://github.com/jtdaugherty/vty-ui/blob/master/src/Graphics/Vty/Widgets/TextZipper.hs

The Clojure community is looking at RRB Trees (Relaxed Radix Balanced) as a persistent data strcuture for vectors of data that can be efficiently concatenated / sliced / inserted etc.
It allows concatenation, insert-at-index and split operations in O(log N) time.
I imagine a RRB Tree specialised for character data would be perfectly suited for large "editable" text data structures.

The possibilities that spring to mind are:
The "Text" type with a numerical index. It keeps text in a linked list of buffers (internal representation is UTF16), so while in theory its computational complexity is usually that of a linked list (e.g. indexing is O(n)), in practice its so much faster than a conventional linked list that you could probably just forget about the impact of n unless you are storing the whole of Wikipedia in your buffer. Try some experiments on 1 million character text to see if I'm right (something I haven't actually done, BTW).
A text zipper: store the text after the cursor in one text element, and the text before the cursor in another. To move the cursor transfer text from one side to the other.

Related

Return a list of characters from a mixed list

I am trying to write a Lisp function to return a list of characters (no repeats) from a list (with ints, characters, etc). I'm still a beginner with Lisp and am having trouble starting. Our prof mentioned using atom but I can't figure out what she meant. Here is the question:
"Write a lisp function that accepts a list as the input argument (the list is mixed up integers, decimals,
characters and nested lists) and creates a list including all the characters in the original list without any
duplication. Sample program output is shown below:
‘((z f) (b a 5 3.5) 6 (7) (a) c) -> (z f b a c)
‘( (n) 2 (6 h 7.8) (w f) (n) (c) n) -> (h w f c n) "
What your assignment calls “characters” are actually symbols with a name of length 1. It seems that you can just mentally replace the word “characters” with “symbols” and work with this.
An atom is anything that is not a cons—any non-empty list consists of a chain of conses. For example, symbols, numbers, strings, and nil are atoms.
A cons (actually a cons cell) is a simple datastructure that can hold two things. In a list, the first thing of each cons is some list element, and the second either a pointer to the next cons or nil. You can also have lists as list elements; then also the first thing would be a pointer to a list. This would then formally be a tree. The accessor function for the first thing of a cons is called car or first, the accessor function for the other thing is called cdr or rest. Car and cdr are a bit archaic, and mainly used when you see the cons cell as a tree node, while first and rest are more modern, and mainly used when you see the cons cell as a list chain link.
You can test whether a thing is an atom with the function atom. If it is not an atom, it is a list with at least one element.
Your assignment has a few parts:
Walk the tree to look at each element. This can be done through recursion, or through looping in one direction and recursing in the other.
Keep a list of symbols that you already found.
If the element you look at is a symbol (with a name of length 1…), then check whether it is new, if yes, add it to your list.
Finally return that list.
One useful idiom is to use push or pushnew, which put new elements at the front of the list, and at the end reverse it.

Using match with user defined types in PL Racket

The following PL code does not work under #lang pl:
Edited code according to Alexis Kings answer
(define-type BINTREE
[Leaf Number]
[Node BINTREE BINTREE])
(: retrieve-leaf : BINTREE -> Number)
(define (retrieve-leaf btree)
(match btree
[(Leaf number) number])
What i'd like to achieve is as follows:
Receive a BINTREE as input
Check whether the tree is simply a leaf
Return the leaf numerical value
This might be a basic question but how would I go about solving this?
EDIT: The above seems to work if cases is used instead of match.
Why is that?
As you've discovered, match and cases are two similar but separate
things. The first is used for general Racket values, and the second is
used for things that you defined with define-type. Unfortunately,
they don't mix well in either direction, so if you have a defined type
then you need to use cases.
As for the reason for that, it's kind of complicated... One thing is
that the pl language was made well before match was powerful enough
to deal with arbitrary values conveniently. It does now, but it cannot
be easily tweaked to do what cases does: the idea behind define-type
is to make programming simple by making it mandatory to use just
cases for such values --- there are no field accessors, no predicates
for the variants (just for the whole type), and certainly no mutation.
Still, it is possible to do anything you need with just cases. If you
read around, the core idea is to mimic disjoint union types in HM
languages like ML and Haskell, and with only cases pattern matching
available, many functions are easy to start since there's a single way
to deal with them.
match and Typed Racket got closer to being able to do these things,
but it's still not really powerful enough to do all of that --- which is
why cases will stay separate from match in the near future.
As a side note, this is in contrast to what I want --- I know that this
is often a point of confusion, so I'd love to have just match used
throughout. Maybe I'll break at some point and hack things so that
cases is also called match, and the contents of the branches would
be used to guess if you really need the real match or the cases
version. But that would really be a crude hack.
I think you're on the right track, but your match syntax isn't correct. It should look like this:
(: retrieve-leaf : BINTREE -> Number)
(define (retrieve-leaf btree)
(match btree
[(Leaf number) number]))
The match pattern clauses must be inside the match form. Additionally, number is just a binding, not a procedure, so it doesn't need to be in parens.

Having `f =` move to either => or ⇒

I edit a lot of Scala code in Vim, which means I hit f = a lot, as I head to the RHS of a case statement (or whatever else):
case PatternMatch(a, b, c) => RHS Here
But Scala supports unicode characters, which means that a lot of people will use ⇒ instead of => and that makes f ... a pain in the butt. Does anyone know how if there's a way to make f = move to the next = or ⇒, whichever comes first?
Try adding this to your vimrc:
nmap f= :call search('=\\|⇒')<CR>
That will alias (map) f= to a call to the search function that will jump to the next = or ⇒.
bundacia's answer is the one you asked for. Another option is to use digraphs. You can type ⇒ in vim by typing <C-k>=> in insert mode. You can also use this digraph with the f command (e.g. f<C-k>=>). No need for mappings etc. though it may be somewhat inconvenient to type.
I prefer this solution since it preserves other operator pending commands such as df=, cf=, etc. and allows the use of F, t, etc.

What are circular lists good for (in Lisp or Scheme)?

I note that Scheme and Lisp (I guess) support circular lists, and I have used circular lists in C/C++ to 'simplify' the insertion and deletion of elements, but what are they good for?
Scheme ensures that they can be built and processed, but for what?
Is there a 'killer' data structure that needs to be circular or tail-circular?
Saying it supports 'circular lists' is a bit much. You can build all kinds of circular data structures in Lisp. Like in many programming languages. There is not much special about Lisp in this respect. Take your typical 'Algorithms and Datastructure' book and implement any circular data structure: graphs, rings, ... What some Lisps offer is that one can print and read circular data structures. The support for this is because in typical Lisp programming domains circular data structures are common: parsers, relational expressions, networks of words, plans, ...
It is quite common that data structures contain cycles. Real 'circular lists' are not that often used. For example think of a task scheduler which runs a task and after some time switches to the next. The list of tasks can be circular so that after the 'last' task the scheduler takes the 'first' task. In fact there is no 'last' and 'first' - it is just a circular list of tasks and the scheduler runs them without end. You could also have a list of windows in a window system and with some key command you would switch to the next window. The list of windows could be circular.
Lists are useful when you need a cheap next operation and the size of the data structure is unknown in advance. You can always add another node to the list or remove a node from a list. Usual implementations of lists make getting the next node and adding/removing an item cheap. Getting the next element from an array is also relatively simple (increase the index, at the last index go to the first index), but adding/removing elements usually needs more expensive shift operations.
Also since it is easy to build circular data structures, one just might do it during interactive programming. If you then print a circular data structure with the built-in routines it would be a good idea if the printer can handle it, since otherwise it may print a circular list forever...
Have you ever played Monopoly?
Without playing games with counters and modulo and such, how would you represent the Monopoly board in a computer implementation of the game? A circular list is a natural.
For example a double linked list data structure is "circular" in the Scheme/LISP point of view, i.e. if you try to print the cons-structure out you get backreferences, i.e. "cycles". So it's not really about having data structures that look like "rings", any data structure where you have some kind of backpointers is "circular" from the Scheme/LISP perspective.
A "normal" LISP list is single linked, which means that a destructive mutation to remove an item from inside the list is an O(n) operation; for double linked lists it is O(1). That's the "killer feature" of double linked lists, which are "circular" in the Scheme/LISP context.
Adding and removing elements to the beginning of a list is cheap. To
add or remove an element from the end of a list, you have to traverse
the whole list.
With a circular list, you can have a sort of fixed-length queue.
Setup a circular list of length 5:
> (import (srfi :1 lists))
> (define q (circular-list 1 2 3 4 5))
Let's add a number to the list:
> (set-car! q 6)
Now, let's make that the last element of the list:
> (set! q (cdr q))
Display the list:
> (take q 5)
(2 3 4 5 6)
So you can view this as a queue where elements enter at the end of the list and are removed from the head.
Let's add 7 to the list:
> (set-car! q 7)
> (set! q (cdr q))
> (take q 5)
(3 4 5 6 7)
Etc...
Anyways, this is one way that I've used circular-lists.
I use this technique in an OpenGL demo which I ported from an example in the Processing book.
Ed
One use of circular lists is to "repeat" values when using the srfi-1 version of map. For example, to add val to each element of lst, we could write:
(map + (circular-list val) lst)
For example:
(map + (circular-list 10) (list 0 1 2 3 4 5))
returns:
(10 11 12 13 14 15)
Of course, you could do this by replacing + with (lambda (x) (+ x val)), but sometimes the above idiom can be handier. Note that this only works with the srfi-1 version of map, which can accept lists of different sizes.

Is there a diff algorithm that preserves line ownership

My goal is coming up with a script to track the point a line was added, even if the line is subsequently modified or moved around (both of which confuse traditional vcs 'blame' scripts. I've done some minor background research (see bottom) but didn't find anything useful. I have a concept for how to proceed but the runtime would be atrocious (there's a factorial involved).
The two missing features are tracking edited-in-place lines separate from a deletion-and-addition of that line, and tracking entire functions moved around so they're in different hunks. For those experienced with diff but unfamiliar with the terminology, a subsequence is a contiguous group of + or - lines, with a type of either delete (all -), add (all +), or replace (a combination). I need more information, on moves and edit-in-place lines, vaguely alluded to in an entry on c2: DiffAlgorithm (paragraph starts with "My favorite mode"). Does anyone know what that is? (seems to be based on Tichy, see bottom.)
Here's more info on the two missing features:
no concept of a change on a line, (a fourth type, something like edit-in-place). In this hunk, the parent of 'bc' is 'b' but 'd' is new and isn't a descendant of 'b':
a
-b
+bc
+d
The workaround for this isn't too complicated, if the position of edits is the same (just an expanded version of markup_instraline_changes but comparing edit distance on all equal-sized subsets of old and new lines.
no concept of "moving" code that preserves the ownership of the lines, e.g. this diff shouldn't alter the ownership of "line", although its position changes.
a
-line
c
+line
This could be dealt with in the same way but with much worse runtime (instead of only checking single blocks marked 'replace', you'd need to check Levenshtein distance between all added against all removed lines) and with likely false positives (some, like whitespace-only lines, aren't relevant to my problem).
Research I've done: reading about gestalt pattern matching (Ratcliff and Obershelp, used in Python's difflib) and An O(ND) Difference Algorithm and its Variations (EW Myers).
After posting the question, I found references to Tichy84 which appears to be The string-to-string correction problem with block moves (which I haven't read yet) according to Walter Tichy's paper a year later on RCS
You appear to be interested in origin tracking, the problem of tracing where a line came from.
Ideally, you'd instrument the editor to remember how things were edited, and store the edits with the text in your repository, thus solving the problem trivially, but none of us software engineers seem to be smart enough to implement this simple idea.
As a weak substitute, one can look at a sequence of source code revisions from the repository and reconstruct a "plausible" history of changes. This is what you seem to be doing by proposing the use of "diff". As you've noted, diff doesn't understand the idea of "moving" or "copying".
SD Smart Differencer tools compare source text by parsing the text according to the langauge it is in, discovering the code structures, and computing least-Levensthein differences in terms of programming language constructs (identifiers, expressions, statements, blocks, classes, ...) and abstract editing operators "insert", "delete", "copy", "move" and "rename identifier within a scope". They produce diff-like output, a little richer because they tell you line/column -> line/column with different editing operations.
Obviously the "move" and "copy" edits are the ones most interesting to you in terms of tracking specific lines (well, specific language constructs). Our experience is that code goes through lots of copy and edits, too, which I suspect won't surprise you.
These tools are in Beta, and are presently available for COBOL, Java and C#. Lots of other langauges are in the pipe, because the SmartDifferencer is built on top of a langauge-parameterized infrastructure, DMS Software Reengineering Toolkit, which has quite a number of already existing, robust langauge grammars.
I think the idea of what amount of editing a line that can be done while it remains a descendent of some previously written line is very subjective, and based on context, both things that a computer cannot work with. You'd have to specify some sort of configurable minimum similarity on lines in your program I think... The other problem is that it is entirely possible for two identical lines to be written completely independently (for example incrementing the value of some variable), and this will be be quite a common thing, so your desired algorithm won't really give truthful or useful information about a line quite often.
I would like to suggest an algorithm for this though (which makes tons of hopefully obvious assumptions by the way) so here goes:
Convert both texts to lists of lines
Copy the lists and Strip all whitespace from inside of each line
Delete blank lines from both lists
Repeat
Do a Levenshtein distance from the old to new lists ...
... keeping all intermediate data
Find all lines in the new text that were matched with old lines
Mark the line in both new/old original lists as having been matched
Delete the line from the new text (the copy)
Optional: If some matched lines are in a contiguous sequence ...
... in either original text assign them to a grouping as well!
Until there is nothing left but unmatchable lines in the new text
Group together sequences of unmatched lines in both old and new texts ...
... which are contiguous in the original text
Attribute each with the line match before and after
Run through all groups in old text
If any match before and after attributes with new text groups for each
//If they are inside the same area basically
Concatenate all the lines in both groups (separately and in order)
Include a character to represent where the line breaks are
Repeat
Do a Levenshtein distance on these concatenations
If there are any significantly similar subsequences found
//I can't really define this but basically a high proportion
//of matches throughout all lines involved on both sides
For each matched subsequence
Find suitable newline spots to delimit the subsequence
Mark these lines matched in the original text
//Warning splitting+merging of lines possible
//No 1-to-1 correspondence of lines here!
Delete the subsequence from the new text group concat
Delete also from the new text working list of lines
Until there are no significantly similar subsequences found
Optional: Regroup based on remaining unmatched lines and repeat last step
//Not sure if there's any point in trying that at the moment
Concatenate the ENTIRE list of whitespaced-removed lines in the old text
Concatenate the lines in new text also (should only be unmatched ones left)
//Newline character added in both cases
Repeat
Do Levenshtein distance on these concatenations
Match similar subsequences in the same way as earlier on
//Don't need to worry deleting from list of new lines any more though
//Similarity criteria should be a fair bit stricter here to avoid
// spurious matchings. Already matched lines in old text might have
// even higher strictness, since all of copy/edit/move would be rare
While you still have matchings
//Anything left unmatched in the old text is deleted stuff
//Anything left unmatched in the new text is newly written by the author
Print out some output to show all the comparing results!
Well, hopefully you can see the basics of what I mean with that completely untested algorithm. Find obvious matches first, and verbatim moves of chunks of decreasing size, then compare stuff that's likely to be similar, then look for anything else which is similar, but both modified and moved: probably just coincidentally similar.
Well, if you try implementing this, tell me how it works out, and what details you changed, and what kind of assignments you made to the various variables involved... I expect there will be some test cases where it works brilliantly and others where it just abyssmally fails due to some massive oversight. The idea is that most stuff will be matched before you get to the inefficient final loop, and indeed the previous one