How can I add state to a lexer in Racket? - racket

I am trying to write a lexer for the sed language in Racket (ex. "s/find/replace/"). One problem I have been encountering is dealing with the fact that a lot of the tokens have no definitive form and can change. For example, I can write the above example as "ssfindsreplaces", where the letter 's' is used instead of '/'.
I've started writing a lexer such as,
(define sed-lexer
(lexer-srcloc
["\n" (token 'NEWLINE lexeme)]
["/" (token 'DIVIDER lexeme]
[(:or "s" "y" "d" "p") (token 'CMD lexeme)]
[(:* (complement "/") (token 'LITERAL lexeme)]))
but this fails on multiple levels:
The command can only come at the beginning (in this simplified example). After a command has been read I want to ignore the command case until a newline.
The DIVIDER token can't be set always as a slash.
I can imagine a solution to this problem could be adding states to this lexer. So for example, the lexer starts in a 'start' state where it looks for a command, then it goes to the 'divider1' state, looking for what will be the divider character. Such a feature seems to exist here http://pygments.org/docs/lexerdevelopment/. What would be the best way to go about solving this problem given the tools in the Racket ecosystem?

A lexer is simply a function that consumes an input port and returns a token. If (br-)parser-tools/lex is not enough for you, you can write it yourself (it should not be difficult).
In theory, finite state machine and regular expression are equally expressive, so I think you can in fact use parser-tools/lex to accomplish what you want. It will just look really tedious due to how you need to split cases over all possible "divider"s (because pure regular expression doesn't have the backreference ability). I think that pygments that you mentioned will have a similar problem.
Another possibility is to use something more powerful than regular expression. As sed grammar is pretty simple, you can even parse it right away without lexing first. Here's a crappy version I wrote quickly using megaparsack, a parser combinator library
#lang racket
(require megaparsack megaparsack/text
data/monad data/applicative)
(struct substitution (search replace flags) #:transparent)
(define substitution/p (do (char/p #\s)
[divider <- any-char/p]
[search <- (many/p (char-not/p divider))]
(char/p divider)
[replace <- (many/p (char-not/p divider))]
(char/p divider)
[flags <- (many/p (char-in/p "gIp"))]
(pure (substitution search replace flags))))
(define dummy-command/p (string/p "dummy-command"))
(define line/p (or/p substitution/p
dummy-command/p))
(define program/p (do [result <- (many/p line/p #:sep (char/p #\newline))]
eof/p
(pure result)))
(pretty-print
(parse-result!
(parse-string program/p
"s/hello/world/\ndummy-command\ns|search|replace|gp")))
#|
Result:
(list
(substitution '(#\h #\e #\l #\l #\o) '(#\w #\o #\r #\l #\d) '())
"dummy-command"
(substitution
'(#\s #\e #\a #\r #\c #\h)
'(#\r #\e #\p #\l #\a #\c #\e)
'(#\g #\p)))
|#

Related

Can a macro be used to make c[...]r combinations with any arbitrary number of car and cdr calls, such as cadaddr?

I recently discovered that all of my implementations of Scheme throw an error when I try to use (cadaddr (list 1 3 (list 5 7) 9)). Apparently, by default Scheme does not allow any car and cdr combinations in the single-function form that use more than four abbreviated car and cdrcalls. I originally blamed this on Scheme's minimalism, but then I discovered that Common Lisp also shares this defect.
Can this be solved with a macro? Can we write a macro that allows an arbitrary amount of a and d in its c[...]r calls and returns the expected result, while also having the Common Lisp-like compatibility with macros like setf? If not, why not? And if so, has a reason ever been given for this is not a default feature in any lisp that I've seen?
Such a macro is described in Let Over Lambda for common lisp. You must wrap your code with (with-cxrs ...) to bring them all into scope, but it walks your code to see which combinators you need. I wrote a Clojure port of it years ago for fun, though of course nobody (including me) has ever wanted to use it for real. You could port it to Scheme if you liked.
(defn cxr-impl [name]
(when-let [op (second (re-matches #"c([ad]+)r" name))]
`(comp ~#(map {\a `first \d `rest} op))))
(defmacro with-cxrs [& body]
(let [symbols (remove coll? (tree-seq coll? seq body))]
`(let [~#(for [sym symbols
:let [impl (cxr-impl (name sym))]
:when impl
thing [sym impl]]
thing)]
~#body)))
user> (macroexpand-1 '(with-cxrs (inc (caadaaddadr x))))
(let [caadaaddadr (comp first first rest first first rest rest first rest)]
(inc (caadaaddadr x)))
https://groups.google.com/g/clojure/c/CanBrJPJ4aI/m/S7wMNqmj_Q0J
As noted in the mailing list thread, there are some bugs you'd have to work out if you wanted to use this for real.

What are limitations of reader macros in Common Lisp

I have my own Lisp interpreter in JavaScript that I work for some time now, and now I want to implement reader macros like in Common Lisp.
I've created Streams (almost working except for special symbols like ,# , ` ') but it freezes the browser for a few seconds when it's loading the page with included scripts (lisp files that have 400 lines of code). This is because my Streams are based on substring function. If I first split tokens and then use TokenStream that iterate over tokens, it works fine.
So my question is this, is string streams really something that is in Common Lisp? Can you add reader macros that create whole new syntax like Python inside CL, this simplify to question can I implement """ macro (not sure if you can have 3 characters as reader macro) or other character that will implement template literal inside lisp for instance:
(let ((foo 10) (bar 20))
{lorem ipsum ${baz} and ${foo}})
or
(let ((foo 10) (bar 20))
""lorem ipsum ${baz} and ${foo}"")
or
(let ((foo 10) (bar 20))
:"lorem ipsum ${baz} and ${foo}")
would yield string
"lorem ipsum 10 and 20"
is something like this possible in Common Lisp and how hard would be to implement #\{ or #\: as reader macro?
The only way I can think of to have template literals in Lisp is something like this:
(let ((foo 10) (bar 20))
(tag "lorem ipsum ${baz} and ${foo}")))
where tag is macro that return strings with ${} as free variable. Can reader macro also return lisp code that is evaluated?
And another question can you implement reader macros like this:
(list :foo:bar)
(list foo:bar)
where : is reader macro and if it's before symbols it convert symbol to
foo.bar
and if it's inside it throw error. I'm asking this because with token based macros :foo:bar and foo:bar will be symbols and will not be processed by my reader macros.
and one more question can reader macro be put in one line and second line use it? This will definitely be only possible with string streams and from what I've tested not possible with interpreter written in JavaScript.
There are some limitations in the sense that it is pretty hard to, for instance, intervene in the interpretation of tokens in any way short of 'implement your own token interpreter from scratch'. But, well, you could if you wanted to do just that: the problem is that your code would need to deal with numbers & things as well as the existing code does and things like floating-point parsing are pretty fiddly to get right.
But the macro functions associated with macro characters get the stream that is being read, and they are free to read as much or as little of the stream as they like, and return any kind of object (or no object, which is how comments are implemented).
I would strongly recommend reading chapters 2 & 23 of the hyperspec, and then playing with an implementation. When you play with the implementation be aware that it is just astonishingly easy to completely wedge things by mucking around with the reader. At the minimum I would suggest code like this:
(defparameter *my-readtable* (copy-readtable nil))
;;; Now muck around with *my-readtable*, *not* the default readtable
;;;
(defun experimentally-read ((&key (stream *standard-input*)
(readtable *my-raedtable*)))
(let ((*readtable* readtable))
(read stream)))
This gives you at least some chance to recover from catastrophe: if you can once abort experimentally-read then you are back in a position where *readtable* is something sensible.
Here is a fairly useless example which shows how much you can subvert the syntax with macro characters: a macro character definition which will cause ( ...) to be read as a string. This may not be fully debugged, and as I say I can see no use for it.
(defun mindless-parenthesized-string-reader (stream open-paren)
;; Cause parenthesized groups to be read as strings:
;; - (a b) -> "a b"
;; - (a (b c) d) -> "a (b c) d"
;; - (a \) b) -> "a ) b"
;; This serves no useful purpose that I can see. Escapes (with #\))
;; and nested parens are dealt with.
;;
;; Real Programmers would write this with LOOP, but that was too
;; hard for me. This may well not be completely right.
(declare (ignore open-paren))
(labels ((collect-it (escaping depth accum)
(let ((char (read-char stream t nil t)))
(if escaping
(collect-it nil depth (cons char accum))
(case char
((#\\)
(collect-it t depth accum))
((#\()
(collect-it nil (1+ depth) (cons char accum)))
((#\))
(if (zerop depth)
(coerce (nreverse accum) 'string)
(collect-it nil (1- depth) (cons char accum))))
(otherwise
(collect-it nil depth (cons char accum))))))))
(collect-it nil 0 '())))
(defvar *my-readtable* (copy-readtable nil))
(set-macro-character #\( #'mindless-parenthesized-string-reader
nil *my-readtable*)
(defun test-my-rt (&optional (stream *standard-input*))
(let ((*readtable* *my-readtable*))
(read stream)))
And now
> (test-my-rt)
12
12
> (test-my-rt)
x
x
> (test-my-rt)
(a string (with some parens) and \) and the end)
"a string (with some parens) and ) and the end"

For-loop macro in Racket

This macro to implement a C-like for-loop in Lisp is mentioned on this page: https://softwareengineering.stackexchange.com/questions/124930/how-useful-are-lisp-macros
(defmacro for-loop [[sym init check change :as params] & steps]
`(loop [~sym ~init value# nil]
(if ~check
(let [new-value# (do ~#steps)]
(recur ~change new-value#))
value#)))
So than one can use following in code:
(for-loop [i 0 , (< i 10) , (inc i)]
(println i))
How can I convert this macro to be used in Racket language?
I am trying following code:
(define-syntax (for-loop) (syntax-rules (parameterize ((sym) (init) (check) (change)) & steps)
`(loop [~sym ~init value# nil]
(if ~check
(let [new-value# (do ~#steps)]
(recur ~change new-value#))
value#))))
But it give "bad syntax" error.
The snippet of code you have included in your question is written in Clojure, which is one of the many dialects of Lisp. Racket, on the other hand, is descended from Scheme, which is quite a different language from Clojure! Both have macros, yes, but the syntax is going to be a bit different between the two languages.
The Racket macro system is quite powerful, but syntax-rules is actually a slightly simpler way to define macros. Fortunately, for this macro, syntax-rules will suffice. A more or less direct translation of the Clojure macro to Racket would look like this:
(define-syntax-rule (for-loop [sym init check change] steps ...)
(let loop ([sym init]
[value #f])
(if check
(let ([new-value (let () steps ...)])
(loop change new-value))
value)))
It could subsequently be invoked like this:
(for-loop [i 0 (< i 10) (add1 i)]
(println i))
There are a number of changes from the Clojure code:
The Clojure example uses ` and ~ (pronounced “quasiquote” and “unquote” respectively) to “interpolate” values into the template. The syntax-rules form performs this substitution automatically, so there is no need to explicitly perform quotation.
The Clojure example uses names that end in a hash (value# and new-value#) to prevent name conflicts, but Racket’s macro system is hygienic, so that sort of escaping is entirely unnecessary—identifiers bound within macros automatically live in their own scope by default.
The Clojure code uses loop and recur, but Racket supports tail recursion, so the translation just uses “named let”, which is really just some extremely simple sugar for an immediately invoked lambda that calls itself.
There are a few other minor syntactic differences, such as using let instead of do, using ellipses instead of & steps to mark multiple occurrences, the syntax of let, and the use of #f instead of nil to represent the absence of a value.
Finally, commas are not used in the actual use of the for-loop macro because , means something different in Racket. In Clojure, it is treated as whitespace, so it’s totally optional there, too, but in Racket, it would be a syntax error.
A full macro tutorial is well outside the scope of a single Stack Overflow post, though, so if you’re interested in learning more, take a look at the Macros section of the Racket guide.
It’s also worth noting that an ordinary programmer would not need to implement this sort of macro themselves, given that Racket already provides a set of very robust for loops and comprehensions built into the language. In truth, though, they are just defined as macros themselves—there is no special magic just because they are builtins.
Racket’s for loops do not look like traditional C-style for loops, however, because C-style for loops are extremely imperative. On the other hand, Scheme, and therefore Racket, tends to favor a functional style, which avoids mutation and often looks more declarative. Therefore, Racket’s loops attempt to describe higher-level iteration patterns, such as looping through a range of numbers or iterating through a list, rather than low-level semantics like describing how a value should be updated. Of course, if you really want something like that, Racket provides the do loop, which is almost identical to the for-loop macro defined above, albeit with some minor differences.
I want to expand on Alexis's excellent answer a bit. Here's an example usage that demonstrates what she means by do being almost identical to your for-loop:
(do ([i 0 (add1 i)])
((>= i 10) i)
(println i))
This do expression actually expands to the following code:
(let loop ([i 0])
(if (>= i 10)
i
(let ()
(println i)
(loop (add1 i)))))
The above version uses a named let, which is considered the conventional way to write loops in Scheme.
Racket also provides for comprehensions, also mentioned in Alexis's answer, which are also considered conventional, and here's how it'd look like:
(for ([i (in-range 10)])
(println i))
(except that this doesn't actually return the final value of i).
I want to rewrite on Alexis's excellent answer and Chris Jester-Young's excellent answer for people not familiar with let.
#lang racket
(define-syntax-rule (for-loop [var init check change] expr ...)
(local [(define (loop var value)
(if check
(loop change (begin expr ...))
value))]
(loop init #f)))
(for-loop [i 0 (< i 10) (add1 i)]
(println i))

Lazy reads of custom types in Racket

I'm new to Racket, and I am trying to write a function to read the lines of a file, parse each line into a struct, and return a lazy sequence of my data type. Here is a simple example of my input format (a matrix with row and column names). My actual input format also includes a header line, which I am omitting here, and consists of very large files, which is why I need the laziness.
R1 1.0 2.3 1.2
R2 1.2 3.1 3.4
Here is my latest attempt:
(struct row (key data))
(define (read-matrix in)
(for [(line (in-lines in))]
(let ([fields (string-split line "\t")]
(row (first fields) (list->vector (map string->number (rest fields))))
)))
I have also tried numerous other approaches including using call-with-input-file. My problem with the approach above is that if I use #lang racket it isn't lazy, and with #lang lazy string-split isn't defined. I should add that in my use case, the semantics I want is to close the port when the entire sequence has been consumed, because I can guarantee that either the whole sequence will be consumed, or the program will terminate.
So, am I on the right track? What approach should I take to solve this problem? Thanks!
I was composing this answer off-line, and came back to find you'd mostly answered it already. I'll post anyway in case the details are helpful to anyone.
If you really need #lang lazy, and want to use string-split, I think you can simply (require racket/string) to use it?
I'm not sure I understand exactly what you mean by "lazy", here. Using in-lines will not suck the entire file into memory, if that's your concern. It will process things one line at a time.
One thing you could do is define a helper function, that handles reading and parsing the line, checking for eof, and closing the input port automatically:
(struct row (key data)
#:transparent)
;; Example couple lines of input to use below.
(define text "R1 1.0 2.3 1.2\nR2 1.2 3.1 3.4")
;; read-matrix-row : input-port? -> (or/c eof row?)
;;
;; Given an input port, try to read another row.
(define (read-matrix-row in)
(match (read-line in)
[(? eof-object?)
(close-input-port in)
eof]
[line (match (string-split line " ")
[(cons key data)
(row key (list->vector (map string->number data)))])]))
You could use this function in a number of ways. One way is with in-producer:
;; Example use with in-producer:
(let ([in (open-input-string text)])
(for/list ([x (in-producer read-matrix-row eof in)])
x))
;; => (list (row "R1" '#(1.0 2.3 1.2))
;; (row "R2" '#(1.2 3.1 3.4)))
That example uses for/list to make list. Of course if you have a giant input file, that will yield a giant list. But you could display them one by one, or write them one by one to a file or database:
;; Example use, displaying one by one.
(let ([in (open-input-string text)])
(for ([x (in-producer read-matrix-row eof in)])
(displayln x))) ;or write to some file, for example
If instead you prefer a stream interface, it's easy to create a stream from any sequence including `in-producer':
;; If you prefer a stream interface, we can use sequence->stream to
;; transform the producer sequence into a stream:
(define (matrix-row-stream in)
(sequence->stream (in-producer read-matrix-row eof in)))
;; Example interactive use of the stream
(define stm (matrix-row-stream (open-input-string text)))
(stream-empty? stm) ;#f
(stream-first stm) ;(row "R1" '#(1.0 2.3 1.2))
(stream-empty? (stream-rest stm)) ;#f
(stream-first (stream-rest stm)) ;(row "R2" '#(1.2 3.1 3.4))
(stream-empty? (stream-rest (stream-rest stm))) ;#t
Try using the functions from SRFI-13, which is a string manipulating library also available in #lang lazy:
(require srfi/13)
And then do this:
[fields (string-tokenize line)]
Ultimately I found that the answer was to use Racket's sequence, streams, and generator libraries for this kind of thing. The generators are especially nice, allowing a simple Python-like "yield" function. These features allow lazy sequences without full-on lazy evaluation as provided by #lang lazy.
http://docs.racket-lang.org/reference/streams.html

lisp code excerpt

i've been reading some lisp code and came across this section, didn't quite understand what it specifically does, though the whole function is supposed to count how many times the letters from a -z appear in an entered text.
(do ((i #.(char-code #\a) (1+ i)))
((> i #.(char-code #\z)))
can anyone explain step by step what is happening? I know that it's somehow counting the letters but not quite sure how.
This Lisp code is slightly unusual, since it uses read-time evaluation. #.expr means that the expression will be evaluated only once, during read-time.
In this case a clever compiler might have guessed that the character code of a given character is known and could have removed the computation of character codes from the DO loop. The author of that code chose to do that by evaluating the expressions before the compiler sees it.
The source looks like this:
(do ((i #.(char-code #\a) (1+ i)))
((> i #.(char-code #\z)))
...)
When Lisp reads in the s-expression, we get this new code as the result (assuming a usual encoding of characters):
(do ((i 97 (1+ i)))
((> i 122))
...)
So that's a loop which counts the variable i up from 97 to 122.
Lisp codes are written as S-Expression. In a typical S-Expression sytax, the first element of any S-expression is treated as operator and the rest as operand. Operands can either be an atom or another S-expression. Please note, an atom is a single data object. Keeping this in mind
char-code
(char-code #\a) - returns the ascii representation of a character here its 'a'.
The do syntax looks similar to the below
(do ((var1 init1 step1)
(var2 init2 step2)
...)
(end-test result)
statement1
...)
So in your example
(do ((i #.(char-code #\a) (1+ i)))
((> i #.(char-code #\z)))
)
The first s-expression operand of do is the loop initialization, the second s-expression operand is the end-test.
So this means you are simply iterating over 'a' through 'z' incrementing i by 1.
In C++ (Not sure your other language comfort level, you can write
for(i='a';i<='z';i++);
the trick with the code you show is in poor form. i know this because i do it all
the time. the code makes an assumtion that the compiler will know the current fixnum
for each character. #.(char-code #\a) eq [(or maybe eql if you are so inclided) unsigned small integer or unsigned 8 bit character with a return value of a positive fixnum].
The # is a reader macro (I'm fairly sure you know this :). Using two reader macros is
not a great idea but it is fast when the compiler knows the datatype.
I have another example. Need to search for ascii in a binary stream:
(defmacro code-char= (byte1 byte2)
(flet ((maybe-char-code (x) (if characterp x) (char-code x) x)))
`(the fixnum (= (the fixnum ,(maybe-char-code byte1)
(the fixnum ,(maybe-char-code byte2)))))
))
Declaring the return type in sbcl will probably insult the complier, but I leave it as a sanity check (4 me not u).
(code-char= #\$ #x36)
=>
t
. At least I think so. But somehow I think you might know your way around some macros ... Hmmmm... I should turn on the machine...
If you're seriously interested, there is some assembler for the 286 (8/16 bit dos assembler) that you can use a jump table. It works fast for the PC , I'd have to look it up...