How do I cache hash codes for an AST? - hash

I am working on a language in F# and upon testing, I find that the runtime spends over 90% of its time comparing for equality. Because of that the language is so slow as to be unusable. During instrumentation, the GetHashCode function shows fairly high up on the list as a source of overhead. What is going on is that during method calls, I am using method bodies (Expr) along with the call arguments as keys in a dictionary and that triggers repeated traversals over the AST segments.
To improve performance I'd like to add memoization nodes in the AST.
type Expr =
| Add of Expr * Expr
| Lit of int
| HashNode of int * Expr
In the above simplified example, what I would like is that the HashNode represent the hash of its Expr, so that the GetHashCode does not have to travel any deeper in the AST in order to calculate it.
That having said, I am not sure how I should override the GetHashCode method. Ideally, I'll like to reuse the inbuilt hash method and make it ignore only the HashNode somehow, but I am not sure how to do that.
More likely, I am going to have to make my own hash function, but unfortunately I know nothing about hash functions so I am a bit lost right now.
An alternative idea that I have would be to replace nodes with unique IDs while keeping that hash function as it is, but that would introduce additional complexities into the code that I'd rather avoid unless I have to.

I needed a similar thing recently in TheGamma (GitHub) where I build a dependency graph (kind of like AST) that gets recreated very often (when you change code in editor and it gets re-parsed), but I have live previews that may take some time to calculate, so I wanted to reuse as much of the previous graph as possible.
The way I'm doing that is that I attach a "symbol" to each node. Two nodes with the same symbol are equal, which I think you could use for efficient equality testing:
type Expr =
| Add of ExprNode * ExprNode
| Lit of int
and ExprNode(expr:Expr, symbol:int) =
member x.Expression = expr
member x.Symbol = symbol
override x.GetHashCode() = symbol
override x.Equals(y) =
match y with
| :? ExprNode as y -> y.Symbol = x.Symbol
| _ -> false
I do keep a cache of nodes - the key is some code of the node kind (0 for Add, 1 for Lit, etc.) and symbols of all nested nodes. For literals, I also add the number itself, which will mean that creating the same literal twice will give you the same node. So creating a node looks like this:
let node expr ctx =
// Get the key from the kind of the expression
// and symbols of all nested node in this expression
let key =
match expr with
| Lit n -> [0; n]
| Add(e1, e2) -> [1; e1.Symbol; e2.Symbol]
// Return either a node from cache or create a new one
match ListDictionary.tryFind key ctx with
| Some res -> res
| None ->
let res = ExprNode(expr, nextId())
ListDictionary.set key res ctx
res
The ListDictionary module is a mutable dictionary where the key is a list of integers and nextId is the usual function to generate next ID:
type ListDictionaryNode<'K, 'T> =
{ mutable Result : 'T option
Nested : Dictionary<'K, ListDictionaryNode<'K, 'T>> }
type ListDictionary<'K, 'V> = Dictionary<'K, ListDictionaryNode<'K, 'V>>
[<CompilationRepresentation(CompilationRepresentationFlags.ModuleSuffix)>]
module ListDictionary =
let tryFind ks dict =
let rec loop ks node =
match ks, node with
| [], { Result = Some r } -> Some r
| k::ks, { Nested = d } when d.ContainsKey k -> loop ks (d.[k])
| _ -> None
loop ks { Nested = dict; Result = None }
let set ks v dict =
let rec loop ks (dict:ListDictionary<_, _>) =
match ks with
| [] -> failwith "Empty key not supported"
| k::ks ->
if not (dict.ContainsKey k) then
dict.[k] <- { Nested = Dictionary<_, _>(); Result = None }
if List.isEmpty ks then dict.[k].Result <- Some v
else loop ks (dict.[k].Nested)
loop ks dict
let nextId =
let mutable id = 0
fun () -> id <- id + 1; id
So, I guess I'm saying that you'll need to implement your own caching mechanism, but this worked quite well for me and may hint at how to do this in your case!

Related

Function generation with arbitrary signature - revisited

I am resubmitting a question asked almost a decade ago on this site link - but which is not as generic as I would like.
What I am hoping for is a way to construct a function from a list of types, where the final output type can have an arbitrary/default value (such as 0.0 for a float, or "" for a string). So, from
[float; int; float;]
I would get something that amounts to
fun(f: float) ->
fun(i: int) ->
0.0
I am hopeful of achieving this, but am so far unable to. It would be helping me out a lot if I could see a sample that does the above.
The answer in the above link goes some of the way, but the example seems to know its function signature at compile time, which I won't, and also generates a compiler warning.
The scenario I have, for those that find context helpful, is that I want to be able to open a dll and one way or another identify a method which will have a given signature with argument-types limited to a known set of types (i.e. float, int). For each input parameter in this function signature I will run code to generate a 'buffer' object, which will have
a buffer of data items of the given type, i.e. [1.2; 3.2; 4.5]
a supplier of that data type (supplies may be intermittent so the receiving buffer may be empty at any one time)
a generator function that transforms data items before being dispatched. This function can be updated at any time.
a dispatch function. The dispatch target of bufferA will be bufferB, and for bufferB it will be a pub-sub thing where subscribers can subscribe to the end result of the calculation, in this case a stream of floats. Data accumulates in applicative style down the chain of buffers, until the final result is published as a new stream.
a regulator that turns the stream of data heading out to the consumer on or off. This ensures orderly function application.
The function from the dll will eventually be given to BufferA to apply to a float and pass the result on to buffer B (to pick up an int). However, while setting up the buffer infrastructure I only need a function with the correct signature, so a dummy value, such as 0.0, is fine.
For a function of a known signature I can handcraft the code that creates the necessary infrastructure, but I would like to be able to automate this, and ideally register dlls and have new calculated streams available plugin-style without rebuilding the application.
If you're willing to throw type safety out the window, you could do this:
let rec makeFunction = function
| ["int"] -> box 0
| ["float"] -> box 0.0
| ["string"] -> box ""
| "int" :: types ->
box (fun (_ : int) -> makeFunction types)
| "float" :: types ->
box (fun (_ : float) -> makeFunction types)
| "string" :: types ->
box (fun (_ : string) -> makeFunction types)
| _ -> failwith "Unexpected"
Here's a helper function for invoking one of these monstrosities:
let rec invokeFunction types (values : List<obj>) (f : obj) =
match types, values with
| [_], [] -> f
| ("int" :: types'), (value :: values') ->
let f' = f :?> (int -> obj)
let value' = value :?> int
invokeFunction types' values' (f' value')
| ("float" :: types'), (value :: values') ->
let f' = f :?> (float -> obj)
let value' = value :?> float
invokeFunction types' values' (f' value')
| ("string" :: types'), (value :: values') ->
let f' = f :?> (string -> obj)
let value' = value :?> string
invokeFunction types' values' (f' value')
| _ -> failwith "Unexpected"
And here it is in action:
let types = ["int"; "float"; "string"] // int -> float -> string
let f = makeFunction types
let values = [box 1; box 2.0]
let result = invokeFunction types values f
printfn "%A" result // output: ""
Caveat: This is not something I would ever recommend in a million years, but it works.
I got 90% of what I needed from this blog by James Randall, entitled compiling and executing fsharp dynamically at runtime. I was unable to avoid concretely specifying the top level function signature, but a work-around was to generate an fsx script file containing that signature (determined from the relevant MethodInfo contained in the inspected dll), then load and run that script. James' blog/ github repository also describes loading and running functions contained in script files. Having obtained the curried function from the dll, I then apply it to default arguments to get representative functions of n-1 arity using
let p1: 'p1 = Activator.CreateInstance(typeof<'p1>) :?> 'p1
let fArity2 = fArity3 p1
Creating and running a script file is slow, of course, but I only need to perform this once when setting up the calculation stream

Using STArray and ignore the return of modify in Purescript

I think I'm close to what I want, though I suspect I'm not understanding how thaw / TH Region works.
Here is what I'm trying to implement (at least roughly)
modifyPerIndex :: forall t a. Foldable t => t (Tuple Int (a -> a)) -> Array a -> Array a
modifyPerIndex foldableActions array = run do
mutableArray <- thaw array
let actions = fromFoldable foldableActions
foreach actions (\(Tuple index action) -> modify index action mutableArray)
freeze mutableArray
This is sort of how I imagine updateAtIndices works. I suppose I could write modifyPerIndex to use updateAtIndices by reading in the values, applying the (a -> a) and mapping the result into a list of Tuples to be sent to updateAtIndices.
I'm curious how to do it this way though.
In the code above modify returns ST h Boolean, which I'd like to change into ST h Unit. That's where I'm lost. I get that h here is a constraint put on mutable data to stop it from leaving run, what I don't understand is how to use that.
There are a few options. But it has nothing to do with h. You don't have to "use" it for anything, and you don't have to worry about it at all.
First, the most dumb and straightforward approach - just bind the result to an ignored variable and then separately return unit:
foreach actions \(Tuple index action) -> do
_ <- modify index action mutableArray
pure unit
Alternatively, you can use void, which does more or less the same thing under the hood:
foreach actions \(Tuple index action) -> void $ modify index action mutableArray
But I would go straight for for_, which is the same as foreach, but works for any monad (not just ST) and ignores individual iterations' return values:
for_ actions \(Tuple index action) -> modify index action mutableArray

Using F#'s hash function inside GetHashCode() evil?

I encountered a couple of places online where code looked something like this:
[<CustomEquality;NoComparison>]
type Test =
| Foo
| Bar
override x.Equals y =
match y with
| :? Test as y' ->
match y' with
| Foo -> false
| Bar -> true // silly, I know, but not the question here
| _ -> failwith "error" // don't do this at home
override x.GetHashCode() = hash x
But when I run the above in FSI, the prompt does not return when I either call hash foo on an instance of Test or when I call foo.GetHashCode() directly.
let foo = Test.Foo;;
hash foo;; // no returning to the console until Ctrl-break
foo.GetHashCode();; // no return
I couldn't readily proof it, but it suggests that hash x calls GetHashCode() on the object, which means the above code is dangerous. Or is it just FSI playing up?
I thought code like the above just means "please implement custom equality, but leave the hash function as default".
I have meanwhile implemented this pattern differently, but am still wondering whether I am correct in assuming that hash just calls GetHashCode(), leading to an eternal loop.
As an aside, using equality inside FSI returns immediately, suggesting that it either does not call GetHashCode() prior to comparison, or it does something else. Update: this makes sense as in the example above x.Equals does not call GetHashCode(), and the equality operator calls into Equals, not into GetHashCode().
It's not quite as simple as the hash function simply being a wrapper for GetHashCode but I can comfortably tell you that it's definitely not safe to use the implementation : override x.GetHashCode() = hash x.
If you trace the hash function through, you end up here:
let rec GenericHashParamObj (iec : System.Collections.IEqualityComparer) (x: obj) : int =
match x with
| null -> 0
| (:? System.Array as a) ->
match a with
| :? (obj[]) as oa -> GenericHashObjArray iec oa
| :? (byte[]) as ba -> GenericHashByteArray ba
| :? (int[]) as ba -> GenericHashInt32Array ba
| :? (int64[]) as ba -> GenericHashInt64Array ba
| _ -> GenericHashArbArray iec a
| :? IStructuralEquatable as a ->
a.GetHashCode(iec)
| _ ->
x.GetHashCode()
You can see here that the wild-card case calls x.GetHashCode(), hence it's very possible to find yourself in an infinite recursion.
The only case I can see where you might want to use hash inside an implementation of GetHashCode() would be when you are manually hashing some of an object's members to produce a hash code.
There is a (very old) example of using hash inside GetHashCode() in this way in Don Syme's WebLog.
By the way, that's not the only thing unsafe about the code you posted.
Overrides for object.Equals absolutely must not throw exceptions. If the types do not match, they are to return false. This is clearly documented in System.Object.
Implementations of Equals must not throw exceptions; they should
always return a value. For example, if obj is null, the Equals method
should return false instead of throwing an ArgumentNullException.
(Source)
If the GetHashCode() method is overridden, then the hash operator will use that:
[The hash operator is a] generic hash function, designed to return equal hash values for items that are equal according to the = operator. By default it will use structural hashing for F# union, record and tuple types, hashing the complete contents of the type. The exact behavior of the function can be adjusted on a type-by-type basis by implementing System.Object.GetHashCode for each type.
So yes, this is a bad idea and it makes sense that it would lead to an infinite loop.

scala assignment of value vs. reference types

I thought I had a firm grasp of Scala's treatment of reference types (i.e., those derived from AnyRef), but now I am not so sure.
If I create a simple class like this
class C(var x: Int = 0) {}
and define a few instances
var a = new C
var b = new C(1)
var c = new C(2)
and then I assign
a = b
I do not get a (shallow) copy, but rather the original reference to the instance to a is lost forever, and a and b are essentially "aliases" for the same object. (This can be seen by looking at the addresses of these items.) This is fine and sensible. It is also clear that these are references (as opposed to values), since I can do
c = null
and this does not generate an error.
Now, suppose I do this
import scala.math.BigInt
var x = BigInt("12345678987654321")
var y = BigInt("98765432123456789")
var z = x + y
This creates three BigInts, with x, y and z, as, I suppose, references to these. In fact, I can do
z = null
and again get no error. However,
y = x
x += 1
does not cause y to change, i.e., it appears that in this case assignment did not simply create another "name" for the object referred to by x, but made a copy of it.
Why does this happen? I cannot find any mechanism (e.g., akin to the "copy constructor" of C++) that would be silently invoked by (what appears to be) straightforward reference assignment.
Any explanation would be greatly appreciated, as two days of web search has proved fruitless.
x += 1 will be expanded into x = x + 1 so it's not only assignment.
If you will look at the source of bigInt you'll see that + creates new instance:
def + (that: BigInt): BigInt = new BigInt(this.bigInteger.add(that.bigInteger))
in fact it uses java's BigInteger underneath whose add operations leaves both arguments untouched.
So what basically happens at the end of the day is reference reassignment of result of copy constructor of immutable addition
y = x
x += 1
BigInt is immutable so +1 creates new BigInt that's why y does not change. y still points to previous object while x points to new BigInt object.
I suppose its related to the immutability of BigInt and similar classes, you always get a new immutable object.

SML/ML Int to String conversion

I have this code:
datatype 'a Tree = Empty | LEAF of 'a | NODE of ('a Tree) list;
val iL1a = LEAF 1;
val iL1b = LEAF 2;
val iL1c = LEAF 3;
val iL2a = NODE [iL1a, iL1b, iL1c];
val iL2b = NODE [iL1b, iL1c, iL1a];
val iL3 = NODE [iL2a, iL2b, iL1a, iL1b];
val iL4 = NODE [iL1c, iL1b, iL3];
val iL5 = NODE [iL4];
fun treeToString f Node = let
fun treeFun (Empty) = ["(:"]
| treeFun (NODE([])) = [")"]
| treeFun (LEAF(v)) = [f v]
| treeFun (NODE(h::t)) = [""] # ( treeFun (h)) # ( treeFun (NODE(t)) )
in
String.concat(treeFun Node)
end;
treeToString Int.toString iL5;
When I run my function I get the output: "32123)231)12)))".
The answer should be "((32((123)(231)12)))".
I've tried modifying my function to add ( in every place I can think but I cannot figure out where I should be adding "(". Where have I messed up?
Edit: I believe I need to use map or List.filter somewhere, but am not sure where.
It looks like your method of recursion over the tail of a list node is the problem. Instead of treeFun h appended to treefun (NODE(t)), try using this for the NODE case:
treeFun (NODE(items)) = ["("] # List.concat (map treeFun items) # [")"]
That is, map treeFun over the entire contents of the node, and surround the results with "(" and ")". That definition might be a bit too terse for you to understand what's going on, so here's a more verbose form that you might find clearer:
| treeFun (NODE(items)) =
let val subtree_strings : string list list = map treeFun items
val concatenated_subtrees : string list = List.concat subtree_strings
in ["("] # concatenated_subtrees # [")"]
end
subtree_strings is the result of taking all the subtrees in the given node, and turning each of them to a list of strings by recursively calling treeFun on each subtree. Since treeFun gives back a list of strings each time it's called, and we're calling it on an entire list of subtrees, the result is a corresponding list of lists of subtrees. So for instance, if we called map treeFun [LEAF 1, LEAF 2, LEAF 3], we'd get back [["1"], ["2"], ["3"]].
That's not the answer we want, since it's a list of lists of strings rather than a list of plain strings. We can fix that using List.concat, which takes a list of lists, and forms a single list of all the underlying items. So for instance List.concat [["1"], ["2"], ["3"]] returns ["1", "2", "3"]. Now all we have to do is put the parentheses around the result, and we're done.
Notice that this strategy works just as well for completely empty nodes as it does for nodes with one or more subtrees, so it eliminates the need for the second case of treeFun in your original definition. Generally, in ML, it's a code smell if a function of one argument doesn't have exactly one case for each constructor of the argument's type.