Matching in Scala - Performance?

Matching in Scala - Performance? - scala

How does scala deal with matching? Is it just syntax sugar that gets transformed into compiler branches at bytecode level or some clever trick hidden under the covers?

Matching integral values (e.g., Int, Char) against constants is generally translated into a TableSwitch (O(1) lookup time through indexing an array) or a LookUpSwitch (O(log n) lookup time through binary search) bytecode instruction. This also remains the case if you have a variable pattern or wildcard pattern as a catch-all branch.
You can use the #switch annotation to ensure that this actually happens.
For non-integral values, the available optimizations are somewhat more limited; however, as far as I understand the compiler code, the compiler will at least check for common subconditions (which it remembers) and shared prefixes.

Related

Scala and JMM: number types performance

I'm new in Scala and don't understand some base things.
Scala does not contains primitives. Hence int, short and other "simple" number types are objects. So, according to JMM, they are not located at stack and subject to cleaning by GB. Cleaning by GB may be too expensive for some cases.
So I don't clearly understand, why Scala is considered faster than Java (in which primitives located in stack).

Scala does not contains primitives. Hence int, short and other "simple" number types are objects.
That is correct.
So, according to JMM,
The Java Memory Model is for Java. It is completely irrelevant to Scala.
they are not located at stack and subject to cleaning by GB. Cleaning by GB may be too expensive for some cases.
There is no such thing as a "stack" in Scala. The Scala Language Specification only mentions the term "stack" in very few places, and none of them have anything to do with Ints:
In section 1 Lexical Syntax, subsection 1.6 XML mode, it is said that because XML literals and Scala code can be arbitrarily nested, the parser has to use a stack data structure to keep track of the context.
In section 7 Implicits, subsection 7.2 Implicit parameters, it is said that to prevent an infinite recursion when searching for implicit, the compiler keeps a stack of "open types", which are types that it is currently searching an implicit for.
In section 6 Expressions, subsection 6.6 Function Applications, there is the following statement, specifying Proper Direct Tail Recursion:
A function application usually allocates a new frame on the program's run-time stack. However, if a local method or a final method calls itself as its last action, the call is executed using the stack-frame of the caller.
In section 6 Expressions, subsection 6.20 Return Expressions, there is the following statement about one possible implementation strategy for non-local returns from nested functions:
Returning from the method from within a nested function may be implemented by throwing and catching a scala.runtime.NonLocalReturnControl. Any exception catches between the point of return and the enclosing methods might see and catch that exception. A key comparison makes sure that this exception is only caught by the method instance which is terminated by the return.
If the return expression is itself part of an anonymous function, it is possible that the enclosing method m has already returned before the return expression is executed. In that case, the thrown scala.runtime.NonLocalReturnControl will not be caught, and will propagate up the call stack.
Of these 4 instances, the first 2 clearly do not refer to the concept of a call stack but rather to the generic computer science data structure. The 4th one is only an example of a possible implementation strategy ("Returning from the method from within a nested function may be implemented by […]"). Only the 3rd one is actually relevant, as it indeed talks about a call stack. However, it does not say anything about allocating Ints, and it explicitly leaves the door open to alternative implementations as well, by stating that "usually" function application leads to allocation of a stack frame, but doesn't have to.
So I don't clearly understand, why Scala is considered faster than Java (in which primitives located in stack).
Actually, there is nothing in the Java Language Specification either that says that primitives are located on the stack. In fact, the Java Language Specification does not mandate the existence of a stack at all. It would be perfectly legal to implement Java without a stack.
There are exactly zero occurrences of the term "stack" in the JLS. There are a couple of mentions of the term "heap", but only in the compound term "heap pollution", which is simply a word describing a certain flaw in the type system, but does not necessarily require a heap, and does not mandate a heap.
And none of these mentions of "heap pollution" have anything to do with primitives.
Note that, when I say that the Scala Language Specification says nothing about stacks or heaps or how Ints are allocated, that is actually really important. Because the SLS doesn't say anything, implementors are allowed to do whatever they want, including making Ints primitive and allocating them on the stack.
And that is exactly what most Scala implementations do. The (now-defunct) Scala.NET implemented scala.Int as a .NET System.Int32. Scala-native implements scala.Int as a C int32_t. Scala.js implements scala.Int as an ECMAScript number. And Scala-JVM implements scala.Int as a JVM int.
If you check out the source code of scala.Int in the Scala-JVM repository (src/library/scala/Int.scala), you will find that it is actually empty! More precisely, it only contains documentation and declarations, but no definitions or implementations. Also, the class is marked final (meaning it can't be inherited from) and abstract (meaning it must be inherited from in order to provide overrides for the missing implementations), which is a contradiction.
How does this work? Well, the compiler knows what an Int is and how it works, and it simply generates the correct code for dealing with a JVM int. So, when it sees a call to scala.Int.+, it knows that instead it must generate an iadd bytecode instruction. Likewise, Scala-native will just generate the native integer addition instructions, and so on.
In other words, Ints are semantically defined as objects, but they are actually pragmatically implemented as primitives.
This is a general rule of how language specifications work: typically, they only describe what the result is that the programmer sees, but they leave it open to the implementor how to actually achieve that result. So, the SLS specifies that an Int must look as if it actually were an object, but there is nothing that says it actually has to be one.

They are handled the same way that Java handles those types, they're only boxed when strictly necessary. The details on how and when they are boxed may differ, but the compiler uses a primitive representation if it can do so. Here's what the docs say (this is just for Int, but it applies to other "primitive" types too):
Int, a 32-bit signed integer (equivalent to Java's int primitive type) is a subtype of scala.AnyVal. Instances of Int are not represented by an object in the underlying runtime system.
There is an implicit conversion from scala.Int => scala.runtime.RichInt which provides useful non-primitive operations.
https://www.scala-lang.org/api/2.13.6/scala/Int.html
The main difference, really, is that there aren't two separate types, like in Java, to represent the boxed and unboxed representations — both get the same Int type, whereas Java has int and Integer.

Does Scala have a value restriction like ML, if not then why?

Here’s my thoughts on the question. Can anyone confirm, deny, or elaborate?
I wrote:
Scala doesn’t unify covariant List[A] with a GLB ⊤ assigned to List[Int], bcz afaics in subtyping “biunification” the direction of assignment matters. Thus None must have type Option[⊥] (i.e. Option[Nothing]), ditto Nil type List[Nothing] which can’t accept assignment from an Option[Int] or List[Int] respectively. So the value restriction problem originates from directionless unification and global biunification was thought to be undecidable until the recent research linked above.
You may wish to view the context of the above comment.
ML’s value restriction will disallow parametric polymorphism in (formerly thought to be rare but maybe more prevalent) cases where it would otherwise be sound (i.e. type safe) to do so such as especially for partial application of curried functions (which is important in functional programming), because the alternative typing solutions create a stratification between functional and imperative programming as well as break encapsulation of modular abstract types. Haskell has an analogous dual monomorphisation restriction. OCaml has a relaxation of the restriction in some cases. I elaborated about some of these details.
EDIT: my original intuition as expressed in the above quote (that the value restriction may be obviated by subtyping) is incorrect. The answers IMO elucidate the issue(s) well and I’m unable to decide which in the set containing Alexey’s, Andreas’, or mine, should be the selected best answer. IMO they’re all worthy.

As I explained before, the need for the value restriction -- or something similar -- arises when you combine parametric polymorphism with mutable references (or certain other effects). That is completely independent from whether the language has type inference or not or whether the language also allows subtyping or not. A canonical counter example like
let r : ∀A.Ref(List(A)) = ref [] in
r := ["boo"];
head(!r) + 1
is not affected by the ability to elide the type annotation nor by the ability to add a bound to the quantified type.
Consequently, when you add references to F<: then you need to impose a value restriction to not lose soundness. Similarly, MLsub cannot get rid of the value restriction. Scala enforces a value restriction through its syntax already, since there is no way to even write the definition of a value that would have polymorphic type.

It's much simpler than that. In Scala values can't have polymorphic types, only methods can. E.g. if you write
val id = x => x
its type isn't [A] A => A.
And if you take a polymorphic method e.g.
def id[A](x: A): A = x
and try to assign it to a value
val id1 = id
again the compiler will try (and in this case fail) to infer a specific A instead of creating a polymorphic value.
So the issue doesn't arise.
EDIT:
If you try to reproduce the http://mlton.org/ValueRestriction#_alternatives_to_the_value_restriction example in Scala, the problem you run into isn't the lack of let: val corresponds to it perfectly well. But you'd need something like
val f[A]: A => A = {
var r: Option[A] = None
{ x => ... }
}
which is illegal. If you write def f[A]: A => A = ... it's legal but creates a new r on each call. In ML terms it would be like
val f: unit -> ('a -> 'a) =
fn () =>
let
val r: 'a option ref = ref NONE
in
fn x =>
let
val y = !r
val () = r := SOME x
in
case y of
NONE => x
| SOME y => y
end
end
val _ = f () 13
val _ = f () "foo"
which is allowed by the value restriction.
That is, Scala's rules are equivalent to only allowing lambdas as polymorphic values in ML instead of everything value restriction allows.

EDIT: this answer was incorrect before. I have completely rewritten the explanation below to gather my new understanding from the comments under the answers by Andreas and Alexey.
The edit history and the history of archives of this page at archive.is provides a recording of my prior misunderstanding and discussion. Another reason I chose to edit rather than delete and write a new answer, is to retain the comments on this answer. IMO, this answer is still needed because although Alexey answers the thread title correctly and most succinctly—also Andreas’ elaboration was the most helpful for me to gain understanding—yet I think the layman reader may require a different, more holistic (yet hopefully still generative essence) explanation in order to quickly gain some depth of understanding of the issue. Also I think the other answers obscure how convoluted a holistic explanation is, and I want naive readers to have the option to taste it. The prior elucidations I’ve found don’t state all the details in English language and instead (as mathematicians tend to do for efficiency) rely on the reader to discern the details from the nuances of the symbolic programming language examples and prerequisite domain knowledge (e.g. background facts about programming language design).
The value restriction arises where we have mutation of referenced1 type parametrised objects2. The type unsafety that would result without the value restriction is demonstrated in the following MLton code example:
val r: 'a option ref = ref NONE
val r1: string option ref = r
val r2: int option ref = r
val () = r1 := SOME "foo"
val v: int = valOf (!r2)
The NONE value (which is akin to null) contained in the object referenced by r can be assigned to a reference with any concrete type for the type parameter 'a because r has a polymorphic type a'. That would allow type unsafety because as shown in the example above, the same object referenced by r which has been assigned to both string option ref and int option ref can be written (i.e. mutated) with a string value via the r1 reference and then read as an int value via the r2 reference. The value restriction generates a compiler error for the above example.
A typing complication arises to prevent3 the (re-)quantification (i.e. binding or determination) of the type parameter (aka type variable) of a said reference (and the object it points to) to a type which differs when reusing an instance of said reference that was previously quantified with a different type.
Such (arguably bewildering and convoluted) cases arise for example where successive function applications (aka calls) reuse the same instance of such a reference. IOW, cases where the type parameters (pertaining to the object) for a reference are (re-)quantified each time the function is applied, yet the same instance of the reference (and the object it points to) being reused for each subsequent application (and quantification) of the function.
Tangentially, the occurrence of these is sometimes non-intuitive due to lack of explicit universal quantifier ∀ (since the implicit rank-1 prenex lexical scope quantification can be dislodged from lexical evaluation order by constructions such as let or coroutines) and the arguably greater irregularity (as compared to Scala) of when unsafe cases may arise in ML’s value restriction:
Andreas wrote:
Unfortunately, ML does not usually make the quantifiers explicit in its syntax, only in its typing rules.
Reusing a referenced object is for example desired for let expressions which analogous to math notation, should only create and evaluate the instantiation of the substitutions once even though they may be lexically substituted more than once within the in clause. So for example, if the function application is evaluated as (regardless of whether also lexically or not) within the in clause whilst the type parameters of substitutions are re-quantified for each application (because the instantiation of the substitutions are only lexically within the function application), then type safety can be lost if the applications aren’t all forced to quantify the offending type parameters only once (i.e. disallow the offending type parameter to be polymorphic).
The value restriction is ML’s compromise to prevent all unsafe cases while also preventing some (formerly thought to be rare) safe cases, so as to simplify the type system. The value restriction is considered a better compromise, because the early (antiquated?) experience with more complicated typing approaches that didn’t restrict any or as many safe cases, caused a bifurcation between imperative and pure functional (aka applicative) programming and leaked some of the encapsulation of abstract types in ML functor modules. I cited some sources and elaborated here. Tangentially though, I’m pondering whether the early argument against bifurcation really stands up against the fact that value restriction isn’t required at all for call-by-name (e.g. Haskell-esque lazy evaluation when also memoized by need) because conceptually partial applications don’t form closures on already evaluated state; and call-by-name is required for modular compositional reasoning and when combined with purity then modular (category theory and equational reasoning) control and composition of effects. The monomorphisation restriction argument against call-by-name is really about forcing type annotations, yet being explicit when optimal memoization (aka sharing) is required is arguably less onerous given said annotation is needed for modularity and readability any way. Call-by-value is a fine tooth comb level of control, so where we need that low-level control then perhaps we should accept the value restriction, because the rare cases that more complex typing would allow would be less useful in the imperative versus applicative setting. However, I don’t know if the two can be stratified/segregated in the same programming language in smooth/elegant manner. Algebraic effects can be implemented in a CBV language such as ML and they may obviate the value restriction. IOW, if the value restriction is impinging on your code, possibly it’s because your programming language and libraries lack a suitable metamodel for handling effects.
Scala makes a syntactical restriction against all such references, which is a compromise that restricts for example the same and even more cases (that would be safe if not restricted) than ML’s value restriction, but is more regular in the sense that we’ll not be scratching our head about an error message pertaining to the value restriction. In Scala, we’re never allowed to create such a reference. Thus in Scala, we can only express cases where a new instance of a reference is created when it’s type parameters are quantified. Note OCaml relaxes the value restriction in some cases.
Note afaik both Scala and ML don’t enable declaring that a reference is immutable1, although the object they point to can be declared immutable with val. Note there’s no need for the restriction for references that can’t be mutated.
The reason that mutability of the reference type1 is required in order to make the complicated typing cases arise, is because if we instantiate the reference (e.g. in for example the substitutions clause of let) with a non-parametrised object (i.e. not None or Nil4 but instead for example a Option[String] or List[Int]), then the reference won’t have a polymorphic type (pertaining to the object it points to) and thus the re-quantification issue never arises. So the problematic cases are due to instantiation with a polymorphic object then subsequently assigning a newly quantified object (i.e. mutating the reference type) in a re-quantified context followed by dereferencing (reading) from the (object pointed to by) reference in a subsequent re-quantified context. As aforementioned, when the re-quantified type parameters conflict, the typing complication arises and unsafe cases must be prevented/restricted.
Phew! If you understood that without reviewing linked examples, I’m impressed.
1 IMO to instead employ the phrase “mutable references” instead of “mutability of the referenced object” and “mutability of the reference type” would be more potentially confusing, because our intention is to mutate the object’s value (and its type) which is referenced by the pointer— not referring to mutability of the pointer of what the reference points to. Some programming languages don’t even explicitly distinguish when they’re disallowing in the case of primitive types a choice of mutating the reference or the object they point to.
2 Wherein an object may even be a function, in a programming language that allows first-class functions.
3 To prevent a segmentation fault at runtime due to accessing (read or write of) the referenced object with a presumption about its statically (i.e. at compile-time) determined type which is not the type that the object actually has.
4 Which are NONE and [] respectively in ML.

How do purely functional compilers annotate the AST with type info?

In the syntax analysis phase, an imperative compiler can build an AST out of nodes that already contain a type field that is set to null during construction, and then later, in the semantic analysis phase, fill in the types by assigning the declared/inferred types into the type fields.
How do purely functional languages handle this, where you do not have the luxury of assignment? Is the type-less AST mapped to a different kind of type-enriched AST? Does that mean I need to define two types per AST node, one for the syntax phase, and one for the semantic phase?
Are there purely functional programming tricks that help the compiler writer with this problem?

I usually rewrite a source (or an already several steps lowered) AST into a new form, replacing each expression node with a pair (tag, expression).
Tags are unique numbers or symbols which are then used by the next pass which derives type equations from the AST. E.g., a + b will yield something like { numeric(Tag_a). numeric(Tag_b). equals(Tag_a, Tag_b). equals(Tag_e, Tag_a).}.
Then types equations are solved (e.g., by simply running them as a Prolog program), and, if successful, all the tags (which are variables in this program) are now bound to concrete types, and if not, they're left as type parameters.
In a next step, our previous AST is rewritten again, this time replacing tags with all the inferred type information.
The whole process is a sequence of pure rewrites, no need to replace anything in your AST destructively. A typical compilation pipeline may take a couple of dozens of rewrites, some of them changing the AST datatype.

There are several options to model this. You may use the same kind of nullable data fields as in your imperative case:
data Exp = Var Name (Maybe Type) | ...
parse :: String -> Maybe Exp -- types are Nothings here
typeCheck :: Exp -> Maybe Exp -- turns Nothings into Justs
or even, using a more precise type
data Exp ty = Var Name ty | ...
parse :: String -> Maybe (Exp ())
typeCheck :: Exp () -> Maybe (Exp Type)

I cant speak for how it is supposed to be done, but I did do this in F# for a C# compiler here
The approach was basically - build an AST from the source, leaving things like type information unconstrained - So AST.fs basically is the AST which strings for the type names, function names, etc.
As the AST starts to be compiled to (in this case) .NET IL, we end up with more type information (we create the types in the source - lets call these type-stubs). This then gives us the information needed to created method-stubs (the code may have signatures that include type-stubs as well as built in types). From here we now have enough type information to resolve any of the type names, or method signatures in the code.
I store that in the file TypedAST.fs. I do this in a single pass, however the approach may be naive.
Now we have a fully typed AST you could then do things like compile it, fully analyze it, or whatever you like with it.
So in answer to the question "Does that mean I need to define two types per AST node, one for the syntax phase, and one for the semantic phase?", I cant say definitively that this is the case, but it is certainly what I did, and it appears to be what MS have done with Roslyn (although they have essentially decorated the original tree with type info IIRC)
"Are there purely functional programming tricks that help the compiler writer with this problem?"
Given the ASTs are essentially mirrored in my case, it would be possible to make it generic and transform the tree, but the code may end up (more) horrendous.
i.e.
type 'type AST;
| MethodInvoke of 'type * Name * 'type list
| ....

Like in the case when dealing with relational databases, in functional programming it is often a good idea not to put everything in a single data structure.
In particular, there may not be a data structure that is "the AST".
Most probably, there will be data structures that represent parsed expressions. One possible way to deal with type information is to assign a unique identifier (like an integer) to each node of the tree already during parsing and have some suitable data structure (like a hash map) that associates those node-ids with types. The job of the type inference pass, then, would be just to create this map.

Why doesn't scala's parallel sequences have a contains method?

Why does
List.range(0,100).contains(2)
Work, while
List.range(0,100).par.contains(2)
Does not?
This is planned for the future?

The non-teleological answer is that it's because contains is defined in SeqLike but not in ParSeqLike.
If that doesn't satisfy your curiosity, you can find that SeqLike's contains is defined thus:
def contains(elem: Any): Boolean = exists (_ == elem)
So for your example you can write
List.range(0,100).par.exists(_ == 2)
ParSeqLike is missing a few other methods as well, some of which would be hard to implement efficiently (e.g. indexOfSlice) and some for less obvious reasons (e.g. combinations - maybe because that's only useful on small datasets). But if you have a parallel collection you can also use .seq to get back to the linear version and get your methods back:
List.range(0,100).par.seq.contains(2)
As for why the library designers left it out... I'm totally guessing, but maybe they wanted to reduce the number of methods for simplicity's sake, and it's nearly as easy to use exists.
This also raises the question, why is contains defined on SeqLike rather than on the granddaddy of all collections, GenTraversableOnce, where you find exists? A possible reason is that contains for Map is semantically a different method to that on Set and Seq. A Map[A,B] is a Traversable[(A,B)], so if contains were defined for Traversable, contains would need to take a tuple (A,B) argument; however Map's contains takes just an A argument. Given this, I think contains should be defined in GenSeqLike - maybe this is an oversight that will be corrected.
(I thought at first maybe parallel sequences don't have contains because searching where you intend to stop after finding your target on parallel collections is a lot less efficient than the linear version (the various threads do a lot of unnecessary work after the value is found: see this question), but that can't be right because exists is there.)

Why don't scala collections have any human-readable methods like .append, .push, etc

Scala collections have a bunch of readable and almost readable operators like :+ and +:, but why aren't there any human readable synonyms like append?

All mutable collections in Scala have the BufferLike trait and it defines an append method.
Immutable collections do not have the BufferLike trait and hence only define the other methods that do not change the collection in place but generate a new one.

Symbolic method names allow the combination with the assignment operation =.
For instance, if you have a method ++ which creates a new collection, you can automatically use ++= to assign the new collection to some variable:
var array = Array(1,2,3)
array ++= Array(4,5,6)
// array is now Array(1,2,3,4,5,6)
This is not possible without symbolic method names.

In fact they often some human-readable synonyms:
foldLeft is equivalent to /:
foldRight is equivalent to :\
The remaining ones are addition operators, which are quite human readable as they are:
++ is equivalent to java addAll
:+ is append
+: is prepend
The position of the semi-colon indicates the receiver instance.
Finally, some weird operators are legacies of other functional programming languages. Such as list construction (SML) or actor messaging (erlang).

Is it any different than any other language?
Let's take Java. What's the human readable version of +, -, * and / on int? Or, let's take String: what's the human readable version of +? Note that concat is not the same thing -- it doesn't accept non-String parameters.
Perhaps you are bothered by it because in Java -- unlike, say, C++ -- either things use exclusively non-alphabetic operators, or alphabetic operators -- with the exception of String's +.

The Scala standard library does not set out to be Java friendly. Instead, adapters are provided to convert between Java and Scala collections.
Attempting to provide a Java friendly API would not only constrain the choice of identifiers (or mandate that aliases should be provided), but also limit the way that generics and function types were used. Substantially more testing would be required to validate the design.
On the same topic, I remember some debate as to whether the 2.8 collections should implement java.util.Iterable.
http://scala-programming-language.1934581.n4.nabble.com/How-to-set-the-scale-for-scala-BigDecimal-s-method-td1948885.html
http://www.scala-lang.org/node/2177

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse