Specify type of a TaggedOutput to pass through GroupByKey (as a part of CombinePerKey)

Specify type of a TaggedOutput to pass through GroupByKey (as a part of CombinePerKey) - python-3.7

When I tried to migrate my project based on apache beam pipelines from python 3.7 to 3.8 the type hint check started to fail at this place:
pcoll = (
wrong_pcoll,
some_pcoll_1,
some_pcoll_2,
some_pcoll_3,
) | beam.Flatten(pipeline=pipeline)
pcoll | beam.CombinePerKey(MyCombineFn()) # << here
with this error:
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GroupByKey: expected Tuple[TypeVariable[K], TypeVariable[V]], got Union[TaggedOutput, Tuple[Any, Any], Tuple[Any, _MyType1], Tuple[Any, _MyType2]]
The wrong_pcoll is actually a TaggedOutput because it's received as a tagged output from one on previous ptransforms.
Type hint check fails when the type of wrong_pcoll which is a TaggedOutput as a part of the type of pcoll (which type in correspondence with the exception is Union[TaggedOutput, Tuple[Any, Any], Tuple[Any, _MyType1], Tuple[Any, _MyType2]]) passed to GrouByKey that is used inside of CombinePerKey.
So I have two questions:
Why does it work in python 3.7 and doesn't on 3.8?
How to specify type for a tagged output? I tried to specify the type for the process() method of PTransform that produced that as a union of all output types that it yields, but for some reason for the type hint check has been chose the wrong one. Then I specified strictly the type I need: Tuple[Any, Any] and it has worked. But it's not a way since process() also yields other types, like simple str.
As a workaround, I can pass this wrong_pcoll through a simple beam.Map with lambda x: x and .with_output_types(Tuple[Any, Any]), but it does not seem to be a clear way to fix it.

I investigated similar failures recently.
Beam has some type-inferencing capabilities which rely on opcode analysis of pipeline code. Inferencing is somewhat limited and conservative. For example, when Beam attempts to infer a function's return type and encounters an opcode that it does not know, Beam infers the return type as Any. It is also sensitive to Python minor version.
Python 3.8 removed some opcodes, such as SETUP_LOOP, that Beam didn't handle previously. Therefore, type inference behavior kicked in for some portions of the code where it didn't work before. I've seen pipelines where an increased type inference on Python 3.8 exposed incorrectly-specified hints.
You are running into a bug/limitation in Beam's type inference for multi-output DoFns tracked in https://issues.apache.org/jira/browse/BEAM-4132. There was some progress, but it's not completely addressed. As a workaround you could manually specify the hints. I think beam.Flatten().with_output_types(Tuple[str, Union[_MyType1, _MyType2]]) should work for your case.

Related

How to get erased type at compile time?

In Scala 2, there is a function TypeApi#erasure described as follows
The erased type corresponding to this type after
all transformations from Scala to Java have been performed.
How to get erased Type for a given TypeRepr in Scala 3?

I have also asked this question on dotty Gitter and it came out, that currently there is no available API for getting erased types.
Then I created a dotty feature request in which you can find possible workarounds.

Instantiation in Minizinc

I am reading through "A Minizinc Tutorial" by Kim Marriott and it says that
the combination of variable instantiation and type is called type-inst. As you start to use Minizinc, you will undoubtedly see examples of type-inst errors.
What exactly are type-inst errors?

I believe the terminology is not often used in the MiniZinc literature these days, but for every value in MiniZinc the compiler keeps track of two things: it's type (int, bool, float, etc.) and if it is a decision variable (not known at solve time) or a problem parameter (must be known when rewriting the model for the solver). Together these two things are called the Type Instantiation or type-inst.
A type-inst error is an error given by the type checker of the compiler. These error can occur in many places, such as when in a declaration the declared type instantiation doesn't match it's right hand side, or when two side of an if-then-else have a different type-instantiation, or when the arguments of a call do not match the declared type-instantiation of the function-declaration.
The mismatch that causes these errors can come from either side of the type-inst: either the types are incompatible (e.g. used float instead of bool), or you used a decision variable where only a problem parameter was allowed. These issues are usually caused by mistakes in the model and are usually resolved easily by changing the value used or using different language constructs.
Note that MiniZinc does allow sub-typing: You are allowed to use bool instead of int and it is converted to a 0/1 value. Similarly you can use a integer value instead of a float, and you can use a parameter in place of a variable.
The newest version of the MiniZinc Tutorial can be found with its documentation: https://www.minizinc.org/doc-latest/en/part_2_tutorial.html

can`t bind[SttpBackend[Try, Nothing]]

I want to use sttp library with guice(with scalaguice wrapper) in my app. But seems it is not so easy to correctly bind things like SttpBackend[Try, Nothing]
SttpBackend.scala
Try[_] and Try[AnyRef] show some other errors, but still have no idea how it should be done correctly
the error I got:
kinds of the type arguments (scala.util.Try) do not conform to the expected kinds of the type parameters (type T).
[error] scala.util.Try's type parameters do not match type T's expected parameters:
[error] class Try has one type parameter, but type T has none
[error] bind[SttpBackend[Try, Nothing]].toProvider[SttpBackendProvider]
[error] ` ^
SttpBackendProvider looks like:
def get: SttpBackend[Try, Nothing] = TryHttpURLConnectionBackend(opts)
complete example in scastie
interesting that version scalaguice 4.1.0 show this error, but latest 4.2.2 shows error inside it with converting Nothing to JavaType

I believe you hit two different bugs in the Scala-Guice one of which is not fixed yet (and probably even not submitted yet).
To describe those issues I need a fast intro into how Guice and Scala-Guice work. Essentially what Guice do is have a mapping from type onto the factory method for an object of that type. To support some advanced features types are mapped onto some internal "keys" representation and then for each "key" Guice builds a way to construct a corresponding object. Also it is important that generics in Java are implemented using type erasure. That's why when you write something like:
bind(classOf[SttpBackend[Try, Nothing]]).toProvider(classOf[SttpBackendProvider])
in raw-Guice, the "key" actually becomes something like "com.softwaremill.sttp.SttpBackend". Luckily Guice developers have thought about this issue with generics and introduced TypeLiteral[T] so you can convey the information about generics.
Scala type system is more reach than in Java and it has some better reflection support from the compiler. Scala-Guice exploits it to map Scala-types on those more detailed keys automatically. Unfortunately it doesn't always work perfectly.
The first issue is the result of the facts that the type SttpBackend is defined as
trait SttpBackend[R[_], -S]
so it uses it expects its first parameter to be a type constructor; and that originally Scala-Guice used the scala.reflect.Manifest infrastructure. AFAIU such higher-kind types are not representable as Manifest and this is what the error in your question really says.
Luckily Scala has added a new scala.reflect.runtime.universe.TypeTag infrastructure to tackle this issue in a better and more consistent way and the Scala-Guice migrated to its usage. That's why with the newer version of Scala-Guice the compiler error goes away. Unfortunately there is another bug in the Scala-Guice that makes the code fail in runtime and it is a lack of handling of the Nothing Scala type. You see, the Nothing type is a kind of fake one on the JVM. It is one of the things where the Scala type system is more reach than the Java one. There is no direct mapping for Nothing in the JVM world. Luckily there is no way to create any value of the type Nothing. Unfortunately you still can create a classOf[Nothing]. The Scala-to-JVM compiler handles it by using an artificial scala.runtime.Nothing$. It is not a part of the public API, it is implementation details of specifically Scala over JVM. Anyway this means that the Nothing type needs additional handling when converting into the Guice TypeLiteral and there is none. There is for Any the cousin of Nothing but not for Nothing (see the usage of the anyType in TypeConversions.scala).
So there are really two workarounds:
Use raw Java-based syntax for Guice instead of the nice Scala-Guice one:
bind(new TypeLiteral[SttpBackend[Try, Nothing]]() {})
.toInstance(sttpBackend) // or to whatever
See online demo based on your example.
Patch the TypeConversions.scala in the Scala-Guice as in:
private[scalaguice] object TypeConversions {
private val mirror = runtimeMirror(getClass.getClassLoader)
private val anyType = typeOf[Any]
private val nothingType = typeOf[Nothing] // added
...
def scalaTypeToJavaType(scalaType: ScalaType): JavaType = {
scalaType.dealias match {
case `anyType` => classOf[java.lang.Object]
case `nothingType` => classOf[scala.runtime.Nothing$] //added
...
I tried it locally and it seems to fix your example. I didn't do any extensive tests so it might have broken something else.

How to use flink fold function in scala

This is a non working try for using Flink fold with scala anonymous function:
val myFoldFunction = (x: Double, t:(Double,String,String)) => x + t._1
env.readFileStream(...).
...
.groupBy(1)
.fold(0.0, myFoldFunction : Function2[Double, (Double,String,String), Double])
It compiles well, but at execution, I get a "type erasure issue" (see below). Doing so in Java is fine, but of course more verbose. I like the concise and clear lambdas. How can I do that in scala?
Caused by: org.apache.flink.api.common.functions.InvalidTypesException:
Type of TypeVariable 'R' in 'public org.apache.flink.streaming.api.scala.DataStream org.apache.flink.streaming.api.scala.DataStream.fold(java.lang.Object,scala.Function2,org.apache.flink.api.common.typeinfo.TypeInformation,scala.reflect.ClassTag)' could not be determined.
This is most likely a type erasure problem.
The type extraction currently supports types with generic variables only in cases where all variables in the return type can be deduced from the input type(s).

The problem you encountered is a bug in Flink [1]. The problem originates from Flink's TypeExtractor and the way the Scala DataStream API is implemented on top of the Java implementation. The TypeExtractor cannot generate a TypeInformation for the Scala type and thus returns a MissingTypeInformation. This missing type information is manually set after creating the StreamFold operator. However, the StreamFold operator is implemented in a way that it does not accept a MissingTypeInformation and, consequently, fails before setting the right type information.
I've opened a pull request [2] to fix this problem. It should be merged within the next two days. By using then the latest 0.10 snapshot version, your problem should be fixed.
[1] https://issues.apache.org/jira/browse/FLINK-2631
[2] https://github.com/apache/flink/pull/1101

How do purely functional compilers annotate the AST with type info?

In the syntax analysis phase, an imperative compiler can build an AST out of nodes that already contain a type field that is set to null during construction, and then later, in the semantic analysis phase, fill in the types by assigning the declared/inferred types into the type fields.
How do purely functional languages handle this, where you do not have the luxury of assignment? Is the type-less AST mapped to a different kind of type-enriched AST? Does that mean I need to define two types per AST node, one for the syntax phase, and one for the semantic phase?
Are there purely functional programming tricks that help the compiler writer with this problem?

I usually rewrite a source (or an already several steps lowered) AST into a new form, replacing each expression node with a pair (tag, expression).
Tags are unique numbers or symbols which are then used by the next pass which derives type equations from the AST. E.g., a + b will yield something like { numeric(Tag_a). numeric(Tag_b). equals(Tag_a, Tag_b). equals(Tag_e, Tag_a).}.
Then types equations are solved (e.g., by simply running them as a Prolog program), and, if successful, all the tags (which are variables in this program) are now bound to concrete types, and if not, they're left as type parameters.
In a next step, our previous AST is rewritten again, this time replacing tags with all the inferred type information.
The whole process is a sequence of pure rewrites, no need to replace anything in your AST destructively. A typical compilation pipeline may take a couple of dozens of rewrites, some of them changing the AST datatype.

There are several options to model this. You may use the same kind of nullable data fields as in your imperative case:
data Exp = Var Name (Maybe Type) | ...
parse :: String -> Maybe Exp -- types are Nothings here
typeCheck :: Exp -> Maybe Exp -- turns Nothings into Justs
or even, using a more precise type
data Exp ty = Var Name ty | ...
parse :: String -> Maybe (Exp ())
typeCheck :: Exp () -> Maybe (Exp Type)

I cant speak for how it is supposed to be done, but I did do this in F# for a C# compiler here
The approach was basically - build an AST from the source, leaving things like type information unconstrained - So AST.fs basically is the AST which strings for the type names, function names, etc.
As the AST starts to be compiled to (in this case) .NET IL, we end up with more type information (we create the types in the source - lets call these type-stubs). This then gives us the information needed to created method-stubs (the code may have signatures that include type-stubs as well as built in types). From here we now have enough type information to resolve any of the type names, or method signatures in the code.
I store that in the file TypedAST.fs. I do this in a single pass, however the approach may be naive.
Now we have a fully typed AST you could then do things like compile it, fully analyze it, or whatever you like with it.
So in answer to the question "Does that mean I need to define two types per AST node, one for the syntax phase, and one for the semantic phase?", I cant say definitively that this is the case, but it is certainly what I did, and it appears to be what MS have done with Roslyn (although they have essentially decorated the original tree with type info IIRC)
"Are there purely functional programming tricks that help the compiler writer with this problem?"
Given the ASTs are essentially mirrored in my case, it would be possible to make it generic and transform the tree, but the code may end up (more) horrendous.
i.e.
type 'type AST;
| MethodInvoke of 'type * Name * 'type list
| ....

Like in the case when dealing with relational databases, in functional programming it is often a good idea not to put everything in a single data structure.
In particular, there may not be a data structure that is "the AST".
Most probably, there will be data structures that represent parsed expressions. One possible way to deal with type information is to assign a unique identifier (like an integer) to each node of the tree already during parsing and have some suitable data structure (like a hash map) that associates those node-ids with types. The job of the type inference pass, then, would be just to create this map.