MapCoder issue after updating Beam to 2.35 - apache-beam

After updating Beam from 2.33 to 2.35, started getting this error:
def estimate_size(self, unused_value, nested=False):
estimate = 4 # 4 bytes for int32 size prefix
> for key, value in unused_value.items():
E AttributeError: 'list' object has no attribute 'items' [while running 'MyGeneric ParDo Job']
../../../python3.8/site-packages/apache_beam/coders/coder_impl.py:677: AttributeError
This is a method of MapCoderImpl. I don't know Beam enough to know when it's called.
Any thoughts on what might be causing it?

Beam uses Coder to encode and decode a PCollection. From the error message you got, Beam tried to use MapCoder to decode your input data. It expected a dict but received a list instead, hence the error.
Additionally, Beam uses the transform functions' type hints to infer the Coder for output PCollection's elements. My guess is that you might use a wrong return type for your function. Assuming you are implementing a DoFn's process, you yield a list in the function body, then you 'd see the error above if you define the function like this:
def process(self, element, **kwargs) -> List[Dict[A, B]]:
Beam sees the output element's type hint, Dict[A, B], and decides to use MapCoder. You might want to change the type hint to the one below, so that Beam could actually use ListCoder:
def process(self, element, **kwargs) -> List[List[Dict[A, B]]]:
More about benefits of using type hint is describe here.

Related

Numba cannot resolve function (np.digitize)

I get an error by numba, it is complaining it can't resolve a function.
The minimal code to reproduce my errors is
import numba
import numpy as np
#numba.jit("float64(float64)", nopython=True)
def get_cut_tight_nom(eta):
binning_abseta = np.array([0., 10., 20.])
return np.digitize(eta, binning_abseta)
I don't understand the error message
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function digitize at 0x7fdb8c11dee0>) found for signature:
>>> digitize(float64, array(float64, 1d, C))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload of function 'digitize': File: numba/np/arraymath.py: Line 3939.
With argument(s): '(float64, array(float64, 1d, C))':
No match.
During: resolving callee type: Function(<function digitize at 0x7fdb8c11dee0>)
During: typing of call at /tmp/ipykernel_309793/3917220133.py (8)
it seems it wants to resolve digitize(float64, array(float64, 1d, C)) and a function with the same signature is not matching?
It's indeed due to the signatures. The Numpy function (without Numba) of np.digitize returns an int64 (scalar or array), not a float64 as you've specified the return type of your function to be.
It seems like the Numba implementation of it requires both to always be arrays, which you'll also have to explicitly add to the signature.
So this for example works for me:
#numba.jit("int64[:](float64[:])", nopython=True)
def get_cut_tight_nom(eta):
binning_abseta = np.array([0., 10., 20.])
return np.digitize(eta, binning_abseta)
Resulting in:
But do you really need the signature in this case? Numba is able to figure it out itself as well, like:
#numba.njit
def get_cut_tight_nom(eta):
...
A signature can still add something if you for example want to explicitly cast float32 inputs to float64 etc.
You can also inspect what signatures Numba comes up with if you run it with some differently typed input. Running it twice with float32 & float64 as the input shows. It can help to highlight where issues like this might arise from.

Specify type of a TaggedOutput to pass through GroupByKey (as a part of CombinePerKey)

When I tried to migrate my project based on apache beam pipelines from python 3.7 to 3.8 the type hint check started to fail at this place:
pcoll = (
wrong_pcoll,
some_pcoll_1,
some_pcoll_2,
some_pcoll_3,
) | beam.Flatten(pipeline=pipeline)
pcoll | beam.CombinePerKey(MyCombineFn()) # << here
with this error:
apache_beam.typehints.decorators.TypeCheckError: Input type hint violation at GroupByKey: expected Tuple[TypeVariable[K], TypeVariable[V]], got Union[TaggedOutput, Tuple[Any, Any], Tuple[Any, _MyType1], Tuple[Any, _MyType2]]
The wrong_pcoll is actually a TaggedOutput because it's received as a tagged output from one on previous ptransforms.
Type hint check fails when the type of wrong_pcoll which is a TaggedOutput as a part of the type of pcoll (which type in correspondence with the exception is Union[TaggedOutput, Tuple[Any, Any], Tuple[Any, _MyType1], Tuple[Any, _MyType2]]) passed to GrouByKey that is used inside of CombinePerKey.
So I have two questions:
Why does it work in python 3.7 and doesn't on 3.8?
How to specify type for a tagged output? I tried to specify the type for the process() method of PTransform that produced that as a union of all output types that it yields, but for some reason for the type hint check has been chose the wrong one. Then I specified strictly the type I need: Tuple[Any, Any] and it has worked. But it's not a way since process() also yields other types, like simple str.
As a workaround, I can pass this wrong_pcoll through a simple beam.Map with lambda x: x and .with_output_types(Tuple[Any, Any]), but it does not seem to be a clear way to fix it.
I investigated similar failures recently.
Beam has some type-inferencing capabilities which rely on opcode analysis of pipeline code. Inferencing is somewhat limited and conservative. For example, when Beam attempts to infer a function's return type and encounters an opcode that it does not know, Beam infers the return type as Any. It is also sensitive to Python minor version.
Python 3.8 removed some opcodes, such as SETUP_LOOP, that Beam didn't handle previously. Therefore, type inference behavior kicked in for some portions of the code where it didn't work before. I've seen pipelines where an increased type inference on Python 3.8 exposed incorrectly-specified hints.
You are running into a bug/limitation in Beam's type inference for multi-output DoFns tracked in https://issues.apache.org/jira/browse/BEAM-4132. There was some progress, but it's not completely addressed. As a workaround you could manually specify the hints. I think beam.Flatten().with_output_types(Tuple[str, Union[_MyType1, _MyType2]]) should work for your case.

What's the difference between Dataset.map(r=>xx) and Dataframe.map(r=>xx) in Spark2.0?

Some how in Spark2.0, I can use Dataframe.map(r => r.getAs[String]("field")) without problems
But DataSet.map(r => r.getAs[String]("field")) gives error that r doesn't have the "getAs" method.
What's the difference between r in DataSet and r in DataFrame and why r.getAs only works with DataFrame?
After doing some research in StackOverflow, I found a helpful answer here
Encoder error while trying to map dataframe row to updated row
Hope it's helpful
Dataset has a type parameter: class Dataset[T]. T is the type of each record in the Dataset. That T might be anything (well, anything for which you can provide an implicit Encoder[T], but that's besides the point).
A map operation on a Dataset applies the provided function to each record, so the r in the map operations you showed will have the type T.
Lastly, DataFrame is actually just an alias for Dataset[Row], which means each record has the type Row. And Row has a method named getAs that takes a type parameter and a String argument, hence you can call getAs[String]("field") on any Row. For any T that doesn't have this method - this will fail to compile.

Convert Standard.Natural to Ada.Containers.Count_Type

I instanced the Ada.Containers.Vectors generic package like this:
package My_Vectors is new Ada.Containers.Vectors(
Element_Type => My_Type,
Index_Type => Natural);
Say, I have a vector and a Standard.Natural value declared:
Foo_Vector: My_vectors.Vector;
Bar_Natural: Natural := 4;
If I call
Foo_Vector.Set_Length(Bar_Natural);
I get the following error
expected type "Ada.Containers.Count_Type"
found type "Standard.Natural"
Is there a way to cast Bar_Natural to be of Ada.Containers.Count_Type?
Sorry, I was too stupid to actually read all that my error said. I tried converting the Natural using:
Ada.Containers.Vectors.Count_Type(Bar_Natural)
Which makes zero sense!
Reading the error, it is trivial to see that Count_Type is defined in package Ada.Containers.
The correct conversion would therefore be:
Ada.Containers.Count_Type(Bar_Natural);
Giving
Foo_Vector.Set_Length(Ada.Containers.Count_Type(Bar_Natural));

How to use flink fold function in scala

This is a non working try for using Flink fold with scala anonymous function:
val myFoldFunction = (x: Double, t:(Double,String,String)) => x + t._1
env.readFileStream(...).
...
.groupBy(1)
.fold(0.0, myFoldFunction : Function2[Double, (Double,String,String), Double])
It compiles well, but at execution, I get a "type erasure issue" (see below). Doing so in Java is fine, but of course more verbose. I like the concise and clear lambdas. How can I do that in scala?
Caused by: org.apache.flink.api.common.functions.InvalidTypesException:
Type of TypeVariable 'R' in 'public org.apache.flink.streaming.api.scala.DataStream org.apache.flink.streaming.api.scala.DataStream.fold(java.lang.Object,scala.Function2,org.apache.flink.api.common.typeinfo.TypeInformation,scala.reflect.ClassTag)' could not be determined.
This is most likely a type erasure problem.
The type extraction currently supports types with generic variables only in cases where all variables in the return type can be deduced from the input type(s).
The problem you encountered is a bug in Flink [1]. The problem originates from Flink's TypeExtractor and the way the Scala DataStream API is implemented on top of the Java implementation. The TypeExtractor cannot generate a TypeInformation for the Scala type and thus returns a MissingTypeInformation. This missing type information is manually set after creating the StreamFold operator. However, the StreamFold operator is implemented in a way that it does not accept a MissingTypeInformation and, consequently, fails before setting the right type information.
I've opened a pull request [2] to fix this problem. It should be merged within the next two days. By using then the latest 0.10 snapshot version, your problem should be fixed.
[1] https://issues.apache.org/jira/browse/FLINK-2631
[2] https://github.com/apache/flink/pull/1101