Read a list input from using the StdIn.readf function - scala

What is the propper way of read a list input using the readf function?
The Scala Standard Library shows that the sintax is:
def readf(format: String): List[Any]
but I'm not finding to introduce on the Format field and always get the error:
Solution.scala:11: error: not found: value format
val arr = scala.io.StdIn.readf(format:String)
when try to store in arr list of Int.

the scaladoc said :
def readf(format: String): List[Any] Reads in some structured input
(from the default input), specified by a format specifier. See class
java.text.MessageFormat for details of the format specification.
format the format of the input.
returns a list of all extracted values.
Definition Classes StdIn Exceptions thrown java.io.EOFException if the
end of the input stream has been reached.
So, go to your browser, search "java.text.MessageFormat", and find it :
https://docs.oracle.com/javase/7/docs/api/java/text/MessageFormat.html

Related

MapCoder issue after updating Beam to 2.35

After updating Beam from 2.33 to 2.35, started getting this error:
def estimate_size(self, unused_value, nested=False):
estimate = 4 # 4 bytes for int32 size prefix
> for key, value in unused_value.items():
E AttributeError: 'list' object has no attribute 'items' [while running 'MyGeneric ParDo Job']
../../../python3.8/site-packages/apache_beam/coders/coder_impl.py:677: AttributeError
This is a method of MapCoderImpl. I don't know Beam enough to know when it's called.
Any thoughts on what might be causing it?
Beam uses Coder to encode and decode a PCollection. From the error message you got, Beam tried to use MapCoder to decode your input data. It expected a dict but received a list instead, hence the error.
Additionally, Beam uses the transform functions' type hints to infer the Coder for output PCollection's elements. My guess is that you might use a wrong return type for your function. Assuming you are implementing a DoFn's process, you yield a list in the function body, then you 'd see the error above if you define the function like this:
def process(self, element, **kwargs) -> List[Dict[A, B]]:
Beam sees the output element's type hint, Dict[A, B], and decides to use MapCoder. You might want to change the type hint to the one below, so that Beam could actually use ListCoder:
def process(self, element, **kwargs) -> List[List[Dict[A, B]]]:
More about benefits of using type hint is describe here.

Scala function TypeTag: T use type T in function

I need to parse several json fields, which I'm using Play Json to do it. As parsing may fail, I need to throw a custom exception for each field.
To read a field, I use this:
val fieldData = parseField[String](json \ fieldName, "fieldName")
My parseField function:
def parseField[T](result: JsLookupResult, fieldName: String): T = {
result.asOpt[T].getOrElse(throw new IllegalArgumentException(s"""Can't access $fieldName."""))
}
However, I get an error that reads:
Error:(17, 17) No Json deserializer found for type T. Try to implement
an implicit Reads or Format for this type.
result.asOpt[T].getOrElse(throw new IllegalArgumentException(s"""Can't access $fieldName."""))
Is there a way to tell the asOpt[] to use the type in T?
I strongly suggest that you do not throw exceptions. The Play JSON API has both a JsSuccess and JsError types that will help you encode parsing errors.
As per the documentation
To convert a Scala object to and from JSON, we use Json.toJson[T: Writes] and Json.fromJson[T: Reads] respectively. Play JSON provides the Reads and Writes typeclasses to define how to read or write specific types. You can get these either by using Play's automatic JSON macros, or by manually defining them. You can also read JSON from a JsValue using validate, as and asOpt methods. Generally it's preferable to use validate since it returns a JsResult which may contain an error if the JSON is malformed.
See https://github.com/playframework/play-json#reading-and-writing-objects
There is also a good example on the Play Discourse forum on how the API manifests in practice.

How to extract JSON from a binary protobuf?

Considering a Apache Spark 2.2.0 Structured Stream as:
jsonStream.printSchema()
root
|-- body: binary (nullable = true)
The data inside body is of type Protocol Buffers v2 and a nested JSON. It looks like
syntax = "proto2";
message Data {
required string data = 1;
}
message List {
repeated Data entry = 1;
}
How can I extract the data inside Spark to "further" process it?
I looked into ScalaPB, but as I run my code in Jupyter couldn't get the ".proto" code to be included inline. I also do not know how to convert a DataFrame to an RDD on a stream. Trying .rdd failed because of a streaming source.
Update 1: I figured out how to generate Scala files from protobuf specifications, using the console tool of ScalaPB. Still I'm not able to import them as of a "type mismatch".
tl;dr Write a user-defined function (UDF) to deserialize the binary field (of protobuf with a JSON) to JSON.
Think of the serialized body (in binary format) as a table column. Forget about Structured Streaming for a moment (and streaming Datasets).
Let me then rephrase the question to the following:
How to convert (aka cast) a value in binary to [here-your-format]?
Some formats are directly cast-able which makes converting binaries to strings as easy as follows:
$"body" cast "string"
If the string is then a JSON or unixtime you could use built-in "converters", i.e. functions like from_json or from_unixtime.
The introduction should give you a hint how to do conversions like yours.
The data inside body is of type Protocol Buffers v2 and a nested JSON.
To deal with such fields (protobuf + json) you'd have to write a Scala function to decode the "payload" to JSON and create a user-defined function (UDF) using udf:
udf(f: UDF1[_, _], returnType: DataType): UserDefinedFunction Defines a Java UDF1 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Then use functions like from_json or get_json_object.
To make your case simpler, write a single-argument function that does the conversion and wrap it into a UDF using udf function.
Trying .rdd failed because of a streaming source.
Use Dataset.foreach or foreachPartition.
foreach(f: (T) ⇒ Unit): Unit Applies a function f to all rows.
foreachPartition(f: (Iterator[T]) ⇒ Unit): Unit Applies a function f to each partition of this Dataset.

Scala io.source fromFile

I have this two lines(among the all others)
import scala.io.Source
val source = Source.fromFile(filename)
As I understand this is a way to read file content.I have read
http://www.scala-lang.org/api/2.12.x/scala/io/Source.html#iter:Iterator[Char]
I still do not get it what does Source.from File represent,one of Type Members,or something else?
from the Scala API stated here fromFile is a method defined on the Source companion object. This is a curried method with the first param list taking a single String representing the path of the file to be read and the second curried parameter list takes a single implicit codec argument of type scala.io.Codec. And this function returns a BufferedSource object

List equality using parser combinators

I've grabbed some Scala CSV parsing code from here:
Use Scala parser combinator to parse CSV files
And then I tried to write a basic test for it:
assertEquals(List(List()), CSV.parse(""))
And this fails, with message:
java.lang.AssertionError: expected: scala.collection.immutable.$colon$colon but was: scala.collection.immutable.$colon$colon
Any ideas? The output from CSV.parse is an empty List[List[String]] but seems to have a different hashCode than List(Nil) or ListList[String] etc. I can't seem to find any way to compose a list which is equal to the output of CSV.parse("").
UPDATE:
Here is the failure using REPL:
scala> assertEquals(List(Nil), CSV.parse(""))
java.lang.AssertionError: expected: scala.collection.immutable.$colon$colon<List(List())> but was: scala.collection.immutable.$colon$colon<List(List())>
Edited: I tried the parser you supplied in the link:
scala> CSV.parse("")
res7: List[List[String]] = List(List(""))
So apparently, it doesn't return a List with an empty List, but a List with a List with the empty string. So your test should fail.