Apache Flink Streaming type mismatch in flatMap function - scala

Trying to use streaming api of 0.10.0 flink version in scala 2.10.4. While trying to compile this first version:
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.windowing.time._
object Main {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("localhost", 9999)
val words : DataStream[String] = text.flatMap[String](
new Function[String,TraversableOnce[String]] {
def apply(line:String):TraversableOnce[String] = line.split(" ")
})
env.execute("Window Stream wordcount")
}
}
I am getting compile time error:
[error] found : String => TraversableOnce[String]
[error] required: org.apache.flink.api.common.functions.FlatMapFunction[String,String]
[error] new Function[String,TraversableOnce[String]] { def apply(line:String):TraversableOnce[String] = line.split(" ")})
[error] ^
And in decompiled version of DataStream.class that I have included to project there are functions that accept such type (the last one):
public <R> DataStream<R> flatMap(FlatMapFunction<T, R> flatMapper, TypeInformation<R> evidence$12, ClassTag<R> evidence$13) {
if (flatMapper == null) {
throw new NullPointerException("FlatMap function must not be null.");
}
TypeInformation outType = (TypeInformation)Predef..MODULE$.implicitly(evidence$12);
return package..MODULE$.javaToScalaStream((org.apache.flink.streaming.api.datastream.DataStream)this.javaStream.flatMap(flatMapper).returns(outType));
}
public <R> DataStream<R> flatMap(Function2<T, Collector<R>, BoxedUnit> fun, TypeInformation<R> evidence$14, ClassTag<R> evidence$15) {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.");
}
Function2<T, Collector<R>, BoxedUnit> cleanFun = this.clean((F)fun);
.anon flatMapper = new /* Unavailable Anonymous Inner Class!! */;
return this.flatMap((FlatMapFunction<T, R>)flatMapper, evidence$14, evidence$15);
}
public <R> DataStream<R> flatMap(Function1<T, TraversableOnce<R>> fun, TypeInformation<R> evidence$16, ClassTag<R> evidence$17) {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.");
}
Function1<T, TraversableOnce<R>> cleanFun = this.clean((F)fun);
.anon flatMapper = new /* Unavailable Anonymous Inner Class!! */;
return this.flatMap((FlatMapFunction<T, R>)flatMapper, evidence$16, evidence$17);
}
What could be wrong here? I would be grateful if you could give some insight.
Thank you in advance.

The problem is that you are importing the Java StreamExecutionEnvironment of Flink: org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.
You have to use the Scala variant of the StreamExecutionEnvironment like this: import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.
With that change, everything is successfully building!
Original answer:
The problem is that you are passing a Function to the flatMap() method. However flatMap() expects a FlatMapFunction.
val words : DataStream[String] = text.flatMap[String](
new FlatMapFunction[String,String] {
override def flatMap(t: String, collector: Collector[String]): Unit = t.split(" ")
})

Related

How to extract value from Scala cats IO

I need to get Array[Byte] value from ioArray which is IO[Array[Byte]] // IO is from cats library
object MyTransactionInputApp extends App{
val ioArray : IO[Array[Byte]] = generateKryoBinary()
val i : Array[Byte] = ioArray.unsafeRunSync();
println(i)
def generateKryoBinaryIO(transaction: Transaction): IO[Array[Byte]] = {
KryoSerializer
.forAsync[IO](kryoRegistrar)
.use { implicit kryo =>
transaction.toBinary.liftTo[IO]
}
}
def generateKryoBinary(): IO[Array[Byte]] = {
val transaction = new Transaction(Hash(""),"","","","","")
val ioArray = generateKryoBinaryIO(transaction);
return ioArray
}
}
I tried the below, but not working
val i : Array[Byte] = for {
array <- ioArray
} yield array
If you just started working with cats-effect I recommend reading about cats.effect.IOApp which runs your IO.
Otherwise simple solutions would be:
run it explicitly and get the result:
import cats.effect.unsafe.implicits.global
ioArray.unsafeRunSync()
or maybe work with Future:
import cats.effect.unsafe.implicits.global
ioArray.unsafeToFuture()
Could you give us more context about your application ?

Registering UDF's dynamically using scala reflect not working

I am registering my UDF's dynamically using scala reflect as shown below and this code works fine. However when I try to list spark functions using spark.catalog then I don't see it. Can you please help me understanding what I am missing here:
spark.catalog.listFunctions().foreach{
fun =>
if (fun.name == "ModelIdToModelYear") {
println(fun.name)
}
}
def registerUDF(spark: SparkSession) : Unit = {
val runtimeMirror = scala.reflect.runtime.universe.runtimeMirror(getClass.getClassLoader)
val moduleSymbol = runtimeMirror.moduleSymbol(Class.forName("com.local.practice.udf.UdfModelIdToModelYear"))
val targetMethod = moduleSymbol.typeSignature.members.filter{
x => x.isMethod && x.name.toString == "ModelIdToModelYear"
}.head.asMethod
val function = runtimeMirror.reflect(runtimeMirror.reflectModule(moduleSymbol).instance).reflectMethod(targetMethod)
function(spark.udf)
}
Below is my UDF definition
package com.local.practice.udf
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
//noinspection ScalaStyle
object UdfModelIdToModelYear {
val ModelIdToModelYear: UserDefinedFunction = udf((model_id : String) => {
val numPattern = "(\\d{2})_.+".r
numPattern.findFirstIn(model_id).getOrElse("0").toInt
})
}

Spark cannot serialize object inside a function?

I am using Spark 1.5.
I run the following code and get exception java.lang.ExceptionInInitializerError
def work() = {
val rdd1:RDD[String] = read_rdd()
val map:Map[String,Boolean] = read_a_map()
object brx extends Serializable { val value = map }
def filter(rdd:RDD[String]) = {
rdd filter brx.value.apply //throws exception
}
def filter2(rdd:RDD[String]) = {
rdd filter map //works fine
}
val x = filter(rdd1) //throws exception
val x = filter2(rdd1) // works fine
}
Why I got java.lang.ExceptionInInitializerError?

How to read a zip containing multiple files in Apache Spark

I am having a Zipped file containing multiple text files.
I want to read each of the file and build a List of RDD containining the content of each files.
val test = sc.textFile("/Volumes/work/data/kaggle/dato/test/5.zip")
will just entire files, but how to iterate through each content of zip and then save the same in RDD using Spark.
I am fine with Scala or Python.
Possible solution in Python with using Spark -
archive = zipfile.ZipFile(archive_path, 'r')
file_paths = zipfile.ZipFile.namelist(archive)
for file_path in file_paths:
urls = file_path.split("/")
urlId = urls[-1].split('_')[0]
Apache Spark default compression support
I have written all the necessary theory in other answer, that you might want to refer to: https://stackoverflow.com/a/45958182/1549135
Read zip containing multiple files
I have followed the advice given by #Herman and used ZipInputStream. This gave me this solution, which returns RDD[String] of the zip content.
import java.io.{BufferedReader, InputStreamReader}
import java.util.zip.ZipInputStream
import org.apache.spark.SparkContext
import org.apache.spark.input.PortableDataStream
import org.apache.spark.rdd.RDD
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.endsWith(".zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile {
case null => zis.close(); false
case _ => true
}
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}
}
} else {
sc.textFile(path, minPartitions)
}
}
}
simply use it by importing the implicit class and call the readFile method on SparkContext:
import com.github.atais.spark.Implicits.ZipSparkContext
sc.readFile(path)
If you are reading binary files use sc.binaryFiles. This will return an RDD of tuples containing the file name and a PortableDataStream. You can feed the latter into a ZipInputStream.
Here's a working version of #Atais solution (which needs enhancement by closing the streams) :
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.toLowerCase.contains("zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap {
case (zipFilePath, zipContent) ⇒
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.map { _ ⇒
scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString("\n")
} #::: { zipInputStream.close; Stream.empty[String] }
}
} else {
sc.textFile(path, minPartitions)
}
}
}
Then all you have to do is the following to read a zip file :
sc.readFile(path)
This filters only the first line. can anyone share your insights. I am trying to read a CSV file which is zipped and create JavaRDD for further processing.
JavaPairRDD<String, PortableDataStream> zipData =
sc.binaryFiles("hdfs://temp.zip");
JavaRDD<Record> newRDDRecord = zipData.flatMap(
new FlatMapFunction<Tuple2<String, PortableDataStream>, Record>(){
public Iterator<Record> call(Tuple2<String,PortableDataStream> content) throws Exception {
List<Record> records = new ArrayList<Record>();
ZipInputStream zin = new ZipInputStream(content._2.open());
ZipEntry zipEntry;
while ((zipEntry = zin.getNextEntry()) != null) {
count++;
if (!zipEntry.isDirectory()) {
Record sd;
String line;
InputStreamReader streamReader = new InputStreamReader(zin);
BufferedReader bufferedReader = new BufferedReader(streamReader);
line = bufferedReader.readLine();
String[] records= new CSVParser().parseLineMulti(line);
sd = new Record(TimeBuilder.convertStringToTimestamp(records[0]),
getDefaultValue(records[1]),
getDefaultValue(records[22]));
records.add(sd);
}
}
return records.iterator();
}
});
Here is another working solution which gives out file name which can be later split and used to create separate schemas from it.
implicit class ZipSparkContext(val sc: SparkContext) extends AnyVal {
def readFile(path: String,
minPartitions: Int = sc.defaultMinPartitions): RDD[String] = {
if (path.toLowerCase.contains("zip")) {
sc.binaryFiles(path, minPartitions)
.flatMap {
case (zipFilePath, zipContent) ⇒
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.map { x ⇒
val filename1 = x.getName
scala.io.Source.fromInputStream(zipInputStream, "UTF-8").getLines.mkString(s"~${filename1}\n")+s"~${filename1}"
} #::: { zipInputStream.close; Stream.empty[String] }
}
} else {
sc.textFile(path, minPartitions)
}
}
}
full code is here
https://github.com/kali786516/Spark2StructuredStreaming/blob/master/src/main/scala/com/dataframe/extraDFExamples/SparkReadZipFiles.scala

Creating serializable objects from Scala source code at runtime

To embed Scala as a "scripting language", I need to be able to compile text fragments to simple objects, such as Function0[Unit] that can be serialised to and deserialised from disk and which can be loaded into the current runtime and executed.
How would I go about this?
Say for example, my text fragment is (purely hypothetical):
Document.current.elements.headOption.foreach(_.open())
This might be wrapped into the following complete text:
package myapp.userscripts
import myapp.DSL._
object UserFunction1234 extends Function0[Unit] {
def apply(): Unit = {
Document.current.elements.headOption.foreach(_.open())
}
}
What comes next? Should I use IMain to compile this code? I don't want to use the normal interpreter mode, because the compilation should be "context-free" and not accumulate requests.
What I need to get hold off from the compilation is I guess the binary class file? In that case, serialisation is straight forward (byte array). How would I then load that class into the runtime and invoke the apply method?
What happens if the code compiles to multiple auxiliary classes? The example above contains a closure _.open(). How do I make sure I "package" all those auxiliary things into one object to serialize and class-load?
Note: Given that Scala 2.11 is imminent and the compiler API probably changed, I am happy to receive hints as how to approach this problem on Scala 2.11
Here is one idea: use a regular Scala compiler instance. Unfortunately it seems to require the use of hard disk files both for input and output. So we use temporary files for that. The output will be zipped up in a JAR which will be stored as a byte array (that would go into the hypothetical serialization process). We need a special class loader to retrieve the class again from the extracted JAR.
The following assumes Scala 2.10.3 with the scala-compiler library on the class path:
import scala.tools.nsc
import java.io._
import scala.annotation.tailrec
Wrapping user provided code in a function class with a synthetic name that will be incremented for each new fragment:
val packageName = "myapp"
var userCount = 0
def mkFunName(): String = {
val c = userCount
userCount += 1
s"Fun$c"
}
def wrapSource(source: String): (String, String) = {
val fun = mkFunName()
val code = s"""package $packageName
|
|class $fun extends Function0[Unit] {
| def apply(): Unit = {
| $source
| }
|}
|""".stripMargin
(fun, code)
}
A function to compile a source fragment and return the byte array of the resulting jar:
/** Compiles a source code consisting of a body which is wrapped in a `Function0`
* apply method, and returns the function's class name (without package) and the
* raw jar file produced in the compilation.
*/
def compile(source: String): (String, Array[Byte]) = {
val set = new nsc.Settings
val d = File.createTempFile("temp", ".out")
d.delete(); d.mkdir()
set.d.value = d.getPath
set.usejavacp.value = true
val compiler = new nsc.Global(set)
val f = File.createTempFile("temp", ".scala")
val out = new BufferedOutputStream(new FileOutputStream(f))
val (fun, code) = wrapSource(source)
out.write(code.getBytes("UTF-8"))
out.flush(); out.close()
val run = new compiler.Run()
run.compile(List(f.getPath))
f.delete()
val bytes = packJar(d)
deleteDir(d)
(fun, bytes)
}
def deleteDir(base: File): Unit = {
base.listFiles().foreach { f =>
if (f.isFile) f.delete()
else deleteDir(f)
}
base.delete()
}
Note: Doesn't handle compiler errors yet!
The packJar method uses the compiler output directory and produces an in-memory jar file from it:
// cf. http://stackoverflow.com/questions/1281229
def packJar(base: File): Array[Byte] = {
import java.util.jar._
val mf = new Manifest
mf.getMainAttributes.put(Attributes.Name.MANIFEST_VERSION, "1.0")
val bs = new java.io.ByteArrayOutputStream
val out = new JarOutputStream(bs, mf)
def add(prefix: String, f: File): Unit = {
val name0 = prefix + f.getName
val name = if (f.isDirectory) name0 + "/" else name0
val entry = new JarEntry(name)
entry.setTime(f.lastModified())
out.putNextEntry(entry)
if (f.isFile) {
val in = new BufferedInputStream(new FileInputStream(f))
try {
val buf = new Array[Byte](1024)
#tailrec def loop(): Unit = {
val count = in.read(buf)
if (count >= 0) {
out.write(buf, 0, count)
loop()
}
}
loop()
} finally {
in.close()
}
}
out.closeEntry()
if (f.isDirectory) f.listFiles.foreach(add(name, _))
}
base.listFiles().foreach(add("", _))
out.close()
bs.toByteArray
}
A utility function that takes the byte array found in deserialization and creates a map from class names to class byte code:
def unpackJar(bytes: Array[Byte]): Map[String, Array[Byte]] = {
import java.util.jar._
import scala.annotation.tailrec
val in = new JarInputStream(new ByteArrayInputStream(bytes))
val b = Map.newBuilder[String, Array[Byte]]
#tailrec def loop(): Unit = {
val entry = in.getNextJarEntry
if (entry != null) {
if (!entry.isDirectory) {
val name = entry.getName
// cf. http://stackoverflow.com/questions/8909743
val bs = new ByteArrayOutputStream
var i = 0
while (i >= 0) {
i = in.read()
if (i >= 0) bs.write(i)
}
val bytes = bs.toByteArray
b += mkClassName(name) -> bytes
}
loop()
}
}
loop()
in.close()
b.result()
}
def mkClassName(path: String): String = {
require(path.endsWith(".class"))
path.substring(0, path.length - 6).replace("/", ".")
}
A suitable class loader:
class MemoryClassLoader(map: Map[String, Array[Byte]]) extends ClassLoader {
override protected def findClass(name: String): Class[_] =
map.get(name).map { bytes =>
println(s"defineClass($name, ...)")
defineClass(name, bytes, 0, bytes.length)
} .getOrElse(super.findClass(name)) // throws exception
}
And a test case which contains additional classes (closures):
val exampleSource =
"""val xs = List("hello", "world")
|println(xs.map(_.capitalize).mkString(" "))
|""".stripMargin
def test(fun: String, cl: ClassLoader): Unit = {
val clName = s"$packageName.$fun"
println(s"Resolving class '$clName'...")
val clazz = Class.forName(clName, true, cl)
println("Instantiating...")
val x = clazz.newInstance().asInstanceOf[() => Unit]
println("Invoking 'apply':")
x()
}
locally {
println("Compiling...")
val (fun, bytes) = compile(exampleSource)
val map = unpackJar(bytes)
println("Classes found:")
map.keys.foreach(k => println(s" '$k'"))
val cl = new MemoryClassLoader(map)
test(fun, cl) // should call `defineClass`
test(fun, cl) // should find cached class
}