Converting Spark UDF written in SCALA to JAVA - scala

Below is my spark UD, can anyone help me to convert this into java?
val customUDF = udf((array: Seq[String]) => {
val newts = array.filter(_.nonEmpty)
if (newts.size == 0) null
else newts.head
})

Something like:
UDF2 my_udf = new UDF2<WrappedArray<String>, String>() {
public String call(WrappedArray<String> arr) throws Exception {
String[] newts = arr.filter(_.nonEmpty)
if (newts.length == 0) {
return null
} else { newts[0] }
}
};
spark.udf().register("my_udf", my_udf, DataTypes.StringType);

You can do it in two ways
Inline using Lambda i.e. Scala style
Or you can define a method and register it.
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.api.java.UDF0;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;
import java.util.List;
public class SimpleUDF {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local[*]").getOrCreate();
spark.sqlContext()
.udf()
.register("sampleUDFLambda", (List<String> array) -> array.stream().filter(element ->
!element.isEmpty()).findFirst().orElse(null), DataTypes.StringType);
}
//Or you can define a function
private UDF1< List<String>,String> sampleUdf()
{
return ( array ) -> array.stream().filter(element ->
!element.isEmpty()).findFirst().orElse(null);
}
}

Related

Exception in thread "main" java.lang.IndexOutOfBoundsException Scala gZip and decode base64 code

I am trying to decompress and decode(base64) a string from gZip.
I have written one class for the same and a test code.
I am getting
Exception in thread "main" java.lang.IndexOutOfBoundsException
on line
while ((readByte = gzipIS.read(gzipByteBuffer)) != -1) byteOS.write(
base code :
import java.io.BufferedReader
import java.io.ByteArrayInputStream
import java.io.ByteArrayOutputStream
import java.io.IOException
import java.io.InputStreamReader
import java.io.UnsupportedEncodingException
import java.nio.charset.StandardCharsets
import java.util.zip.GZIPInputStream
import java.util.zip.GZIPOutputStream
import org.apache.commons.codec.binary.Base64
import org.apache.commons.io.IOUtils
class ZipUtilities {
def unzipCCBRsponse(inputB64: String): String = {
val bDecodeBase64: Array[Byte] = Base64.decodeBase64(inputB64)
var zipInputStream: GZIPInputStream = null
try {
zipInputStream = new GZIPInputStream(
new ByteArrayInputStream(bDecodeBase64, 4, bDecodeBase64.length - 4))
val inputStreamReader: InputStreamReader =
new InputStreamReader(zipInputStream, StandardCharsets.UTF_8)
val bufferedReader: BufferedReader = new BufferedReader(
inputStreamReader)
val output: StringBuilder = new StringBuilder()
var line: String = null
while ((line = bufferedReader.readLine()) != null) output.append(line)
println("Output String Length: " + output.length)
bufferedReader.close()
output.toString
} catch {
case e: IOException => e.printStackTrace()
}
null
}
def decodeBase64(b64EncodedString: String): String = {
var bDecodeBase64: Array[Byte] = null
try {
bDecodeBase64 =
Base64.decodeBase64(b64EncodedString.getBytes("ISO-8859-1"))
new String(bDecodeBase64)
} catch {
case e: UnsupportedEncodingException => e.printStackTrace()
}
null
}
def decodeFromGzip(input: String): String = {
var output: String = null
if (input != null && !input.isEmpty) {
try {
var readByte: Int = 0
val gzipByteBuffer: Array[Byte] = Array.ofDim[Byte](2048)
val gzipIS: GZIPInputStream = new GZIPInputStream(
new ByteArrayInputStream(Base64.decodeBase64(input)))
val byteOS: ByteArrayOutputStream = new ByteArrayOutputStream()
while ((readByte = gzipIS.read(gzipByteBuffer)) != -1) byteOS.write(
gzipByteBuffer,
0,
readByte)
byteOS.close()
output = new String(byteOS.toByteArray())
} catch {
case e: IOException => e.printStackTrace()
}
}
output
}
}
The test code is :
object ZipTest {
def main(args: Array[String]): Unit = {
val b64: String =
"H4sIAAAAAAAAAO2dW1PbOBTH3/spNHkvgXYvLVPoCFlJtNiSR5IT8sRkUxeYJQmThMJ++5Xt3LjNNsdwVprNC8N4fPz/2dblXGTly9f70TX5kU9nV5PxUeNgb79B8vFw8u1qfHHUuJ1"
val util: ZipUtilities = new ZipUtilities()
println(util.decodeFromGzip(b64).replaceAll("\n", ""))
}
}
In Scala, assignments (readByte = gzipIS.read(gzipByteBuffer)) return Unit, and since a value of type Unit will never equal -1 (or any Int value), your while loop is effectively infinite.
The compiler should have alerted you:
warning: comparing values of types Unit and Int using `!=' will always yield true
Note: You seldom see null or var in idiomatic Scala code. It's almost never needed.

How to unitest gauge metrics in flink

I have a datastrean in flink and I generate my owns metrics using gauge in a ProcessFunction.
As these metrics are important for my activity, i would like to unit test them once the flow is executed.
Unfortunately, I didn't find a way to implement a proper test reporter.
Here is a simple code explaining my issue.
Two concerns with this code:
how do i trigger the gauge
how do I get the reporter instiantiated by env.execute
Here is the sample
import java.util.concurrent.atomic.AtomicInteger
import org.apache.flink.api.scala.metrics.ScalaGauge
import org.apache.flink.configuration.{ConfigConstants, Configuration}
import org.apache.flink.metrics.reporter.AbstractReporter
import org.apache.flink.metrics.{Gauge, Metric, MetricConfig}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.util.Collector
import org.scalatest.FunSuite
import org.scalatest.Matchers._
import org.scalatest.PartialFunctionValues._
import scala.collection.JavaConverters._
import scala.collection.mutable
/* Test based on Flink test example https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/testing.html */
class MultiplyByTwo extends ProcessFunction[Long, Long] {
override def processElement(data: Long, context: ProcessFunction[Long, Long]#Context, collector: Collector[Long]): Unit = {
collector.collect(data * 2L)
}
val nbrCalls = new AtomicInteger(0)
override def open(parameters: Configuration): Unit = {
getRuntimeContext.getMetricGroup
.addGroup("counter")
.gauge[Int, ScalaGauge[Int]]("call" , ScalaGauge[Int]( () => nbrCalls.get()))
}
}
// create a testing sink
class CollectSink extends SinkFunction[Long] {
override def invoke(value: Long): Unit = {
synchronized {
CollectSink.values.add(value)
}
}
}
object CollectSink {
val values: java.util.ArrayList[Long] = new java.util.ArrayList[Long]()
}
class StackOverflowTestReporter extends AbstractReporter {
var gaugesMetrics : mutable.Map[String, String] = mutable.Map[String, String]()
override def open(metricConfig: MetricConfig): Unit = {}
override def close(): Unit = {}
override def filterCharacters(s: String): String = s
def report(): Unit = {
gaugesMetrics = this.gauges.asScala.map(t => (metricValue(t._1), t._2))
}
private def metricValue(m: Metric): String = {
m match {
case g: Gauge[_] => g.getValue.toString
case _ => ""
}
}
}
class StackOverflowTest extends FunSuite with StreamingMultipleProgramsTestBase{
def createConfigForReporter(reporterName : String) : Configuration = {
val cfg : Configuration = new Configuration()
cfg.setString(ConfigConstants.METRICS_REPORTER_PREFIX + reporterName + "." + ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX, classOf[StackOverflowTestReporter].getName)
cfg
}
test("test_metrics") {
val env = StreamExecutionEnvironment.createLocalEnvironment(
StreamExecutionEnvironment.getDefaultLocalParallelism,
createConfigForReporter("reporter"))
// configure your test environment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
// values are collected in a static variable
CollectSink.values.clear()
// create a stream of custom elements and apply transformations
env.fromElements[Long](1L, 21L, 22L)
.process(new MultiplyByTwo())
.addSink(new CollectSink())
// execute
env.execute()
// verify your results
CollectSink.values should have length 3
CollectSink.values should contain (2L)
CollectSink.values should contain (42L)
CollectSink.values should contain (44L)
//verify gauge counter
//pseudo code ...
val testReporter : StackOverflowTestReporter = _ // how to get testReporter instantiate in env
testReporter.gaugesMetrics should have size 1
testReporter.gaugesMetrics should contain key "count.call"
testReporter.gaugesMetrics.valueAt("count.call") should be equals("3")
}
}
Solution thanks to Chesnay Schepler
import java.util.concurrent.atomic.AtomicInteger
import org.apache.flink.api.common.time.Time
import org.apache.flink.api.scala.metrics.ScalaGauge
import org.apache.flink.configuration.{ConfigConstants, Configuration}
import org.apache.flink.metrics.reporter.MetricReporter
import org.apache.flink.metrics.{Metric, MetricConfig, MetricGroup}
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.test.util.MiniClusterResource
import org.apache.flink.util.Collector
import org.scalatest.Matchers._
import org.scalatest.PartialFunctionValues._
import org.scalatest.{BeforeAndAfterAll, FunSuite}
import scala.collection.mutable
/* Test based on Flink test example https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/testing.html */
class MultiplyByTwo extends ProcessFunction[Long, Long] {
override def processElement(data: Long, context: ProcessFunction[Long, Long]#Context, collector: Collector[Long]): Unit = {
nbrCalls.incrementAndGet()
collector.collect(data * 2L)
}
val nbrCalls = new AtomicInteger(0)
override def open(parameters: Configuration): Unit = {
getRuntimeContext.getMetricGroup
.addGroup("counter")
.gauge[Int, ScalaGauge[Int]]("call" , ScalaGauge[Int]( () => nbrCalls.get()))
}
}
// create a testing sink
class CollectSink extends SinkFunction[Long] {
import CollectSink._
override def invoke(value: Long): Unit = {
synchronized {
values.add(value)
}
}
}
object CollectSink {
val values: java.util.ArrayList[Long] = new java.util.ArrayList[Long]()
}
class StackOverflowTestReporter extends MetricReporter {
import StackOverflowTestReporter._
override def open(metricConfig: MetricConfig): Unit = {}
override def close(): Unit = {}
override def notifyOfAddedMetric(metric: Metric, metricName: String, group: MetricGroup) : Unit = {
metric match {
case gauge: ScalaGauge[_] => {
//drop group metrics meaningless for the test, seem's to be the first 6 items
val gaugeKey = group.getScopeComponents.toSeq.drop(6).mkString(".") + "." + metricName
gaugesMetrics(gaugeKey) = gauge.asInstanceOf[ScalaGauge[Int]]
}
case _ =>
}
}
override def notifyOfRemovedMetric(metric: Metric, metricName: String, group: MetricGroup): Unit = {}
}
object StackOverflowTestReporter {
var gaugesMetrics : mutable.Map[String, ScalaGauge[Int]] = mutable.Map[String, ScalaGauge[Int]]()
}
class StackOverflowTest extends FunSuite with BeforeAndAfterAll{
val miniClusterResource : MiniClusterResource = buildMiniClusterResource()
override def beforeAll(): Unit = {
CollectSink.values.clear()
StackOverflowTestReporter.gaugesMetrics.clear()
miniClusterResource.before()
}
override def afterAll(): Unit = {
miniClusterResource.after()
}
def createConfigForReporter() : Configuration = {
val cfg : Configuration = new Configuration()
cfg.setString(ConfigConstants.METRICS_REPORTER_PREFIX + "reporter" + "." + ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX, classOf[StackOverflowTestReporter].getName)
cfg
}
def buildMiniClusterResource() : MiniClusterResource = new MiniClusterResource(
new MiniClusterResource.MiniClusterResourceConfiguration(
createConfigForReporter(),1,1, Time.milliseconds(50L)))
test("test_metrics") {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements[Long](1L, 21L, 22L)
.process(new MultiplyByTwo())
.addSink(new CollectSink())
env.execute()
CollectSink.values should have length 3
CollectSink.values should contain (2L)
CollectSink.values should contain (42L)
CollectSink.values should contain (44L)
//verify gauge counter
val gaugeValues = StackOverflowTestReporter.gaugesMetrics.map(t => (t._1, t._2.getValue()))
gaugeValues should have size 1
gaugeValues should contain ("counter.call" -> 3)
}
}
your best bet is to use a MiniClusterResource to explicitly start a cluster before the job and configure a reporter that checks for specific metrics and exposes them through static fields.
#Rule
public final MiniClusterResource clusterResource = new MiniClusterResource(
new MiniClusterResourceConfiguration.Builder()
.setConfiguration(getConfig()));
private static Configuration getConfig() {
Configuration config = new Configuration();
config.setString(
ConfigConstants.METRICS_REPORTER_PREFIX +
"myTestReporter." +
ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX,
MyTestReporter.class.getName());
return config;
}
public static class MyTestReporter implements MetricReporter {
static volatile Gauge<?> myGauge = null;
#Override
public void open(MetricConfig metricConfig) {
}
#Override
public void close() {
}
#Override
public void notifyOfAddedMetric(Metric metric, String name, MetricGroup metricGroup) {
if ("myMetric".equals(name)) {
myGauge = (Gauge<?>) metric;
}
}
#Override
public void notifyOfRemovedMetric(Metric metric, String s, MetricGroup metricGroup) {
}
}

How to get defined Arbitrary?

I am using ScalaTest and ScalaCheck. I wrote a custom generator and arbitrary generator as following:
import java.time.LocalDateTime
import org.scalacheck._
import org.scalatest.PropSpec
import org.scalatest.prop.Checkers
import Gen._
import Arbitrary.arbitrary
class AuthJwtSpec extends PropSpec with Checkers {
private val start = LocalDateTime.now.atZone(java.time.ZoneId.systemDefault()).toEpochSecond
private val end = LocalDateTime.now.plusDays(2).atZone(java.time.ZoneId.systemDefault()).toEpochSecond
private val pickTime = Gen.choose(start, end)
private val authUser: Arbitrary[AuthUser] =
Arbitrary {
for {
u <- arbitrary[String]
p <- arbitrary[String]
} yield AuthUser(u, p)
}
property("Generate JWT token.") {
check(Prop.forAll(authUser, pickTime) {(r1: AuthUser, r2: Long) =>
???
})
}
}
The problem is, that r1 has type Arbitrary[AuthUser] but I need AuthUser, how to get it?
As I see, the problem is with authUser field - it should be Gen[AuthUser], not Arbitary[AuthUser]:
import java.time.LocalDateTime
import org.scalacheck._
import org.scalatest._
import prop._
case class AuthUser(u: String, p: String)
class AuthJwtSpec extends PropSpec with Checkers with PropertyChecks {
private val start = LocalDateTime.now.atZone(java.time.ZoneId.systemDefault()).toEpochSecond
private val end = LocalDateTime.now.plusDays(2).atZone(java.time.ZoneId.systemDefault()).toEpochSecond
private val pickTime: Gen[Long] = Gen.choose(start, end)
// AuthUser should be Gen[AuthUser], not Arbitary[AuthUser]
private val authUser: Gen[AuthUser] =
for {
u <- Arbitrary.arbitrary[String]
p <- Arbitrary.arbitrary[String]
} yield AuthUser(u, p)
property("Generate JWT token.") {
val prop = Prop.forAll(authUser, pickTime) {(user: AuthUser, time: Long) =>
println(s"User: $user")
println(s"Time: $time")
// Property checks must always return true or false
true
}
check(prop)
}
}

Apache Flink Streaming type mismatch in flatMap function

Trying to use streaming api of 0.10.0 flink version in scala 2.10.4. While trying to compile this first version:
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.windowing.time._
object Main {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("localhost", 9999)
val words : DataStream[String] = text.flatMap[String](
new Function[String,TraversableOnce[String]] {
def apply(line:String):TraversableOnce[String] = line.split(" ")
})
env.execute("Window Stream wordcount")
}
}
I am getting compile time error:
[error] found : String => TraversableOnce[String]
[error] required: org.apache.flink.api.common.functions.FlatMapFunction[String,String]
[error] new Function[String,TraversableOnce[String]] { def apply(line:String):TraversableOnce[String] = line.split(" ")})
[error] ^
And in decompiled version of DataStream.class that I have included to project there are functions that accept such type (the last one):
public <R> DataStream<R> flatMap(FlatMapFunction<T, R> flatMapper, TypeInformation<R> evidence$12, ClassTag<R> evidence$13) {
if (flatMapper == null) {
throw new NullPointerException("FlatMap function must not be null.");
}
TypeInformation outType = (TypeInformation)Predef..MODULE$.implicitly(evidence$12);
return package..MODULE$.javaToScalaStream((org.apache.flink.streaming.api.datastream.DataStream)this.javaStream.flatMap(flatMapper).returns(outType));
}
public <R> DataStream<R> flatMap(Function2<T, Collector<R>, BoxedUnit> fun, TypeInformation<R> evidence$14, ClassTag<R> evidence$15) {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.");
}
Function2<T, Collector<R>, BoxedUnit> cleanFun = this.clean((F)fun);
.anon flatMapper = new /* Unavailable Anonymous Inner Class!! */;
return this.flatMap((FlatMapFunction<T, R>)flatMapper, evidence$14, evidence$15);
}
public <R> DataStream<R> flatMap(Function1<T, TraversableOnce<R>> fun, TypeInformation<R> evidence$16, ClassTag<R> evidence$17) {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.");
}
Function1<T, TraversableOnce<R>> cleanFun = this.clean((F)fun);
.anon flatMapper = new /* Unavailable Anonymous Inner Class!! */;
return this.flatMap((FlatMapFunction<T, R>)flatMapper, evidence$16, evidence$17);
}
What could be wrong here? I would be grateful if you could give some insight.
Thank you in advance.
The problem is that you are importing the Java StreamExecutionEnvironment of Flink: org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.
You have to use the Scala variant of the StreamExecutionEnvironment like this: import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.
With that change, everything is successfully building!
Original answer:
The problem is that you are passing a Function to the flatMap() method. However flatMap() expects a FlatMapFunction.
val words : DataStream[String] = text.flatMap[String](
new FlatMapFunction[String,String] {
override def flatMap(t: String, collector: Collector[String]): Unit = t.split(" ")
})

i am trying to write a text file using scala in play

i have attached my codes
Application (controller)
package controllers
import play.api._
import play.api.mvc._
import play.api.data._
import play.api.data.Forms._
import models.Task
import java.io._
object Application extends Controller {
val taskForm = Form(
tuple(
"id" -> number,
"label" -> nonEmptyText(minLength = 4),
"add" -> nonEmptyText
)
)
def index = Action {
Redirect(routes.Application.tasks)
}
def tasks = Action {
Ok(views.html.index(Task.all(),taskForm))
}
def showTask= Action {
Ok(views.html.test(Task.all(), taskForm))
}
def newTask = Action { implicit request =>
taskForm.bindFromRequest.fold(
errors => BadRequest(views.html.index(Task.all(), errors)),
{
case(id,label,add) => {
Task.create(id,label,add)
Redirect(routes.Application.showTask)
}
}
)
}
def deleteTask(id: Int) = Action {
Task.delete(id)
Redirect(routes.Application.showTask)
}
}
Task(model)
package models
import anorm._
import anorm.SqlParser._
import play.api.db._
import play.api.Play.current
case class Task(id: Int, label: String,add:String)
object Task {
val task = {
get[Int]("id") ~
get[String]("label") ~
get[String]("add") map {
case id~label~add => Task(id, label,add)
}
}
def all(): List[Task] = DB.withConnection { implicit c =>
SQL("select * from task").as(task *)
}
def create(id:Int , label: String, add:String) {
DB.withConnection { implicit c =>
SQL("insert into task (id,label,add) values ({id},{label},{add})").on(
'id -> id ,
'label -> label ,
'add -> add
).executeUpdate()
}
}
def delete(id:Int) {
DB.withConnection { implicit c =>
SQL("delete from task where id = {id}").on(
'id -> id
).executeUpdate()
}
}
}
I have no idea where to declare the writer function .please help me with the syntax also, I need to write the form elements into a text file .. Thankx in advance
Assuming that you want to append the text whenever a new task is added (i.e. newTask is invoked by Play).
You can define a helper function in object Application and use this helper method in your newTask method.
object Application extends Controller {
//...
import java.io.FileWriter
val filePath = """ path to file """
def writingToFile(str: String) = {
val fw = new FileWriter(filePath, true)
try {
fw.write(str)
} finally {
fw.close()
}
}
def newTask = Action { implicit request =>
taskForm.bindFromRequest.fold(
errors => BadRequest(views.html.index(Task.all(), errors)),
{
case(id,label,add) => {
/* Call the helper function to append to the file */
writingToFile(s"id : $id, label : $label, add : $add \n")
Task.create(id,label,add)
Redirect(routes.Application.showTask)
}
}
)
}
//..
}
Likewise when other methods are invoked you may append to the file in similar fashion.
Hope it helps :)