Converting Spark UDF written in SCALA to JAVA

Converting Spark UDF written in SCALA to JAVA - scala

Below is my spark UD, can anyone help me to convert this into java?
val customUDF = udf((array: Seq[String]) => {
val newts = array.filter(_.nonEmpty)
if (newts.size == 0) null
else newts.head
})

Something like:
UDF2 my_udf = new UDF2<WrappedArray<String>, String>() {
public String call(WrappedArray<String> arr) throws Exception {
String[] newts = arr.filter(_.nonEmpty)
if (newts.length == 0) {
return null
} else { newts[0] }
}
};
spark.udf().register("my_udf", my_udf, DataTypes.StringType);

You can do it in two ways
Inline using Lambda i.e. Scala style
Or you can define a method and register it.
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.api.java.UDF0;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;
import java.util.List;
public class SimpleUDF {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder().master("local[*]").getOrCreate();
spark.sqlContext()
.udf()
.register("sampleUDFLambda", (List<String> array) -> array.stream().filter(element ->
!element.isEmpty()).findFirst().orElse(null), DataTypes.StringType);
}
//Or you can define a function
private UDF1< List<String>,String> sampleUdf()
{
return ( array ) -> array.stream().filter(element ->
!element.isEmpty()).findFirst().orElse(null);
}
}

Related

Exception in thread "main" java.lang.IndexOutOfBoundsException Scala gZip and decode base64 code

I am trying to decompress and decode(base64) a string from gZip.
I have written one class for the same and a test code.
I am getting
Exception in thread "main" java.lang.IndexOutOfBoundsException
on line
while ((readByte = gzipIS.read(gzipByteBuffer)) != -1) byteOS.write(
base code :
import java.io.BufferedReader
import java.io.ByteArrayInputStream
import java.io.ByteArrayOutputStream
import java.io.IOException
import java.io.InputStreamReader
import java.io.UnsupportedEncodingException
import java.nio.charset.StandardCharsets
import java.util.zip.GZIPInputStream
import java.util.zip.GZIPOutputStream
import org.apache.commons.codec.binary.Base64
import org.apache.commons.io.IOUtils
class ZipUtilities {
def unzipCCBRsponse(inputB64: String): String = {
val bDecodeBase64: Array[Byte] = Base64.decodeBase64(inputB64)
var zipInputStream: GZIPInputStream = null
try {
zipInputStream = new GZIPInputStream(
new ByteArrayInputStream(bDecodeBase64, 4, bDecodeBase64.length - 4))
val inputStreamReader: InputStreamReader =
new InputStreamReader(zipInputStream, StandardCharsets.UTF_8)
val bufferedReader: BufferedReader = new BufferedReader(
inputStreamReader)
val output: StringBuilder = new StringBuilder()
var line: String = null
while ((line = bufferedReader.readLine()) != null) output.append(line)
println("Output String Length: " + output.length)
bufferedReader.close()
output.toString
} catch {
case e: IOException => e.printStackTrace()
}
null
}
def decodeBase64(b64EncodedString: String): String = {
var bDecodeBase64: Array[Byte] = null
try {
bDecodeBase64 =
Base64.decodeBase64(b64EncodedString.getBytes("ISO-8859-1"))
new String(bDecodeBase64)
} catch {
case e: UnsupportedEncodingException => e.printStackTrace()
}
null
}
def decodeFromGzip(input: String): String = {
var output: String = null
if (input != null && !input.isEmpty) {
try {
var readByte: Int = 0
val gzipByteBuffer: Array[Byte] = Array.ofDim[Byte](2048)
val gzipIS: GZIPInputStream = new GZIPInputStream(
new ByteArrayInputStream(Base64.decodeBase64(input)))
val byteOS: ByteArrayOutputStream = new ByteArrayOutputStream()
while ((readByte = gzipIS.read(gzipByteBuffer)) != -1) byteOS.write(
gzipByteBuffer,
0,
readByte)
byteOS.close()
output = new String(byteOS.toByteArray())
} catch {
case e: IOException => e.printStackTrace()
}
}
output
}
}
The test code is :
object ZipTest {
def main(args: Array[String]): Unit = {
val b64: String =
"H4sIAAAAAAAAAO2dW1PbOBTH3/spNHkvgXYvLVPoCFlJtNiSR5IT8sRkUxeYJQmThMJ++5Xt3LjNNsdwVprNC8N4fPz/2dblXGTly9f70TX5kU9nV5PxUeNgb79B8vFw8u1qfHHUuJ1"
val util: ZipUtilities = new ZipUtilities()
println(util.decodeFromGzip(b64).replaceAll("\n", ""))
}
}

In Scala, assignments (readByte = gzipIS.read(gzipByteBuffer)) return Unit, and since a value of type Unit will never equal -1 (or any Int value), your while loop is effectively infinite.
The compiler should have alerted you:
warning: comparing values of types Unit and Int using `!=' will always yield true
Note: You seldom see null or var in idiomatic Scala code. It's almost never needed.

How to unitest gauge metrics in flink

I have a datastrean in flink and I generate my owns metrics using gauge in a ProcessFunction.
As these metrics are important for my activity, i would like to unit test them once the flow is executed.
Unfortunately, I didn't find a way to implement a proper test reporter.
Here is a simple code explaining my issue.
Two concerns with this code:
how do i trigger the gauge
how do I get the reporter instiantiated by env.execute
Here is the sample
import java.util.concurrent.atomic.AtomicInteger
import org.apache.flink.api.scala.metrics.ScalaGauge
import org.apache.flink.configuration.{ConfigConstants, Configuration}
import org.apache.flink.metrics.reporter.AbstractReporter
import org.apache.flink.metrics.{Gauge, Metric, MetricConfig}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.util.Collector
import org.scalatest.FunSuite
import org.scalatest.Matchers._
import org.scalatest.PartialFunctionValues._
import scala.collection.JavaConverters._
import scala.collection.mutable
/* Test based on Flink test example https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/testing.html */
class MultiplyByTwo extends ProcessFunction[Long, Long] {
override def processElement(data: Long, context: ProcessFunction[Long, Long]#Context, collector: Collector[Long]): Unit = {
collector.collect(data * 2L)
}
val nbrCalls = new AtomicInteger(0)
override def open(parameters: Configuration): Unit = {
getRuntimeContext.getMetricGroup
.addGroup("counter")
.gauge[Int, ScalaGauge[Int]]("call" , ScalaGauge[Int]( () => nbrCalls.get()))
}
}
// create a testing sink
class CollectSink extends SinkFunction[Long] {
override def invoke(value: Long): Unit = {
synchronized {
CollectSink.values.add(value)
}
}
}
object CollectSink {
val values: java.util.ArrayList[Long] = new java.util.ArrayList[Long]()
}
class StackOverflowTestReporter extends AbstractReporter {
var gaugesMetrics : mutable.Map[String, String] = mutable.Map[String, String]()
override def open(metricConfig: MetricConfig): Unit = {}
override def close(): Unit = {}
override def filterCharacters(s: String): String = s
def report(): Unit = {
gaugesMetrics = this.gauges.asScala.map(t => (metricValue(t._1), t._2))
}
private def metricValue(m: Metric): String = {
m match {
case g: Gauge[_] => g.getValue.toString
case _ => ""
}
}
}
class StackOverflowTest extends FunSuite with StreamingMultipleProgramsTestBase{
def createConfigForReporter(reporterName : String) : Configuration = {
val cfg : Configuration = new Configuration()
cfg.setString(ConfigConstants.METRICS_REPORTER_PREFIX + reporterName + "." + ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX, classOf[StackOverflowTestReporter].getName)
cfg
}
test("test_metrics") {
val env = StreamExecutionEnvironment.createLocalEnvironment(
StreamExecutionEnvironment.getDefaultLocalParallelism,
createConfigForReporter("reporter"))
// configure your test environment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
// values are collected in a static variable
CollectSink.values.clear()
// create a stream of custom elements and apply transformations
env.fromElements[Long](1L, 21L, 22L)
.process(new MultiplyByTwo())
.addSink(new CollectSink())
// execute
env.execute()
// verify your results
CollectSink.values should have length 3
CollectSink.values should contain (2L)
CollectSink.values should contain (42L)
CollectSink.values should contain (44L)
//verify gauge counter
//pseudo code ...
val testReporter : StackOverflowTestReporter = _ // how to get testReporter instantiate in env
testReporter.gaugesMetrics should have size 1
testReporter.gaugesMetrics should contain key "count.call"
testReporter.gaugesMetrics.valueAt("count.call") should be equals("3")
}
}
Solution thanks to Chesnay Schepler
import java.util.concurrent.atomic.AtomicInteger
import org.apache.flink.api.common.time.Time
import org.apache.flink.api.scala.metrics.ScalaGauge
import org.apache.flink.configuration.{ConfigConstants, Configuration}
import org.apache.flink.metrics.reporter.MetricReporter
import org.apache.flink.metrics.{Metric, MetricConfig, MetricGroup}
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.functions.sink.SinkFunction
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
import org.apache.flink.test.util.MiniClusterResource
import org.apache.flink.util.Collector
import org.scalatest.Matchers._
import org.scalatest.PartialFunctionValues._
import org.scalatest.{BeforeAndAfterAll, FunSuite}
import scala.collection.mutable
/* Test based on Flink test example https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/testing.html */
class MultiplyByTwo extends ProcessFunction[Long, Long] {
override def processElement(data: Long, context: ProcessFunction[Long, Long]#Context, collector: Collector[Long]): Unit = {
nbrCalls.incrementAndGet()
collector.collect(data * 2L)
}
val nbrCalls = new AtomicInteger(0)
override def open(parameters: Configuration): Unit = {
getRuntimeContext.getMetricGroup
.addGroup("counter")
.gauge[Int, ScalaGauge[Int]]("call" , ScalaGauge[Int]( () => nbrCalls.get()))
}
}
// create a testing sink
class CollectSink extends SinkFunction[Long] {
import CollectSink._
override def invoke(value: Long): Unit = {
synchronized {
values.add(value)
}
}
}
object CollectSink {
val values: java.util.ArrayList[Long] = new java.util.ArrayList[Long]()
}
class StackOverflowTestReporter extends MetricReporter {
import StackOverflowTestReporter._
override def open(metricConfig: MetricConfig): Unit = {}
override def close(): Unit = {}
override def notifyOfAddedMetric(metric: Metric, metricName: String, group: MetricGroup) : Unit = {
metric match {
case gauge: ScalaGauge[_] => {
//drop group metrics meaningless for the test, seem's to be the first 6 items
val gaugeKey = group.getScopeComponents.toSeq.drop(6).mkString(".") + "." + metricName
gaugesMetrics(gaugeKey) = gauge.asInstanceOf[ScalaGauge[Int]]
}
case _ =>
}
}
override def notifyOfRemovedMetric(metric: Metric, metricName: String, group: MetricGroup): Unit = {}
}
object StackOverflowTestReporter {
var gaugesMetrics : mutable.Map[String, ScalaGauge[Int]] = mutable.Map[String, ScalaGauge[Int]]()
}
class StackOverflowTest extends FunSuite with BeforeAndAfterAll{
val miniClusterResource : MiniClusterResource = buildMiniClusterResource()
override def beforeAll(): Unit = {
CollectSink.values.clear()
StackOverflowTestReporter.gaugesMetrics.clear()
miniClusterResource.before()
}
override def afterAll(): Unit = {
miniClusterResource.after()
}
def createConfigForReporter() : Configuration = {
val cfg : Configuration = new Configuration()
cfg.setString(ConfigConstants.METRICS_REPORTER_PREFIX + "reporter" + "." + ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX, classOf[StackOverflowTestReporter].getName)
cfg
}
def buildMiniClusterResource() : MiniClusterResource = new MiniClusterResource(
new MiniClusterResource.MiniClusterResourceConfiguration(
createConfigForReporter(),1,1, Time.milliseconds(50L)))
test("test_metrics") {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.fromElements[Long](1L, 21L, 22L)
.process(new MultiplyByTwo())
.addSink(new CollectSink())
env.execute()
CollectSink.values should have length 3
CollectSink.values should contain (2L)
CollectSink.values should contain (42L)
CollectSink.values should contain (44L)
//verify gauge counter
val gaugeValues = StackOverflowTestReporter.gaugesMetrics.map(t => (t._1, t._2.getValue()))
gaugeValues should have size 1
gaugeValues should contain ("counter.call" -> 3)
}
}

your best bet is to use a MiniClusterResource to explicitly start a cluster before the job and configure a reporter that checks for specific metrics and exposes them through static fields.
#Rule
public final MiniClusterResource clusterResource = new MiniClusterResource(
new MiniClusterResourceConfiguration.Builder()
.setConfiguration(getConfig()));
private static Configuration getConfig() {
Configuration config = new Configuration();
config.setString(
ConfigConstants.METRICS_REPORTER_PREFIX +
"myTestReporter." +
ConfigConstants.METRICS_REPORTER_CLASS_SUFFIX,
MyTestReporter.class.getName());
return config;
}
public static class MyTestReporter implements MetricReporter {
static volatile Gauge<?> myGauge = null;
#Override
public void open(MetricConfig metricConfig) {
}
#Override
public void close() {
}
#Override
public void notifyOfAddedMetric(Metric metric, String name, MetricGroup metricGroup) {
if ("myMetric".equals(name)) {
myGauge = (Gauge<?>) metric;
}
}
#Override
public void notifyOfRemovedMetric(Metric metric, String s, MetricGroup metricGroup) {
}
}

How to get defined Arbitrary?

I am using ScalaTest and ScalaCheck. I wrote a custom generator and arbitrary generator as following:
import java.time.LocalDateTime
import org.scalacheck._
import org.scalatest.PropSpec
import org.scalatest.prop.Checkers
import Gen._
import Arbitrary.arbitrary
class AuthJwtSpec extends PropSpec with Checkers {
private val start = LocalDateTime.now.atZone(java.time.ZoneId.systemDefault()).toEpochSecond
private val end = LocalDateTime.now.plusDays(2).atZone(java.time.ZoneId.systemDefault()).toEpochSecond
private val pickTime = Gen.choose(start, end)
private val authUser: Arbitrary[AuthUser] =
Arbitrary {
for {
u <- arbitrary[String]
p <- arbitrary[String]
} yield AuthUser(u, p)
}
property("Generate JWT token.") {
check(Prop.forAll(authUser, pickTime) {(r1: AuthUser, r2: Long) =>
???
})
}
}
The problem is, that r1 has type Arbitrary[AuthUser] but I need AuthUser, how to get it?

As I see, the problem is with authUser field - it should be Gen[AuthUser], not Arbitary[AuthUser]:
import java.time.LocalDateTime
import org.scalacheck._
import org.scalatest._
import prop._
case class AuthUser(u: String, p: String)
class AuthJwtSpec extends PropSpec with Checkers with PropertyChecks {
private val start = LocalDateTime.now.atZone(java.time.ZoneId.systemDefault()).toEpochSecond
private val end = LocalDateTime.now.plusDays(2).atZone(java.time.ZoneId.systemDefault()).toEpochSecond
private val pickTime: Gen[Long] = Gen.choose(start, end)
// AuthUser should be Gen[AuthUser], not Arbitary[AuthUser]
private val authUser: Gen[AuthUser] =
for {
u <- Arbitrary.arbitrary[String]
p <- Arbitrary.arbitrary[String]
} yield AuthUser(u, p)
property("Generate JWT token.") {
val prop = Prop.forAll(authUser, pickTime) {(user: AuthUser, time: Long) =>
println(s"User: $user")
println(s"Time: $time")
// Property checks must always return true or false
true
}
check(prop)
}
}

Apache Flink Streaming type mismatch in flatMap function

Trying to use streaming api of 0.10.0 flink version in scala 2.10.4. While trying to compile this first version:
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.windowing.time._
object Main {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("localhost", 9999)
val words : DataStream[String] = text.flatMap[String](
new Function[String,TraversableOnce[String]] {
def apply(line:String):TraversableOnce[String] = line.split(" ")
})
env.execute("Window Stream wordcount")
}
}
I am getting compile time error:
[error] found : String => TraversableOnce[String]
[error] required: org.apache.flink.api.common.functions.FlatMapFunction[String,String]
[error] new Function[String,TraversableOnce[String]] { def apply(line:String):TraversableOnce[String] = line.split(" ")})
[error] ^
And in decompiled version of DataStream.class that I have included to project there are functions that accept such type (the last one):
public <R> DataStream<R> flatMap(FlatMapFunction<T, R> flatMapper, TypeInformation<R> evidence$12, ClassTag<R> evidence$13) {
if (flatMapper == null) {
throw new NullPointerException("FlatMap function must not be null.");
}
TypeInformation outType = (TypeInformation)Predef..MODULE$.implicitly(evidence$12);
return package..MODULE$.javaToScalaStream((org.apache.flink.streaming.api.datastream.DataStream)this.javaStream.flatMap(flatMapper).returns(outType));
}
public <R> DataStream<R> flatMap(Function2<T, Collector<R>, BoxedUnit> fun, TypeInformation<R> evidence$14, ClassTag<R> evidence$15) {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.");
}
Function2<T, Collector<R>, BoxedUnit> cleanFun = this.clean((F)fun);
.anon flatMapper = new /* Unavailable Anonymous Inner Class!! */;
return this.flatMap((FlatMapFunction<T, R>)flatMapper, evidence$14, evidence$15);
}
public <R> DataStream<R> flatMap(Function1<T, TraversableOnce<R>> fun, TypeInformation<R> evidence$16, ClassTag<R> evidence$17) {
if (fun == null) {
throw new NullPointerException("FlatMap function must not be null.");
}
Function1<T, TraversableOnce<R>> cleanFun = this.clean((F)fun);
.anon flatMapper = new /* Unavailable Anonymous Inner Class!! */;
return this.flatMap((FlatMapFunction<T, R>)flatMapper, evidence$16, evidence$17);
}
What could be wrong here? I would be grateful if you could give some insight.
Thank you in advance.

The problem is that you are importing the Java StreamExecutionEnvironment of Flink: org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.
You have to use the Scala variant of the StreamExecutionEnvironment like this: import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.
With that change, everything is successfully building!
Original answer:
The problem is that you are passing a Function to the flatMap() method. However flatMap() expects a FlatMapFunction.
val words : DataStream[String] = text.flatMap[String](
new FlatMapFunction[String,String] {
override def flatMap(t: String, collector: Collector[String]): Unit = t.split(" ")
})

i am trying to write a text file using scala in play

i have attached my codes
Application (controller)
package controllers
import play.api._
import play.api.mvc._
import play.api.data._
import play.api.data.Forms._
import models.Task
import java.io._
object Application extends Controller {
val taskForm = Form(
tuple(
"id" -> number,
"label" -> nonEmptyText(minLength = 4),
"add" -> nonEmptyText
)
)
def index = Action {
Redirect(routes.Application.tasks)
}
def tasks = Action {
Ok(views.html.index(Task.all(),taskForm))
}
def showTask= Action {
Ok(views.html.test(Task.all(), taskForm))
}
def newTask = Action { implicit request =>
taskForm.bindFromRequest.fold(
errors => BadRequest(views.html.index(Task.all(), errors)),
{
case(id,label,add) => {
Task.create(id,label,add)
Redirect(routes.Application.showTask)
}
}
)
}
def deleteTask(id: Int) = Action {
Task.delete(id)
Redirect(routes.Application.showTask)
}
}
Task(model)
package models
import anorm._
import anorm.SqlParser._
import play.api.db._
import play.api.Play.current
case class Task(id: Int, label: String,add:String)
object Task {
val task = {
get[Int]("id") ~
get[String]("label") ~
get[String]("add") map {
case id~label~add => Task(id, label,add)
}
}
def all(): List[Task] = DB.withConnection { implicit c =>
SQL("select * from task").as(task *)
}
def create(id:Int , label: String, add:String) {
DB.withConnection { implicit c =>
SQL("insert into task (id,label,add) values ({id},{label},{add})").on(
'id -> id ,
'label -> label ,
'add -> add
).executeUpdate()
}
}
def delete(id:Int) {
DB.withConnection { implicit c =>
SQL("delete from task where id = {id}").on(
'id -> id
).executeUpdate()
}
}
}
I have no idea where to declare the writer function .please help me with the syntax also, I need to write the form elements into a text file .. Thankx in advance

Assuming that you want to append the text whenever a new task is added (i.e. newTask is invoked by Play).
You can define a helper function in object Application and use this helper method in your newTask method.
object Application extends Controller {
//...
import java.io.FileWriter
val filePath = """ path to file """
def writingToFile(str: String) = {
val fw = new FileWriter(filePath, true)
try {
fw.write(str)
} finally {
fw.close()
}
}
def newTask = Action { implicit request =>
taskForm.bindFromRequest.fold(
errors => BadRequest(views.html.index(Task.all(), errors)),
{
case(id,label,add) => {
/* Call the helper function to append to the file */
writingToFile(s"id : $id, label : $label, add : $add \n")
Task.create(id,label,add)
Redirect(routes.Application.showTask)
}
}
)
}
//..
}
Likewise when other methods are invoked you may append to the file in similar fashion.
Hope it helps :)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Converting Spark UDF written in SCALA to JAVA - scala

Below is my spark UD, can anyone help me to convert this into java? val customUDF = udf((array: Seq[String]) => { val newts = array.filter(_.nonEmpty) if (newts.size == 0) null else newts.head })

Something like: UDF2 my_udf = new UDF2<WrappedArray<String>, String>() { public String call(WrappedArray<String> arr) throws Exception { String[] newts = arr.filter(_.nonEmpty) if (newts.length == 0) { return null } else { newts[0] } } }; spark.udf().register("my_udf", my_udf, DataTypes.StringType);

Related

Exception in thread "main" java.lang.IndexOutOfBoundsException Scala gZip and decode base64 code

How to unitest gauge metrics in flink

How to get defined Arbitrary?

Apache Flink Streaming type mismatch in flatMap function

i am trying to write a text file using scala in play

Categories

Resources