Generating random/sample json based on a schema in Scala - scala

I need to generate some radom json sample, compliant to a schema dynamically. Meaning that the input would be a schema (e.g. json-schema) and the output would a json that complies to it.
I'm looking for pointers. Any suggestions ?

This is not complete solution, but you can get it from here.
Let's assume we have our domain objects we want to generate:
case class Dummy1(foo: String)
case class Dummy11(foo: Dummy1)
If we do this:
object O {
implicit def stringR: Random[String] = new Random[String] {
override def generate(): String = "s"
}
implicit def intR: Random[Int] = new Random[Int] {
override def generate(): Int = 1
}
implicit def tupleR[T1: Random, T2: Random]: Random[(T1, T2)] = new Random[(T1, T2)] {
override def generate(): (T1, T2) = {
val t1: T1 = G.random[T1]()
val t2: T2 = G.random[T2]()
(t1, t2)
}
}
}
object G {
def random[R: Random](): R = {
implicitly[Random[R]].generate()
}
}
then we will be able to generate some primitive values:
import O._
val s: String = G.random[String]()
val i: Int = G.random[Int]()
val t: (Int, String) = G.random[(Int, String)]()
println("s=" + s)
println("i=" + i)
println("t=" + t)
Now to jump to custom type we need to add
def randomX[R: Random, T](f: R=>T): Random[T] = {
val value: Random[R] = implicitly[Random[R]]
new Random[T] {
override def generate(): T = f.apply(value.generate())
}
}
to our G object.
Now we can
import O._
val d1: Dummy1 = G.randomX(Dummy1.apply).generate()
println("d1=" + d1)
and with some extra effort even
import O._
implicit val d1Gen: Random[Dummy1] = G.randomX(Dummy1.apply)
val d11: Dummy11 = G.randomX(Dummy11.apply).generate()
println("d11=" + d11)
Now you need to extend it to all primitive you have, add real implementation of random and support classes with more then 1 field and you ready to go.
You may even make some fancy library out of it.

Related

Generic UDAF in Spark 3.0 using Aggregator

Spark 3.0 has deprecated UserDefinedAggregateFunction and I was trying to rewrite my udaf using Aggregator. Basic usage of Aggregator is simple, however, I struggle with more generic version of the function.
I will try to explain my problem with this example, an implementation of collect_set. It's not my actual case, but it's easier to explain the problem:
class CollectSetDemoAgg(name: String) extends Aggregator[Row, Set[Int], Set[Int]] {
override def zero = Set.empty
override def reduce(b: Set[Int], a: Row) = b + a.getInt(a.fieldIndex(name))
override def merge(b1: Set[Int], b2: Set[Int]) = b1 ++ b2
override def finish(reduction: Set[Int]) = reduction
override def bufferEncoder = Encoders.kryo[Set[Int]]
override def outputEncoder = ExpressionEncoder()
}
// using it:
df.agg(new CollectSetDemoAgg("rank").toColumn as "result").show()
I prefer .toColumn vs .udf.register, but it's not the point here.
Problem:
I can not make universal version of this Aggregator, it will only work with integers.
I've attempted:
class CollectSetDemo(name: String) extends Aggregator[Row, Set[Any], Set[Any]]
It crashes with error:
No Encoder found for Any
- array element class: "java.lang.Object"
- root class: "scala.collection.immutable.Set"
java.lang.UnsupportedOperationException: No Encoder found for Any
- array element class: "java.lang.Object"
- root class: "scala.collection.immutable.Set"
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567)
I could not go with CollectSetDemo[T], case I was not able to proper outputEncoder. Also, when using udaf, I can only work with Spark data types, columns, etc.
Have not found a nice way to solve the situation, but I was able to somewhat workaround it. Code was partially borrowed from RowEncoder:
class CollectSetDemoAgg(name: String, fieldType: DataType) extends Aggregator[Row, Set[Any], Any] {
override def zero = Set.empty
override def reduce(b: Set[Any], a: Row) = b + a.get(a.fieldIndex(name))
override def merge(b1: Set[Any], b2: Set[Any]) = b1 ++ b2
override def finish(reduction: Set[Any]) = reduction.toSeq
override def bufferEncoder = Encoders.kryo[Set[Any]]
// now
override def outputEncoder = {
val mirror = ScalaReflection.mirror
val tt = fieldType match {
case ArrayType(LongType, _) => typeTag[Seq[Long]]
case ArrayType(IntegerType, _) => typeTag[Seq[Int]]
case ArrayType(StringType, _) => typeTag[Seq[String]]
// .. etc etc
case _ => throw new RuntimeException(s"Could not create encoder for ${name} column (${fieldType})")
}
val tpe = tt.in(mirror).tpe
val cls = mirror.runtimeClass(tpe)
val serializer = ScalaReflection.serializerForType(tpe)
val deserializer = ScalaReflection.deserializerForType(tpe)
new ExpressionEncoder[Any](serializer, deserializer, ClassTag[Any](cls))
}
}
One thing, that I had to add was result data type parameter in aggregator. The usage then changed to:
df.agg(new CollectSetDemoAgg("rank", new ArrayType(IntegerType, true)).toColumn as "result").show()
I really don't like how it turned out, but it works. I also welcome any suggestions how to improve it.
Modification of #Ramunas answer with generics:
class CollectSetDemoAgg[T: TypeTag](name: String) extends Aggregator[Row, Set[T], Seq[T]] {
override def zero = Set.empty
override def reduce(b: Set[T], a: Row) = b + a.getAs[T](a.fieldIndex(name))
override def merge(b1: Set[T], b2: Set[T]) = b1 ++ b2
override def finish(reduction: Set[T]) = reduction.toSeq
override def bufferEncoder = Encoders.kryo[Set[T]]
override def outputEncoder = {
val tt = typeTag[Seq[T]]
val tpe = tt.in(mirror).tpe
val cls = mirror.runtimeClass(tpe)
val serializer = serializerForType(tpe)
val deserializer = deserializerForType(tpe)
new ExpressionEncoder[Seq[T]](serializer, deserializer, ClassTag[Seq[T]](cls))
}
}

In Scala, how to deal with heterogeneous list of the same parameterized type

I have an array of Any (in real life, it's a Spark Row, but it's sufficient to isolate the problem)
object Row {
val buffer : Array[Any] = Array(42, 21, true)
}
And I want to apply some operations on its elements.
So, I've defined a simple ADT to define a compute operation on a type A
trait Op[A] {
def cast(a: Any) : A = a.asInstanceOf[A]
def compute(a: A) : A
}
case object Count extends Op[Int] {
override def compute(a: Int): Int = a + 1
}
case object Exist extends Op[Boolean] {
override def compute(a: Boolean): Boolean = a
}
Given that I have a list of all operations and I know which operation is to apply to each element, let's use these operations.
object GenericsOp {
import Row._
val ops = Seq(Count, Exist)
def compute() = {
buffer(0) = ops(0).compute(ops(0).cast(buffer(0)))
buffer(1) = ops(0).compute(ops(0).cast(buffer(1)))
buffer(2) = ops(1).compute(ops(1).cast(buffer(2)))
}
}
By design, for a given op, types are aligned between cast and combine. But unfortunately the following code does not compile. The error is
Type mismatch, expected: _$1, actual: AnyVal
Is there a way to make it work ?
I've found a workaround by using abstract type member instead of type parameter.
object AbstractOp extends App {
import Row._
trait Op {
type A
def compute(a: A) : A
}
case object Count extends Op {
type A = Int
override def compute(a: Int): Int = a + 1
}
case object Exist extends Op {
type A = Boolean
override def compute(a: Boolean): Boolean = a
}
val ops = Seq(Count, Exist)
def compute() = {
val op0 = ops(0)
val op1 = ops(1)
buffer(0) = ops(0).compute(buffer(0).asInstanceOf[op0.A])
buffer(1) = ops(0).compute(buffer(1).asInstanceOf[op0.A])
buffer(2) = ops(1).compute(buffer(2).asInstanceOf[op1.A])
}
}
Is there a better way ?
It seems that your code can be simplified by making Op[A] extend Any => A:
trait Op[A] extends (Any => A) {
def cast(a: Any) : A = a.asInstanceOf[A]
def compute(a: A) : A
def apply(a: Any): A = compute(cast(a))
}
case object Count extends Op[Int] {
override def compute(a: Int): Int = a + 1
}
case object Exist extends Op[Boolean] {
override def compute(a: Boolean): Boolean = a
}
object AbstractOp {
val buffer: Array[Any] = Array(42, 21, true)
val ops: Array[Op[_]] = Array(Count, Count, Exist)
def main(args: Array[String]): Unit = {
for (i <- 0 until buffer.size) {
buffer(i) = ops(i)(buffer(i))
}
println(buffer.mkString("[", ",", "]"))
}
}
Since it's asInstanceOf everywhere anyway, it does not make the code any less safe than what you had previously.
Update
If you cannot change the Op interface, then invoking cast and compute is a bit more cumbersome, but still possible:
trait Op[A] {
def cast(a: Any) : A = a.asInstanceOf[A]
def compute(a: A) : A
}
case object Count extends Op[Int] {
override def compute(a: Int): Int = a + 1
}
case object Exist extends Op[Boolean] {
override def compute(a: Boolean): Boolean = a
}
object AbstractOp {
val buffer: Array[Any] = Array(42, 21, true)
val ops: Array[Op[_]] = Array(Count, Count, Exist)
def main(args: Array[String]): Unit = {
for (i <- 0 until buffer.size) {
buffer(i) = ops(i) match {
case op: Op[t] => op.compute(op.cast(buffer(i)))
}
}
println(buffer.mkString("[", ",", "]"))
}
}
Note the ops(i) match { case op: Opt[t] => ... } part with a type-parameter in the pattern: this allows us to make sure that cast returns a t that is accepted by compute.
As a more general solution than Andrey Tyukin's, you can define the method outside Op, so it works even if Op can't be modified:
def apply[A](op: Op[A], x: Any) = op.compute(op.cast(x))
buffer(0) = apply(ops(0), buffer(0))

Using a double value in a Fractional[T] method

I have the following function which generates a Uniform distributed value between 2 bounds:
def Uniform(x: Bounded[Double], n: Int): Bounded[Double] = {
val y: Double = (x.upper - x.lower) * scala.util.Random.nextDouble() + x.lower
Bounded(y, x.bounds)
}
and Bounded is defined as follows:
trait Bounded[T] {
val underlying: T
val bounds: (T, T)
def lower: T = bounds._1
def upper: T = bounds._2
override def toString = underlying.toString + " <- [" + lower.toString + "," + upper.toString + "]"
}
object Bounded {
def apply[T : Numeric](x: T, _bounds: (T, T)): Bounded[T] = new Bounded[T] {
override val underlying: T = x
override val bounds: (T, T) = _bounds
}
}
However, I want Uniform to work on all Fractional[T] values so I wanted to add a context bound:
def Uniform[T : Fractional](x: Bounded[T], n: Int): Bounded[T] = {
import Numeric.Implicits._
val y: T = (x.upper - x.lower) * scala.util.Random.nextDouble().asInstanceOf[T] + x.lower
Bounded(y, x.bounds)
}
This works swell when doing a Uniform[Double](x: Bounded[Double]), but the other ones are impossible and get a ClassCastException at runtime because they can not be casted. Is there a way to solve this?
I'd suggest defining a new type class that characterizes types that you can get random instances of:
import scala.util.Random
trait GetRandom[A] {
def next(): A
}
object GetRandom {
def instance[A](a: => A): GetRandom[A] = new GetRandom[A] {
def next(): A = a
}
implicit val doubleRandom: GetRandom[Double] = instance(Random.nextDouble())
implicit val floatRandom: GetRandom[Float] = instance(Random.nextFloat())
// Define any other instances here
}
Now you can write Uniform like this:
def Uniform[T: Fractional: GetRandom](x: Bounded[T], n: Int): Bounded[T] = {
import Numeric.Implicits._
val y: T = (x.upper - x.lower) * implicitly[GetRandom[T]].next() + x.lower
Bounded(y, x.bounds)
}
And use it like this:
scala> Uniform[Double](Bounded(2, (0, 4)), 1)
res15: Bounded[Double] = 1.5325899033654382 <- [0.0,4.0]
scala> Uniform[Float](Bounded(2, (0, 4)), 1)
res16: Bounded[Float] = 0.06786823 <- [0.0,4.0]
There are libraries like rng that provide a similar type class for you, but they tend to be focused on purely functional ways to work with random numbers, so if you want something simpler you're probably best off writing your own.

Avoid explicit type parameters on map operations

I have a Span[A] data type that tracks a minimum and maximum value of type A. Because of this, I require A to have a Scalaz Order instance. Here's what the implementation looks like:
trait Span[A] {
val min: A
val max: A
}
object Span {
def apply[A : Order](id: A): Span[A] = new Span[A] {
override val min = id
override val max = id
}
def apply[A : Order](a: A, b: A): Span[A] = {
val swap = implicitly[Order[A]].greaterThan(a, b)
new Span[A] {
override val min = if (swap) b else a
override val max = if (swap) a else b
}
}
implicit def orderSpanSemigroup[A : Order]: Semigroup[Span[A]] = new Semigroup[Span[A]] {
override def append(f1: Span[A], f2: => Span[A]): Span[A] = {
val ord = implicitly[Order[A]]
Span(ord.min(f1.min, f2.min), ord.max(f1.max, f2.max))
}
}
}
The apply method seems to work as expected, as I can do this:
val a = Span(1) // or Span.apply(1)
It breaks down if I try to map over this value using a functor, for example, an Option[Int]:
val b = 1.some map Span.apply
// could not find implicit value for evidence parameter of type scalaz.Order[A]
However, I can fix the error by using an explicit type parameter:
val c = 1.some map Span.apply[Int]
Why is this happening? Is there a way to avoid this explicit type parameter? I wonder if it's related to this issue as I originally ran into the problem while trying to use my own implicit Order instance. Of course, it's also failing on Int inputs so maybe it's just a limitation of parameterized methods.

scala's spire framework : I am unable to operate on a group

I try to use spire, a math framework, but I have an error message:
import spire.algebra._
import spire.implicits._
trait AbGroup[A] extends Group[A]
final class Rationnel_Quadratique(val n1: Int = 2)(val coef: (Int, Int)) {
override def toString = {
coef match {
case (c, i) =>
s"$c + $i√$n"
}
}
def a() = coef._1
def b() = coef._2
def n() = n1
}
object Rationnel_Quadratique {
def apply(coef: (Int, Int),n: Int = 2)= {
new Rationnel_Quadratique(n)(coef)
}
}
object AbGroup {
implicit object RQAbGroup extends AbGroup[Rationnel_Quadratique] {
def +(a: Rationnel_Quadratique, b: Rationnel_Quadratique): Rationnel_Quadratique = Rationnel_Quadratique(coef=(a.a() + b.a(), a.b() + b.b()))
def inverse(a: Rationnel_Quadratique): Rationnel_Quadratique = Rationnel_Quadratique((-a.a(), -a.b()))
def id: Rationnel_Quadratique = Rationnel_Quadratique((0, 0))
}
}
object euler66_2 extends App {
val c = Rationnel_Quadratique((1, 2))
val d = Rationnel_Quadratique((3, 4))
val e = c + d
println(e)
}
the program is expected to add 1+2√2 and 3+4√2, but instead I have this error:
could not find implicit value for evidence parameter of type spire.algebra.AdditiveSemigroup[Rationnel_Quadratique]
val e = c + d
^
I think there is something essential I have missed (usage of implicits?)
It looks like you are not using Spire correctly.
Spire already has an AbGroup type, so you should be using that instead of redefining your own. Here's an example using a simple type I created called X.
import spire.implicits._
import spire.algebra._
case class X(n: BigInt)
object X {
implicit object XAbGroup extends AbGroup[X] {
def id: X = X(BigInt(0))
def op(lhs: X, rhs: X): X = X(lhs.n + rhs.n)
def inverse(lhs: X): X = X(-lhs.n)
}
}
def test(a: X, b: X): X = a |+| b
Note that with groups (as well as semigroups and monoids) you'd use |+| rather than +. To get plus, you'll want to define something with an AdditiveSemigroup (e.g. Semiring, or Ring, or Field or something).
You'll also use .inverse and |-| instead of unary and binary - if that makes sense.
Looking at your code, I am also not sure your actual number type is right. What will happen if I want to add two numbers with different values for n?
Anyway, hope this clears things up for you a bit.
EDIT: Since it seems like you're also getting hung up on Scala syntax, let me try to sketch a few designs that might work. First, there's always a more general solution:
import spire.implicits._
import spire.algebra._
import spire.math._
case class RQ(m: Map[Natural, SafeLong]) {
override def toString: String = m.map {
case (k, v) => if (k == 1) s"$v" else s"$v√$k" }.mkString(" + ")
}
object RQ {
implicit def abgroup[R <: Radical](implicit r: R): AbGroup[RQ] =
new AbGroup[RQ] {
def id: RQ = RQ(Map.empty)
def op(lhs: RQ, rhs: RQ): RQ = RQ(lhs.m + rhs.m)
def inverse(lhs: RQ): RQ = RQ(-lhs.m)
}
}
object Test {
def main(args: Array[String]) {
implicit val radical = _2
val x = RQ(Map(Natural(1) -> 1, Natural(2) -> 2))
val y = RQ(Map(Natural(1) -> 3, Natural(2) -> 4))
println(x)
println(y)
println(x |+| y)
}
}
This allows you to add different roots together without problem, at the cost of some indirection. You could also stick more closely to your design with something like this:
import spire.implicits._
import spire.algebra._
abstract class Radical(val n: Int) { override def toString: String = n.toString }
case object _2 extends Radical(2)
case object _3 extends Radical(3)
case class RQ[R <: Radical](a: Int, b: Int)(implicit r: R) {
override def toString: String = s"$a + $b√$r"
}
object RQ {
implicit def abgroup[R <: Radical](implicit r: R): AbGroup[RQ[R]] =
new AbGroup[RQ[R]] {
def id: RQ[R] = RQ[R](0, 0)
def op(lhs: RQ[R], rhs: RQ[R]): RQ[R] = RQ[R](lhs.a + rhs.a, lhs.b + rhs.b)
def inverse(lhs: RQ[R]): RQ[R] = RQ[R](-lhs.a, -lhs.b)
}
}
object Test {
def main(args: Array[String]) {
implicit val radical = _2
val x = RQ[_2.type](1, 2)
val y = RQ[_2.type](3, 4)
println(x)
println(y)
println(x |+| y)
}
}
This approach creates a fake type to represent whatever radical you are using (e.g. √2) and parameterizes QR on that type. This way you can be sure that no one will try to do additions that are invalid.
Hopefully one of these approaches will work for you.