Scala BitSet for Long value? - scala

I have a large dataset and the rowId is Long type.
It need to indicate whether one row is in some special set.
I found the BitSet more efficient for memory and network transport, but unlucky that the max value BitSet can hold is Int.
Is there some alternate way? (using boolean[]?)
Thanks

The nature of BitSet makes it so that they can easily broken down and merged together to make network transport easy depending on what you are trying to achieve exactly.
You can always build your own however:
object BitSet extends BitSetFactory[BitSet] {
val empty: BitSet = immutable.BitSet.empty
def newBuilder = immutable.BitSet.newBuilder
implicit def canBuildFrom: CanBuildFrom[BitSet, Long, BitSet] = bitsetCanBuildFrom
}
trait BitSetFactory[Coll <: BitSet with BitSetLike[Coll]] {
def empty: Coll
def newBuilder: Builder[Long, Coll]
def apply(elems: Long*): Coll = (empty /: elems) (_ + _)
def bitsetCanBuildFrom = new CanBuildFrom[Coll, Long, Coll] {
def apply(from: Coll) = newBuilder
def apply() = newBuilder
}
}
trait BitSetLike[+This <: BitSetLike[This] with SortedSet[Long]] extends SortedSetLike[Long, This] { self =>
etc...

Related

Understanding Mixed Context Bounds of Seq[AnyVal] and Seq[String]

Suppose I have some function that should take a sequence of Ints or a sequence of Strings.
My attempt:
object Example extends App {
import scala.util.Random
val rand: Random.type = scala.util.Random
// raw data
val x = Seq(1, 2, 3, 4, 5).map(e => e + rand.nextDouble())
val y = Seq("chc", "asas")
def f1[T <: AnyVal](seq: Seq[T]) = {
println(seq(0))
}
// this works fine as expected
f1(x)
// how can i combine
f1(y)
}
How can I add this to also work with strings?
If I change the method signature to:
def f1[T <: AnyVal:String](seq: Seq[T])
But this won't work.
Is there a way to impose my required constraint on the types elegantly?
Note the difference between an upper bound
A <: C
and a context bound
A : C
so type parameter clause [T <: AnyVal : String] does not make much sense. Also types such as String are rarely (or never) used as context bounds.
Here is a typeclass approach
trait EitherStringOrAnyVal[T]
object EitherStringOrAnyVal {
implicit val str: EitherStringOrAnyVal[String] = new EitherStringOrAnyVal[String] {}
implicit def aval[T <: AnyVal]: EitherStringOrAnyVal[T] = new EitherStringOrAnyVal[T] {}
}
def f1[T: EitherStringOrAnyVal](seq: Seq[T]): Unit = {
println(seq(0))
}
f1(Seq(1)) // ok
f1(Seq("a")) // ok
f1(Seq(Seq(1))) // nok
or generelized type constraints approach
object Foo {
private def impl[T](seq: Seq[T]): Unit = {
println(seq(0))
}
def f1[T](seq: Seq[T])(implicit ev: T =:= String): Unit = impl(seq)
def f1[T <: AnyVal](seq: Seq[T]): Unit = impl(seq)
}
import Foo._
f1(Seq(1)) // ok
f1(Seq("a")) // ok
f1(Seq(Seq(1))) // nok
I feel like you should write a separate function for both, but there are other ways to do it:
In Dotty, you can just use union types, which is what I'd recommend here:
Edit: As per Alexey Romanov’s suggestion, replaced Seq[AnyVal | String] with Seq[AnyVal | String], which was wrong,
def foo(seq: Seq[AnyVal] | Seq[String]): Unit = {
println(seq)
}
Scastie
If you're not using Dotty, you can still do this in Scala 2 with a couple reusable implicit defs
class Or[A, B]
implicit def orA[A, B](implicit ev: A): Or[A, B] = new Or
implicit def orB[A, B](implicit ev: B): Or[A, B] = new Or
def foo[T](seq: Seq[T])(implicit ev: Or[T <:< AnyVal, T =:= String]): Unit =
println(seq)
foo(Seq("baz", "waldo"))
foo(Seq(23, 34))
foo(Seq(List(), List())) //This last one fails
Scastie
You can also do it with a typeclass, as Luis Miguel Mejía Suárez suggested, but I wouldn't recommend it for such a trivial task since you still have to define 2 functions for each type, along with a trait, 2 implicit objects, and one function that can use instances of that trait. You probably don't want this pattern to handle just 2 different types.
sealed trait DoSomethingToIntOrString[T] {
def doSomething(t: Seq[T]): Unit
}
implicit object DoSomethingToAnyVal extends DoSomethingToAnyValOrString[AnyVal] {
def doSomething(is: Seq[AnyVal]): Unit = println(is)
}
implicit object DoSomethingToString extends DoSomethingToIntOrString[String] {
def doSomething(ss: Seq[String]): Unit = println(ss)
}
def foo[T](t: Seq[T])(implicit dsis: DoSomethingToIntOrString[T]): Unit = {
dsis.doSomething(t)
}
foo(Seq("foo", "bar"))
foo(Seq(1, 2))
In Scastie
def f1( seq: Seq[Any] ): Unit =
println( seq( 0 ) )
This works, because the least common supertype of Int and String (the 'lowest' type that is a supertype of both) would be Any, since Int is a subtype of AnyVal (i.e. 'the primitives) and String is a subtype of AnyRef (i.e. `the heap objects). f1 does not depends in any way on the contents of the list and is not polymorphic in its return type (Unit), so we don't need a type parameter at all.

Scala "belongsTo" function

I am often in the situation of doing this type of filtering:
allPeople
.filter(person => englishPeopleIds contains person.id)
and it would make my life easier and my code more readable if there was some sort of "belongsTo" function to do the following:
allPeople
.filter(_.id belongsTo englishPeopleIds)
belongsTo function would have this kind of behaviour (but would be a method of the element):
def belongsTo[T](element: T, list: List[T]): Boolean = list contains element
Do you know if such function is already implemented in Scala?
Perhaps define a right-associative operator on a sequential collections like so
implicit class ElementOf[A](l: Seq[A]) {
def ∈: (a: A): Boolean = l contains a
}
using mathematical symbol element of ∈ which falls under Scala operator characters, and then call-site becomes
allPeople filter (_.id ∈: englishPeopleIds)
To make it work also for Set and Map as well try defining in terms of Iterable#exists like so
implicit class ElementOf[A](cc: Iterable[A]) {
def ∈: (a: A): Boolean = cc exists (_ == a)
}
No, but you could use an implicit class to do the same:
implicit class BelongsTo[T](private val t: T) extends AnyVal {
def belongsTo(s: Seq[T]): Boolean = s.contains(t)
def belongsTo(s: Set[T]): Boolean = s(t)
}
In Dotty, you can do this:
extension [T](t: T):
def belongsTo(s: Seq[T]): Boolean = s.contains(t)
def belongsTo(s: Set[T]): Boolean = s(t)
Honestly, though, it doesn't seem worth it.

How can I dynamically construct a Scala class without knowing types in advance?

I want to build a simple library in which a developer can define a Scala class that represents command line arguments (to keep it simple, just a single set of required arguments -- no flags or optional arguments). I'd like the library to parse the command line arguments and return an instance of the class. The user of the library would just do something like this:
case class FooArgs(fluxType: String, capacitorCount: Int)
def main(args: Array[String]) {
val argsObject: FooArgs = ArgParser.parse(args).as[FooArgs]
// do real stuff
}
The parser should throw runtime errors if the provided arguments do not match the expected types (e.g. if someone passes the string "bar" at a position where an Int is expected).
How can I dynamically build a FooArgs without knowing its shape in advance? Since FooArgs can have any arity or types, I don't know how to iterate over the command line arguments, cast or convert them to the expected types, and then use the result to construct a FooArgs. Basically, I want to do something along these lines:
// ** notional code - does not compile **
def parse[T](args: Seq[String], klass: Class[T]): T = {
val expectedTypes = klass.getDeclaredFields.map(_.getGenericType)
val typedArgs = args.zip(expectedTypes).map({
case (arg, String) => arg
case (arg, Int) => arg.toInt
case (arg, unknownType) =>
throw new RuntimeException(s"Unsupported type $unknownType")
})
(klass.getConstructor(typedArgs).newInstance _).tupled(typedArgs)
}
Any suggestions on how I can achieve something like this?
When you want to abstract over case class (or Tuple) shape, the standard approach is to get the HList representation of the case class with the help of shapeless library. HList keeps track in its type signature of the types of its elements and their amount. Then you can implement the algorithm you want recursively on the HList. Shapeless also provides a number of helpful transformations of HLists in shapeless.ops.hlist.
For this problem, first we need to define an auxiliary typeclass to parse an argument of some type from String:
trait Read[T] {
def apply(str: String): T
}
object Read {
def make[T](f: String => T): Read[T] = new Read[T] {
def apply(str: String) = f(str)
}
implicit val string: Read[String] = make(identity)
implicit val int: Read[Int] = make(_.toInt)
}
You can define more instances of this typeclass, if you need to support other argument types than String or Int.
Then we can define the actual typeclass that parses a sequence of arguments into some type:
// This is needed, because there seems to be a conflict between
// HList's :: and the standard Scala's ::
import shapeless.{:: => :::, _}
trait ParseArgs[T] {
def apply(args: List[String]): T
}
object ParseArgs {
// Base of the recursion on HList
implicit val hnil: ParseArgs[HNil] = new ParseArgs[HNil] {
def apply(args: List[String]) =
if (args.nonEmpty) sys.error("too many args")
else HNil
}
// A single recursion step on HList
implicit def hlist[T, H <: HList](
implicit read: Read[T], parseRest: ParseArgs[H]
): ParseArgs[T ::: H] = new ParseArgs[T ::: H] {
def apply(args: List[String]) = args match {
case first :: rest => read(first) :: parseRest(rest)
case Nil => sys.error("too few args")
}
}
// The implementation for any case class, based on its HList representation
implicit def caseClass[C, H <: HList](
implicit gen: Generic.Aux[C, H], parse: ParseArgs[H]
): ParseArgs[C] = new ParseArgs[C] {
def apply(args: List[String]) = gen.from(parse(args))
}
}
And lastly we can define some API, that uses this typeclass. For example:
case class ArgParser(args: List[String]) {
def to[C](implicit parseArgs: ParseArgs[C]): C = parseArgs(args)
}
object ArgParser {
def parse(args: Array[String]): ArgParser = ArgParser(args.toList)
}
And a simple test:
scala> ArgParser.parse(Array("flux", "10")).to[FooArgs]
res0: FooArgs = FooArgs(flux,10)
There is a great guide on using shapeless for solving similar problems, which you may find helpful: The Type Astronaut’s Guide to Shapeless

How to maintain a property about trees using a Scala Collections TreeMap (or similar)

Since the scala RedBlack tree is no longer available in the scala collections, I'm having trouble figuring out how to maintain properties about tree in the nodes on a scala tree, as I need direct access to the tree rather than the encaspulated interface provided by the standard collections.
For example, I want a tree that maintains on each node a field x such that the function f holds for each node n such that
f(null) = 0
f(n) = n.x + f(n.left) + f(n.right)
Can I efficiently use a Scala standard library TreeMap for this or will I have to implement my own tree (currently doing this)?
Unfortunately, I don't think you will be able to do this using TreeMap. You might want to just copy the red black tree from scala collections and add a summary field.
Alternatively, you might want to use one of the tree-based collections that allow specifying a "measure". E.g.
https://github.com/Sciss/FingerTree
Or, you could use this horrible hack to add some summary data to the nodes of e.g. a TreeSet. Note the namespace scala.collection.immutable so you can access RedBlackTree from inside the utility.
package scala.collection.immutable
trait Summary[E, S] {
def apply(v: E): S
def empty: S
def combine(a: S, b: S): S
def combine3(a: S, b: S, c: S) = combine(combine(a, b), c)
}
object TreeSetSummarizer {
private[this] val treeSetAccessor = classOf[scala.collection.immutable.TreeSet[_]].getDeclaredField("tree")
treeSetAccessor.setAccessible(true)
private def tree[K](set: TreeSet[K]): AnyRef = treeSetAccessor.get(set) match {
case t: RedBlackTree.Tree[K, Unit] ⇒ t
case _ ⇒ "null"
}
private type JFunction[T, R] = java.util.function.Function[T, R]
def apply[K, S](implicit s: Summary[K, S]): (TreeSet[K] ⇒ S) = new TreeSetSummarizer[K, S]
}
class TreeSetSummarizer[K, S](implicit summary: Summary[K, S]) extends (TreeSet[K] ⇒ S) {
import TreeSetSummarizer._
// this should be a guava cache using weak keys to prevent memory leaks
private val memo = new java.util.IdentityHashMap[AnyRef, S]()
private val f: JFunction[AnyRef, S] = new JFunction[AnyRef, S] {
def apply(t: AnyRef): S = t match {
case t: RedBlackTree.Tree[K, Unit] ⇒ summary.combine3(apply0(t.left), summary.apply(t.key), apply0(t.right))
case _ ⇒ summary.empty
}
}
private def apply0(set: AnyRef): S =
memo.computeIfAbsent(set, f)
def apply(set: TreeSet[K]): S =
apply0(tree(set))
}
Here is how you would use it
import scala.collection.immutable._
// create a large TreeSet
val largeSet = TreeSet(0 until 10000: _*)
// define your summary
implicit val s = new Summary[Int, Long] {
def empty = 0L
def apply(x: Int) = x
def combine(a: Long, b: Long) = a + b
}
// define your summarizer. You need to keep the summarizer instance for
// having a performance benefit, since it internally stores summaries for
// tree nodes in an identity hash map
val summary = TreeSetSummarizer.apply[Int, Long]
// summarize something
println(summary(largeSet))
// summarize a modified set. This is fast because the summaries for tree
// nodes are being cached.
val largeSet1 = largeSet - 5000
println(summary(largeSet1))
Note that given the reflection and hashing overhead, this is probably best used for more computationally intensive summaries than a simple sum.
Update: I have now written a small library persistentsummary to define persistent summaries for existing scala collections. This should do exactly what you want.

How to shapeless case classes with attributes and typeclasses?

I am currently implementing a library to serialize and deserialize to and from XML-RPC messages. It's almost done but now I am trying to remove the boilerplate of my current asProduct method using Shapeless. My current code:
trait Serializer[T] {
def serialize(value: T): NodeSeq
}
trait Deserializer[T] {
type Deserialized[T] = Validation[AnyErrors, T]
type AnyErrors = NonEmptyList[AnyError]
def deserialize(from: NodeSeq): Deserialized[T]
}
trait Datatype[T] extends Serializer[T] with Deserializer[T]
// Example of asProduct, there are 20 more methods like this, from arity 1 to 22
def asProduct2[S, T1: Datatype, T2: Datatype](apply: (T1, T2) => S)(unapply: S => Product2[T1, T2]) = new Datatype[S] {
override def serialize(value: S): NodeSeq = {
val params = unapply(value)
val b = toXmlrpc(params._1) ++ toXmlrpc(params._2)
b.theSeq
}
// Using scalaz
override def deserialize(from: NodeSeq): Deserialized[S] = (
fromXmlrpc[T1](from(0)) |#| fromXmlrpc[T2](from(1))
) {apply}
}
My goal is to allow the user of my library to serialize/deserialize case classes without force him to write boilerplate code. Currently, you have to declare the case class and an implicit val using the aforementioned asProduct method to have a Datatype instance in context. This implicit is used in the following code:
def toXmlrpc[T](datatype: T)(implicit serializer: Serializer[T]): NodeSeq =
serializer.serialize(datatype)
def fromXmlrpc[T](value: NodeSeq)(implicit deserializer: Deserializer[T]): Deserialized[T] =
deserializer.deserialize(value)
This is the classic strategy of serializing and deserializing using type classes.
At this moment, I have grasped how to convert from case classes to HList via Generic or LabelledGeneric. The problem is once I have this conversion done how I can call the methods fromXmlrpc and toXmlrpc as in the asProduct2 example. I don't have any information about the types of the attributes in the case class and, therefore, the compiler cannot find any implicit that satisfy fromXmlrpc and toXmlrpc. I need a way to constrain that all the elements of a HList have an implicit Datatype in context.
As I am a beginner with Shapeless, I would like to know what's the best way of getting this functionality. I have some insights but I definitely have no idea of how to get it done using Shapeless. The ideal would be to have a way to get the type from a given attribute of the case class and pass this type explicitly to fromXmlrpc and toXmlrpc. I imagine that this is not how it can be done.
First, you need to write generic serializers for HList. That is, you need to specify how to serialize H :: T and HNil:
implicit def hconsDatatype[H, T <: HList](implicit hd: Datatype[H],
td: Datatype[T]): Datatype[H :: T] =
new Datatype[H :: T] {
override def serialize(value: H :: T): NodeSeq = value match {
case h :: t =>
val sh = hd.serialize(h)
val st = td.serialize(t)
(sh ++ st).theSeq
}
override def deserialize(from: NodeSeq): Deserialized[H :: T] =
(hd.deserialize(from.head) |#| td.deserialize(from.tail)) {
(h, t) => h :: t
}
}
implicit val hnilDatatype: Datatype[HNil] =
new Datatype[HNil] {
override def serialize(value: HNil): NodeSeq = NodeSeq()
override def deserialize(from: NodeSeq): Deserialized[HNil] =
Success(HNil)
}
Then you can define a generic serializer for any type which can be deconstructed via Generic:
implicit def genericDatatype[T, R](implicit gen: Generic.Aux[T, R],
rd: Lazy[Datatype[R]]): Datatype[T] =
new Datatype[T] {
override def serialize(value: T): NodeSeq =
rd.value.serialize(gen.to(value))
override def deserialize(from: NodeSeq): Deserialized[T] =
rd.value.deserialize(from).map(rd.from)
}
Note that I had to use Lazy because otherwise this code would break the implicit resolution process if you have nested case classes. If you get "diverging implicit expansion" errors, you could try adding Lazy to implicit parameters in hconsDatatype and hnilDatatype as well.
This works because Generic.Aux[T, R] links the arbitrary product-like type T and a HList type R. For example, for this case class
case class A(x: Int, y: String)
shapeless will generate a Generic instance of type
Generic.Aux[A, Int :: String :: HNil]
Consequently, you can delegate the serialization to recursively defined Datatypes for HList, converting the data to HList with Generic first. Deserialization works similarly but in reverse - first the serialized form is read to HList and then this HList is converted to the actual data with Generic.
It is possible that I made several mistakes in the usage of NodeSeq API above but I guess it conveys the general idea.
If you want to use LabelledGeneric, the code would become slightly more complex, and even more so if you want to handle sealed trait hierarchies which are represented with Coproducts.
I'm using shapeless to provide generic serialization mechanism in my library, picopickle. I'm not aware of any other library which does this with shapeless. You can try and find some examples how shapeless could be used in this library, but the code there is somewhat complex. There is also an example among shapeless examples, namely S-expressions.
Vladimir's answer is great and should be the accepted one, but it's also possible to do this a little more nicely with Shapeless's TypeClass machinery. Given the following setup:
import scala.xml.NodeSeq
import scalaz._, Scalaz._
trait Serializer[T] {
def serialize(value: T): NodeSeq
}
trait Deserializer[T] {
type Deserialized[T] = Validation[AnyErrors, T]
type AnyError = Throwable
type AnyErrors = NonEmptyList[AnyError]
def deserialize(from: NodeSeq): Deserialized[T]
}
trait Datatype[T] extends Serializer[T] with Deserializer[T]
We can write this:
import shapeless._
object Datatype extends ProductTypeClassCompanion[Datatype] {
object typeClass extends ProductTypeClass[Datatype] {
def emptyProduct: Datatype[HNil] = new Datatype[HNil] {
def serialize(value: HNil): NodeSeq = Nil
def deserialize(from: NodeSeq): Deserialized[HNil] = HNil.successNel
}
def product[H, T <: HList](
dh: Datatype[H],
dt: Datatype[T]
): Datatype[H :: T] = new Datatype[H :: T] {
def serialize(value: H :: T): NodeSeq =
dh.serialize(value.head) ++ dt.serialize(value.tail)
def deserialize(from: NodeSeq): Deserialized[H :: T] =
(dh.deserialize(from.head) |#| dt.deserialize(from.tail))(_ :: _)
}
def project[F, G](
instance: => Datatype[G],
to: F => G,
from: G => F
): Datatype[F] = new Datatype[F] {
def serialize(value: F): NodeSeq = instance.serialize(to(value))
def deserialize(nodes: NodeSeq): Deserialized[F] =
instance.deserialize(nodes).map(from)
}
}
}
Be sure to define these all together so they'll be properly companioned.
Then if we have a case class:
case class Foo(bar: String, baz: String)
And instances for the types of the case class members (in this case just String):
implicit object DatatypeString extends Datatype[String] {
def serialize(value: String) = <s>{value}</s>
def deserialize(from: NodeSeq) = from match {
case <s>{value}</s> => value.text.successNel
case _ => new RuntimeException("Bad string XML").failureNel
}
}
We automatically get a derived instance for Foo:
scala> case class Foo(bar: String, baz: String)
defined class Foo
scala> val fooDatatype = implicitly[Datatype[Foo]]
fooDatatype: Datatype[Foo] = Datatype$typeClass$$anon$3#2e84026b
scala> val xml = fooDatatype.serialize(Foo("AAA", "zzz"))
xml: scala.xml.NodeSeq = NodeSeq(<s>AAA</s>, <s>zzz</s>)
scala> fooDatatype.deserialize(xml)
res1: fooDatatype.Deserialized[Foo] = Success(Foo(AAA,zzz))
This works about the same as Vladimir's solution, but it lets Shapeless abstract some of the boring boilerplate of type class instance derivation so you don't have to get your hands dirty with Generic.