Pattern Matching for CSV processing

Pattern Matching for CSV processing - scala

I have a csv file with three columns and I want to get the third column into an Iterator. I want to filter out the headers by using the trytoDouble method in combination of the Pattern Matching.
def trytoDouble(s: String): Option[Double] = {
try {
Some(s.toDouble)
} catch {
case e: Exception => None
}
}
val asdf = Source.fromFile("my.csv").getLines().map(_.split(",").map(_.trim).map(utils.trytoDouble(_))).map{
_ match {
case Array(a, b, Some(c: Double)) => c
}
}
results in
An exception or error caused a run to abort: [Lscala.Option;#2b4a2ec7 (of class [Lscala.Option;)
scala.MatchError: [Lscala.Option;#2b4a2ec7 (of class [Lscala.Option;)
what did I do wrong?

Try using an extractor like StringDouble below. If the unapply returns Some then it matches, if it returns None then it doesn't match.
object StringDouble {
def unapply(str: String): Option[Double] = Try(str.toDouble).toOption
}
val doubles =
Source.fromFile("my.csv").getLines().map { line =>
line.split(",").map(_.trim)
}.map {
case Array(_, _, StringDouble(d)) => d
}

It will always give scala.MatchError
scala> val url = "/home/knoldus/data/moviedataset.csv"
url: String = /home/knoldus/data/moviedataset.csv
scala> val asdf1 = Source.fromFile(url).getLines().map(_.split(",").map(_.trim).map(trytoDouble(_))).toList
asdf1: List[Array[Option[Double]]] = List(Array(Some(1.0), None, Some(1993.0), Some(3.9), Some(4568.0)), Array(Some(2.0), None, Some(1932.0), Some(3.5), Some(4388.0)), Array(Some(3.0), None, Some(1921.0), Some(3.2), Some(9062.0)), Array(Some(4.0), None, Some(1991.0), Some(2.8), Some(6150.0)), Array(Some(5.0), None, Some(1963.0), Some(2.8), Some(5126.0)), ....
As it will return non-empty iterator so, to see the result i have converted it toList.
If you notice return type is List[Array[Option[Double]]], and you are trying to match with Array of tuple3 but it always returns Array[Option[Double]].
Therefore, it will always throw error.

Try This:
val asdf = Array("1,2,3","1,b,c").map(_.split(",").map(_.trim).map(trytoDouble(_))).map{
_(2) match {
case x:Option[Double] => x
}
}

having only one match case is always breaking code.
See the REPL example,
scala> def trytoDouble(s: String): Option[Double] = {
| try {
| Some(s.toDouble)
| } catch {
| case e: Exception => None
| }
| }
when there's no Double in your CSV,
scala> List("a,b,c").map(_.split(",").map(_.trim).map(trytoDouble(_)))
res1: List[Array[Option[Double]]] = List(Array(None, None, None))
When you match above result (Array(None, None, None)) with Array(a, b, Some(c: Double)) always going to fail,
scala> List("a,b,c").map(_.split(",").map(_.trim).map(trytoDouble(_))).map { data =>
| data match {
| case Array(a, b, Some(c: Double)) => c
| }
| }
scala.MatchError: [Lscala.Option;#22604c7e (of class [Lscala.Option;)
at .$anonfun$res4$4(<console>:15)
at .$anonfun$res4$4$adapted(<console>:13)
at scala.collection.immutable.List.map(List.scala:283)
... 33 elided
And when there Double,
scala> List("a,b,100").map(_.split(",").map(_.trim).map(trytoDouble(_))).map { data => data match { case Array(a, b, Some(c: Double)) => c }}
res5: List[Double] = List(100.0)
You basically need to check for Some(c) or None.
BUT, if I understand you you are to extract third field as Double which can be done this way,
scala> List("a,b,100", "100, 200, p").map(_.split(",")).map { case Array(a, b, c) => trytoDouble(c)}.filter(_.nonEmpty)
res14: List[Option[Double]] = List(Some(100.0))

The other answers are great, but I just noticed that an alternative of minimal code change is to simply write
val asdf = Source.fromFile(fileName).getLines().map(_.split(",").map(_.trim).map( utils.trytoDouble(_))).flatMap{
_ match {
case Array(a, b, c) => c
}
}
or slightly more efficient (since we are only interested in the last column anyway):
val asdf = Source.fromFile(fileName).getLines().map(_.split(",")).flatMap{
_ match {
case Array(a, b, c) => trytoDouble(c.trim)
}
}
Important to notice is here that flatMap will remove None objects.

Related

Log warnings in pattern match without returning Any

I have an iterable of arrays that I am trying to turn into case classes, and I'm mapping over them to do so. In the event of an array being non-convertable to a case class, I want to log a warning and proceed with the mapping. However, when I implement the warning, the return type changes from Iterable[MyCaseClass] to Iterable[Any] which is not what I want. E.g.:
case class MyCaseClass(s1: String, s2: String)
object MyCaseClass {
def apply(sa: Array[String]) = new MyCaseClass(sa(0), sa(1))
}
val arrayIterable: Iterable[Array[String]] = Iterable(Array("a", "b"), Array("a", "b", "c"))
def badReturnType(): Iterable[Any] = { // Iterable[Any] is undesireable
arrayIterable map {
case sa: Array[String] if sa.length == 2 => MyCaseClass(sa)
case _ => println("something bad happened!") // but warnings are good
}
}
def desiredReturnType(): Iterable[MyCaseClass] = { // Iterable[MyCaseClass] is desireable
arrayIterable map {
case sa: Array[String] if sa.length == 2 => MyCaseClass(sa)
// but no warnings if things go wrong!
}
}
I want to write a function that meets the following criteria:
maps over the Iterable, converting each element to a MyCaseClass
log warnings when I get an array that cant be converted to a MyCaseClass
after logging the warning, the array passed into the match criteria is ignored/discarded
the return type should be Iterable[MyCaseClass].
How can I meet these conditions?

Consider using List instead of Array, and try wrapping in Option in combination with flatMap
l flatMap {
case e if e.length == 2 => Some(MyCaseClass(e))
case e => println(s"$e is wrong length"); None
}
Another approach is partitionMap
val (lefts, rights) = l.partitionMap {
case e if e.size == 2 => Right(MyCaseClass(e))
case e => Left(s"$e is wrong length")
}
lefts.foreach(println)
rights

You can do something like this:
final case class MyCaseClass(s1: String, s2: String)
def parse(input: Array[String]): Either[String, MyCaseClass] = input match {
case Array(s1, s2) => Right(MyCaseClass(s1, s2))
case _ => Left(s"Bad input: ${input.mkString("[", ", ", "]")}")
}
def logErrors(validated: Either[String, _]): Unit = validated match {
case Left(error) => println(error)
case Right(_) => ()
}
def validateData(data: IterableOnce[Array[String]]): List[MyCaseClass] =
data
.iterator
.map(parse)
.tapEach(logErrors)
.collect {
case Right(value) => value
}.toList
Which you can use like this:
val arrayIterable = Iterable(Array("a", "b"), Array("a", "b", "c"))
validateData(arrayIterable)
// Bad input: [a, b, c]
// res14: List[MyCaseClass] = List(MyCaseClass("a", "b"))

Troubles with collections type in pattern matching in Scala

I tried to do collection matching in Scala without using scala.reflect.ClassTag
case class Foo(name: String)
case class Bar(id: Int)
case class Items(items: Vector[AnyRef])
val foo = Vector(Foo("a"), Foo("b"), Foo("c"))
val bar = Vector(Bar(1), Bar(2), Bar(3))
val fc = Items(foo)
val bc = Items(bar)
we can not do this:
fc match {
case Items(x) if x.isInstanceOf[Vector[Foo]]
}
because:
Warning: non-variable type argument Foo in type scala.collection.immutable.Vector[Foo] (the underlying of Vector[Foo]) is unchecked since it is eliminated by erasure
and this:
fc match {
case Items(x: Vector[Foo]) =>
}
but we can do this:
fc match {
case Items(x#(_: Foo) +: _) => ...
case Items(x#(_: Bar) +: _) => ...
}
bc match {
case Items(x#(_: Foo) +: _) => ...
case Items(x#(_: Bar) +: _) => ...
}
As you can see, we are check - is collection Foo + vector or Bar + vector.
And here we are have some problems:
We can do Vector(Foo("1"), Bar(2)), and this is will be match with Foo.
We are still need "val result = x.asInstanceOf[Vector[Bar]]" class casting for result extraction
Is there are some more beautiful way?
Like this:
fc match {
case Items(x: Vector[Foo]) => // result is type of Vector[Foo] already
}

What you're doing here is fundamentally just kind of unpleasant, so I'm not sure making it possible to do it in a beautiful way is a good thing, but for what it's worth, Shapeless's TypeCase is a little nicer:
case class Foo(name: String)
case class Bar(id: Int)
case class Items(items: Vector[AnyRef])
val foo = Vector(Foo("a"), Foo("b"), Foo("c"))
val bar = Vector(Bar(1), Bar(2), Bar(3))
val fc = Items(foo)
val bc = Items(bar)
val FooVector = shapeless.TypeCase[Vector[Foo]]
val BarVector = shapeless.TypeCase[Vector[Bar]]
And then:
scala> fc match {
| case Items(FooVector(items)) => items
| case _ => Vector.empty
| }
res0: Vector[Foo] = Vector(Foo(a), Foo(b), Foo(c))
scala> bc match {
| case Items(FooVector(items)) => items
| case _ => Vector.empty
| }
res1: Vector[Foo] = Vector()
Note that while ClassTag instances can also be used in this way, they don't do what you want:
scala> val FooVector = implicitly[scala.reflect.ClassTag[Vector[Foo]]]
FooVector: scala.reflect.ClassTag[Vector[Foo]] = scala.collection.immutable.Vector
scala> fc match {
| case Items(FooVector(items)) => items
| case _ => Vector.empty
| }
res2: Vector[Foo] = Vector(Foo(a), Foo(b), Foo(c))
scala> bc match {
| case Items(FooVector(items)) => items
| case _ => Vector.empty
| }
res3: Vector[Foo] = Vector(Bar(1), Bar(2), Bar(3))
…which will of course throw ClassCastExceptions if you try to use res3.
This really isn't a nice thing to do, though—inspecting types at runtime undermines parametricity, makes your code less robust, etc. Type erasure is a good thing, and the only problem with type erasure on the JVM is that it's not more complete.

If you want something that is simple using implicit conversions. then try this!
implicit def VectorConversionI(items: Items): Vector[AnyRef] = items match { case x#Items(v) => v }
Example:
val fcVertor: Vector[AnyRef] = fc // Vector(Foo(a), Foo(b), Foo(c))
val bcVertor: Vector[AnyRef] = bc // Vector(Bar(1), Bar(2), Bar(3))

Need to convert a Seq[Option[A]] to Option[Seq[A]]

USE CASE
I have a list of files that can might have a valid mime type or not.
In my code, I represent this using an Option.
I need to convert a Seq[Option[T]] to Option[Seq[T]] so that I do not process the list if some of the files are invalid.
ERROR
This is the error in the implementation below:
found : (Option[Seq[A]], Option[A]) => Option[Seq[A]]
[error] required: (Option[Any], Option[Any]) => Option[Any]
[error] s.fold(init)(liftOptionItem[A])
IMPLEMENTATION
def liftOptionItem[A](acc: Option[Seq[A]], itemOption: Option[A]): Option[Seq[A]] = {
{
acc match {
case None => None
case Some(items) =>
itemOption match {
case None => None
case Some(item) => Some(items ++ Seq(item))
}
}
}
}
def liftOption[A](s: Seq[Option[A]]): Option[Seq[A]] = {
s.fold(Some(Seq()))(liftOptionItem[A])
}
This implementation returns Option[Any] instead of Option[Seq[A] as the type of the liftOptionItem[A] does not fit in.

If you use TypeLevel Cats:
import cats.implicits._
List(Option(1), Option(2), Option(3)).traverse(identity)
Returns:
Option[List[Int]] = Some(List(1, 2, 3))
You have to use List so use a toList first:
Seq(Option(1), Option(2), Option(3)).toList.traverse(identity).map(_.toSeq)

using scalaz:
import scalaz._
import Sclaza._
val x:List[Option[Int]] = List(Option(1))
x.sequence[Option, Int] //returns Some(List(1))
val y:List[Option[Int]] = List(None, Option(1))
y.sequence[Option, Int] // returns None

If you dont want to use functional libraries like cats or Scalaz you could use a foldLeft
def seqToOpt[A](seq: Seq[Option[A]]): Option[Seq[A]] =
seq.foldLeft(Option(Seq.empty[A])){
(res, opt) =>
for {
seq <- res
v <- opt
} yield seq :+ v
}

Tail-recursive Solution: It returns None if any one of the seq element is None.
def seqToOption[T](s: Seq[Option[T]]): Option[Seq[T]] = {
#tailrec
def seqToOptionHelper(s: Seq[Option[T]], accum: Seq[T] = Seq[T]()): Option[Seq[T]] = {
s match {
case Some(head) :: Nil => Option(head +: accum)
case Some(head) :: tail => seqToOptionHelper(tail, head +: accum)
case _ => None
}
}
seqToOptionHelper(s)
}

Dealing with None in case statements is the reason for returning the Option[Seq[Any]] type in stead of Option[Seq[A]] type. We need to make the function
liftOptionItem[A] to return Option[Seq[Any]] type. And the compilation error can be fixed with the following changes in both the functions.(Because fold does not go in any particular order, there are constraints on the start value and thus return value , the foldLeft is used in stead of fold.)
def liftOptionItem[A](acc: Option[Seq[Any]], itemOption: Option[A]): Option[Seq[Any]] = {
{
acc match {
case None => Some(Nil)
case Some(items)=>
itemOption match {
case None => Some(items ++ Seq("None"))
case Some(item) => Some(items ++ Seq(item))
}
}
}
}
def liftOption[A](s: Seq[Option[A]]): Option[Seq[Any]] = {
s.foldLeft(Option(Seq[Any]()))(liftOptionItem[A])
}
Now, code compiles.
In Scala REPL:
scala> val list1 = Seq(None,Some(21),None,Some(0),Some(43),None)
list1: Seq[Option[Int]] = List(None, Some(21), None, Some(0), Some(43), None)
scala> liftOption(list1)
res2: Option[Seq[Any]] = Some(List(None, 21, None, 0, 43, None))
scala> val list2 = Seq(None,Some("String1"),None,Some("String2"),Some("String3"),None)
list2: Seq[Option[String]] = List(None, Some(String1), None, Some(String2), Some(String3), None)
scala> liftOption(list2)
res3: Option[Seq[Any]] = Some(List(None, String1, None, String2, String3, None))

There is no really "beautiful" way to make this with out scalaz or cats.
But you can try something like this.
def seqToOpt[A](seq: Seq[Option[A]]): Option[Seq[A]] = {
val flatten = seq.flatten
if (flatten.isEmpty) None
else Some(flatten)
}

Scala: How to simplify nested pattern matching statements

I am writing a Hive UDF in Scala (because I want to learn scala). To do this, I have to override three functions: evaluate, initialize and getDisplayString.
In the initialize function I have to:
Receive an array of ObjectInspector and return an ObjectInspector
Check if the array is null
Check if the array has the correct size
Check if the array contains the object of the correct type
To do this, I am using pattern matching and came up with the following function:
override def initialize(genericInspectors: Array[ObjectInspector]): ObjectInspector = genericInspectors match {
case null => throw new UDFArgumentException(functionNameString + ": ObjectInspector is null!")
case _ if genericInspectors.length != 1 => throw new UDFArgumentException(functionNameString + ": requires exactly one argument.")
case _ => {
listInspector = genericInspectors(0) match {
case concreteInspector: ListObjectInspector => concreteInspector
case _ => throw new UDFArgumentException(functionNameString + ": requires an input array.")
}
PrimitiveObjectInspectorFactory.getPrimitiveWritableObjectInspector(listInspector.getListElementObjectInspector.asInstanceOf[PrimitiveObjectInspector].getPrimitiveCategory)
}
}
Nevertheless, I have the impression that the function could be made more legible and, in general, prettier since I don't like to have code with too many levels of indentation.
Is there an idiomatic Scala way to improve the code above?

It's typical for patterns to include other patterns. The type of x here is String.
scala> val xs: Array[Any] = Array("x")
xs: Array[Any] = Array(x)
scala> xs match {
| case null => ???
| case Array(x: String) => x
| case _ => ???
| }
res0: String = x
The idiom for "any number of args" is "sequence pattern", which matches arbitrary args:
scala> val xs: Array[Any] = Array("x")
xs: Array[Any] = Array(x)
scala> xs match { case Array(x: String) => x case Array(_*) => ??? }
res2: String = x
scala> val xs: Array[Any] = Array(42)
xs: Array[Any] = Array(42)
scala> xs match { case Array(x: String) => x case Array(_*) => ??? }
scala.NotImplementedError: an implementation is missing
at scala.Predef$.$qmark$qmark$qmark(Predef.scala:230)
... 32 elided
scala> Array("x","y") match { case Array(x: String) => x case Array(_*) => ??? }
scala.NotImplementedError: an implementation is missing
at scala.Predef$.$qmark$qmark$qmark(Predef.scala:230)
... 32 elided
This answer should not be construed as advocating matching your way back to type safety.

Chaining Scalaz validation functions: Function1[A,Validation[E,B]]

I'm trying to write some code to make it easy to chain functions that return Scalaz Validation types. One method I am trying to write is analogous to Validation.flatMap (Short circuit that validation) which I will call andPipe. The other is analogous to |#| on ApplicativeBuilder (accumulating errors) except it only returns the final Success type, which I will call andPass
Suppose I have functions:
def allDigits: (String) => ValidationNEL[String, String]
def maxSizeOfTen: (String) => ValidationNEL[String, String]
def toInt: (String) => ValidationNEL[String, Int]
As an example, I would like to first pass the input String to both allDigits and maxSizeOf10. If there are failures, it should short circuit by not calling the toInt function and return either or both failures that occurred. If Successful, I would like to pass the Success value to the toInt function. From there, it would either Succeed with the output value being an Int, or it would fail returning only the validation failure from toInt.
def intInput: (String) => ValidationNEL[String,Int] = (allDigits andPass maxSizeOfTen) andPipe toInt
Is there a way to do this without my add-on implementation below?
Here is my Implementation:
trait ValidationFuncPimp[E,A,B] {
val f: (A) => Validation[E, B]
/** If this validation passes, pass to f2, otherwise fail without accumulating. */
def andPipe[C](f2: (B) => Validation[E,C]): (A) => Validation[E,C] = (a: A) => {
f(a) match {
case Success(x) => f2(x)
case Failure(x) => Failure(x)
}
}
/** Run this validation and the other validation, Success only if both are successful. Fail accumulating errors. */
def andPass[D](f2: (A) => Validation[E,D])(implicit S: Semigroup[E]): (A) => Validation[E,D] = (a:A) => {
(f(a), f2(a)) match {
case (Success(x), Success(y)) => Success(y)
case (Failure(x), Success(y)) => Failure(x)
case (Success(x), Failure(y)) => Failure(y)
case (Failure(x), Failure(y)) => Failure(S.append(x, y))
}
}
}
implicit def toValidationFuncPimp[E,A,B](valFunc : (A) => Validation[E,B]): ValidationFuncPimp[E,A,B] = {
new ValidationFuncPimp[E,A,B] {
val f = valFunc
}
}

I'm not claiming that this answer is necessarily any better than drstevens's, but it takes a slightly different approach and wouldn't fit in a comment there.
First for our validation methods (note that I've changed the type of toInt a bit, for reasons I'll explain below):
import scalaz._, Scalaz._
def allDigits: (String) => ValidationNEL[String, String] =
s => if (s.forall(_.isDigit)) s.successNel else "Not all digits".failNel
def maxSizeOfTen: (String) => ValidationNEL[String, String] =
s => if (s.size <= 10) s.successNel else "Too big".failNel
def toInt(s: String) = try(s.toInt.right) catch {
case _: NumberFormatException => NonEmptyList("Still not an integer").left
}
I'll define a type alias for the sake of convenience:
type ErrorsOr[+A] = NonEmptyList[String] \/ A
Now we've just got a couple of Kleisli arrows:
val validator = Kleisli[ErrorsOr, String, String](
allDigits.flatMap(x => maxSizeOfTen.map(x *> _)) andThen (_.disjunction)
)
val integerizer = Kleisli[ErrorsOr, String, Int](toInt)
Which we can compose:
val together = validator >>> integerizer
And use like this:
scala> together("aaa")
res0: ErrorsOr[Int] = -\/(NonEmptyList(Not all digits))
scala> together("12345678900")
res1: ErrorsOr[Int] = -\/(NonEmptyList(Too big))
scala> together("12345678900a")
res2: ErrorsOr[Int] = -\/(NonEmptyList(Not all digits, Too big))
scala> together("123456789")
res3: ErrorsOr[Int] = \/-(123456789)
Using flatMap on something that isn't monadic makes me a little uncomfortable, and combining our two ValidationNEL methods into a Kleisli arrow in the \/ monad—which also serves as an appropriate model for our string-to-integer conversion—feels a little cleaner to me.

This is relatively concise with little "added code". It is still sort of wonky though because it ignores the successful result of applying allDigits.
scala> val validated = for {
| x <- allDigits
| y <- maxSizeOfTen
| } yield x *> y
validated: String => scalaz.Validation[scalaz.NonEmptyList[String],String] = <function1>
scala> val validatedToInt = (str: String) => validated(str) flatMap(toInt)
validatedToInt: String => scalaz.Validation[scalaz.NonEmptyList[String],Int] = <function1>
scala> validatedToInt("10")
res25: scalaz.Validation[scalaz.NonEmptyList[String],Int] = Success(10)
Alternatively you could keep both of the outputs of allDigits and maxSizeOfTen.
val validated2 = for {
x <- allDigits
y <- maxSizeOfTen
} yield x <|*|> y
I'm curious if someone else could come up with a better way to combine these. It's not really composition...
val validatedToInt = (str: String) => validated2(str) flatMap(_ => toInt(str))
Both validated and validated2 accumulate failures as shown below:
scala> def allDigits: (String) => ValidationNEL[String, String] = _ => failure(NonEmptyList("All Digits Fail"))
allDigits: String => scalaz.Scalaz.ValidationNEL[String,String]
scala> def maxSizeOfTen: (String) => ValidationNEL[String, String] = _ => failure(NonEmptyList("max > 10"))
maxSizeOfTen: String => scalaz.Scalaz.ValidationNEL[String,String]
scala> val validated = for {
| x <- allDigits
| y <- maxSizeOfTen
| } yield x *> y
validated: String => scalaz.Validation[scalaz.NonEmptyList[String],String] = <function1>
scala> val validated2 = for {
| x <- allDigits
| y <- maxSizeOfTen
| } yield x <|*|> y
validated2: String => scalaz.Validation[scalaz.NonEmptyList[String],(String, String)] = <function1>
scala> validated("ten")
res1: scalaz.Validation[scalaz.NonEmptyList[String],String] = Failure(NonEmptyList(All Digits Fail, max > 10))
scala> validated2("ten")
res3: scalaz.Validation[scalaz.NonEmptyList[String],(String, String)] = Failure(NonEmptyList(All Digits Fail, max > 10))

Use ApplicativeBuilder with the first two, so that the errors accumulate,
then flatMap toInt, so toInt only gets called if the first two succeed.
val validInt: String => ValidationNEL[String, Int] =
for {
validStr <- (allDigits |#| maxSizeOfTen)((x,_) => x);
i <- toInt
} yield(i)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pattern Matching for CSV processing - scala

Try This: val asdf = Array("1,2,3","1,b,c").map(_.split(",").map(_.trim).map(trytoDouble(_))).map{ _(2) match { case x:Option[Double] => x } }

Related

Log warnings in pattern match without returning Any

Troubles with collections type in pattern matching in Scala

Need to convert a Seq[Option[A]] to Option[Seq[A]]

Scala: How to simplify nested pattern matching statements

Chaining Scalaz validation functions: Function1[A,Validation[E,B]]

Categories

Resources