Scala parsing string sequence

Scala parsing string sequence - scala

I wish to parse a sequence of string into individual tokens. Right now it only parses the first word.
class SimpleRegexParser extends RegexParsers{
def word: Parser[String] = """[a-z]+""".r ^^ { _.toString }
}
object SimpleRegexParserMain extends SimpleRegexParser{
def main(args: Array[String]) = {
println(parse(word, "johnny has a little lamb"))
}
}
Right now I am getting : [1.7] parsed: johnny
How can I parse the whole string into individual tokens, so that this works for variable string length.
Any pointers to make this work are welcome. Please tell me how I can make it work in scala.

Okay. I found the answer to this.
I had to change the definition of word.
Here is the updated definition:
class SimpleRegexParser extends RegexParsers{
def word: Parser[String] = rep("""[a-z]+""".r) ^^ { _.toString }
}
Previous expression was working for single word, I now have a repetition of word rep().
Here is the result:
[1.25] parsed: List(johnny, has, a, little, lamb)

Related

What is mean method Apply on Scala

according to this link example
and then im create code like this :
object objPersamaan{
def main(args : Array[String]){
val x = objPersamaan("Budi ")
println(x)
}
def apply(pX:String) = pX + "Belajar"
def unapply(pZ:Int) : Option[Int] = if(pZ%2==0) Some(pZ/2) else None
}
i can't still dont understand, how can objPersamaan("Budi ") would had value Budi Belajar? is that method apply like constructor with parameter? or what? can anyone, explain more detail why the process from method apply? thanks!

In Scala, you can omit the method name apply. So calling objPersamaan("Budi ") is the same as objPersamaan.apply("Budi "), which then concatenates the String Belajar to the given parameter.
When you have an object and call it without a method name, just your arguments, the copmiler expands this call with MyObject.apply(...). This comes from the fact, that you often use objects where you want to apply something to (e.g. in factories), so Scala built this feature directly into the language.
case class MyObj(foo: String, bar: String)
object MyObj {
def apply(x: Int): MyObj = MyObj(x.toBinaryString, x.toHexString)
}
// ...
MyObj(42)
// same as
MyObj.apply(42)

Is there a good way in Scala to interpret the types of values in a CSV

Suppose I'm given a CSV with the following values:
0, 1.00, Hello
3, 2.13, World
.
.
.
Is there a good method or library that could automatically detect the best type to classify a given column as? In this case (Int, Float, String).
For more context, I'm attempting to extend a CSV parsing library to allow it to report histogram like data on the CSV that is passed in. The idea is to make it very easy to add certain validation tasks into this framework so as to figure out deficiencies or irregularities in a CSV data dump.
Intially I thought to write something which a user could supply a config file that specified the types, but for cases when the CSV column sets are very large, or just for ease of use, I'd like to attempt to automatically detect the types instead of having a user have to write them out.

One answer might be:
def parse(s:String): Any = Try(s.toInt) orElse(Try(s.toDouble)) getOrElse(s)
Then you can use pattern-matching to do whatever you want with it.
You could, of course, first do regular-expression tests on the string to see which type you have. But I'm fairly sure just brute-forcing the parse for each format, as above, will be faster.

Consider parser combinators; inferred types are reported via a list of case classes,
import scala.util.parsing.combinator._
trait CSVType
case class LiteralStr extends CSVType
case class Float extends CSVType
case class Integer extends CSVType
case class Bool extends CSVType
case class NA extends CSVType // Not Available
class CSV extends JavaTokenParsers {
def row: Parser[List[CSVType]] = repsep(value, ",")
def value: Parser[CSVType] =
floatingPointNumber ^^ { f => if (f.toDouble.toInt == f.toDouble) Integer()
else Float() } |
"NA" ^^ { na => NA() } |
("true" | "false") ^^ { b => Bool() } |
stringLiteral ^^ { s => LiteralStr() }
}
object ParseExpr extends CSV with App {
println("in: "+ args(0))
println(parseAll(row, args(0)))
}
Hence
scala> val s = """1.23,2,true,NA,"hello" """
s: String = "1.23,2,true,NA,"hello" "
scala> ParseExpr.main(Array(s))
in: 1.23,2,true,NA,"hello"
[1.24] parsed: List(Float(), Integer(), Bool(), NA(), LiteralStr())
Note that combinators include the parsing of types such as numerics, boolean and strings. In addition, custom types are defined by the parser, for instance NA. See JavaTokenParsers trait for definitions used here.
Each case class may include additional logic to report typing in a most convenient way.

Parsing sentences using Scala parser combinator

I just started playing with parser combinators in Scala, but got stuck on a parser to parse sentences such as "I like Scala." (words end on a whitespace or a period (.)).
I started with the following implementation:
package example
import scala.util.parsing.combinator._
object Example extends RegexParsers {
override def skipWhitespace = false
def character: Parser[String] = """\w""".r
def word: Parser[String] =
rep(character) <~ (whiteSpace | guard(literal("."))) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ "."
}
object Test extends App {
val result = Example.parseAll(Example.sentence, "I like Scala.")
println(result)
}
The idea behind using guard() is to have a period demarcate word endings, but not consume it so that sentences can. However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
If I change the word and sentence definitions as follows, it parses the sentence, but the grammar description doesn't look right and will not work if I try to add parser for paragraph (rep(sentence)) etc.
def word: Parser[String] =
rep(character) <~ (whiteSpace | literal(".")) ^^ (_.mkString(""))
def sentence: Parser[List[String]] = rep(word) <~ opt(".")
Any ideas what may be going on here?

However, the parser gets stuck (adding log() reveals that it is repeatedly trying the word and character parser).
The rep combinator corresponds to a * in perl-style regex notation. This means it matches zero or more characters. I think you want it to match one or more characters. Changing that to a rep1 (corresponding to + in perl-style regex notation) should fix the problem.
However, your definition still seems a little verbose to me. Why are you parsing individual characters instead of just using \w+ as the pattern for a word? Here's how I'd write it:
object Example extends RegexParsers {
override def skipWhitespace = false
def word: Parser[String] = """\w+""".r
def sentence: Parser[List[String]] = rep1sep(word, whiteSpace) <~ "."
}
Notice that I use rep1sep to parse a non-empty list of words separated by whitespace. There's a repsep combinator as well, but I think you'd want at least one word per sentence.

Generic synchronisation design

We are building some sync functionality using two-way json requests and this algorithm. All good and we have it running in prototype mode. Now I am trying to genericise the code, as we will be synching for several tables in the app. It would be cool to be able to define a class as "extends Synchable" and get the additional attributes and sync processing methods with a few specialisations/overrides. I have got this far:
abstract class Synchable [T<:Synchable[T]] (val ruid: String, val lastSyncTime: String, val isDeleted:Int) {
def contentEquals(Target: T): Boolean
def updateWith(target: T)
def insert
def selectSince(clientLastSyncTime: String): List[T]
def findByRuid(ruid: String): Option[T]
implicit val validator: Reads[T]
def process(clientLastSyncTime: String, updateRowList: List[JsObject]) = {
for (syncRow <- updateRowList) {
val validatedSyncRow = syncRow.validate[Synchable]
validatedSyncRow.fold(
valid = { result => // valid row
findByRuid(result.ruid) match { //- do we know about it?
case Some(knownRow) => knownRow.updateWith(result)
case None => result.insert
}
}... invalid, etc
I am new to Scala and know I am probably missing things - WIP!
Any pointers or suggestions on this approach would be much appreciated.

Some quick ones:
Those _ parameters you pass in and then immediately assign to vals: why not do it in one hit? e.g.
abstract class Synchable( val ruid: String = "", val lastSyncTime: String = "", val isDeleted: Int = 0) {
which saves you a line and is clearer in intent as well I think.
I'm not sure about your defaulting of Strings to "" - unless there's a good reason (and there often is), I think using something like ruid:Option[String] = None is more explicit and lets you do all sorts of nice monad-y things like fold, map, flatMap etc.
Looking pretty cool otherwise - the only other thing you might want to do is strengthen the typing with a bit of this.type magic so you'll prevent incorrect usage at compile-time. With your current abstract class, nothing prevents me from doing:
class SynchableCat extends Synchable { ... }
class SynchableDog extends Synchable { ... }
val cat = new SynchableCat
val dog = new SynchableDog
cat.updateWith(dog) // This won't end well
But if you just change your abstract method signatures to things like this:
def updateWith(target: this.type)
Then the change ripples down through the subclasses, narrowing down the types, and the compiler will omit a (relatively clear) error if I try the above update operation.

Strongly typed access to csv in scala?

I would like to access csv files in scala in a strongly typed manner. For example, as I read each line of the csv, it is automatically parsed and represented as a tuple with the appropriate types. I could specify the types beforehand in some sort of schema that is passed to the parser. Are there any libraries that exist for doing this? If not, how could I go about implementing this functionality on my own?

product-collections appears to be a good fit for your requirements:
scala> val data = CsvParser[String,Int,Double].parseFile("sample.csv")
data: com.github.marklister.collections.immutable.CollSeq3[String,Int,Double] =
CollSeq((Jan,10,22.33),
(Feb,20,44.2),
(Mar,25,55.1))
product-collections uses opencsv under the hood.
A CollSeq3 is an IndexedSeq[Product3[T1,T2,T3]] and also a Product3[Seq[T1],Seq[T2],Seq[T3]] with a little sugar. I am the author of product-collections.
Here's a link to the io page of the scaladoc
Product3 is essentially a tuple of arity 3.

If your content has double-quotes to enclose other double quotes, commas and newlines, I would definitely use a library like opencsv that deals properly with special characters. Typically you end up with Iterator[Array[String]]. Then you use Iterator.map or collect to transform each Array[String] into your tuples dealing with type conversions errors there. If you need to do process the input without loading all in memory, you then keep working with the iterator, otherwise you can convert to a Vector or List and close the input stream.
So it may look like this:
val reader = new CSVReader(new FileReader(filename))
val iter = reader.iterator()
val typed = iter collect {
case Array(double, int, string) => (double.toDouble, int.toInt, string)
}
// do more work with typed
// close reader in a finally block
Depending on how you need to deal with errors, you can return Left for errors and Right for success tuples to separate the errors from the correct rows. Also, I sometimes wrap of all this using scala-arm for closing resources. So my data maybe wrapped into the resource.ManagedResource monad so that I can use input coming from multiple files.
Finally, although you want to work with tuples, I have found that it is usually clearer to have a case class that is appropriate for the problem and then write a method that creates that case class object from an Array[String].

You can use kantan.csv, which is designed with precisely that purpose in mind.
Imagine you have the following input:
1,Foo,2.0
2,Bar,false
Using kantan.csv, you could write the following code to parse it:
import kantan.csv.ops._
new File("path/to/csv").asUnsafeCsvRows[(Int, String, Either[Float, Boolean])](',', false)
And you'd get an iterator where each entry is of type (Int, String, Either[Float, Boolean]). Note the bit where the last column in your CSV can be of more than one type, but this is conveniently handled with Either.
This is all done in an entirely type safe way, no reflection involved, validated at compile time.
Depending on how far down the rabbit hole you're willing to go, there's also a shapeless module for automated case class and sum type derivation, as well as support for scalaz and cats types and type classes.
Full disclosure: I'm the author of kantan.csv.

I've created a strongly-typed CSV helper for Scala, called object-csv. It is not a fully fledged framework, but it can be adjusted easily. With it you can do this:
val peopleFromCSV = readCSV[Person](fileName)
Where Person is case class, defined like this:
case class Person (name: String, age: Int, salary: Double, isNice:Boolean = false)
Read more about it in GitHub, or in my blog post about it.

Edit: as pointed out in a comment, kantan.csv (see other answer) is probably the best as of the time I made this edit (2020-09-03).
This is made more complicated than it ought to because of the nontrivial quoting rules for CSV. You probably should start with an existing CSV parser, e.g. OpenCSV or one of the projects called scala-csv. (There are at least three.)
Then you end up with some sort of collection of collections of strings. If you don't need to read massive CSV files quickly, you can just try to parse each line into each of your types and take the first one that doesn't throw an exception. For example,
import scala.util._
case class Person(first: String, last: String, age: Int) {}
object Person {
def fromCSV(xs: Seq[String]) = Try(xs match {
case s0 +: s1 +: s2 +: more => new Person(s0, s1, s2.toInt)
})
}
If you do need to parse them fairly quickly and you don't know what might be there, you should probably use some sort of matching (e.g. regexes) on the individual items. Either way, if there's any chance of error you probably want to use Try or Option or somesuch to package errors.

I built my own idea to strongly typecast the final product, more than the reading stage itself..which as pointed out might be better handled as stage one with something like Apache CSV, and stage 2 could be what i've done. Here's the code you are welcome to it. The idea is to typecast the CSVReader[T] with type T .. upon construction, you must supply the reader with a Factor object of Type[T] as well. The idea here is that the class itself (or in my example a helper object) decides the construction detail and thus decouples this from the actual reading. You could use Implicit objects to pass the helper around but I've not done that here. The only downside is that each row of the CSV must be of the same class type, but you could expand this concept as needed.
class CsvReader/**
* #param fname
* #param hasHeader : ignore header row
* #param delim : "\t" , etc
*/
[T] ( factory:CsvFactory[T], fname:String, delim:String) {
private val f = Source.fromFile(fname)
private var lines = f.getLines //iterator
private var fileClosed = false
if (lines.hasNext) lines = lines.dropWhile(_.trim.isEmpty) //skip white space
def hasNext = (if (fileClosed) false else lines.hasNext)
lines = lines.drop(1) //drop header , assumed to exist
/**
* also closes the file
* #return the line
*/
def nextRow ():String = { //public version
val ans = lines.next
if (ans.isEmpty) throw new Exception("Error in CSV, reading past end "+fname)
if (lines.hasNext) lines = lines.dropWhile(_.trim.isEmpty) else close()
ans
}
//def nextObj[T](factory:CsvFactory[T]): T = past version
def nextObj(): T = { //public version
val s = nextRow()
val a = s.split(delim)
factory makeObj a
}
def allObj() : Seq[T] = {
val ans = scala.collection.mutable.Buffer[T]()
while (hasNext) ans+=nextObj()
ans.toList
}
def close() = {
f.close;
fileClosed = true
}
} //class
next the example Helper Factory and example "Main"
trait CsvFactory[T] { //handles all serial controls (in and out)
def makeObj(a:Seq[String]):T //for reading
def makeRow(obj:T):Seq[String]//the factory basically just passes this duty
def header:Seq[String] //must define headers for writing
}
/**
* Each class implements this as needed, so the object can be serialized by the writer
*/
case class TestRecord(val name:String, val addr:String, val zip:Int) {
def toRow():Seq[String] = List(name,addr,zip.toString) //handle conversion to CSV
}
object TestFactory extends CsvFactory[TestRecord] {
def makeObj (a:Seq[String]):TestRecord = new TestRecord(a(0),a(1),a(2).toDouble.toInt)
def header = List("name","addr","zip")
def makeRow(o:TestRecord):Seq[String] = {
o.toRow.map(_.toUpperCase())
}
}
object CsvSerial {
def main(args: Array[String]): Unit = {
val whereami = System.getProperty("user.dir")
println("Begin CSV test in "+whereami)
val reader = new CsvReader(TestFactory,"TestCsv.txt","\t")
val all = reader.allObj() //read the CSV info a file
sd.p(all)
reader.close
val writer = new CsvWriter(TestFactory,"TestOut.txt", "\t")
for (x<-all) writer.printObj(x)
writer.close
} //main
}
Example CSV (tab seperated.. might need to repair if you copy from an editor)
Name Addr Zip "Sanders, Dante R." 4823 Nibh Av. 60797.00 "Decker, Caryn G." 994-2552 Ac Rd. 70755.00 "Wilkerson, Jolene Z." 3613 Ultrices. St. 62168.00 "Gonzales, Elizabeth W." "P.O. Box 409, 2319 Cursus. Rd." 72909.00 "Rodriguez, Abbot O." Ap #541-9695 Fusce Street 23495.00 "Larson, Martin L." 113-3963 Cras Av. 36008.00 "Cannon, Zia U." 549-2083 Libero Avenue 91524.00 "Cook, Amena B." Ap
#668-5982 Massa Ave 69205.00
And finally the writer (notice the factory methods require this as well with "makerow"
import java.io._
class CsvWriter[T] (factory:CsvFactory[T], fname:String, delim:String, append:Boolean = false) {
private val out = new PrintWriter(new BufferedWriter(new FileWriter(fname,append)));
if (!append) out.println(factory.header mkString delim )
def flush() = out.flush()
def println(s:String) = out.println(s)
def printObj(obj:T) = println( factory makeRow(obj) mkString(delim) )
def printAll(objects:Seq[T]) = objects.foreach(printObj(_))
def close() = out.close
}

If you know the the # and types of fields, maybe like this?:
case class Friend(id: Int, name: String) // 1, Fred
val friends = scala.io.Source.fromFile("friends.csv").getLines.map { line =>
val fields = line.split(',')
Friend(fields(0).toInt, fields(1))
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Scala parsing string sequence - scala

Related

What is mean method Apply on Scala

Is there a good way in Scala to interpret the types of values in a CSV

Parsing sentences using Scala parser combinator

Generic synchronisation design

Strongly typed access to csv in scala?

Categories

Resources