Customised string in spark scala - scala

I have a string like "debug#compile". Now, my end goal is to convert first letter of each word to uppercase. So, at last I should get "Debug#Compile" where 'D' and 'C' are converted to uppercase.
My logic:
1) I have to split the string on the basis of delimiters. It will be special characters.So, I have to check everytime.
2) After that I would convert each word's first letter to upper case and then using map I would join it again.
I am trying my best but not able to design the code for this. Can anyone help me in this. Even hints would help!
Below is my code:
object WordCase {
def main(args: Array[String]) {
val s="siddhesh#kalgaonkar"
var b=""
val delimeters= Array("#","_")
if(delimeters(0)=="#")
{
b=s.split(delimeters(0).toString).map(_.capitalize).mkString(delimeters(0).toString())
}
else if(delimeters(0)=="_")
{
b=s.split(delimeters(0).toString).map(_.capitalize).mkString(delimeters(0).toString())
}
else{
println("Non-Standard String")
}
println(b)
}
}
My code capitalizes the first letter of every word in capital on the basis of constant delimeter and have to merge it. Here for the first part i.e "#" it capitalizes first letter of every words but it fails for the second case i.e "_". Am I makinig any silly mistakes in looping?

scala> val s="siddhesh#kalgaonkar"
scala> val specialChar = (s.split("[a-zA-Z0-9]") filterNot Seq("").contains).mkString
scala> s.replaceAll("[^a-zA-Z]+"," ").split(" ").map(_.capitalize).mkString(",").replaceAll(",",specialChar)
res41: String = Siddhesh#Kalgaonkar
You can manage multiple special char in this way
scala> val s="siddhesh_kalgaonkar"
s: String = siddhesh_kalgaonkar
scala> val specialChar = (s.split("[a-zA-Z0-9]") filterNot Seq("").contains).mkString
specialChar: String = _
scala> s.replaceAll("[^a-zA-Z]+"," ").split(" ").map(_.capitalize).mkString(",").replaceAll(",",specialChar)
res42: String = Siddhesh_Kalgaonkar

I solved it the easy way:
object WordCase {
def main(args: Array[String]) {
val s = "siddhesh_kalgaonkar"
var b = s.replaceAll("[^a-zA-Z]+", " ").split(" ").map(_.capitalize).mkString(" ") //Replacing delimiters with space and capitalizing first letter of each word
val c=b.indexOf(" ") //Getting index of space
val d=s.charAt(c).toString // Getting delimiter character from the original string
val output=b.replace(" ", d) //Now replacing space with the delimiter character in the modified string i.e 'b'
println(output)
}
}

Related

Extracting a Word after delimiter in scala

I want to extract a word from a String in Scala
val text = "This ball is from Rock. He is good boy"
I want to extract "Rock" from the string.
I tried:
val op = text.subString(4)
text is not a fixed length string. I just want to pick first word after "From".
This doesnt give the right word. can anyone suggest.
This does what you want:
text.drop(text.indexOfSlice("from ")+5).takeWhile(_.isLetter)
or more generally
val delim = "from "
text.drop(text.indexOfSlice(delim)+delim.length).takeWhile(_.isLetter)
The indexOfSlice finds the position of the delimiter and the drop removes everything up to the end of the delimiter. The takeWhile takes all the letters for the word and stops at the first non-letter character (in this case ".").
Note that this is case sensitive so it will not find "From ", it will only work with "from ". If more complex testing is required then use split to convert to separate words and check each word in turn.
This is because you are telling scala to print everything from 4th index to end of string.
In your case you first want to split the string into words which can be done using split function and then access the word you want.
Note: split gives you an array of string and array index begin from 0 in scala so rock would be at the 4th index
This piece of code should work for you. Basically I am using a function to the processing of index of the word that immediately follows a substring (in this case from)
val text = "This ball is from Rock. He is good boy"
val splitText = text.split("\\W+")
println(splitText(4))
Based on your comment below, I would create a code something like this
import scala.util.control.Breaks.{break, breakable}
object Question extends App {
val text = "This ball is from Rock. He is from good boy"
val splitText = text.split("\\W+")
val index = getIndex(splitText, "from")
println(index)
if (index != -1) {
println(splitText(index))
}
def getIndex(arrSplit: Array[String], subString: String): Int = {
var output = -1
var index = 0
breakable {
for(item <- arrSplit) {
if(item.equalsIgnoreCase(subString)) {
output = index + 1
break
}
index = index + 1
}
}
output
}
}
I hope this is what you are expecting:
object Test2 extends App {
val text = "This ball is from Rock. He is good boy"
private val words = text.split(" ")
private val wordToTrack = "from"
val indexOfFrom = words.indexOf(wordToTrack);
val expectedText = words.filterNot{
case word if indexOfFrom < words.size & words.contains(wordToTrack) =>
word == words(indexOfFrom + 1)
case _ => false
}.mkString(" ")
print(expectedText)
}
words.contains(wordToTrack) guards the scenario if the from word(i.e tracking word for this example) is missing in the input text string.
I have used the partial function along with the filter to get the desired result.
You probably want something more general so that you can extract a word from a sentence if that word is present in the input, without having to hard-code offsets:
def extractWordFromString(input: String, word: String): Option[String] = {
val wordLength = word.length
val start = input.indexOfSlice(word)
if (start == -1) None else Some(input.slice(start, start + wordLength))
}
Executing extractWordFromString(text, "Rock") will give you an option containing the target word from input if it was found, and an empty option otherwise. That way you can handle the case where the word you were searching for was not found.

Parse CSV and add only matching rows to List functionally in Scala

I am reading csv scala.
Person is a case class
Case class Person(name, address)
def getData(path:String,existingName) : List[Person] = {
Source.fromFile(“my_file_path”).getLines.drop(1).map(l => {
val data = l.split("|", -1).map(_.trim).toList
val personName = data(0)
if(personName.equalsIgnoreCase(existingName)) {
val address=data(1)
Person(personName,address)
//here I want to add to list
}
else
Nil
///here return empty list of zero length
}).toList()
}
I want to achieve this functionally in scala.
Here's the basic approach to what I think you're trying to do.
case class Person(name:String, address:String)
def getData(path:String, existingName:String) :List[Person] = {
val recordPattern = raw"\s*(?i)($existingName)\s*\|\s*(.*)".r.unanchored
io.Source.fromFile(path).getLines.drop(1).collect {
case recordPattern(name,addr) => Person(name, addr.trim)
}.toList
}
This doesn't close the file reader or report the error if the file can't be opened, which you really should do, but we'll leave that for a different day.
update: added file closing and error handling via Using (Scala 2.13)
import scala.util.{Using, Try}
case class Person(name:String, address:String)
def getData(path:String, existingName:String) :Try[List[Person]] =
Using(io.Source.fromFile(path)){ file =>
val recordPattern = raw"\s*(?i)($existingName)\s*\|\s*([^|]*)".r
file.getLines.drop(1).collect {
case recordPattern(name,addr) => Person(name, addr.trim)
}.toList
}
updated update
OK. Here's a version that:
reports the error if the file can't be opened
closes the file after it's been opened and read
ignores unwanted spaces and quote marks
is pre-2.13 compiler friendly
import scala.util.Try
case class Person(name:String, address:String)
def getData(path:String, existingName:String) :List[Person] = {
val recordPattern =
raw"""[\s"]*(?i)($existingName)["\s]*\|[\s"]*([^"|]*)*.""".r
val file = Try(io.Source.fromFile(path))
val res = file.fold(
err => {println(err); List.empty[Person]},
_.getLines.drop(1).collect {
case recordPattern(name,addr) => Person(name, addr.trim)
}.toList)
file.map(_.close)
res
}
And here's how the regex works:
[\s"]* there might be spaces or quote marks
(?i) matching is case-insensitive
($existingName) match and capture this string (1st capture group)
["\s]* there might be spaces or quote marks
\| there will be a bar character
[\s"]* there might be spaces or quote marks
([^"|]*) match and capture everything that isn't quote or bar
.* ignore anything that might come thereafter
you were not very clear on what was the problem on your approach, but this should do the trick (very close to what you have)
def getData(path:String, existingName: String) : List[Person] = {
val source = Source.fromFile("my_file_path")
val lst = source.getLines.drop(1).flatMap(l => {
val data = l.split("|", -1).map(_.trim).toList
val personName = data.head
if (personName.equalsIgnoreCase(existingName)) {
val address = data(1)
Option(Person(personName, address))
}
else
Option.empty
}).toList
source.close()
lst
}
we read the file line per line, for each line we extract the personName from the first csv field, and if it's the one we are looking for we return an (Option) Person, otherwise none (Option.empty). By doing flatmap we discard the empty options (just to avoid using nils)

finding a substring in a text column that start and end with a specific string

I'm trying to scan a text dataframe column and retrieve a string that starts with a specific string and ends with a specific string.I tried to use substring with instr but couldn't get it working.
what you could do is use regex and pattern matching to achieve this
# def getText(startsWith: String, endsWith: String)(text: String): (String, String) = {
val rr = s"($startsWith(.+?)$endsWith)".r
text match {
case rr(all, partial) => (all, partial)
case _ => ("", "")
}
}
defined function getText
# getText("1", "2")("1hdfjhsdf2")
res5: (String, String) = ("1hdfjhsdf2", "hdfjhsdf")
this should do what you want

Addition of numbers recursively in Scala

In this Scala code I'm trying to analyze a string that contains a sum (such as 12+3+5) and return the result (20). I'm using regex to extract the first digit and parse the trail to be added recursively. My issue is that since the regex returns a String, I cannot add up the numbers. Any ideas?
object TestRecursive extends App {
val plus = """(\w*)\+(\w*)""".r
println(parse("12+3+5"))
def parse(str: String) : String = str match {
// sum
case plus(head, trail) => parse(head) + parse(trail)
case _ => str
}
}
You might want to use the parser combinators for an application like this.
"""(\w*)\+(\w*)""".r also matches "+" or "23+" or "4 +5" // but captures it only in the first group.
what you could do might be
scala> val numbers = "[+-]?\\d+"
numbers: String = [+-]?\d+
^
scala> numbers.r.findAllIn("1+2-3+42").map(_.toInt).reduce(_ + _)
res4: Int = 42
scala> numbers.r.findAllIn("12+3+5").map(_.toInt).reduce(_ + _)
res5: Int = 20

Accessing Scala Parser regular expression match data

I wondering if it's possible to get the MatchData generated from the matching regular expression in the grammar below.
object DateParser extends JavaTokenParsers {
....
val dateLiteral = """(\d{4}[-/])?(\d\d[-/])?(\d\d)""".r ^^ {
... get MatchData
}
}
One option of course is to perform the match again inside the block, but since the RegexParser has already performed the match I'm hoping that it passes the MatchData to the block, or stores it?
Here is the implicit definition that converts your Regex into a Parser:
/** A parser that matches a regex string */
implicit def regex(r: Regex): Parser[String] = new Parser[String] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(source.subSequence(start, start + matched.end).toString,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
Just adapt it:
object X extends RegexParsers {
/** A parser that matches a regex string and returns the Match */
def regexMatch(r: Regex): Parser[Regex.Match] = new Parser[Regex.Match] {
def apply(in: Input) = {
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
(r findPrefixMatchOf (source.subSequence(start, source.length))) match {
case Some(matched) =>
Success(matched,
in.drop(start + matched.end - offset))
case None =>
Failure("string matching regex `"+r+"' expected but `"+in.first+"' found", in.drop(start - offset))
}
}
}
val t = regexMatch("""(\d\d)/(\d\d)/(\d\d\d\d)""".r) ^^ { case m => (m.group(1), m.group(2), m.group(3)) }
}
Example:
scala> X.parseAll(X.t, "23/03/1971")
res8: X.ParseResult[(String, String, String)] = [1.11] parsed: (23,03,1971)
No, you can't do this. If you look at the definition of the Parser used when you convert a regex to a Parser, it throws away all context and just returns the full matched string:
http://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_7_7_final/src/library/scala/util/parsing/combinator/RegexParsers.scala?view=markup#L55
You have a couple of other options, though:
break up your parser into several smaller parsers (for the tokens you actually want to extract)
define a custom parser that extracts the values you want and returns a domain object instead of a string
The first would look like
val separator = "-" | "/"
val year = ("""\d{4}"""r) <~ separator
val month = ("""\d\d"""r) <~ separator
val day = """\d\d"""r
val date = ((year?) ~ (month?) ~ day) map {
case year ~ month ~ day =>
(year.getOrElse("2009"), month.getOrElse("11"), day)
}
The <~ means "require these two tokens together, but only give me the result of the first one.
The ~ means "require these two tokens together and tie them together in a pattern-matchable ~ object.
The ? means that the parser is optional and will return an Option.
The .getOrElse bit provides a default value for when the parser didn't define a value.
When a Regex is used in a RegexParsers instance, the implicit def regex(Regex): Parser[String] in RegexParsers is used to appoly that Regex to the input. The Match instance yielded upon successful application of the RE at the current input is used to construct a Success in the regex() method, but only its "end" value is used, so any captured sub-matches are discarded by the time that method returns.
As it stands (in the 2.7 source I looked at), you're out of luck, I believe.
I ran into a similar issue using scala 2.8.1 and trying to parse input of the form "name:value" using the RegexParsers class:
package scalucene.query
import scala.util.matching.Regex
import scala.util.parsing.combinator._
object QueryParser extends RegexParsers {
override def skipWhitespace = false
private def quoted = regex(new Regex("\"[^\"]+"))
private def colon = regex(new Regex(":"))
private def word = regex(new Regex("\\w+"))
private def fielded = (regex(new Regex("[^:]+")) <~ colon) ~ word
private def term = (fielded | word | quoted)
def parseItem(str: String) = parse(term, str)
}
It seems that you can grab the matched groups after parsing like this:
QueryParser.parseItem("nameExample:valueExample") match {
case QueryParser.Success(result:scala.util.parsing.combinator.Parsers$$tilde, _) => {
println("Name: " + result.productElement(0) + " value: " + result.productElement(1))
}
}