Functionally transform this String into a List of objects - scala

I have a String in this csv format :
//> lines : String = a1 , 2 , 10
//| a2 , 2 , 5
//| a3 , 8 , 4
//| a4 , 5 , 8
//| a5 , 7 , 5
//| a6 , 6 , 4
//| a8 , 4 , 9
I would like to convert this String into a List of objects where each line of the String represents a new entry in the object List. I can think how to do this imperatively -
Divide the String into multiple lines and split every line into its csv tokens. Loop over each line and for each line create a new object and add it to a List. But I'm trying to think about this functionally and I'm not sure where to start. Any pointers please ?

Let's assume you're starting with an iterator producing one String for each line. The Source class can do this if you're loading from a file, or you can use val lines = input.split("\n") if you're already starting with everything in a single String
This also works with List, Seq, etc. Iterator isn't a pre-requisite.
So you map over the input to parse each line
val lines = input split "\n"
val output = lines map { line => parse(line) }
or (in point-free style)
val output = lines map parse
All you need is the parse method, and a type that lines should be parsed to. Case classes are a good bet here:
case class Line(id: String, num1: Int, num2: Int)
So to parse. I'm going too wrap the results in a Try so you can capture errors:
def parse(line: String): Try[Line] = Try {
//split on commas and trim whitespace
line.split(",").trim match {
//String.split produces an Array, so we pattern-match on an Array of 3 elems
case Array(id,n1,n2) =>
// if toInt fails it'll throw an Exception to be wrapped in the Try
Line(id, n1.toInt, n2.toInt)
case x => throw new RuntimeException("Invalid line: " + x)
}
}
Put it all together and you end up with output being a CC[Try[Line]], where CC is the collection-type of lines (e.g. Iterator, Seq, etc.)
You can then isolate the errors:
val (goodLines, badLines) = output.partition(_.isSuccess)
Or if you simply want to strip out the intermediate Trys and discard the errors:
val goodLines: Seq[Line] = output collect { case Success(line) => line }
ALL TOGETHER
case class Line(id: String, num1: Int, num2: Int)
def parse(line: String): Try[Line] = Try {
line.split(",").trim match {
case Array(id,n1,n2) => Line(id, n1.toInt, n2.toInt)
case x => throw new RuntimeException("Invalid line: " + x)
}
}
val lines = input split "\n"
val output = lines map parse
val (successes, failures) = output.partition(_.isSuccess)
val goodLines = successes collect { case Success(line) => line }

Not sure if this is the exact output you want since there wasn't a sample output provided. Should be able to get what you want from this though.
scala> val lines: String = """a1,2,10
| a2,2,5
| a3,8,4
| a4,5,8
| a5,7,5
| a6,6,4
| a8,4,9"""
lines: String =
a1,2,10
a2,2,5
a3,8,4
a4,5,8
a5,7,5
a6,6,4
a8,4,9
scala> case class Line(s: String, s2: String, s3: String)
defined class Line
scala> lines.split("\n").map(line => line.split(",")).map(split => Line(split(0), split(1), split(2)))
res0: Array[Line] = Array(Line(a1,2,10), Line(a2,2,5), Line(a3,8,4), Line(a4,5,8), Line(a5,7,5), Line(a6,6,4), Line(a8,4,9))

Related

Customised string in spark scala

I have a string like "debug#compile". Now, my end goal is to convert first letter of each word to uppercase. So, at last I should get "Debug#Compile" where 'D' and 'C' are converted to uppercase.
My logic:
1) I have to split the string on the basis of delimiters. It will be special characters.So, I have to check everytime.
2) After that I would convert each word's first letter to upper case and then using map I would join it again.
I am trying my best but not able to design the code for this. Can anyone help me in this. Even hints would help!
Below is my code:
object WordCase {
def main(args: Array[String]) {
val s="siddhesh#kalgaonkar"
var b=""
val delimeters= Array("#","_")
if(delimeters(0)=="#")
{
b=s.split(delimeters(0).toString).map(_.capitalize).mkString(delimeters(0).toString())
}
else if(delimeters(0)=="_")
{
b=s.split(delimeters(0).toString).map(_.capitalize).mkString(delimeters(0).toString())
}
else{
println("Non-Standard String")
}
println(b)
}
}
My code capitalizes the first letter of every word in capital on the basis of constant delimeter and have to merge it. Here for the first part i.e "#" it capitalizes first letter of every words but it fails for the second case i.e "_". Am I makinig any silly mistakes in looping?
scala> val s="siddhesh#kalgaonkar"
scala> val specialChar = (s.split("[a-zA-Z0-9]") filterNot Seq("").contains).mkString
scala> s.replaceAll("[^a-zA-Z]+"," ").split(" ").map(_.capitalize).mkString(",").replaceAll(",",specialChar)
res41: String = Siddhesh#Kalgaonkar
You can manage multiple special char in this way
scala> val s="siddhesh_kalgaonkar"
s: String = siddhesh_kalgaonkar
scala> val specialChar = (s.split("[a-zA-Z0-9]") filterNot Seq("").contains).mkString
specialChar: String = _
scala> s.replaceAll("[^a-zA-Z]+"," ").split(" ").map(_.capitalize).mkString(",").replaceAll(",",specialChar)
res42: String = Siddhesh_Kalgaonkar
I solved it the easy way:
object WordCase {
def main(args: Array[String]) {
val s = "siddhesh_kalgaonkar"
var b = s.replaceAll("[^a-zA-Z]+", " ").split(" ").map(_.capitalize).mkString(" ") //Replacing delimiters with space and capitalizing first letter of each word
val c=b.indexOf(" ") //Getting index of space
val d=s.charAt(c).toString // Getting delimiter character from the original string
val output=b.replace(" ", d) //Now replacing space with the delimiter character in the modified string i.e 'b'
println(output)
}
}

Iterate and trim string based on condition in spark Scala

I have dataframe 'regexDf' like below
id,regex
1,(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
If the length of the regex exceeds some max length for example 50, then i want to remove the last text token in splitted regex string separated by '|' for the exceeded id. In the above data frame, id 1 length is more than 50 so that last tokens 'text4(.)' and 'text6(.)' from each splitted regex string should be removed. Even after removing that also length of the regex string in id 1 still more than 50, so that again last tokens 'text3(.)' and 'text5(.)' should be removed.so the final dataframe will be
id,regex
1,(.*)text1(.*)text2(.*)|(.*)text2(.*)
2,(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)
I am able to trim the last tokens using the following code
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
I tried using while loop to check the length and trim the text tokens in iteration which is not working. Also i want to avoid using var and while loop. Is it possible to achieve without while loop.
val optimizeRegexString = udf((regex: String) => {
if(regex.length >= 50) {
var len = regex.length;
var resultStr: String = ""
while(len >= maxLength) {
val reducedStr = regex.split("|").foldLeft(List[String]()) {
(regexStr,eachRegex) => {
regexStr :+ eachRegex
.replaceAll("\\(\\.\\*\\)\\w+\\(\\.\\*\\)$", "\\(\\.\\*\\)")
}
}.mkString("|")
len = reducedStr.length
resultStr = reducedStr
}
resultStr
} else {
regex
}
})
regexDf.withColumn("optimizedRegex", optimizeRegexString(col("regex")))
As per SathiyanS and Pasha suggestion, I changed the recursive method as function.
def optimizeRegex(regexDf: DataFrame): DataFrame = {
val shrinkString= (s: String) => {
if (s.length > 50) {
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
extractedString
}
else s
}
def shrinkUdf = udf((regex: String) => shrinkString(regex))
regexDf.withColumn("regexString", shrinkUdf(col("regex")))
}
Now i am getting exception as "recursive value shrinkString needs type"
Error:(145, 39) recursive value shrinkString needs type
val extractedString: String = shrinkString(s.split("\\|")
.map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"));
Recursion:
def shrink(s: String): String = {
if (s.length > 50)
shrink(s.split("\\|").map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
else s
}
Looks like issues with function calling, some additional info.
Can be called as static function:
object ShrinkContainer {
def shrink(s: String): String = {
if (s.length > 50)
shrink(s.split("\\|").map(s => s.substring(0, s.lastIndexOf("text"))).mkString("|"))
else s
}
}
Link with dataframe:
def shrinkUdf = udf((regex: String) => ShrinkContainer.shrink(regex))
df.withColumn("regex", shrinkUdf(col("regex"))).show(truncate = false)
Drawbacks: Just basic example (approach) provided. Some edge cases (if regexp does not contains "text", if too many parts separated by "|", for ex. 100; etc.) have to be resolved by author of question, for avoid infinite recursion loop.
This is how I would do it.
First, a function for removing the last token from a regex:
def deleteLastToken(s: String): String =
s.replaceFirst("""[^)]+\(\.\*\)$""", "")
Then, a function that shortens the entire regex string by deleting the last token from all the |-separated fields:
def shorten(r: String) = {
val items = r.split("[|]").toSeq
val shortenedItems = items.map(deleteLastToken)
shortenedItems.mkString("|")
}
Then, for a given input regex string, create the stream of all the shortened strings you get by applying the shorten function repeatedly. This is an infinite stream, but it's lazily evaluated, so only as few elements as required will be actually computed:
val regex = "(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)"
val allShortened = Stream.iterate(regex)(shorten)
Finally, you can treat allShortened as any other sequence. For solving our problem, you can drop all elements while they don't satisfy the length requirement, and then keep only the first one of the remaining ones:
val result = allShortened.dropWhile(_.length > 50).head
You can see all the intermediate values by printing some elements of allShortened:
allShortened.take(10).foreach(println)
// Prints:
// (.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)
// (.*)text1(.*)text2(.*)text3(.*)|(.*)text2(.*)text5(.*)
// (.*)text1(.*)text2(.*)|(.*)text2(.*)
// (.*)text1(.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
// (.*)|(.*)
Just to add to #pasha701 answer. Here is the solution that works in spark.
val df = sc.parallelize(Seq((1,"(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)"),(2,"(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)"))).toDF("ID", "regex")
df.show()
//prints
+---+------------------------------------------------------------------------+
|ID |regex |
+---+------------------------------------------------------------------------+
|1 |(.*)text1(.*)text2(.*)text3(.*)text4(.*)|(.*)text2(.*)text5(.*)text6(.*)|
|2 |(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*) |
+---+------------------------------------------------------------------------+
Now you can use the #pasha701 shrink function using udf
val shrink: String => String = (s: String) => if (s.length > 50) shrink(s.split("\\|").map(s => s.substring(0,s.lastIndexOf("text"))).mkString("|")) else s
def shrinkUdf = udf((regex: String) => shrink(regex))
df.withColumn("regex", shrinkUdf(col("regex"))).show(truncate = false)
//prints
+---+---------------------------------------------+
|ID |regex |
+---+---------------------------------------------+
|1 |(.*)text1(.*)text2(.*)|(.*)text2(.*) |
|2 |(.*)text1(.*)text5(.*)text6(.*)|(.*)text2(.*)|
+---+---------------------------------------------+

Inverting the case of characters within a defined region [Scala]

I'm trying to write a function that inverts the case of any alphabetic chars within a defined region (from the cursor position to the marker position) however I'm struggling.
I have a feeling that something very similar to this would work, but I can't get my head around it.
def invertCase() {
this.getString.map(c => if(c.isLower) c.toUpper else c.toLower)
}
I need to invert the case of the alphabetic characters within a defined region, which (as far as I'm aware) I am doing by calling this.getString (getString gets the buffer and converts it to a string).
So by doing this.getString I believe I am selecting the region which needs to have its alphabetic characters inverted, yet the code following it doesn't do what I want it to.
Any pointers?
Thank you!
EDIT: the buffer is of type StringBuilder if that changes anything
xd
You can use splitAt and map to invert part of the string as follows:
def invertBetween(start: Int, end: Int, str: String) = {
val (a, bc) = str.splitAt(start)
val (b, c) = bc.splitAt(end - start)
a + b.map(c => if (c.isUpper) c.toLower else c.toUpper) + c
}
Example:
invertBetween(3, 10, "Hello, World, Foo, BAR, baz")
res10: String = HelLO, wORld, Foo, BAR, baz
^^^^^^^
Here is a solution using collect with zipWithIndex
scala> val start = 4
start: Int = 4
scala> val end = 9
end: Int = 9
scala> val result = ("Your own String value.").zipWithIndex.collect {
case e if(e._2 >= start && e._2 <= end) => if (e._1.isLower) e._1.toUpper else e._1.toLower
case e => e._1
}.mkString("")
result: String = Your OWN string value.

Addition of numbers recursively in Scala

In this Scala code I'm trying to analyze a string that contains a sum (such as 12+3+5) and return the result (20). I'm using regex to extract the first digit and parse the trail to be added recursively. My issue is that since the regex returns a String, I cannot add up the numbers. Any ideas?
object TestRecursive extends App {
val plus = """(\w*)\+(\w*)""".r
println(parse("12+3+5"))
def parse(str: String) : String = str match {
// sum
case plus(head, trail) => parse(head) + parse(trail)
case _ => str
}
}
You might want to use the parser combinators for an application like this.
"""(\w*)\+(\w*)""".r also matches "+" or "23+" or "4 +5" // but captures it only in the first group.
what you could do might be
scala> val numbers = "[+-]?\\d+"
numbers: String = [+-]?\d+
^
scala> numbers.r.findAllIn("1+2-3+42").map(_.toInt).reduce(_ + _)
res4: Int = 42
scala> numbers.r.findAllIn("12+3+5").map(_.toInt).reduce(_ + _)
res5: Int = 20

In Scala, how to find an elemein in CSV by a pair of key values?

For example, from a following file:
Name,Surname,E-mail
John,Smith,john.smith#hotmail.com
Nancy,Smith,nancy.smith#gmail.com
Jane,Doe,jane.doe#aol.com
John,Doe,john.doe#yahoo.com
how do I get e-mail address of John Doe?
I use the following code now, but can specify only one key field now:
val src = Source.fromFile(file)
val iter = src.getLines().drop(1).map(_.split(","))
var quote = ""
iter.find( _(1) == "Doe" ) foreach (a => println(a(2)))
src.close()
I've tried writing "iter.find( _(0) == "John" && _(1) == "Doe" )", but this raises an error saying that only one parameter is expected (enclosing the condition into extra pair of parentheses does not help).
The underscore as a placeholder for a parameter to a lambda doesn't work the way that you think.
a => println(a)
// is equivalent to
println(_)
(a,b) => a + b
// is equivalent to
_ + _
a => a + a
// is not equivalent to
_ + _
That is, the first underscore means the first parameter and the second one means the second parameter and so on. So that's the reason for the error that you're seeing -- you're using two underscores but have only one parameter. The fix is to use the explicit version:
iter.find( a=> a(0) == "John" && a(1) == "Doe" )
You can use Regex:
scala> def getRegex(v1: String, v2: String) = (v1 + "," + v2 +",(\\S+)").r
getRegex: (v1: String,v2: String)scala.util.matching.Regex
scala> val src = """John,Smith,john.smith#hotmail.com
| Nancy,Smith,nancy.smith#gmail.com
| Jane,Doe,jane.doe#aol.com
| John,Doe,john.doe#yahoo.com
| """
src: java.lang.String =
John,Smith,john.smith#hotmail.com
Nancy,Smith,nancy.smith#gmail.com
Jane,Doe,jane.doe#aol.com
John,Doe,john.doe#yahoo.com
scala> val MAIL = getRegex("John","Doe")
MAIL: scala.util.matching.Regex = John,Doe,(\S+)
scala> val itr = src.lines
itr: Iterator[String] = non-empty iterator
scala> for(MAIL(address) <- itr) println(address)
john.doe#yahoo.com
scala>
You could also do a pattern match on the result of split in a for comprehension.
val firstName = "John"
val surName = "Doe"
val emails = for {
Array(`firstName`, `surName`, email) <-
src.getLines().drop(1) map { _ split ',' }
} yield { email }
println(emails.mkString(","))
Note the backticks in the pattern: this means we match on the value of firstName instead of introducing a new variable matching anything and shadowing the val firstname.