Pre append regex pattern to split and map case class to splitted - scala

I want to split following string which takes the form
val str = "X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8| |Z01|Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]|Z01|Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]|Z02|02|z2|Str|Str|"
This string always start with X and split positions are Z01,Z02,...,Z0D
str.split("""\|Z0[1|2|3|4|5|6|7|8|9|A|B|C|D]{1}\|""") foreach println
Here there could be no ordering of the position of Z01,...,Z0D appear in the string.
The split gives the desired result :
X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8|
Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]
Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]
02|z2|Str|Str|
However I want to map X, Z01,... to case classes. Since there is no ordering there is no way of identifying to which case class split would need to map(can't use length of individual splits ).
I expect my split to have following output :
X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8|
Z01|[Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]|
Z01|[Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]|
Z01|02|z2|Str|Str|
so that the result could be mapped to case class with the help of the preappended pattern.
For example:
case class X( ....)
case class Z01(val1: String, val2: String, val3: String)
case class Z02(val1: Int, val2: String, val3: String,val4:String)
.................
X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8| maps to case class X
and
Z01|[Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]| maps to case class Z01
and in the end result should be in the form of ordered and similar groups to be taken as a array of particular case class.
X
Array[Z01]
Array[Z02]
......
......

As an alternative it might be an option to get your values by matching them:
(?:Z0[1-9A-D]|^X).*?(?=\|Z0[1-9A-D]|$)
This matches:
(?:Z0[1-9A-D]|X\|) In a non capturing group match Z0 and list the possible options in a character class or | X at the start ^of the line
.*? Match any character one or more times non greedy
(?=\|Z0[1-9A-D]|$) Positive lookahead which asserts that what follows is a pipe | followed by Z0 and a character from the character list or | the end of the line $
For example:
val re = """(?:Z0[1-9A-D]|^X).*?(?=\|Z0[1-9A-D]|$)""".r
val str = "X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8| |Z01|Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]|Z01|Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]|Z02|02|z2|Str|Str|"
re.findAllIn(str) foreach println
That will result in:
X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8|
Z01|Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]
Z01|Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]
Z02|02|z2|Str|Str|
Demo

How about this idea?
val x = str.split("""\|Z0[1|2|3|4|5|6|7|8|9|A|B|C|D]{1}\|""") // actual string splits
val y = """\|Z0[1|2|3|4|5|6|7|8|9|A|B|C|D]{1}\|""".r.findAllIn(str).toArray // delimiters Array
val final_data = x.slice(1, x.size).zip(y).map(x => x._2+x._1).toList // taking actual splits except first one .... and then zipping and concatenating with delimiters like below.
/*
List(|Z01|Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|], |Z01|Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|], |Z02|02|z2|Str|Str|) */
the first | in the final_data can be removed with subString

Related

Scala split a 2 words which aren't seperated

I have a corpus with words like, applefruit which isn't separated by any separator which I would like to do. As this can be a non-linear problem. I would like to pass a custom dictionary to split only when a word from the dictionary is a substring of a word in the corpus.
if my dictionary has only apple and 3 words in corpus aaplefruit, applebananafruit, bananafruit. The output should look like apple , fruit apple, bananafruit, bananafruit.
Notice I am not splitting bananafruit, the goal is to make the process faster by just splitting on the text provided in the dictionary. I am using scala 2.x.
You can use regular expressions with split:
scala> "foobarfoobazfoofoobatbat".split("(?<=foo)|(?=foo)")
res27: Array[String] = Array(foo, bar, foo, baz, foo, foo, batbat)
Or if your dictionary (and/or strings to split) has more than one word ...
val rx = wordList.map { w => s"(?<=$w)|(?=$w)" }.mkString("|")
val result: List[String] = toSplit.flatMap(_.split(rx))
You could do a regex find and replace on the following pattern:
(?=apple)|(?<=apple)
and then replace with comma surrounded by spaces on both sides. We could try:
val input = "bananaapplefruit"
val output = input.replaceAll("(?=apple)|(?<=apple)", " , ")
println(output) // banana , apple , fruit

Scala: Convert a string to string array with and without split given that all special characters except "(" an ")" are allowed

I have an array
val a = "((x1,x2),(y1,y2),(z1,z2))"
I want to parse this into a scala array
val arr = Array(("x1","x2"),("y1","y2"),("z1","z2"))
Is there a way of directly doing this with an expr() equivalent ?
If not how would one do this using split
Note : x1 x2 x3 etc are strings and can contain special characters so key would be to use () delimiters to parse data -
Code I munged from Dici and Bogdan Vakulenko
val x2 = a.getString(1).trim.split("[\()]").grouped(2).map(x=>x(0).trim).toArray
val x3 = x2.drop(1) // first grouping is always null dont know why
var jmap = new java.util.HashMap[String, String]()
for (i<-x3)
{
val index = i.lastIndexOf(",")
val fv = i.slice(0,index)
val lv = i.substring(index+1).trim
jmap.put(fv,lv)
}
This is still suceptible to "," in the second string -
Actually, I think regex are the most convenient way to solve this.
val a = "((x1,x2),(y1,y2),(z1,z2))"
val regex = "(\\((\\w+),(\\w+)\\))".r
println(
regex.findAllMatchIn(a)
.map(matcher => (matcher.group(2), matcher.group(3)))
.toList
)
Note that I made some assumptions about the format:
no whitespaces in the string (the regex could easily be updated to fix this if needed)
always tuples of two elements, never more
empty string not valid as a tuple element
only alphanumeric characters allowed (this also would be easy to fix)
val a = "((x1,x2),(y1,y2),(z1,z2))"
a.replaceAll("[\\(\\) ]","")
.split(",")
.sliding(2)
.map(x=>(x(0),x(1)))
.toArray

How to split and remove empty spaces before get the result

I have this String:
val str = "9617 / 20634"
So after split i want to parse only the 2 values 9617 & 20634 in order to calculate percentage.
So instead of trim after the split can i do it before ?
It is easier to remove spaces after the split than before. Unless you are expected to trim it before, here is a simpler way of doing it.
val Array(x, y) = "9617 / 20634".split("/").map(_.trim.toFloat)
val p = x / y * 100
Values are converted to Float to prevent integer division leading to 0.
The val Array(x,y) = ... statement is just another way of calling
unapplySeq of the Array object.
For scalable solution one might use regexps:
scala> val r = """(\d+) */ *(\d+)""".r
r: scala.util.matching.Regex = (\d+) +/ +(\d+)
scala> "111 / 222" match {
case r(a, b) => println(s"Got $a, $b")
case _ =>
}
Got 111, 222
This is useful if you need different usage patterns.
Here you define correspondence between variables from you input (111, 222) and match groups ((\d+), (\d+)).

Splitting String using first entry

Say I have a list:
val l = List("and", "or", "up to").
I want to check if anyone of these entries from l are present in a string. If it is present, then the string will be split using the first found entry from l. If not, then the entire string is returned.
So for example, let's say our string is: "1.5 litres of milk or 2 apples" should return List("1.5 litres of milk", "2 apples").
On the other hand a string like "1.5 litres of milk along with 2 apples" should return List("1.5 litres of milk along with 2 apples").
Presuming (from your comment) that "caseLine" contains the text to be split, you can try:
l.find(caseLine.contains).map(caseLine.split)
.map(_.toList.map(_.trim))
.getOrElse(caseLine :: Nil)
Find returns the first element of l found in caseLine, as an Option, which is then mapped to produce caseLine split, then trimmed, and finally the getOrElse returns the original caseLine if find returned None. The results have all been converted to List[String] for consistent typing of the results.
Try this:
def splitOnFirstMatch(splitters: List[String], s: String): List[String] = {
splitters.find(s.contains) match {
case Some(x) => s.split(x).map(_.trim).toList
case None => List(s)
}
}
I am first finding the first string in splitters that are contained in s, then I return the string splitted on this word. I return the list with the input string if no splitter exists.

Scala Sub string combinations with delimiter

My input string is
element1-element2-element3-element4a|element4b-element5-element6a|element6b
All the elements (sub strings) are separated by - and for some of the elements there will be alternatives separated by | (pipe).
A valid output string is which contains the elements separated by - (dash) only and any one of the alternative elements separated by |
All the List of valid possible combinations of output strings have to be returned.
Output:
element1-element2-element3-element4a-element5-element6a
element1-element2-element3-element4b-element5-element6a
element1-element2-element3-element4a-element5-element6b
element1-element2-element3-element4b-element5-element6b
This can be done using while loop and string functions but it takes more complexity.
(I'm a traditional Java programmer)
Can this be implemented using Scala features making it more efficient
Note: Input can contain any no of elements and pipes
This seems to fit the bill.
def getCombinations(input: String) = {
val group = """(\w|\|)+""".r // Match groups of letters and pipes
val word = """\w+""".r // Match groups of letters in between pipes
val groups = group.findAllIn(input).map(word.findAllIn(_).toVector).toList
// Use fold to construct a 'tree' of vectors, appending each possible entry in a
// pipe-separated group to each previous prefix. We're using vectors because
// the append time is O(1) rather than O(n).
val tree = groups match {
case (x :: tail) => {
val head = x.map(Vector(_)) // split each element in the head into its own node
tail.foldLeft(head) { (acc, elems) =>
for (elem <- elems; xs <- acc) yield (xs :+ elem)
}
}
case _ => Nil // Handle the case of 0 inputs
}
tree.map(_.mkString("-")) // Combine each of our trees back into a dash-separated string
}
I haven't tested this with extensive input, but the runtime complexity shouldn't be too bad. Introducing an 'Or' pipe causes the output to grow, by that's due the nature of the problem.