Scala split a 2 words which aren't seperated - scala

I have a corpus with words like, applefruit which isn't separated by any separator which I would like to do. As this can be a non-linear problem. I would like to pass a custom dictionary to split only when a word from the dictionary is a substring of a word in the corpus.
if my dictionary has only apple and 3 words in corpus aaplefruit, applebananafruit, bananafruit. The output should look like apple , fruit apple, bananafruit, bananafruit.
Notice I am not splitting bananafruit, the goal is to make the process faster by just splitting on the text provided in the dictionary. I am using scala 2.x.

You can use regular expressions with split:
scala> "foobarfoobazfoofoobatbat".split("(?<=foo)|(?=foo)")
res27: Array[String] = Array(foo, bar, foo, baz, foo, foo, batbat)
Or if your dictionary (and/or strings to split) has more than one word ...
val rx = wordList.map { w => s"(?<=$w)|(?=$w)" }.mkString("|")
val result: List[String] = toSplit.flatMap(_.split(rx))

You could do a regex find and replace on the following pattern:
(?=apple)|(?<=apple)
and then replace with comma surrounded by spaces on both sides. We could try:
val input = "bananaapplefruit"
val output = input.replaceAll("(?=apple)|(?<=apple)", " , ")
println(output) // banana , apple , fruit

Related

How can i split string to list using scala

i has string str = "one,two,(three,four), five"
I want to split this string into list, like this:
List[String] = ("one", "two", "(three, four)", "five")?
i have no idea for this.
thanks
We can try matching on the pattern \(.*?\)|[^, ]+:
val str = "one,two,(three,four), five"
val re = """\(.*?\)|[^, ]+""".r
for(m <- re.findAllIn(str)) println(m)
This prints:
one
two
(three,four)
five
This regex pattern eagerly first tries to find a (...) term. That failing, it matches any content other than comma or space, to consume one CSV term at a time. This trick avoids the problem of matching across commas inside (...).

Concatenate elements on a list separated by a character [duplicate]

How do I "join" an iterable of strings by another string in Scala?
val thestrings = Array("a","b","c")
val joined = ???
println(joined)
I want this code to output a,b,c (join the elements by ",").
How about mkString ?
theStrings.mkString(",")
A variant exists in which you can specify a prefix and suffix too.
See here for an implementation using foldLeft, which is much more verbose, but perhaps worth looking at for education's sake.

Map letter to word in which it appears

I am newbie to scala and I am trying to write a function that takes an input string and returns a map of letters to words in which they appear.
For example given the input string "this is demo",
I would like the output map ['t'->["this"],'h'->["this],'i'->["this","is"]... and so on.
I can write this code in traditional way, but how can I write this code in a functional way by using scala constructs like map, groupby, flatmap etc?
"this is demo"
.split(" ")
.flatMap(w => w.map(c => c -> w))
.groupMap(_._1)(_._2)
// HashMap(e -> Array(demo), s -> Array(this, is), t -> Array(this), m -> Array(demo), i -> Array(this, is), h -> Array(this), o -> Array(demo), d -> Array(demo))
First step consists in getting an array of tuples representing for each character which word it comes from. This can be achieved by first splitting the sentence in words and for each character of each word producing a tuple with the character and its word (.map(w => w.map(c => c -> w))). And since this gives us an array of arrays, we can use a flatMap to flatten these into a one level array of tuples (producing Array((t,this), (h,this), (i,this), ...)).
Second step consists in grouping these tuples of characters and words by character and mapping the grouped values to the associated words. Which can be achieved with groupMap (it groups tuples by their first part (by character) and maps grouped tuples to their second part (the word)). If you're using an earlier Scala version (before 2.13), you'll have to replace groupMap with a combination of groupBy and mapValues: .groupBy(_._1).mapValues(_.map(_._2)).
Here is another solution. You can replace the rough tokenizer I suggest by a parser, like Stanford CoreNLP Simple for example, info here.
def tokenizeText(s:String):Array[String] = {
s.toLowerCase().split("[\\W]+")
}
val text:String = "Here you have your text. Set several sentences if you like."
val words = tokenizeText(text)
val letters = words.mkString("").toSet.mkString("").split("")
val twoDArray:Array[Array[String]] = letters.map(l => words.filter(w => w.contains(l)))

Pre append regex pattern to split and map case class to splitted

I want to split following string which takes the form
val str = "X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8| |Z01|Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]|Z01|Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]|Z02|02|z2|Str|Str|"
This string always start with X and split positions are Z01,Z02,...,Z0D
str.split("""\|Z0[1|2|3|4|5|6|7|8|9|A|B|C|D]{1}\|""") foreach println
Here there could be no ordering of the position of Z01,...,Z0D appear in the string.
The split gives the desired result :
X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8|
Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]
Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]
02|z2|Str|Str|
However I want to map X, Z01,... to case classes. Since there is no ordering there is no way of identifying to which case class split would need to map(can't use length of individual splits ).
I expect my split to have following output :
X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8|
Z01|[Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]|
Z01|[Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]|
Z01|02|z2|Str|Str|
so that the result could be mapped to case class with the help of the preappended pattern.
For example:
case class X( ....)
case class Z01(val1: String, val2: String, val3: String)
case class Z02(val1: Int, val2: String, val3: String,val4:String)
.................
X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8| maps to case class X
and
Z01|[Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]| maps to case class Z01
and in the end result should be in the form of ordered and similar groups to be taken as a array of particular case class.
X
Array[Z01]
Array[Z02]
......
......
As an alternative it might be an option to get your values by matching them:
(?:Z0[1-9A-D]|^X).*?(?=\|Z0[1-9A-D]|$)
This matches:
(?:Z0[1-9A-D]|X\|) In a non capturing group match Z0 and list the possible options in a character class or | X at the start ^of the line
.*? Match any character one or more times non greedy
(?=\|Z0[1-9A-D]|$) Positive lookahead which asserts that what follows is a pipe | followed by Z0 and a character from the character list or | the end of the line $
For example:
val re = """(?:Z0[1-9A-D]|^X).*?(?=\|Z0[1-9A-D]|$)""".r
val str = "X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8| |Z01|Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]|Z01|Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]|Z02|02|z2|Str|Str|"
re.findAllIn(str) foreach println
That will result in:
X|blnk_1|blnk_2|blnk_3|blnk_4|time1|time2|blnk_5|blnk_6|blnk_7|blnk_8|
Z01|Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|]
Z01|Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|]
Z02|02|z2|Str|Str|
Demo
How about this idea?
val x = str.split("""\|Z0[1|2|3|4|5|6|7|8|9|A|B|C|D]{1}\|""") // actual string splits
val y = """\|Z0[1|2|3|4|5|6|7|8|9|A|B|C|D]{1}\|""".r.findAllIn(str).toArray // delimiters Array
val final_data = x.slice(1, x.size).zip(y).map(x => x._2+x._1).toList // taking actual splits except first one .... and then zipping and concatenating with delimiters like below.
/*
List(|Z01|Str1|01|001|NE]|[HEX1|HEX2]|[NA|001:1000|123:456|[00]|], |Z01|Str2|02|002|NE]|[HEX3|HEX4]|[NA|002:1001|234:456|[01]|], |Z02|02|z2|Str|Str|) */
the first | in the final_data can be removed with subString

How do I escape tilde character in scala?

Given that i have a file that looks like this
CS~84~Jimmys Bistro~Jimmys
...
using tilde (~) as a delimiter, how can i split it?
val company = dataset.map(k=>k.split(""\~"")).map(
k => Company(k(0).trim, k(1).toInt, k(2).trim, k(3).trim)
The above don't work
Hmmm, I don't see where it needs to be escaped.
scala> val str = """CS~84~Jimmys Bistro~Jimmys"""
str: String = CS~84~Jimmys Bistro~Jimmys
scala> str.split('~')
res15: Array[String] = Array(CS, 84, Jimmys Bistro, Jimmys)
And the array elements don't need to be trimmed unless you know that errant spaces can be part of the input.