I have String
val s1 = "dog#$&cat#$&cow#$&snak"
val s2 = s1.split()
how to split string into words
For a precise split, you could use #\\$& to match all 3 characters where the dollar sign has to be escaped, and the backslash itself also has to be escaped.
val s1= "dog#$&cat#$&cow#$&snak"
val s2= s1.split("#\\$&")
Output
s2: Array[String] = Array(dog, cat, cow, snak)
A broader pattern could be using \\W+ to match 1+ times any character except a word character.
Related
I have a corpus with words like, applefruit which isn't separated by any separator which I would like to do. As this can be a non-linear problem. I would like to pass a custom dictionary to split only when a word from the dictionary is a substring of a word in the corpus.
if my dictionary has only apple and 3 words in corpus aaplefruit, applebananafruit, bananafruit. The output should look like apple , fruit apple, bananafruit, bananafruit.
Notice I am not splitting bananafruit, the goal is to make the process faster by just splitting on the text provided in the dictionary. I am using scala 2.x.
You can use regular expressions with split:
scala> "foobarfoobazfoofoobatbat".split("(?<=foo)|(?=foo)")
res27: Array[String] = Array(foo, bar, foo, baz, foo, foo, batbat)
Or if your dictionary (and/or strings to split) has more than one word ...
val rx = wordList.map { w => s"(?<=$w)|(?=$w)" }.mkString("|")
val result: List[String] = toSplit.flatMap(_.split(rx))
You could do a regex find and replace on the following pattern:
(?=apple)|(?<=apple)
and then replace with comma surrounded by spaces on both sides. We could try:
val input = "bananaapplefruit"
val output = input.replaceAll("(?=apple)|(?<=apple)", " , ")
println(output) // banana , apple , fruit
I need to replace all space characters with %20. I wrote this in Scala
strToConvert.map(c => if (Character.isSpaceChar(c)) "%20" else c).mkString
Is there any better way to do this in Scala?
[Edit]
Lets assume replaceAll is not available and we'd like to implement algorithm similar to replaceAll method
you can use String.replaceAll(what_to_replace, with_what).
eg. to replace single whitespace with %20
scala> val input = "this is my http request execute me"
input: String = this is my http request execute me
scala> input.replaceAll(" ", "%20")
res1: String = this%20is%20my%20http%20request%20%20%20%20%20%20%20%20%20%20execute%20me
or use \\s regex (matches single whitespace character)
scala> input.replaceAll("\\s", "%20")
res2: String = this%20is%20my%20http%20request%20%20%20%20%20%20%20%20%20%20execute%20me
If you want multiple whitespaces to replace to one single %20, then use \\s+ which matches sequence of one or more whitespace characters
scala> input.replaceAll("\\s+", "%20")
res3: String = this%20is%20my%20http%20request%20execute%20me
I am writing a regex in scala
val regex = "^foo.*$".r
this is great but if I want to do
var x = "foo"
val regex = s"""^$x.*$""".r
now we have a problem because $ is ambiguous. is it possible to have string interpolation and be able to write a regex as well?
I can do something like
val x = "foo"
val regex = ("^" + x + ".*$").r
but I don't like to do a +
You can use $$ to have a literal $ in an interpolated string.
You should use the raw interpolator when enclosing a string in triple-quotes as the s interpolator will re-enable escape sequences that you might expect to be interpreted literally in triple-quotes. It doesn't make a difference in your specific case but it's good to keep in mind.
so val regex = raw"""^$x.*$$""".r
Using %s should work.
var x = "foo"
val regex = """^%s.*$""".format(x).r
In the off case you need %s to be a regex match term, just do
val regex = """^%s.*%s$""".format(x, "%s").r
I try the following code in scala REPL:
"ASD-ASD.KZ".split('.')
res7: Array[String] = Array(ASD-ASD, KZ)
"ASD-ASD.KZ".split(".")
res8: Array[String] = Array()
Why this function calls have a different results?
There's a big difference in the function use.
The split function is overloaded, and this is the implementation from the source code of Scala:
/** For every line in this string:
Strip a leading prefix consisting of blanks or control characters
followed by | from the line.
*/
def stripMargin: String = stripMargin('|')
private def escape(ch: Char): String = "\\Q" + ch + "\\E"
#throws(classOf[java.util.regex.PatternSyntaxException])
def split(separator: Char): Array[String] = toString.split(escape(separator))
#throws(classOf[java.util.regex.PatternSyntaxException])
def split(separators: Array[Char]): Array[String] = {
val re = separators.foldLeft("[")(_+escape(_)) + "]"
toString.split(re)
}
So when you're calling split() with a char, you ask to split by that specific char:
scala> "ASD-ASD.KZ".split('.')
res0: Array[String] = Array(ASD-ASD, KZ)
And when you're calling split() with a string, it means that you want to have a regex. So for you to get the exact result using the double quotes, you need to do:
scala> "ASD-ASD.KZ".split("\\.")
res2: Array[String] = Array(ASD-ASD, KZ)
Where:
First \ escapes the following character
Second \ escapes character for the dot which is a regex expression, and we want to use it as a character
. - the character to split the string by
Given that i have a file that looks like this
CS~84~Jimmys Bistro~Jimmys
...
using tilde (~) as a delimiter, how can i split it?
val company = dataset.map(k=>k.split(""\~"")).map(
k => Company(k(0).trim, k(1).toInt, k(2).trim, k(3).trim)
The above don't work
Hmmm, I don't see where it needs to be escaped.
scala> val str = """CS~84~Jimmys Bistro~Jimmys"""
str: String = CS~84~Jimmys Bistro~Jimmys
scala> str.split('~')
res15: Array[String] = Array(CS, 84, Jimmys Bistro, Jimmys)
And the array elements don't need to be trimmed unless you know that errant spaces can be part of the input.