How do i replace whitespace with underscore and encode values in scala array / list - scala

I have a spark scala dataframe which has column "Name"
I have extracted the values of that column in to scala array[string]
org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
I want to replace whitespaces with _ and encode that value in to utf-8 (any encoding is fine as long as it replaces special chars with something else)
so if there are any special chars those will be removed. later i want to use those in file path .
var org_name = orgsFlatDF.rdd.collect
.map( _.getString(2))
This is how i am extracting those vals ^^. I haven't found any method which I can use to do that. Replace or replaceall doesn't work on array
I tried this :
org_name.replace("\\s", "")
That didn't work .
Expected output : SARATOGA_SENIOR_HIGH_SCHOOL
if name is : new $ high school it should gets converted to new_$_high_school then encoded to new_%24_high_school

There are a couple of issues with what you are asking.
Java/Scala Arrays don't have a replace method. Even if they did have a replace method, would they replace the values they hold or the characters in a String they hold?
Let's assume this line org_name.replace("\\s", "") didn't compiled and org_name is indeed a an Array[String] holding one element.
scala> val org_name=Array("SARATOGA SENIOR HIGH SCHOOL")
val org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
scala> org_name(0).replace(" ","_")
val res15: String = SARATOGA_SENIOR_HIGH_SCHOOL
replace("\\s","_") wouldn't work because it represents a \s string. "\" represents \. That's only way you'd be able to define strings containing other escape codes like \n or \t.
PS: to transform all the string in the array use org_name.map(_.replace(" ","_")), this gives you back another another array.

Related

How to replace pound sign £ in scala

In sales column i have values with pound sign £1200. It is not readable by Data frame in scala, please help me for the same. i want column value in double, 1200. I am using below method but its not working.
def getRemovedDollarValue = udf(
(actualSales: String) => {
val actualSalesDouble = actualSales
.replace(",", "")
.replace("$", "")
.replace("\\u00A3","")
.replace("\\U00A3","")
.replaceAll("\\s", "_").trim().toDouble
java.lang.Double.parseDouble(actualSalesDouble.toString)
}
)
You need write: .replace("\u00A3","") instead of escaping .replace("\\u00A3","").
But I prefer just: .replace("£", "") - it is more readable.
I think the proposed solutions and comments all work but don't address the confusion behind why your code isn't working.
From the Pattern docs:
Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.
replace and replaceAll are both replacing all occurrences in a String, but only replaceAll is taking in a regular expression. You're passing in "\\u00A3" which will work as a pattern, but not a unicode literal due to the added backslash. As already suggested, either use replace with a unicode literal or the actual symbol, or change to replaceAll.

How to calculate number of occurrence of a character at beginning in a List of String using Scala

I am new to Scala and I want to calculate number of occurrences of a character in which start with a particular alphabet in a list of Strings.
For example-
val test1 : List[String] = List("zero","zebra","zenith","tiger","mosquito")
I have defined above List of Strings and I want to calculate count of all strings which start with "z".
I tried with below code-
scala> test2.count(s=> s.charAt(0) == "z")
res7: Int = 0
It is giving me result as 0. I am not sure what I am doing wrong. Please suggest.
Character values are delimited by single quotes. Double quotes are reserved for strings:
val test : List[String] = List("zero","zebra","zenith","tiger","mosquito")
test.count(_.charAt(0) == 'z') // 3: Int
you can simply use filter and find the length of the list
println(test1.filter(_.startsWith("z")).length)
If you want to ignore the cases (uppercase or lowercase) you can add .toLowerCase as
println(test1.filter(_.toLowerCase.startsWith("z")).length)
I hope the answer is helpful

remove pipe delimiter from data using spark

i am new to spark, i am using scala to separate pipe delimited file and save in hdfs without pipe delimited, for that i have written this code.
object WordCount {
def main(args: Array[String])
{
val textfile = sc.textFile("/user/cloudera/xxxx/xxxx")
val word = textfile.map( l => l.split("|"))
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")
}
}
but when i am executing it i am not getting any error's but in my hdfs i am getting below data.
[Ljava.lang.String;#10ed847f
[Ljava.lang.String;#4316ebe
[Ljava.lang.String;#495d7e18
[Ljava.lang.String;#19017f49
[Ljava.lang.String;#314b9e72
[Ljava.lang.String;#5b8f67a6
[Ljava.lang.String;#23ddf240
[Ljava.lang.String;#404b5a25
[Ljava.lang.String;#130b541d
[Ljava.lang.String;#4cbf45af
[Ljava.lang.String;#21780b86
[Ljava.lang.String;#503c9b94
[Ljava.lang.String;#3b0a3ab3
i don't know what i am doing wrong.
Please help
That's because you are splitting each string into a Array of Strings. To save as text file, you'll need to use mkString(",") if you wish to concatenate with a comma. But I don't see any purpose in that.
If you want to replace pipe separator by a comma, you can use _.replaceAll("|",",") instead and save it :
val word = textfile.map(_.replaceAll("\\|",",").replaceFirst(",","").trim)
word.saveAsTextFile("/user/cloudera/xxxxx/Sparktest")
PS : You can replace the comma with anything you want e.g a whitespace, a word, etc.
So Why does the pipe need to be escaped ?
A string split expects a regular expression argument. An unescaped | is parsed as a regex meaning "empty string or empty string," which isn't what you mean.

How to strip everything except digits from a string in Scala (quick one liners)

This is driving me nuts... there must be a way to strip out all non-digit characters (or perform other simple filtering) in a String.
Example: I want to turn a phone number ("+72 (93) 2342-7772" or "+1 310-777-2341") into a simple numeric String (not an Int), such as "729323427772" or "13107772341".
I tried "[\\d]+".r.findAllIn(phoneNumber) which returns an Iteratee and then I would have to recombine them into a String somehow... seems horribly wasteful.
I also came up with: phoneNumber.filter("0123456789".contains(_)) but that becomes tedious for other situations. For instance, removing all punctuation... I'm really after something that works with a regular expression so it has wider application than just filtering out digits.
Anyone have a fancy Scala one-liner for this that is more direct?
You can use filter, treating the string as a character sequence and testing the character with isDigit:
"+72 (93) 2342-7772".filter(_.isDigit) // res0: String = 729323427772
You can use replaceAll and Regex.
"+72 (93) 2342-7772".replaceAll("[^0-9]", "") // res1: String = 729323427772
Another approach, define the collection of valid characters, in this case
val d = '0' to '9'
and so for val a = "+72 (93) 2342-7772", filter on collection inclusion for instance with either of these,
for (c <- a if d.contains(c)) yield c
a.filter(d.contains)
a.collect{ case c if d.contains(c) => c }

Implement Scala-style String Interpolation In Scala

I want to implement a Scala-style string interpolation in Scala. Here is an example,
val str = "hello ${var1} world ${var2}"
At runtime I want to replace "${var1}" and "${var2}" with some runtime strings. However, when trying to use Regex.replaceAllIn(target: CharSequence, replacer: (Match) ⇒ String), I ran into the following problem:
import scala.util.matching.Regex
val placeholder = new Regex("""(\$\{\w+\})""")
placeholder.replaceAllIn(str, m => s"A${m.matched}B")
java.lang.IllegalArgumentException: No group with name {var1}
at java.util.regex.Matcher.appendReplacement(Matcher.java:800)
at scala.util.matching.Regex$Replacement$class.replace(Regex.scala:722)
at scala.util.matching.Regex$MatchIterator$$anon$1.replace(Regex.scala:700)
at scala.util.matching.Regex$$anonfun$replaceAllIn$1.apply(Regex.scala:410)
at scala.util.matching.Regex$$anonfun$replaceAllIn$1.apply(Regex.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:743)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1174)
at scala.util.matching.Regex.replaceAllIn(Regex.scala:410)
... 32 elided
However, when I removed '$' from the regular expression, it worked:
val placeholder = new Regex("""(\{\w+\})""")
placeholder.replaceAllIn(str, m => s"A${m.matched}B")
res2: String = hello $A{var1}B world $A{var2}B
So my question is that whether this is a bug in Scala Regex. And if so, are there other elegant ways to achieve the same goal (other than brutal force replaceAllLiterally on all placeholders)?
$ is a treated specially in the replacement string. This is described in the documentation of replaceAllIn:
In the replacement String, a dollar sign ($) followed by a number will be interpreted as a reference to a group in the matched pattern, with numbers 1 through 9 corresponding to the first nine groups, and 0 standing for the whole match. Any other character is an error. The backslash (\) character will be interpreted as an escape character and can be used to escape the dollar sign. Use Regex.quoteReplacement to escape these characters.
(Actually, that doesn't mention named group references, so I guess it's only sort of documented.)
Anyway, the takeaway here is that you need to escape the $ characters in the replacement string if you don't want them to be treated as references.
new scala.util.matching.Regex("""(\$\{\w+\})""")
.replaceAllIn("hello ${var1} world ${var2}", m => s"A\\${m.matched}B")
// "hello A${var1}B world A${var2}B"
It's hard to tell what you're expecting the behavior to do. The issue is that s"${m.matched}" is turning into "${var1}" (and "${var2}"). The '$' is special character to say "place the group with name {var1} here instead".
For example:
scala> placeholder.replaceAllIn(str, m => "$1")
res0: String = hello ${var1} world ${var2}
It replaces the match with the first capturing group (which is m itself).
It's hard to tell exactly what you're doing, but you could escape any $ like so:
scala> placeholder.replaceAllIn(str, m => s"${m.matched.replace("$","\\$")}")
res1: String = hello ${var1} world ${var2}
If what you really want to do is evaluate var1/var2 for some variables in the local scope of the method; that's not possible. In fact, the s"Hello, $name" pattern is actually converted into new StringContext("Hello, ", "").s(name) at compile time.