Replacing special characters in scala - scala

I am working on a scala and want to replace special characters from my dataframe replaceAll doesn't seem to work, is there any other way ?
My code is this:
val specialchar = dataframe.select(column).replaceAll("[^A-za-z]+","")

You can provide the allowed characters in regex .
Try following
val badDF = Seq(("7369", "SMI_)(TH" , "2010-12-17", "800.00"), ("7499", "AL#;__#$LEN","2011-02-20", "1600.00")).toDF("empno", "ename","hire_date", "sal")
val cleanedDF = badDF.select(badDF.columns.map(c => regexp_replace(badDF(c), """[^A-Z a-z 0-9]""", "").alias(c)): _*)
cleanedDF.show
ename contains special characters. above regex will only allow Capital/Small a-z characters and 0-9 digits. All other characters will be removed.

Related

How do i replace whitespace with underscore and encode values in scala array / list

I have a spark scala dataframe which has column "Name"
I have extracted the values of that column in to scala array[string]
org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
I want to replace whitespaces with _ and encode that value in to utf-8 (any encoding is fine as long as it replaces special chars with something else)
so if there are any special chars those will be removed. later i want to use those in file path .
var org_name = orgsFlatDF.rdd.collect
.map( _.getString(2))
This is how i am extracting those vals ^^. I haven't found any method which I can use to do that. Replace or replaceall doesn't work on array
I tried this :
org_name.replace("\\s", "")
That didn't work .
Expected output : SARATOGA_SENIOR_HIGH_SCHOOL
if name is : new $ high school it should gets converted to new_$_high_school then encoded to new_%24_high_school
There are a couple of issues with what you are asking.
Java/Scala Arrays don't have a replace method. Even if they did have a replace method, would they replace the values they hold or the characters in a String they hold?
Let's assume this line org_name.replace("\\s", "") didn't compiled and org_name is indeed a an Array[String] holding one element.
scala> val org_name=Array("SARATOGA SENIOR HIGH SCHOOL")
val org_name: Array[String] = Array(SARATOGA SENIOR HIGH SCHOOL)
scala> org_name(0).replace(" ","_")
val res15: String = SARATOGA_SENIOR_HIGH_SCHOOL
replace("\\s","_") wouldn't work because it represents a \s string. "\" represents \. That's only way you'd be able to define strings containing other escape codes like \n or \t.
PS: to transform all the string in the array use org_name.map(_.replace(" ","_")), this gives you back another another array.

How to replace pound sign £ in scala

In sales column i have values with pound sign £1200. It is not readable by Data frame in scala, please help me for the same. i want column value in double, 1200. I am using below method but its not working.
def getRemovedDollarValue = udf(
(actualSales: String) => {
val actualSalesDouble = actualSales
.replace(",", "")
.replace("$", "")
.replace("\\u00A3","")
.replace("\\U00A3","")
.replaceAll("\\s", "_").trim().toDouble
java.lang.Double.parseDouble(actualSalesDouble.toString)
}
)
You need write: .replace("\u00A3","") instead of escaping .replace("\\u00A3","").
But I prefer just: .replace("£", "") - it is more readable.
I think the proposed solutions and comments all work but don't address the confusion behind why your code isn't working.
From the Pattern docs:
Thus the strings "\u2014" and "\\u2014", while not equal, compile into the same pattern, which matches the character with hexadecimal value 0x2014.
replace and replaceAll are both replacing all occurrences in a String, but only replaceAll is taking in a regular expression. You're passing in "\\u00A3" which will work as a pattern, but not a unicode literal due to the added backslash. As already suggested, either use replace with a unicode literal or the actual symbol, or change to replaceAll.

Created unicode & unicode without whitespace generators in ScalaCheck

During testing we want to qualify unicode characters, sometimes with wide ranges and sometimes more narrow. I've created a few specific generators:
// Generate a wide varying of Unicode strings with all legal characters (21-40 characters):
val latinUnicodeCharacter = Gen.choose('\u0041', '\u01B5').filter(Character.isDefined)
// Generate latin Unicode strings with all legal characters (21-40 characters):
val latinUnicodeGenerator: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.sequence[String, Char](List.fill(n)(latinUnicodeCharacter))
}
// Generate latin unicode strings without whitespace (21-40 characters): !! COMES UP SHORT...
val latinUnicodeGeneratorNoWhitespace: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.sequence[String, Char](List.fill(n)(latinUnicodeCharacter)).map(_.replaceAll("[\\p{Z}\\p{C}]", ""))
}
The latinUnicodeCharacter generator picks from characters ranging from standard latin ("A," "B," etc.) up to higher order latin character (Germanic/Nordic and others). This is good for testing latin-based character input for, say, names.
The latinUnicodeGenerator creates strings of 21-40 characters in length. These strings include horizontal space (not just a space character but other "horizontal space").
The final example, latinUnicodeGeneratorNoWhitespace, is used for say email addresses. We want the latin characters but we don't want spaces, control codes, and the like. The problem: Because I'm mapping the final result String and filtering out the control characters, the String shrinks and I end up with a total length that is less than 21 characters (sometimes).
So the question is: How can I implement latinUnicodeGeneratorNoWhitespace but do it inside the generator in such a way that I always get 21-40 character strings?
You could do this by putting together a sequence of your non-whitespace characters, another of whitespace, and then picking from either only the non-whitespace, or from both together:
import org.scalacheck.Gen
val myChars = ('A' to 'Z') ++ ('a' to 'z')
val ws = Seq(' ', '\t')
val myCharsGenNoWhitespace: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars))
}
val myCharsGen: Gen[String] = Gen.chooseNum(21, 40).flatMap { n =>
Gen.buildableOfN[String, Char](n, Gen.oneOf(myChars ++ ws))
}
I would suggest considering what you're really testing for, though—the more you restrict the test cases, the less you're checking about how your program will behave on unexpected inputs.

How to strip everything except digits from a string in Scala (quick one liners)

This is driving me nuts... there must be a way to strip out all non-digit characters (or perform other simple filtering) in a String.
Example: I want to turn a phone number ("+72 (93) 2342-7772" or "+1 310-777-2341") into a simple numeric String (not an Int), such as "729323427772" or "13107772341".
I tried "[\\d]+".r.findAllIn(phoneNumber) which returns an Iteratee and then I would have to recombine them into a String somehow... seems horribly wasteful.
I also came up with: phoneNumber.filter("0123456789".contains(_)) but that becomes tedious for other situations. For instance, removing all punctuation... I'm really after something that works with a regular expression so it has wider application than just filtering out digits.
Anyone have a fancy Scala one-liner for this that is more direct?
You can use filter, treating the string as a character sequence and testing the character with isDigit:
"+72 (93) 2342-7772".filter(_.isDigit) // res0: String = 729323427772
You can use replaceAll and Regex.
"+72 (93) 2342-7772".replaceAll("[^0-9]", "") // res1: String = 729323427772
Another approach, define the collection of valid characters, in this case
val d = '0' to '9'
and so for val a = "+72 (93) 2342-7772", filter on collection inclusion for instance with either of these,
for (c <- a if d.contains(c)) yield c
a.filter(d.contains)
a.collect{ case c if d.contains(c) => c }

Play framework, Display special characters

Hell, I am new to play framework and Scala. I don't know how to display special characters like #, -, "" as text.
Help!
If I got your question right, you are asking how to escape special characters in scala. It's pretty much same as other languages. using escape character \
val str = " \\\" " //output str: \"
or follow the below special syntax to store in string value with exactly what is inside the triple quotes.
val str = """ \" """ //output str: \"