Difference between passing "*" and '*' arguments to split [duplicate]

Difference between passing "*" and '*' arguments to split [duplicate] - scala

This question already has answers here:
Scala: Splitting with Double Quotes ("") vs Single Quotes ('')
(4 answers)
Closed 2 years ago.
I am working on spark - scala and find out that in scala. the splitting is different than python. As an example:
My function:
for (line <- lines) {
var fields = line.split("|")
if(fields.length > 1){
movieNames += (fields(0).toInt -> fields(1))
}
gives me an ERROR but when I change it to ...
for (line <- lines) {
var fields = line.split('|')
if(fields.length > 1){
movieNames += (fields(0).toInt -> fields(1))
}
then, It will solve so, what is the difference between " | " and ' | ' at logic level.

The docs are your friend.
line.split("|") calls the split version that receives a String, such version treats that String as a regex.
And | is not a valid regex.
Whereas line.split('|') calls the split version that receives a Char, such version just split the line every time it finds that character.

Related

if statement never returns true in scala when comparing strings

I'm reading text from a text file in Scala. I'm having difficulties with if statements.
for (line <- Source.fromFile(filename).getLines) {
if (line.length>7) {
println("b1 >" + line(7)+ "< " + line(0).getType)
if(line(7)=="#") {
println("hashtag")
}
}
}
below is 2 lines from my text file. the first line has 4 spaces followed by many hashtags. the second line is 4 spaces followed by 1 hashtag (the 4 spaces keep getting deleted by stack overflow)
##################################################################################################################################################
#
below is the output i recieve
//| b1 >#< 12
//| b1 > < 12
Question 1) why is getType returning 12? This is the strangest data type I've ever heard of.
Question 2) (possibly answered by Q1) why does the if(line(7)=="#") statement never returns true?

To answer your questions in reverse order:
Question 2. Because line is a String, line(7) is a Char which is never equal to a String. You want to compare it with '#' instead.
Question 1. Because of the above, this calls Char.getType method which
Returns a value indicating a character's general category.
(not that you can find it from Scala's own documentation). You probably wanted getClass instead.

Apply a text-preprocessing function to a dataframe column in scala spark

I want to create a function to handle the text-prepocessing in a problem I am facing with text data. I am familiar with Python and pandas dataframe and my usual thought process of solving the problem is to use a function and then using pandas apply method to apply the function to all the elements in a column. However I don't know where to begin to accomplish this.
So, I created two functions to handle the replacements. The problem is that I don't know how to put more than one replace inside this method. I need to make about 20 replacements for three separate dataframes so to solve it with this method it would take me 60 lines of code. Is there a way to do all the replacements inside a single function and then apply it to all the elements in a dataframe column in scala?
def removeSpecials: String => String = _.replaceAll("$", " ")
def removeSpecials2: String => String = _.replaceAll("?", " ")
val udf_removeSpecials = udf(removeSpecials)
val udf_removeSpecials2 = udf(removeSpecials2)
val consolidated2 = consolidated.withColumn("product_description", udf_removeSpecials($"product_description"))
val consolidated3 = consolidated2.withColumn("product_description", udf_removeSpecials2($"product_description"))
consolidated3.show()

Well you can simply add every replacement next to the previous one like this :
def removeSpecials: String => String = _.replaceAll("$", " ").replaceAll("?", " ")
But in this case where the replacement character is the same, it would be better to use regular expressions to avoid multiple replaceAll.
def removeSpecials: String => String = _.replaceAll("\\$|\\?", " ")
Note that \\ is used as escape character.

PowerShell Substring Exception [duplicate]

This question already has answers here:
$string.Substring Index/Length exception
(3 answers)
Closed 5 years ago.
I'm trying to use Substring() function in PowerShell. This is the example :
$string = "example string"
$temp = $string.Substring(5,($string.Length))
In this example I'm trying to get part of $string, from the 5th char, until the end. I'm using the .Length property to get the last index of $string.
The problem is that I'm getting this exception :
Exception calling "Substring" with "2" argument(s) because of .Length property.
What can I do to get part of $string until the last char?

$string.Length indicates you want a substring that is as long as $string, which is not possible if you are starting at the 5th character of $string. [documentation]
Specify $string.Length - 5 or - much simpler - omit the 2nd argument
$string = "example string"
$temp = $string.Substring(5,($string.Length - 5))
$temp = $string.Substring(5) # much simpler
This Tip of the Week helps to get to grips with PowerShell string manipulation.

Read only lines that start with specific regular expression

I want to read only line that start with a specific regular expression.
val rawData = spark.read.textFile(file.path).filter(f => f.nonEmpty && f.length > 1 && f.startsWith("("))
is what I did until now.
Now I found out that I have entries starting with:
(W);27536- or (W) 28325- (5 digits after seperator)
I only want to read lines that start with (W);1234- (4 digits after seperator)
The regular expression that would catch this look like: \(\D\)(;|\s)\d{4} for a boolean return or \(\D\)(;|\s)\d{4}-.* for a string match return
My problem now is that I don't know how to include the regular expression in my read.textFile command.
f.startswith only works with strings
f.matches also only works with strings
I also tried using http://www.scala-lang.org/api/2.12.3/scala/util/matching/Regex.html but this returns a string and not a boolean, which I can not use in the filter function
Any help would be appreciated.

Other answers are over-thinking this. Just use matches
val lineRegex = """\(\D\)(;|\s)\d{4}-.*"""
val ns = List ("(W);1234-something",
"(W);12345-something",
"(W);2345-something",
"(W);23456-something",
"(W);3456-something",
"",
"1" )
ns.filter(f=> f.matches(lineRegex))
results in
List("(W);1234-something", "(W);2345-something", "(W);3456-something")

I found the answer to my question.
The command needs to look like this.
val lineregex = """\(\D\)(;|\s)\d{4}-.*""".r
val rawData = spark.read.textFile(file.path)
.filter(f => f.nonEmpty && f.length > 1 && lineregex.unapplySeq(f).isDefined )

You can try to find a match of the Regex using the findFirstMatchIn method, which returns an Option[Match]:
spark.read.textFile(file.path).filter { line =>
line.nonEmpty &&
line.length > 1 &&
"regex".r.findFirstMatchIn(line).isDefined
}

Extracting specific data from a string with regex using Powershell

I'm returning some data like this in powershell :
1)Open;#1
2)Open;#1;#Close;#2;#pending;#6
3)Closed;#5
But I want an output like this :
1)1 Open
2)
1 Open
2 Close
6 pending
3)
5 Closed
The code:
$lookupitem = $lookupList.Items
$CMRSItems = $list.Items | where {$_['ID'] -le 5}
$CMRSItems | ForEach-Object {
$realval = $_['EventType']
Write-Host "RefNumber: " $_['RefID']
Write-Host $realval
}
Any help would be appreciated as my powershell isn't that good.

Without regular expressions, you could do something like the following:
Ignore everything up to the first ')' character
Split the string on the ';' character
foreach pair of the split string
the state is the first part (ignore potentially leading '#')
the number is the second part (ignore leading '#')
Or you could do it using the .NET System.Text.RegularExpressions.Regex class with the following regular expression:
(?:#?(?<state>[a-zA-Z]+);#(?<number>\d);?)
The Captures property on the MatchCollection returned by the Matches method would be a collection in which each item will contain two instances in the Group collection; named state and number respectively.