Read only lines that start with specific regular expression

Read only lines that start with specific regular expression - scala

I want to read only line that start with a specific regular expression.
val rawData = spark.read.textFile(file.path).filter(f => f.nonEmpty && f.length > 1 && f.startsWith("("))
is what I did until now.
Now I found out that I have entries starting with:
(W);27536- or (W) 28325- (5 digits after seperator)
I only want to read lines that start with (W);1234- (4 digits after seperator)
The regular expression that would catch this look like: \(\D\)(;|\s)\d{4} for a boolean return or \(\D\)(;|\s)\d{4}-.* for a string match return
My problem now is that I don't know how to include the regular expression in my read.textFile command.
f.startswith only works with strings
f.matches also only works with strings
I also tried using http://www.scala-lang.org/api/2.12.3/scala/util/matching/Regex.html but this returns a string and not a boolean, which I can not use in the filter function
Any help would be appreciated.

Other answers are over-thinking this. Just use matches
val lineRegex = """\(\D\)(;|\s)\d{4}-.*"""
val ns = List ("(W);1234-something",
"(W);12345-something",
"(W);2345-something",
"(W);23456-something",
"(W);3456-something",
"",
"1" )
ns.filter(f=> f.matches(lineRegex))
results in
List("(W);1234-something", "(W);2345-something", "(W);3456-something")

I found the answer to my question.
The command needs to look like this.
val lineregex = """\(\D\)(;|\s)\d{4}-.*""".r
val rawData = spark.read.textFile(file.path)
.filter(f => f.nonEmpty && f.length > 1 && lineregex.unapplySeq(f).isDefined )

You can try to find a match of the Regex using the findFirstMatchIn method, which returns an Option[Match]:
spark.read.textFile(file.path).filter { line =>
line.nonEmpty &&
line.length > 1 &&
"regex".r.findFirstMatchIn(line).isDefined
}

Related

How to filter dataframe columns that start with something and end with something

I have this piece of code currently that works as intended
val rules_list = df.columns.filter(_.startsWith("rule")).toList
However this is including some columns that I don't want. How would I add a second filter to this so that the total filter is "columns that start with 'rule' and end with any integer value"
So it should return "rule_1" in the list of columns but not "rule_1_modified"
Thanks and have a great day!

You can simply add a regex expression to your filter:
val rules_list = data.columns.filter(c => c.startsWith("rule") && c.matches("^.*\\d$")).toList

You can use python's Regex module like this
import re
columns = df.columns;
rules_list = [];
for col_name in range(len(columns)):
rules_list += re.findall('rule[_][0-9]',columns[col_name])
print(rules_list)

Apply a text-preprocessing function to a dataframe column in scala spark

I want to create a function to handle the text-prepocessing in a problem I am facing with text data. I am familiar with Python and pandas dataframe and my usual thought process of solving the problem is to use a function and then using pandas apply method to apply the function to all the elements in a column. However I don't know where to begin to accomplish this.
So, I created two functions to handle the replacements. The problem is that I don't know how to put more than one replace inside this method. I need to make about 20 replacements for three separate dataframes so to solve it with this method it would take me 60 lines of code. Is there a way to do all the replacements inside a single function and then apply it to all the elements in a dataframe column in scala?
def removeSpecials: String => String = _.replaceAll("$", " ")
def removeSpecials2: String => String = _.replaceAll("?", " ")
val udf_removeSpecials = udf(removeSpecials)
val udf_removeSpecials2 = udf(removeSpecials2)
val consolidated2 = consolidated.withColumn("product_description", udf_removeSpecials($"product_description"))
val consolidated3 = consolidated2.withColumn("product_description", udf_removeSpecials2($"product_description"))
consolidated3.show()

Well you can simply add every replacement next to the previous one like this :
def removeSpecials: String => String = _.replaceAll("$", " ").replaceAll("?", " ")
But in this case where the replacement character is the same, it would be better to use regular expressions to avoid multiple replaceAll.
def removeSpecials: String => String = _.replaceAll("\\$|\\?", " ")
Note that \\ is used as escape character.

How to calculate number of occurrence of a character at beginning in a List of String using Scala

I am new to Scala and I want to calculate number of occurrences of a character in which start with a particular alphabet in a list of Strings.
For example-
val test1 : List[String] = List("zero","zebra","zenith","tiger","mosquito")
I have defined above List of Strings and I want to calculate count of all strings which start with "z".
I tried with below code-
scala> test2.count(s=> s.charAt(0) == "z")
res7: Int = 0
It is giving me result as 0. I am not sure what I am doing wrong. Please suggest.

Character values are delimited by single quotes. Double quotes are reserved for strings:
val test : List[String] = List("zero","zebra","zenith","tiger","mosquito")
test.count(_.charAt(0) == 'z') // 3: Int

you can simply use filter and find the length of the list
println(test1.filter(_.startsWith("z")).length)
If you want to ignore the cases (uppercase or lowercase) you can add .toLowerCase as
println(test1.filter(_.toLowerCase.startsWith("z")).length)
I hope the answer is helpful

Scala program to replace words in an alphabetical order with in a string

I am learning Scala and have been trying to create a program which should replace characters in each word with in a string in an alphabetical order. For example, the string is "Where are you" so program should change it to "Eehrw aer ouy". I googled search and found some examples but I am not able to write an error free program. I think I am far from having a working program.
def main(args:Array[String]){
val r = "Where are you"
val newstr = r.map(x=>(x,_) match {
case ' ' = ' '
case y => {
val newchar = (x.toByte).toChar
if newchar.toByte.toChar > (newchar + 1).toByte.toChar
x = newchar
else
x
}
})
}

The tricky part is restoring the original capitalization. Add punctuation to the mix and it turns into a fun little challenge.
val str = "Where, aRe yoU?"
val sortedLowerCase = str.toLowerCase.split("(?=\\W)").map(_.sorted).mkString
val capsAt = str.indices.filter(str(_).isUpper)
capsAt.foldLeft(sortedLowerCase)((s,x) => s.patch(x,Seq(s(x).toUpper),1))
// res0: String = Eehrw, aEr ouY?
Time spent studying the Standard Library will be richly rewarded.

r.split(" ").map(word => word.toLowerCase.sorted)
To keep the capital letters, instead of .toLowerCase.sorted, used .sortWith and implement the sort comparison function according to how you want to sort characters.

Let me expand on Ren's answer:
compare based on lowercase and then capitalize only if there's an uppercase letter
r.split(" ").map(word => word.sortWith(_.toLower < _.toLower))
.map(x => if (x.exists(_.isUpper)) x.toLowerCase.capitalize else x )

find line number in an unstructured file in scala

Hi guys I am parsing an unstructured file for some key words but i can't seem to easily find the line number of what the results I am getiing
val filePath:String = "myfile"
val myfile = sc.textFile(filePath);
var ora_temp = myfile.filter(line => line.contains("MyPattern")).collect
ora_temp.length
However, I not only want to find the lines that contains MyPatterns but I want more like a tupple (Mypattern line, line number)
Thanks in advance,

You can use ZipWithIndex as eliasah pointed out in a comment (with probably the most succinct way to do this using the direct tuple accessor syntax), or like so using pattern matching in the filter:
val matchingLineAndLineNumberTuples = sc.textFile("myfile").zipWithIndex().filter({
case (line, lineNumber) => line.contains("MyPattern")
}).collect

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Read only lines that start with specific regular expression - scala

I found the answer to my question. The command needs to look like this. val lineregex = """\(\D\)(;|\s)\d{4}-.*""".r val rawData = spark.read.textFile(file.path) .filter(f => f.nonEmpty && f.length > 1 && lineregex.unapplySeq(f).isDefined )

You can try to find a match of the Regex using the findFirstMatchIn method, which returns an Option[Match]: spark.read.textFile(file.path).filter { line => line.nonEmpty && line.length > 1 && "regex".r.findFirstMatchIn(line).isDefined }

Related

How to filter dataframe columns that start with something and end with something

Apply a text-preprocessing function to a dataframe column in scala spark

How to calculate number of occurrence of a character at beginning in a List of String using Scala

Scala program to replace words in an alphabetical order with in a string

find line number in an unstructured file in scala

Categories

Resources