Scala "does not contain" with String split - scala

I am trying Scala for the very first time, not sure how to filter based on not contains.
I have a below query;
.filter(_.get[Option[String]]("status") map(_ split "," contains "Pending") getOrElse(false))
But I want to do something like below;
.filter(_.get[Option[String]]("status") map(_ split "," does not contain "Pending") getOrElse(false))
Can someone please help?

You can use exists and forall to simplify this. These functions return true if a condition is true for any or all elements of a collection.
So the pattern
Option[String].map(???).getOrElse(false)
can be
Option[String].exists(???)
And the condition
!(a split "," contains "Pending")
can be
a.split(",").forall(_ != "Pending")
Applying both of these to the original code gives
.filter(_.get[Option[String]]("status").exists(_.split(",").forall(_ != "Pending")))
But I would recommend a local function to clarify this code:
def notPending(s: String) = s.split(",").forall(_ != "Pending")
.filter(_.get[Option[String]]("status").exists(notPending))
This reads as "take all values where the status option exists and the status is not pending"

Figured out the solution;
.filter(_.get[Option[String]]("status") map(a => !(a split "," contains "Pending")) getOrElse(false))
Thanks.

Related

How to filter dataframe columns that start with something and end with something

I have this piece of code currently that works as intended
val rules_list = df.columns.filter(_.startsWith("rule")).toList
However this is including some columns that I don't want. How would I add a second filter to this so that the total filter is "columns that start with 'rule' and end with any integer value"
So it should return "rule_1" in the list of columns but not "rule_1_modified"
Thanks and have a great day!
You can simply add a regex expression to your filter:
val rules_list = data.columns.filter(c => c.startsWith("rule") && c.matches("^.*\\d$")).toList
You can use python's Regex module like this
import re
columns = df.columns;
rules_list = [];
for col_name in range(len(columns)):
rules_list += re.findall('rule[_][0-9]',columns[col_name])
print(rules_list)

Spark/Scala group similar words and count

I am trying to group and count words in an rdd such that if a word ends with s/ly it is counted as the same word.
hi
yes
love
know
hi
knows
loves
lovely
Expected output:
hi 2
yes 1
love 3
know 2
This is what I currently have:
data.map(word=>(word,1)).reduceByKey((a,b)=>(a+b+).collect
Any help is appreciated regarding adding s/ly condition.
It seems that you want to count stem of words in your input list. The process of finding the stem of a word in Computational Linguistics is called stemming. If your goal is to handle s and ly at the end of the words in your input list, you can remove in a map step and then count the remaining parts. As a matter of fact, there would be some side effects by removing s and ly blindly. For instance, if there is a word which ends with s like "is" you would count "i" at the end. It's a better solution to use some available stemmers like Porter or the one which is available in Stanford Corenlp.
listRdd.mapToPair(t -> new Tuple2(t.replayAll("(ly|s)$", ""), 1))
.reduceByKey((a,b) -> a+b).collect()
the second solution which can help to overcome other suffixes too is using stemmers:
listRdd.mapToPair(t -> {
Stemmer stemmer = new Stemmer();
return new Tuple2(stemmer.stem(t), 1));
}).reduceByKey((a,b) -> a+b).collect();
about stemmer can be replaced with any implementation of stemmers.
For more information about stemmers and lemmatizers, you can use https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
If you want to group together words that finish with 's' or 'ly' here is how I would do it:
data
.map(word => (if (word.endsWith('s') || word.endsWith('ly')) 's/ly-words' else word, 1))
.reduceByKey(_+_)
.collect
If you want to separate 'ly' words from 's' words from the rest:
data
.map(word => (if (word.endsWith('s')) 's-words' else if (word.endsWith('ly')) 'ly-words' else word, 1))
.reduceByKey(_+_)
.collect
If you want to count words that end with 'ly' or 's' as if they did not end with it (eg. 'love', 'lovely', 'loves' are counted as being 'love'):
data
.map(word => (if (word.endsWith('s')) word.slice(0, word.length-1) else if (word.endsWith('ly')) word.slice(0, word.length-2) else word, 1))
.reduceByKey(_+_)
.collect

AHK: Remove duplicates from list, and add duplicate-count to respective list items

I would like an AutoHotkey code that will remove duplicates from a list, while also adding a duplicate-count to respective list items, i.e. “x 2.”
Here is an example list:
myList =
)
apple
banana
apple
apple pie
banana
apple
)
Here is the desired result list:
myList =
(
apple x 3
banana x 2
apple pie
)
I am a novice in AHK, and code in general. I found many good codes for removing duplicates, but none to count them as indicated above. My own approach to the solution may be rather rudimentary: it is to place unique items (“apple pie” above) into a variable, place duplicate items (all instances of “banana, apple”) into a separate variable, count/condense the like duplicates, and then combine the two variables together for the “desired result list.” However, my own code will not work properly due to problems with substrings. Rather than dilute this question with my code, it may be best to start with a more experienced, concise approach. Thank you for your help.
MsgBox % CountList( MyList )
CountList( _list ) {
l := StrSplit(_list,"`n"), out:="`n"
for i, a in l {
c:=0
for j, b in l
c := (a = b) ? c + 1 : c
if !(InStr(out, "`n" a "`t"))
out .= a (c > 1 ? "`t x " c "`n" : "`n")
}
return Trim(out, "`n")
}

Read only lines that start with specific regular expression

I want to read only line that start with a specific regular expression.
val rawData = spark.read.textFile(file.path).filter(f => f.nonEmpty && f.length > 1 && f.startsWith("("))
is what I did until now.
Now I found out that I have entries starting with:
(W);27536- or (W) 28325- (5 digits after seperator)
I only want to read lines that start with (W);1234- (4 digits after seperator)
The regular expression that would catch this look like: \(\D\)(;|\s)\d{4} for a boolean return or \(\D\)(;|\s)\d{4}-.* for a string match return
My problem now is that I don't know how to include the regular expression in my read.textFile command.
f.startswith only works with strings
f.matches also only works with strings
I also tried using http://www.scala-lang.org/api/2.12.3/scala/util/matching/Regex.html but this returns a string and not a boolean, which I can not use in the filter function
Any help would be appreciated.
Other answers are over-thinking this. Just use matches
val lineRegex = """\(\D\)(;|\s)\d{4}-.*"""
val ns = List ("(W);1234-something",
"(W);12345-something",
"(W);2345-something",
"(W);23456-something",
"(W);3456-something",
"",
"1" )
ns.filter(f=> f.matches(lineRegex))
results in
List("(W);1234-something", "(W);2345-something", "(W);3456-something")
I found the answer to my question.
The command needs to look like this.
val lineregex = """\(\D\)(;|\s)\d{4}-.*""".r
val rawData = spark.read.textFile(file.path)
.filter(f => f.nonEmpty && f.length > 1 && lineregex.unapplySeq(f).isDefined )
You can try to find a match of the Regex using the findFirstMatchIn method, which returns an Option[Match]:
spark.read.textFile(file.path).filter { line =>
line.nonEmpty &&
line.length > 1 &&
"regex".r.findFirstMatchIn(line).isDefined
}

How can I replace empty elements in an array with "OTHER"?

My list (#degree) is built from a SQL command. The NVL command in the SQL isn't working, neither are tests such as:
if (#degree[$i] == "")
if (#degree[$i] == " ")
if (#degree[$i] == '')
if (#degree[$i] == -1)
if (#degree[$i] == 0)
if (#degree[$i] == ())
if (#degree[$i] == undef)
$i is a counter variable in a for loop. Basically it goes through and grabs unique degrees from a table and ends up creating ("AFA", "AS", "AAS", "", "BS"). The list is not always this long, and the empty element is not always in that position 3.
Can anyone help?
I want to either test during the for loop, or after the loop completes for where this empty element is and then replace it with the word, "OTHER".
Thanks for anything
-Ken
First of all, the ith element in an array is $degree[$i], not #degree[$i]. Second, "==" is for numerical comparisons - use "eq" for lexical comparisons. Third of all, try if (defined($degree[$i]))
Everything that Paul said. And, if you need an example:
my #degree = ('AFA', 'AS', 'AAS', '', 'BS');
$_ ||= 'OTHER' for #degree;
print join ' ' => #degree; # prints 'AFA AS AAS OTHER BS'
If its actually a null in the database, try COALESCE
SELECT COALESCE(column, 'no value') AS column FROM whatever ...
That's the SQL-standard way to do it.