Scala spark: Efficient check if condition is matched anywhere? - scala

What I want is roughly equivalent to
df.where(<condition>).count() != 0
But I'm pretty sure it's not quite smart enough to stop once it finds any such violation. I would expect some sort of aggregator to be able to do this, but I haven't found one? I could do it with a max and some sort of conversion, but again I don't think it would necessarily know to quit (not being specific to bool, I'm not sure if understands no value is larger than true).
More specifically, I want to check if a column contains only a single element. Right now my best idea is to do this is by grabbing the first value and comparing everything.

I would try this option, it should be much faster:
df.where(<condition>).head(1).isEmpty
You can also try to define your conditions on a row together with scala's exists (which stops at the first occurence of true):
df.mapPartitions(rows => if(rows.exists(row => <condition>)) Iterator(1) else Iterator.empty).isEmpty
At the end you should benchmark the alternatives

Related

Scala - finding a specific key in an array of tuples

So far I have an array of tuples that is filled with key,value pairs (keys are ints and values are strings).
val tuple_array = new Array[(K,V)](100)
I want to find a specific key in this array. So far I have tried:
tuple_array.find()
but this requires me to enter a key,value pair. (I think). I want to just search this array and see if the key exists at all and if it does either return 1 or true.(havent decided yet). I could just loop through the array but I was going for a faster runtime.
How would I go about searching for this?
find requires you to pass a predicate: function returning true if condition is fulfilled. You can use it e.g. like this:
tuple_array.find { tuple =>
tuple._1 == searched_key
}
It doesn't require you to pass a tuple.
Since this is an array, you have to go through a whole array at worse case (O(n)), there is no faster way (asymptotically) unless your array is sorted and allows usage of a binary-search (which isn't a part of the interface as you never knows if a random array is sorted). Whether you'll do this by iterating manually or through find (or collectFirst) doesn't affect the speed much.
but this requires me to enter a key,value pair. (I think).
No it doesn't, check the docs, you can just do:
tuple_array.find(_._1 == theKeyYouWant).map(_._2)
That returns an Option[V] with the value associated with the key if it was present. You then may just do an isDefined to return true if the key existed.
could just loop through the array but I was going for a faster runtime.
Well find just loops.
You may want to use a Map[K, V] instead of an Array[(K, V)] and just use contains
Also, as personal advice, it seems you are pretty new to the language; I would recommend you to pick a course or tutorial. Scala is way more than syntax.

Split scala treeset from ordered object

My use-case is very simple and looks like caching so maybe something like Guava is useful but I use scala and prefer not to pull in Guava if I dont need to.
case class AAA(index:Double) extends Ordered[AAA] {
override def compare(that: AAA): Int = index.compare(that.index)
}
var aaaSet = mutable.TreeSet[AAA]()
AAA's mostly come in the set in increasing order but the index value might be lower then what already exists. What I need is a simple function that removes elements lower than a certain index (a Double). This does not -need- to be exact as long nothing above the index gets deleted. I can do this with O(log(n)) complexity but since I always can start at the bottom of the set(or head) I think it can be done more efficient. Obviously I quickly end up with caching libs but these indexes are not time-based and I need up to millions of these sets in my program (hence the wish to go faster then O(log(n))).
Some help and direction to possible solutions are much appreciated. Even if it means that O(log(n)) means best performance.
Even though it is not really the solution I was looking for I think this will be an ok solution:
aaaSet = aaaSet.dropWhile(aa => aa.index < 1.3)

if (Option.nonEmpty) vs Option.foreach

I want to perform some logic if the value of an option is set.
Coming from a java background, I used:
if (opt.nonEmpty) {
//something
}
Going a little further into scala, I can write that as:
opt.foreach(o => {
//something
})
Which one is better? The "foreach" one sounds more "idiomatic" and less Java, but it is less readable - "foreach" applied to a single value sounds weird.
Your example is not complete and you don't use minimal syntax. Just compare these two versions:
if (opt.nonEmpty) {
val o = opt.get
// ...
}
// vs
opt foreach {
o => // ...
}
and
if (opt.nonEmpty)
doSomething(opt.get)
// vs
opt foreach doSomething
In both versions there is more syntactic overhead in the if solution, but I agree that foreach on an Option can be confusing if you think of it only as an optional value.
Instead foreach describes that you want to do some sort of side effects, which makes a lot of sense if you think of Option being a monad and foreach just a method to transform it. Using foreach has furthermore the great advantage that it makes refactorings easier - you can just change its type to a List or any other monad and you will not get any compiler errors (because of Scalas great collections library you are not constrained to use only operations that work on monads, there are a lot of methods defined on a lot of types).
foreach does make sense, if you think of Option as being like a List, but with a maximum of one element.
A neater style, IMO, is to use a for-comprehension:
for (o <- opt) {
something(o)
}
foreach makes sense if you consider Option to be a list that can contain at most a single value. This also leads to a correct intuition about many other methods that are available to Option.
I can think of at least one important reason you might want to prefer foreach in this case: it removes possible run-time errors. With the nonEmpty approach, you'll at one point have to do a get*, which can crash your program spectacularly if you by accident forget to check for emptiness one time.
If you completely erase get from your mind to avoid bugs of that kind, a side effect is that you also have less use for nonEmpty! And you'll start to enjoy foreach and let the compiler take care of what should happen if the Option happens to be empty.
You'll see this concept cropping up in other places. You would never do
if (age.nonEmpty)
Some(age.get >= 18)
else
None
What you'll rather see is
age.map(_ >= 18)
The principle is that you want to avoid having to write code that handles the failure case – you want to offload that burden to the compiler. Many will tell you that you should never use get, however careful you think you are about pre-checking. So that makes things easier.
* Unless you actually just want to know whether or not the Option contains a value and don't really care for the value, in which case nonEmpty is just fine. In that case it serves as a sort of toBoolean.
Things I didn't find in the other answers:
When using if, I prefer if (opt isDefined) to if (opt nonEmpty) as the former is less collection-like and makes it more clear we're dealing with an option, but that may be a matter of taste.
if and foreach are different in the sense that if is an expression that will return a value while foreach returns Unit. So in a certain way using foreach is even more java-like than if, as java has a foreach loop and java's if is not an expression. Using a for comprehension is more scala-like.
You can also use pattern matching (but this is also also less idiomatic)
You can use the fold method, which takes two functions so you can evaluate one expression in the Some case and another in the None case. You may need to explicitly specify the expected type because of how type inference works (as shown here). So in my opinion it may sometimes still be clearer to use either pattern matching or val result = if (opt isDefined) expression1 else expression2.
If you don't need a return value and really have no need to handle the not-defined case, you can use foreach.

Need an explanation for a confusing way the AND boolean works

I am tutoring someone in basic search and sorts. In insertion sort I iterate negatively when I have a value that is greater than the one previous to it in numerical terms. Now of course this approach can cause issues because there is a check which calls for array[-1] which does not exist.
As underlined in bold below, adding the and x > 0 boolean prevents the index issue.
My question is how is this the case? Wouldn't the call for array[-1] still be made to ensure the validity of both booleans?
the_list = [10,2,4,3,5,7,8,9,6]
for x in range(1,len(the_list)):
value = the_list[x]
while value < the_list[x-1] **and x > 0**:
the_list[x] = the_list[x-1]
x=x-1
the_list[x] = value
print the_list
I'm not sure I completely understand the question, and I don't know what programming language this is, but most modern programming languages use so-called short-circuit Boolean evaluation by default so that the logical expression isn't evaluated further once the outcome is known.
You can use that to guard against range overflow, like this:
while x > 0 and value < the_list[x-1]
but the check of x's range here must come before the use.
AND operation returns true if and only if both arguments are true, so if one of arguments is false there's no point of checking others as the final value is already known at that point. As for your example, usually evaluation goes from left to right but it is not a principle and it looks the language you used is not following that rule (othewise it still should crash on array lookup). But ut may be, this particular implementation optimizes this somehow (which IMHO is not good idea) and evaluates "simpler" things first (like checking if x > 0) before it look up the array. check the specs why this exact order works for you as in most popular languages you would still crash if test x > 0 wouldn't be evaluated before lookup

head and tail calls on empty list bringing an exception

I'm following a tutorial. (Real World Haskell)
And I have one beginner question about head and tail called on empty lists: In GHCi it returns exception.
Intuitively I think I would say they both should return an empty list. Could you correct me ? Why not ? (as far as I remember in OzML left or right of an empty list returns nil)
I surely have not yet covered this topic in the tutorial, but isnt it a source of bugs (if providing no arguments)?
I mean if ever passing to a function a list of arguments which may be optionnal, reading them with head may lead to a bug ?
I just know the GHCi behaviour, I don't know what happens when compiled.
Intuitively I think would say they both should return an empty list. Could you correct me ? Why not ?
Well - head is [a] -> a. It returns the single, first element; no list.
And when there is no first element like in an empty list? Well what to return? You can't create a value of type a from nothing, so all that remains is undefined - an error.
And tail? Tail basically is a list without its first element - i.e. one item shorter than the original one. You can't uphold these laws when there is no first element.
When you take one apple out of a box, you can't have the same box (what happened when tail [] == []). The behaviour has to be undefined too.
This leads to the following conclusion:
I surely have not yet covered this topic in the tutorial, but isnt it a source of bugs ? I mean if ever passing to a function a list of arguments which may be optionnal, reading them with head may lead to a bug ?
Yes, it is a source of bugs, but because it allows to write flawed code. Code that's basically trying to read a value that doesn't exist. So: *Don't ever use head/tail** - Use pattern matching.
sum [] = 0
sum (x:xs) = x + sum xs
The compiler can guarantee that all possible cases are covered, values are always defined and it's much cleaner to read.