Filtering XML elements with null attributes - scala

I'm trying to extract attributes from a regular XML structure; it seems natural to first exclude the elements for which a particular attribute is missing.
I don't know why the following doesn't work (see answer for why I ever got the idea to test vs. null):
val test = <top><el attr="1"></el><el></el><el attr="2"></el></top>
test.child.filter(_ \ "#attr" != null).map(_ \ "#attr")
// ArrayBuffer(1, NodeSeq(), 2)
Why is the middle element still there after the filter?
I've confirmed it's not operator precedence:
test.child.filter(x => (x \ "#attr") != null).map(_ \ "#attr")
// ArrayBuffer(1, NodeSeq(), 2)
Alternatively (assuming this is optimized internally), how could I exclude the NodeSeq() elements after the map step?

Just figured this out. filter wasn't return null, but NodeSeq(), so that the following works:
test.child.filter(_ \ "#attr" != scala.xml.NodeSeq.Empty).map(_ \ "#attr")
// ArrayBuffer(1, 2)
Followed this Q&A to discover how to create the NodeSeq() object by hand
I discovered my problem ultimately derived from crossing my own wires. I initially had been using the following:
test.child.map(_.attributes("attr"))
// ArrayBuffer(1, null, 2)
Which is where I got the idea to test vs. null originally. Of course, if I had stuck with that, my initial approach would have worked:
test.child.filter(_.attributes("attr") != null).map(_ \ "#attr")
// ArrayBuffer(1, 2)

Related

Getting unexpected behavior with multiple OR conditions

Here is my code:
df.where((F.col("A") != F.col("B")) | \
(F.col("A").isNotNull()) | \
(F.col("C") == F.col("D"))).show()
When I do this, I do see instances that contradict some of the conditions above. Now, when I structure the code like this, it runs successfully:
df.where((F.col("A") != F.col("B")))\
.where((F.col("A").isNotNull()))\
.where((F.col("C") == F.col("D")))
The first snipper uses the | to combine the three conditions.However, the | checks if any of the conditions evaluate to true rather than all of them.
However, chaining using where clause is equivalent to combining the conditions using and.
Hence, the snippets in the code are not equivalent and produce different results.
For equivalence, you first snipper will become
df.where((F.col("A") != F.col("B")) & \
(F.col("A").isNotNull()) & \
(F.col("C") == F.col("D"))).show()

display words which words length is more than 8

I am trying to get from below list where the words have r and size should be more than 8 and converted to uppercase all the values in the list.
val names=List("sachinramesh","rahuldravid","viratkohli","mayank")
But I have tried with below but it is not giving anything. It throwing error.
names.map(s =>s.toUpperCase.contains("r").size(8)
It is throwing error.
can someone tell me how to resolve this issue.
Regards,
Kumar
If you are doing a combination of filter and map, think about using the collect method, which does both in one call. This is how to do what is described in the question:
names.collect{
case s if s.lengthCompare(8) > 0 && s.contains('r') =>
s.toUpperCase
}
collect works like filter because it only returns values that match a case statement. It works like map because you can make changes to the matching values before returning them.
you can try this :
names.filter(str => str.contains('r') && str.length > 8) // str contains an `r` and length > 8
.map(_.toUpperCase) // map the result to uppercase
names.filter(...).map(...) approach solves the problem, however requires iterating through the list twice. For a more optimal solution where we go through the list only once, consider #Tim's suggestion regarding collect, or perhaps consider lazy Iterator approach like so:
names
.iterator
.filter(_.size > 8)
.filter(_.contains('r'))
.map(_.toUpperCase)
.toList
You can also try this:
val result =for (x <- names if x.contains('r') && x.length > 8) yield x.toUpperCase
result.foreach(println)
cheers

PySpark RDD Filter with "not in" for multiple values

I have an RDD looks like below:
myRDD:
[[u'16/12/2006', u'17:24:00'],
[u'16/12/2006', u'?'],
[u'16/12/2006', u'']]
I want to exclude the records with '?' or '' in it.
Following code works for one by one filtering, but is there a way to combine and filter items with '?' and '' in one go to get back following:
[u'16/12/2006', u'17:24:00']
The below works only for one item at a time, how to extend to multiple items
myRDD.filter(lambda x: '?' not in x)
want help on how to write:
myRDD.filter(lambda x: '?' not in x && '' not in x)
Try this,
myRDD.filter(lambda x: ('?' not in x) & ('' not in x))

scala - Find a pair in a list only with first element value

Suppose that we have a list like: val list = List((1,'o'), (3,'t'), (10, 't'), (7, 's')).
Then I want to find a pair whose first element is 10, ignoring what the second element is.
How can I find the pair or the index of the pair?
I tried list.indexOf((10,_)), list.indexOf((10,???)) and so on. However,
as you know, these tries are wrong.
Any suggestions are welcome :)
Use indexWhere to find the index:
list.indexWhere(_._1 == 10)
If you want the pair you can use find:
list.find(_._1 == 10)
Note that find returns an option because it may not find any element. If you want to return a default value you can use getOrElse, otherwise you need to handle the not found case:
list.find(_._1 == 10).getOrElse(/* default value */)

Erlang - Mnesia - equivalent to "select distinct id from Table"

Hi is there a possibility to make a distinct select request to mnesia ?
I could copy the content of one table to ets and since ets is a hash table it could work. But i thought there is maybe a more elegant solution to this problem.
Thank you.
I'm not sure if this is what you had in mind, but you could use QLC's {unique, true} option (See QLC documentation for more info).
I created a mnesia table, called test, with bag semantics. Each row consists of the table name, a Key and a Value, so my rows looked like:
1. test, 1, 1
2. test, 2, 1
3. test, 2, 2
4. test, 3, 1
5. test, 3, 2
6. test, 3, 3
... etc.
Then this simple module illustrates my approach. Notice that you have to include the qlc library and that, in my example, I am selecting distinct Keys.
-module(test).
-export([select_distinct/0]).
-include_lib("stdlib/include/qlc.hrl").
select_distinct()->
QH = qlc:q( [K || {_TName, K, _V} <- mnesia:table(test)], {unique, true}),
F = fun() -> qlc:eval(QH) end,
{atomic, Result} = mnesia:transaction(F),
Result.
Compiling and running
> c("/home/jim/test", [{outdir, "/home/jim/"}]).
> test:select_distinct().
> [4,1,2,3,5]
If you wanted sorted output then use the following version of the QH = ... line above
QH = qlc:sort(qlc:q( [K || {_TName, K, _V} <- mnesia:table(test)], {unique, true})),
If you wanted to select distinct values, the following would work:
QH = qlc:sort(qlc:q( [V || {_TName, _K, V} <- mnesia:table(test)], {unique, true})),
Again, the code is just to illustrate an approach
For keys you can get a list of unique keys using:
mnesia:all_keys(Table).
From my tests, for bags it yields a list of unique keys.