Split Set into multiple Sets Scala - scala

I have some Set[String] and a number devider: Int. I need to split the set arbitrary by pieces each of which has size devider. Examples:
1.
Set: "a", "bc", "ds", "fee", "s"
devider: 2
result:
Set1: "a", "bc"
Set2: "ds", "fee"
Set3: "s"
2.
Set: "a", "bc", "ds", "fee", "s", "ff"
devider: 3
result:
Set1: "a", "bc", "ds"
Set2: "fee", "s", "ff"
3.
Set: "a", "bc", "ds"
devider: 4
result:
Set1: "a", "bc", "ds"
What is the idiomatic way to do it in Scala?

You probably want something like:
Set("a", "bc", "ds", "fee", "s").grouped(2).toSet
The problem is that a Set, by definition, has no order, so there's no telling which elements will be grouped together.
Set( "a", "bc", "ds", "fee", "s").grouped(2).toSet
//res0: Set[Set[String]] = Set(Set(s, bc), Set(a, ds), Set(fee))
To get them grouped in a particular fashion you'll need to change the Set to one of the ordered collections, order the elements as required, group them, and transition everything back to Sets.

This is possible only if it is a List like:
val pn=List("a", "bc", "ds", "fee", "s").grouped(2).toSet
println(pn)

Related

How to Replace Variable with Level in tbl_summary

A reviewer asked that rather than have both genders listed in the table, to just include one. So, Gender would be replaced with Female and the proportions of the gender that was female would be under each Treatment.
library(gtsummary)
d<-tibble::tribble(
~Gender, ~Treatment,
"Male", "A",
"Male", "B",
"Female", "A",
"Male", "C",
"Female", "B",
"Female", "C")
d %>% tbl_summary(by = Treatment)
One way to do this would be to remove the first row of table_body and only keep the second row of table_body this will only keep the information on Female. This matches your table you provided in the comments.
library(gtsummary)
d<-tibble::tribble(
~Gender, ~Treatment,
"Male", "A",
"Male", "B",
"Female", "A",
"Male", "C",
"Female", "B",
"Female", "C")
t1 <- d %>% filter(Gender == "Female") %>% tbl_summary(by = Treatment)
t1$table_body <- t1$table_body[2,]
t1

Spark (Scala) filter array of structs without explode

I have a dataframe with a key and a column with an array of structs in a dataframe column. Each row contains a column a looks something like this:
[
{"id" : 1, "someProperty" : "xxx", "someOtherProperty" : "1", "propertyToFilterOn" : 1},
{"id" : 2, "someProperty" : "yyy", "someOtherProperty" : "223", "propertyToFilterOn" : 0},
{"id" : 3, "someProperty" : "zzz", "someOtherProperty" : "345", "propertyToFilterOn" : 1}
]
Now I would like to do two things:
Filter on "propertyToFilterOn" = 1
Apply some logic on other
properties - for example concatenate
So that the result is:
[
{"id" : 1, "newProperty" : "xxx_1"},
{"id" : 3, "newProperty" : "zzz_345"}
]
I know how to do it with explode but explode also requires groupBy on the key when putting it back together. But as this is a streaming Dataframe I would also have to put a watermark on it which I am trying to avoid.
Is there any other way to achieve this without using explode? I am sure there is some Scala magic that can achieve this!
Thanks!
With spark 2.4+ came many higher order functions for arrays. (see https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html)
val dataframe = Seq(
("a", 1, "xxx", "1", 1),
("a", 2, "yyy", "223", 0),
("a", 3, "zzz", "345", 1)
).toDF( "grouping_key", "id" , "someProperty" , "someOtherProperty", "propertyToFilterOn" )
.groupBy("grouping_key")
.agg(collect_list(struct("id" , "someProperty" , "someOtherProperty", "propertyToFilterOn")).as("your_array"))
dataframe.select("your_array").show(false)
+----------------------------------------------------+
|your_array |
+----------------------------------------------------+
|[[1, xxx, 1, 1], [2, yyy, 223, 0], [3, zzz, 345, 1]]|
+----------------------------------------------------+
You can filter elements within an array using the array filter higher order function like this:
val filteredDataframe = dataframe.select(expr("filter(your_array, your_struct -> your_struct.propertyToFilterOn == 1)").as("filtered_arrays"))
filteredDataframe.show(false)
+----------------------------------+
|filtered_arrays |
+----------------------------------+
|[[1, xxx, 1, 1], [3, zzz, 345, 1]]|
+----------------------------------+
for the "other logic" your talking about you should be able to use the transform higher order array function like so:
val tranformedDataframe = filteredDataframe
.select(expr("transform(filtered_arrays, your_struct -> struct(concat(your_struct.someProperty, '_', your_struct.someOtherProperty))"))
but there are issues with returning structs from the transform function as described in this post:
http://mail-archives.apache.org/mod_mbox/spark-user/201811.mbox/%3CCALZs8eBgWqntiPGU8N=ENW2Qvu8XJMhnViKy-225ktW+_c0czA#mail.gmail.com%3E
so you are best using the dataset api for the transform like so:
case class YourStruct(id:String, someProperty: String, someOtherProperty: String)
case class YourArray(filtered_arrays: Seq[YourStruct])
case class YourNewStruct(id:String, newProperty: String)
val transformedDataset = filteredDataframe.as[YourArray].map(_.filtered_arrays.map(ys => YourNewStruct(ys.id, ys.someProperty + "_" + ys.someOtherProperty)))
val transformedDataset.show(false)
+--------------------------+
|value |
+--------------------------+
|[[1, xxx_1], [3, zzz_345]]|
+--------------------------+

How to use $in operator with pair of attributes

Consider this collection
/* 1 */
{
"key" : 1,
"b" : 2,
"c" : 3
}
/* 2 */
{
"key" : 2,
"b" : 5,
"c" : 4
}
/* 3 */
{
"key" : 3,
"b" : 7,
"c" : 9
}
/* 4 */
{
"key" : 4,
"b" : 7,
"c" : 4
}
/* 5 */
{
"key" : 5,
"b" : 2,
"c" : 9
}
I want to use the $in operator and write a query to return the document such (b, c) IN ((2, 3), (7, 9)). It means "return all rows where b is 2 and c is 3 at the same time, OR b is 7 and с is 9 at the same time."
How can I use $in operator to use multiple attribute values.
If I use the following query
db.getCollection('test').find({
$and:[
{b:{$in:[2,7]}},
{c:{$in:[3,9]}}
]
})
then I get following results
(2,3)
(7,9)
(2,9) --> This is unwanted record.
IN SQL world it is possible
SELECT *
FROM demo
WHERE (b, c) IN ((2, 3), (7, 9))
What is the equivalent in Mongo DB?
If I get it right, the thing you are doing is getting all the pairs (2,3),(2,9),(7,3),(7,9).
But you want to match those one by one, so your valid pairs should be (2, 3), (7, 9). To satisfy this, you can match b and c one by one, pair them and "or" them after.
db.getCollection('test').find({
$or: [
{$and: [ {b : 2}, {c : 3} ]},
{$and: [ {b : 7}, {c : 9} ]}
]
})

reactivex repeated skip between

Suppose I have the following stream of data:
1, 2, 3, a, 5, 6, b, 7, 8, a, 10, 11, b, 12, 13, ...
I want to filter everything between 'a' and 'b' (inclusive) no matter how many times they appear. So the result of the above would be:
1, 2, 3, 7, 8, 12, 13, ...
How can I do this with ReactiveX?
Use scan with initial value b to turn
1, 2, 3, a, 5, 6, b, 7, 8, a, 10, 11, b, 12, 13, ...
into
b, 1, 2, 3, a, a, a, b, 7, 8, a, a, a, b, 12, 13, ...
and then filter out a and b to get
1, 2, 3, 7, 8, 12, 13, ...
In pseudo code
values.scan('b', (s, v) -> if (v == 'a' || v == 'b' || s != 'a') v else s).
filter(v -> v != 'a' && v != 'b');
OK. I'm posting this in case anyone else needs an answer to it. A slightly different setup than I described above just to make it easier to understand.
List<String> values = new List<string>()
{
"1", "2", "3",
"a", "5", "6", "b",
"8", "9", "10", "11",
"a", "13", "14", "b",
"16", "17", "18", "19",
"a", "21", "22", "b",
"24"
};
var aa =
// Create an array of CSV strings split on the terminal sigil value
String.Join(",", values.ToArray())
.Split(new String[] { "b," }, StringSplitOptions.None)
// Create the observable from this array of CSV strings
.ToObservable()
// Now create an Observable from each element, splitting it up again
// It is no longer a CSV string but the original elements up to each terminal value
.Select(s => s.Split(',').ToObservable()
// From each value in each observable take those elements
// up to the initial sigil
.TakeWhile(s1 => !s1.Equals("a")))
// Concat the output of each individual Observable - in order
// SelectMany won't work here since it could interleave the
// output of the different Observables created above.
.Concat();
aa.Subscribe(s => Console.WriteLine(s));
This prints out:
1
2
3
8
9
10
11
16
17
18
19
24
It is a bit more convoluted than I wanted but it works.
Edit 6/3/17:
I actually found a cleaner solution for my case.
List<String> values = new List<string>()
{
"1", "2", "3",
"a", "5", "6", "b",
"8", "9", "10", "11",
"a", "13", "14", "b",
"16", "17", "18", "19",
"a", "21", "22", "b",
"24"
};
string lazyABPattern = #"a.*?b";
Regex abRegex = new Regex(lazyABPattern);
var bb = values.ToObservable()
.Aggregate((s1, s2) => s1 + "," + s2)
.Select(s => abRegex.Replace(s, ""))
.Select(s => s.Split(',').ToObservable())
.Concat();
bb.Subscribe(s => Console.WriteLine(s));
The code is simpler which makes it easier to follow (even though it uses regexes).
The problem here is that it still isn't really a general solution to the problem of removing 'repeated regions' of a data stream. It relies on converting the stream to a single string, operating on the string, then converting it back to some other form. If anyone has any ideas on how to approach this in a general way I would love to hear about it.

Mongodb aggregation framework to count distinct array items

I have some docs like:
{ tags: { first_cat: ["a", "b", "c"], second_cat : ["1","2","3"]}}
{ tags: { first_cat: ["d", "b", "a"], second_cat : ["1"]}}
I need something like this:
{ first_cat: [{"a" : 2}, {"b" : 2}, {"c" : 1}, {"d" : 1}], second_cat: [{"1" : 2, "2": 1, "3":1}] }
With m/r it's quite easy to do (but slow), is it possibile to get a similar result with aggregation framework?
You can not do this with the Aggregation Framework as there is no way to convert arbitrary values "a" to a key { "a": 2 }. You will need to redesign your schema.