I created an RDD wtih the following format using Scala :
Array[(String, (Array[String], Array[String]))]
How can I get the list of the Array[1] from this RDD?
The data for the first data line is:
// Array[(String, (Array[String], Array[String]))]
Array(
(
966515171418,
(
Array(4579848447, 4579848453, 2015-07-29 03:27:28, 44, 1, 1, 966515171418, 966515183263, 420500052424347, 0, 52643, 9, 5067, 5084, 2, 1, 0, 0),
Array(4579866236, 4579866226, 2015-07-29 04:16:22, 37, 1, 1, 966515171418, 966515183264, 420500052424347, 0, 3083, 9, 5072, 5084, 2, 1, 0, 0)
)
)
)
Assuming you have something like this (just paste into a spark-shell):
val a = Array(
("966515171418",
(Array("4579848447", "4579848453", "2015-07-29 03:27:28", "44", "1", "1", "966515171418", "966515183263", "420500052424347", "0", "52643", "9", "5067", "5084", "2", "1", "0", "0"),
Array("4579866236", "4579866226", "2015-07-29 04:16:22", "37", "1", "1", "966515171418", "966515183264", "420500052424347", "0", "3083", "9", "5072", "5084", "2", "1", "0", "0")))
)
val rdd = sc.makeRDD(a)
then you get the first array using
scala> rdd.first._2._1
res9: Array[String] = Array(4579848447, 4579848453, 2015-07-29 03:27:28, 44, 1, 1, 966515171418, 966515183263, 420500052424347, 0, 52643, 9, 5067, 5084, 2, 1, 0, 0)
which means the first row (which is a Tuple2), then the 2nd element of the tuple (which is again a Tuple2), then the 1st element.
Using pattern matching
scala> rdd.first match { case (_, (array1, _)) => array1 }
res30: Array[String] = Array(4579848447, 4579848453, 2015-07-29 03:27:28, 44, 1, 1, 966515171418, 966515183263, 420500052424347, 0, 52643, 9, 5067, 5084, 2, 1, 0, 0)
If you want to get it of all rows, just use map():
scala> rdd.map(_._2._1).collect()
which puts the results of all rows into an array.
Another option is to use pattern matching in map():
scala> rdd.map { case (_, (array1, _)) => array1 }.collect()
Related
my json file look like
{
"product1" : {
"data1" : [0, 0, 0, ...],
"data2" : [0, 0, 0, ...]
},
"product2" : {
"data1" : [0, 0, 0, ...],
"data2" : [0, 0, 0, ...]
}
...
}
my case class is
case class Product(code: String, detail: Detail)
case class Detail(data1: List[Int], data2: List[Int])
[modify question for certainty]
but circe fail parse
decode[Product](MyJson)
return fail (Letf).
is there more suitable case class? or
may circe can derive Decoder from mapping Map[String, Detail] ?
I have simple dictionary
var countOfR = ["R0": 0, "R1": 0, "R2": 0, "R3": 0, "R4": 0, "R5": 0, "R6": 0]
I need to check this dictionary by multiple conditions. For example, next statement works perfectly:
for index in countOfR {
if index == ("R0",2) || index == ("R1",2) || index == ("R2",2) || index == ("R3",2) || index == ("R4",2) || index == ("R5",2) || index == ("R6",2) {
type = "P"
This will find one "pair". But next I need to check for "two pairs" - "PP". It's terrible to write something like this:
if index == ("R0",2) && index == ("R1",2) || index == ("R0",2) && index == ("R2",2) || index == ("R0",2) && index == ("R3",2) || index == ("R0",2) && index == ("R4",2) || index == ("R0",2) && index == ("R5",2) || index == ("R0",2) && index == ("R6",2) || ...
and so on... I also need to search for "pair and trine", "three pairs" and many others.
For better understanding:
["R0": 1, "R1": 2, "R2": 1, "R3": 1, "R4": 0, "R5": 1, "R6": 0] is "P",
["R0": 1, "R1": 0, "R2": 0, "R3": 1, "R4": 0, "R5": 2, "R6": 0] is "P" too,
["R0": 1, "R1": 0, "R2": 2, "R3": 1, "R4": 0, "R5": 2, "R6": 0] is "PP"
How can I solve this task? Please, give me some advice!
Your something like this makes no sense as it never gets true. Your index (or what name it might be) cannot be equal to two different things at the same time.
I guess you just need to count the number of entries whose value is 2.
You can write something like this:
func getType(_ countOfR: [String: Int]) -> String {
let pairs = countOfR.filter{$0.value == 2}.count
let trines = countOfR.filter{$0.value == 3}.count
let type = String(repeating: "P", count: pairs) + String(repeating: "T", count: trines)
return type
}
print(getType(["R0": 1, "R1": 2, "R2": 1, "R3": 1, "R4": 0, "R5": 1, "R6": 0]))
//->P
print(getType(["R0": 1, "R1": 0, "R2": 0, "R3": 1, "R4": 0, "R5": 2, "R6": 0]))
//->P
print(getType(["R0": 1, "R1": 0, "R2": 2, "R3": 1, "R4": 0, "R5": 2, "R6": 0]))
//->PP
Create a dictionary with your conditions, and filter the main dictionary by comparing the condition dictionary. Type will be create based on the filtered result count.
var condition = ["R0":2,"R1":2,"R2":2,"R3":2,"R4":2,"R5":2,"R6":2]
let test1 = ["R0": 1, "R1": 2, "R2": 1, "R3": 1, "R4": 0, "R5": 1, "R6": 0]// is "P",
let count1 = test1.filter { index in condition.contains(where: { $0 == index }) }.count
let type1 = String(repeating: "P", count: count1)
print(type1)//P
let test2 = ["R0": 1, "R1": 0, "R2": 0, "R3": 1, "R4": 0, "R5": 2, "R6": 0]// is "P" too,
let count2 = test2.filter { index in condition.contains(where: { $0 == index }) }.count
let type2 = String(repeating: "P", count: count2)
print(type2)//P
let test3 = ["R0": 1, "R1": 0, "R2": 2, "R3": 1, "R4": 0, "R5": 2, "R6": 0]// is "PP"
let count3 = test3.filter { index in condition.contains(where: { $0 == index }) }.count
let type3 = String(repeating: "P", count: count3)
print(type3)//PP
I have a dataframe with a key and a column with an array of structs in a dataframe column. Each row contains a column a looks something like this:
[
{"id" : 1, "someProperty" : "xxx", "someOtherProperty" : "1", "propertyToFilterOn" : 1},
{"id" : 2, "someProperty" : "yyy", "someOtherProperty" : "223", "propertyToFilterOn" : 0},
{"id" : 3, "someProperty" : "zzz", "someOtherProperty" : "345", "propertyToFilterOn" : 1}
]
Now I would like to do two things:
Filter on "propertyToFilterOn" = 1
Apply some logic on other
properties - for example concatenate
So that the result is:
[
{"id" : 1, "newProperty" : "xxx_1"},
{"id" : 3, "newProperty" : "zzz_345"}
]
I know how to do it with explode but explode also requires groupBy on the key when putting it back together. But as this is a streaming Dataframe I would also have to put a watermark on it which I am trying to avoid.
Is there any other way to achieve this without using explode? I am sure there is some Scala magic that can achieve this!
Thanks!
With spark 2.4+ came many higher order functions for arrays. (see https://docs.databricks.com/spark/2.x/spark-sql/language-manual/functions.html)
val dataframe = Seq(
("a", 1, "xxx", "1", 1),
("a", 2, "yyy", "223", 0),
("a", 3, "zzz", "345", 1)
).toDF( "grouping_key", "id" , "someProperty" , "someOtherProperty", "propertyToFilterOn" )
.groupBy("grouping_key")
.agg(collect_list(struct("id" , "someProperty" , "someOtherProperty", "propertyToFilterOn")).as("your_array"))
dataframe.select("your_array").show(false)
+----------------------------------------------------+
|your_array |
+----------------------------------------------------+
|[[1, xxx, 1, 1], [2, yyy, 223, 0], [3, zzz, 345, 1]]|
+----------------------------------------------------+
You can filter elements within an array using the array filter higher order function like this:
val filteredDataframe = dataframe.select(expr("filter(your_array, your_struct -> your_struct.propertyToFilterOn == 1)").as("filtered_arrays"))
filteredDataframe.show(false)
+----------------------------------+
|filtered_arrays |
+----------------------------------+
|[[1, xxx, 1, 1], [3, zzz, 345, 1]]|
+----------------------------------+
for the "other logic" your talking about you should be able to use the transform higher order array function like so:
val tranformedDataframe = filteredDataframe
.select(expr("transform(filtered_arrays, your_struct -> struct(concat(your_struct.someProperty, '_', your_struct.someOtherProperty))"))
but there are issues with returning structs from the transform function as described in this post:
http://mail-archives.apache.org/mod_mbox/spark-user/201811.mbox/%3CCALZs8eBgWqntiPGU8N=ENW2Qvu8XJMhnViKy-225ktW+_c0czA#mail.gmail.com%3E
so you are best using the dataset api for the transform like so:
case class YourStruct(id:String, someProperty: String, someOtherProperty: String)
case class YourArray(filtered_arrays: Seq[YourStruct])
case class YourNewStruct(id:String, newProperty: String)
val transformedDataset = filteredDataframe.as[YourArray].map(_.filtered_arrays.map(ys => YourNewStruct(ys.id, ys.someProperty + "_" + ys.someOtherProperty)))
val transformedDataset.show(false)
+--------------------------+
|value |
+--------------------------+
|[[1, xxx_1], [3, zzz_345]]|
+--------------------------+
Consider this collection
/* 1 */
{
"key" : 1,
"b" : 2,
"c" : 3
}
/* 2 */
{
"key" : 2,
"b" : 5,
"c" : 4
}
/* 3 */
{
"key" : 3,
"b" : 7,
"c" : 9
}
/* 4 */
{
"key" : 4,
"b" : 7,
"c" : 4
}
/* 5 */
{
"key" : 5,
"b" : 2,
"c" : 9
}
I want to use the $in operator and write a query to return the document such (b, c) IN ((2, 3), (7, 9)). It means "return all rows where b is 2 and c is 3 at the same time, OR b is 7 and с is 9 at the same time."
How can I use $in operator to use multiple attribute values.
If I use the following query
db.getCollection('test').find({
$and:[
{b:{$in:[2,7]}},
{c:{$in:[3,9]}}
]
})
then I get following results
(2,3)
(7,9)
(2,9) --> This is unwanted record.
IN SQL world it is possible
SELECT *
FROM demo
WHERE (b, c) IN ((2, 3), (7, 9))
What is the equivalent in Mongo DB?
If I get it right, the thing you are doing is getting all the pairs (2,3),(2,9),(7,3),(7,9).
But you want to match those one by one, so your valid pairs should be (2, 3), (7, 9). To satisfy this, you can match b and c one by one, pair them and "or" them after.
db.getCollection('test').find({
$or: [
{$and: [ {b : 2}, {c : 3} ]},
{$and: [ {b : 7}, {c : 9} ]}
]
})
Suppose I have the following stream of data:
1, 2, 3, a, 5, 6, b, 7, 8, a, 10, 11, b, 12, 13, ...
I want to filter everything between 'a' and 'b' (inclusive) no matter how many times they appear. So the result of the above would be:
1, 2, 3, 7, 8, 12, 13, ...
How can I do this with ReactiveX?
Use scan with initial value b to turn
1, 2, 3, a, 5, 6, b, 7, 8, a, 10, 11, b, 12, 13, ...
into
b, 1, 2, 3, a, a, a, b, 7, 8, a, a, a, b, 12, 13, ...
and then filter out a and b to get
1, 2, 3, 7, 8, 12, 13, ...
In pseudo code
values.scan('b', (s, v) -> if (v == 'a' || v == 'b' || s != 'a') v else s).
filter(v -> v != 'a' && v != 'b');
OK. I'm posting this in case anyone else needs an answer to it. A slightly different setup than I described above just to make it easier to understand.
List<String> values = new List<string>()
{
"1", "2", "3",
"a", "5", "6", "b",
"8", "9", "10", "11",
"a", "13", "14", "b",
"16", "17", "18", "19",
"a", "21", "22", "b",
"24"
};
var aa =
// Create an array of CSV strings split on the terminal sigil value
String.Join(",", values.ToArray())
.Split(new String[] { "b," }, StringSplitOptions.None)
// Create the observable from this array of CSV strings
.ToObservable()
// Now create an Observable from each element, splitting it up again
// It is no longer a CSV string but the original elements up to each terminal value
.Select(s => s.Split(',').ToObservable()
// From each value in each observable take those elements
// up to the initial sigil
.TakeWhile(s1 => !s1.Equals("a")))
// Concat the output of each individual Observable - in order
// SelectMany won't work here since it could interleave the
// output of the different Observables created above.
.Concat();
aa.Subscribe(s => Console.WriteLine(s));
This prints out:
1
2
3
8
9
10
11
16
17
18
19
24
It is a bit more convoluted than I wanted but it works.
Edit 6/3/17:
I actually found a cleaner solution for my case.
List<String> values = new List<string>()
{
"1", "2", "3",
"a", "5", "6", "b",
"8", "9", "10", "11",
"a", "13", "14", "b",
"16", "17", "18", "19",
"a", "21", "22", "b",
"24"
};
string lazyABPattern = #"a.*?b";
Regex abRegex = new Regex(lazyABPattern);
var bb = values.ToObservable()
.Aggregate((s1, s2) => s1 + "," + s2)
.Select(s => abRegex.Replace(s, ""))
.Select(s => s.Split(',').ToObservable())
.Concat();
bb.Subscribe(s => Console.WriteLine(s));
The code is simpler which makes it easier to follow (even though it uses regexes).
The problem here is that it still isn't really a general solution to the problem of removing 'repeated regions' of a data stream. It relies on converting the stream to a single string, operating on the string, then converting it back to some other form. If anyone has any ideas on how to approach this in a general way I would love to hear about it.