Distinct() command used with skip() and limit() - mongodb

I have those items in my MongoDB collection:
{x: 1, y: 60, z:100}
{x: 1, y: 60, z:100}
{x: 1, y: 60, z:100}
{x: 2, y: 60, z:100}
{x: 2, y: 60, z:100}
{x: 3, y: 60, z:100}
{x: 4, y: 60, z:100}
{x: 4, y: 60, z:100}
{x: 5, y: 60, z:100}
{x: 6, y: 60, z:100}
{x: 6, y: 60, z:100}
{x: 6, y: 60, z:100}
{x: 7, y: 60, z:100}
{x: 7, y: 60, z:100}
I want to query the distinct values of x (i.e. [1, 2, 3, 4, 5, 6, 7]) ... but I only want a part of them (similar to what we can obtain with skip(a) and limit(b)).
How do I do that with the java driver of MongoDB (or with spring-data-mongodb if possible) ?

in mongo shell is simple with aggregate framework:
db.collection.aggregate([{$group:{_id:'$x'}}, {$skip:3}, {$limit:5}])
for java look: use aggregation framework in java

Depending on your use case, you may find this approach to be more performant than aggregation. Here's a mongo shell example function.
function getDistinctValues(skip, limit) {
var q = {x:{$gt: MinKey()}}; // query
var s = {x:1}; // sort key
var results = [];
for(var i = 0; i < skip; i++) {
var result = db.test.find(q).limit(1).sort(s).toArray()[0];
if(!result) {
return results;
}
q.x.$gt = result.x;
}
for(var i = 0; i < limit; i++) {
var result = db.test.find(q).limit(1).sort(s).toArray()[0];
if(!result) {
break;
}
results.push(result.x);
q.x.$gt = result.x;
}
return results;
}
We are basically just finding the values one at a time, using the query and sort to skip past values we have already seen. You can easily improve on this by adding more arguments to make the function more flexible. Also, creating an index on the property you want to find distinct values for will improve performance.
A less obvious improvement would be to skip the "skip" phase all together and specify a value to continue from. Here's a mongo shell example function.
function getDistinctValues(limit, lastValue) {
var q = {x:{$gt: lastValue === undefined ? MinKey() : lastValue}}; // query
var s = {x:1}; // sort key
var results = [];
for(var i = 0; i < limit; i++) {
var result = db.test.find(q).limit(1).sort(s).toArray()[0];
if(!result) {
break;
}
results.push(result.x);
q.x.$gt = result.x;
}
return results;
}
If you do decide to go with the aggregation technique, make sure you add a $sort stage after the $group stage. Otherwise your results will not show up in a predictable order.

Related

reduce from PySpark RDD returns tuple

data = data.withColumn('n', F.lit(10))
result = data.select('n').rdd.reduce(lambda x, y: x + y)
print(result)
Above code returns following output. But I was expecting single value which is the sum.
output:
(10,
10,
10,
10,
10,
10,
10,
10,.....)

What will be the result for the dropWhile in scala

I am using dropWhile in scala below is my problem.
Problem:
val list = List(87, 44, 5, 4, 200)
list.dropWhile(_ < 100) should be(/*result*/)
My Answer:
val list = List(87, 44, 5, 4, 200)
list.dropWhile(_ < 100) should be(List(44,5,4,200))
As per the documentation on dropWhile will continually drop elements until a predicate is no longer satisfied:
In my list the first element will satisfy the predicate so i removed the first element from the list.
val list = List(87, 44, 5, 4, 200)
list.dropWhile(_ < 100) should be(/*result*/)
I am expecting a result of List(44,5,4,200)
But it is not.
You are kind of going in the wrong direction. The head of the list is 87. The next element is 44, etc. dropWhile will continue to drop elements from the list until it hits that 200. If you initialize the list with more elements to the right of the 200, say
val list = List(87, 44, 5, 4, 200, 54, 60)
Then list.dropWhile(_ < 100) will return dropped: List[Int] = List(200, 54, 60)

In scala, can a variable changed in a for-loop be used outside the loop? [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 years ago.
Improve this question
I'm updating a variable within a for-loop in scala that I need outside the loop. I tested the code below and got the message "ERROR: undefined". The variable is not empty, inside the loop the values are returned. Thank you.
val example=List(0,0,1,0.7,10,2,5,7,4,1,-9,0,0,0,0,3,3,0,0,0,-80,-6.6,-1,0)
var b=scala.collection.mutable.MutableList.empty[Double]
var b_val:Double=0
for (i<-1 to 24) {
if ( example (i) != 0 ){b_val = b_val + example(i)} else {b_val = 0}
b += b_val;
}
println(b);
You're getting an error because example only has 24 items. Therefore example(i) will throw an IndexOutOfBoundsException when i is 24.
you are trying to add b_val which is double to AnyVal which won't work.
So, to add you need to cast the elements in List to Double as below or define example as List[Double], also your iteration won't work because you are doing to 24 which will blow up because list indexes from 0 to 23
val example = List(0, 0, 1, 0.7, 10, 2, 5, 7, 4, 1, -9, 0, 0, 0, 0, 3, 3, 0, 0, 0, -80, -6.6, -1, 0)
var b = scala.collection.mutable.MutableList.empty[Double]
var b_val: Double = 0
for (i <- 0 until example.length) {
if (example(i) != 0) {
b_val = (b_val + example(i).asInstanceOf[Double])
} else {
b_val = 0
}
b += b_val
}
println(b)
Result is : MutableList(0.0, 0.0, 1.0, 1.7, 11.7, 13.7, 18.7, 25.7, 29.7, 30.7, 21.7, 0.0, 0.0, 0.0, 0.0, 3.0, 6.0, 0.0, 0.0, 0.0, -80.0, -86.6, -87.6, 0.0)
But only you know what you are trying to achieve, small refactor I would do to it in scala way
val example = List(0, 0, 1, 0.7, 10, 2, 5, 7, 4, 1, -9, 0, 0, 0, 0, 3, 3, 0, 0, 0, -80, -6.6, -1, 0)
var b = scala.collection.mutable.MutableList.empty[Double]
var b_val: Double = 0
example.foreach(elem => {
if (elem != 0) {
b_val = b_val + elem.asInstanceOf[Double]
} else {
b_val = 0
}
b += b_val
})
println(b)
I think, you are looking for something like this:
val b = example.scanLeft(0.0) {
case (_, 0) => 0
case (l, r) => l + r
}
If you are going to be writing scala code anyway, learn to do it in scala way. There is no point otherwise.

One hot encoding in RDD in scala

I have a user data from movielense ml-100K dataset.
Sample rows are -
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
I have read data as RDD as follows-
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
user_data: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:29
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
# encode distinct profession with zipWithIndex -
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex()
indexed_profession: org.apache.spark.rdd.RDD[(String, Long)] = ZippedWithIndexRDD[18] at zipWithIndex at <console>:31
scala> indexed_profession.collect()
res1: Array[(String, Long)] = Array((administrator,0), (artist,1), (doctor,2), (educator,3), (engineer,4), (entertainment,5), (executive,6), (healthcare,7), (homemaker,8), (lawyer,9), (librarian,10), (marketing,11), (none,12), (other,13), (programmer,14), (retired,15), (salesman,16), (scientist,17), (student,18), (technician,19), (writer,20))
I want to do one hot encoding for Occupation column.
Expected output is -
userId Age Gender Occupation Zipcodes technician other writer
1 24 M technician 85711 1 0 0
2 53 F other 94043 0 1 0
3 23 M writer 32067 0 0 1
4 24 M technician 43537 1 0 0
5 33 F other 15213 0 1 0
How do I achieve this on RDD in scala.
I want to perform operation on RDD without converting it to dataframe.
Any help
Thanks
I did this in following way -
1) Read user data -
scala> val user_data = sc.textFile("/home/user/Documents/movielense/ml-100k/u.user").map(x=>x.split('|'))
2) show 5 rows of data-
scala> user_data.take(5)
res0: Array[Array[String]] = Array(Array(1, 24, M, technician, 85711), Array(2, 53, F, other, 94043), Array(3, 23, M, writer, 32067), Array(4, 24, M, technician, 43537), Array(5, 33, F, other, 15213))
3) Create map of profession by indexing-
scala> val indexed_profession = user_data.map(x=>x(3)).distinct().sortBy[String](x=>x).zipWithIndex().collectAsMap()
scala> indexed_profession
res35: scala.collection.Map[String,Long] = Map(scientist -> 17, writer -> 20, doctor -> 2, healthcare -> 7, administrator -> 0, educator -> 3, homemaker -> 8, none -> 12, artist -> 1, salesman -> 16, executive -> 6, programmer -> 14, engineer -> 4, librarian -> 10, technician -> 19, retired -> 15, entertainment -> 5, marketing -> 11, student -> 18, lawyer -> 9, other -> 13)
4) create encode function which does one hot encoding of profession
scala> def encode(x: String) =
|{
| var encodeArray = Array.fill(21)(0)
| encodeArray(indexed_user.get(x).get.toInt)=1
| encodeArray
}
5) Apply encode function to user data -
scala> val encode_user_data = user_data.map{ x => (x(0),x(1),x(2),x(3),x(4),encode(x(3)))}
6) show encoded data -
scala> encode_user_data.take(6)
res71: Array[(String, String, String, String, String, Array[Int])] =
1,24,M,technician,85711,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
2,53,F,other,94043,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
3,23,M,writer,32067,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1)),
4,24,M,technician,43537,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)),
5,33,F,other,15213,Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0)),
6,42,M,executive,98101,Array(0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)))
[My solution is for Dataframe] This below should help in converting a categorical map to one-hot. You have to create a map catMap object with keys as column name and values as list of categories.
var OutputDf = df
for (cat <- catMap.keys) {
val categories = catMap(cat)
for (oneHotVal <- categories) {
OutputDf = OutputDf.withColumn(oneHotVal,
when(lower(OutputDf(cat)) === oneHotVal, 1).otherwise(0))
}
}
OutputDf

scalaquery : Dynamic BatchInsert

The FirstExample in scalaquery-examples project
provides an example of batch insert with the following Syntax:
Coffees.insertAll(
("Colombian", 101, 7.99, 0, 0),
("French_Roast",49, 8.99, 0, 0),
("Espresso",150, 9.99, 0, 0),
("Colombian_Decaf",101, 8.99, 0, 0),
("French_Roast_Decaf", 49, 9.99, 0, 0)
)
How is it possible to pass a dynamically constructed List of tuples in the InsertAll method given that for this example the function definition is:
def insertAll(values: (String, Int, Double, Int, Int)*)(implicit session: org.scalaquery.session.Session): Option[Int]
You can transform your List to variable-length arguments like this:
insertAll(tuplesList.toSeq:_*)