how to find length of RDD in Spark [duplicate] - scala

This question already has answers here:
How to find spark RDD/Dataframe size?
(3 answers)
Closed 5 years ago.
How can i find the length of the below RDD?
var mark = sc.parallelize(List(1,2,3,4,5,6))
scala> mark.map(l => l.length).collect
<console>:27: error: value length is not a member of Int
mark.map(l => l.length).collect

First you should clarify what you want exactly. In your examplek you are running a map function, so it looks like you are trying to get the length of each of the fields of the RDD, not the RDD size.
sc.textFile loads everything as Strings, so you can call length method on each of the fields. Paralellize is parallelizing the information as Ints because your list is made of integers.
If you want the size of an RDD you should run count on the RDD, not on on each field
mark.count()
This will return 6
If you want the size of each element you can convert them to String if needed, but it looks like a weird requirement. It will be something like this:
mark.map(l => l.toString.length).collect

Related

How to remove duplicate elements from an array in Swift 5? [duplicate]

This question already has answers here:
Removing duplicate elements from an array in Swift
(49 answers)
How to merge two custom type arrays and remove duplicate custom type items in Swift?
(2 answers)
Remove objects with duplicate properties from Swift array
(16 answers)
Closed 3 years ago.
I have an array of custom objects, a.k.a users, and some of them are duplicates.
How can I make sure there is only one of each element? No duplicates.
Also, what's the most efficient way?
var users: UserModel = [UserModel]()
Most efficient way if you don’t care about maintaining the original order in the array
let uniqueUsers = Array(Set(users))
Maybe instead of using an array you want to use a set? https://developer.apple.com/documentation/swift/set

How to print last n lines of a dstream in spark streaming?

Spark streaming dstream print() displays first 10 lines like
val fileDstream = ssc.textFileStream("hdfs://localhost:9000/abc.txt")
fileDstream.print()
Is there are way to get last n lines considering that text file is large in size and unsorted ?
If you do this, you could simplify to:
fileDstream.foreachRDD { rdd =>
rdd.collect().last
}
However, this has the problem of collecting all data to the driver.
Is your data sorted? If so, you could reverse the sort and take the first. Alternatively, a hackey implementation might involve a mapPartitionsWithIndex that returns an empty iterator for all partitions except for the last. For the last partition, you would filter all elements except for the last element in your iterator. This should leave one element, which is your last element.
OR you can also try with
fileDstream.foreachRDD { rdd =>
rdd.top(10)(reverseOrdering)
}

How to loop through tuple in scala [duplicate]

This question already has answers here:
Scala: How to convert tuple elements to lists
(5 answers)
Closed 5 years ago.
I have a tuple in scala
val captainStuff = ("One", "Two", "Three", "Four", "Five")
How i can iterate through for loop?? It's easy to loop through list and map. But how to loop through Tuple.
Thanks!!
You can convert it to iterator like:
val captainStuff = ("One", "Two", "Three", "Four", "Five")
captainStuff.productIterator.foreach(x => {
println(x)
})
This question is a duplicate btw:
Scala: How to convert tuple elements to lists
How i can iterate through for loop?? It's easy to loop through list and map. But how to loop through Tuple.
Lists and maps are collections. Tuples are not. Iterating (aka "looping through") really only makes sense for collections which tuples aren't.
Tuples are product types. They are a way of grouping multiple values of different types together into a single structure. Considering that the fields of a tuple may have different types, how exactly would you iterate over it? What would be the type of your element variable?
If you are familiar with other languages, you may be familiar with the concept of records (e.g. RECORD in Pascal or struct in C). Tuples are kind of like them, except the fields don't have names. How do you iterate over a record in Pascal or a struct in C? You don't, it makes no sense.
In fact, you can think of an object as a record. Again, how do you iterate over the fields of an object? You don't, it makes no sense.
Note #1: Yes, sometimes, it does make sense to iterate over the field of an object iff you are doing reflective metaprogramming.
Note #2: In Scala, tuples inherit from Product, which has a non-typesafe productIterator method that gives you an Iterator[Any] which allows you to iterate over a tuple, but without type-safety. Just don't do it.
tl;dr: tuples are not collections. You simply don't iterate over them. Period. If you think you have to, you're doing something wrong, i.e. you shouldn't have a tuple but maybe an array or a list instead.

How can I apply a filter in MongoDB find operation to check empty string? [duplicate]

This question already has answers here:
Test empty string in mongodb and pymongo
(4 answers)
Closed 5 years ago.
I am working on a application built in PHP using MongoDB as a database.
Data is organized across BSON documents into a collection in MongoDB.
I need to retrieve only those documents where field containing string value is non empty value. I searched for functions equivalent to empty and strlen functions belonging to PHP language but did not get any relevant search results.
try this,
$cursor = $collection->find(array("someField" => array('$ne' => null)));

Reduction of a RDD to the collection of its values [duplicate]

This question already has an answer here:
ways to replace groupByKey in apache Spark
(1 answer)
Closed 6 years ago.
I have an RDD as the types of (key,value) which value is a case class, I need to reduce this RDD to (Key, ArrayBuffer(values))..based on comments below the typical way is using reducebykey method..however, I wanted to know if I can do this with reducebykey as it is a more efficient way based on this article.
// Consider pairRdd is the RDD that contains the (key, value) then
val groupedPairRDD = pairRdd.groupByKey
The output groupedPairRDD is your expected output. It contains the collection of values against the keys.