Spark streaming dstream print() displays first 10 lines like
val fileDstream = ssc.textFileStream("hdfs://localhost:9000/abc.txt")
fileDstream.print()
Is there are way to get last n lines considering that text file is large in size and unsorted ?
If you do this, you could simplify to:
fileDstream.foreachRDD { rdd =>
rdd.collect().last
}
However, this has the problem of collecting all data to the driver.
Is your data sorted? If so, you could reverse the sort and take the first. Alternatively, a hackey implementation might involve a mapPartitionsWithIndex that returns an empty iterator for all partitions except for the last. For the last partition, you would filter all elements except for the last element in your iterator. This should leave one element, which is your last element.
OR you can also try with
fileDstream.foreachRDD { rdd =>
rdd.top(10)(reverseOrdering)
}
Related
I understand that take(n) will return n elements of an RDD, but how Spark decides from which partition to call those elements from and which elements should be chosen?
Does it maintain indexes internally on Driver?
In the take(n) method of RDD, Spark starts scanning for elements from the first partition. If there are not enough elements in that, Spark increases the number of partitions to scan from. And as for what elements are taken that is determined by the following line
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
The take(n) method of the Iterator in scala says "Selects first ''n'' values of this iterator."-scaladoc. So as for what elements will be selected, we see elements are selected from the front of the iterator.
I would like to create an RDD which contains String elements. Alongside of these elements I would like a number indicating the index of the element. However, I do not want this number to change if I remove elements, as I want the number to be the original index (thus preserving it). It is also important that the order is preserved in this RDD.
If I use zipWithIndex and thereafter remove some elements, will the indexes change? Which function/structure can I use to have unchanged indexes? I was thinking of creating a Pair RDD, however my input data does not contain indexes.
Answering rather than deleting. My problem was easily solved by zipWithIndex which fulfilled all my requirements.
I am new to spark and scala. I need to order my result count tuple which is like (course, count) into descending order. I put like below
val results = ratings.countByValue()
val sortedResults = results.toSeq.sortBy(_._2)
But still its't working. In the above way it will sort the results by count with ascending order. But I need to have it in descending order. Can anybody please help me.
Results would be like below
(History, 12100),
(Music, 13200),
(Drama, 143000)
But I need to display it like below
(Drama, 143000),
(Music, 13200),
(History, 12100)
thanks
You have almost done it! you need add additional parameter for descending order as RDD sortBy() method arrange elements in ascending order by default.
val results = ratings.countByValue()
val sortedRdd = results.sortBy(_._2, false)
//Just to display results from RDD
println(sortedRdd.collect().toList)
You can use
.sortWith(_._2 >_._2)
most of the time calling toSeq is not good idea because driver needs to put this in memory and you might run out of memory in on larger data sets. I guess this is o.k. for intro to spark.
For example, someRDD is a pair RDD and the value is comparable, you can do like this:
someRDD.sortBy(item=>(item._2, false))
note: do not forget the brackets after =>.
In Scala I have a list of tuples List[(String, String)]. So now from this list I want to find how many times each unique tuple appears in the list.
One way to do this would be to apply groupby{ x => x} and then find the length. But here my data set it quite large and it's taking a lot of time.
So is there any better way of doing this?
I would do the counting manually, using a Map. Iterate over your collection/list. During the iteration, build a count map. Keys in the count map are unique items from the original collection/list, values are number of occurrences of the key. If the item being processed during the iteration is in the count collection, increase its value by 1. If not, add value 1 to the count map. You can use getOrElse:
count(current_item) = count.getOrElse(current_item, 0) + 1;
This should work faster than groupby, followed by length check. Will also require less memory.
Other suggestions, check also this discussion.
I am running a spark job and at some point I want to connect to an elastic search server to get some data and add them to an RDD. So the code I am using looks like this
input.mapParitions(records=>{
val elcon=new ElasticSearchConnection
val client:TransportClient=elcon.openConnection()
val newRecs=records.flatMap(record=>{
val response = client.prepareGet("index" "indexType",
record.id.toString).execute().actionGet()
val newRec=processRec(record,reponse)
newRec
})//end of flatMap
client.close()
newRecs
})//end of mapPartitions
My problem is that the client.close() command is called before the flatMap operation is finished which results of course into an Exception. The code works if I move the generation and the closing of the connection inside the flatMap, but this would generate a huge amount of connections. Is it possible to make sure that client.close will be called after the flatMap operation is finished?
Making a blocking call for each item in your RDD to fetch corresponding ElasticSearch document is causing the problem. It is generally advised to avoid blocking calls.
There is another alternate approach using the ElasticSearch-for-Hadoop's Spark support.
Read the ElasticSearch index/type as another RDD and join it with your RDD.
Include the right version of ESHadoop dependency.
import org.elasticsearch.spark._
val esRdd = sc.esRDD("index/indexType") //This returns a pair RDD of (_id, Map of all key value pairs for all fields]
input.map(record => (record.id, record)) //Convert your RDD of records to a pair rdd of (id, record) as we want to join based on the id
input.join(esRdd).map(rec => processResponse(rec._2._1, rec._2._2)) // Join the two RDDs based on id column it returns a pair RDD with key=id & value=Pair of matching records (id,(inputrddrecord,esrddrecord))
Hope this helps.
PS: It will still not alleviate the problem of lack of co-location. (i.e. each document with _id will come from different shard of the index). Better approach would have been to achieve co-location of the input RDD and the ES index's documents at the time of creating the ES index.