I have following Movies data which is like below,
I should get count of movies in each year like 2002,2 and 2004,1
Littlefield, John (I) x House 2002
Houdyshell, Jayne demon State 2004
Houdyshell, Jayne mall in Manhattan 2002
val data=sc.textFile("..line to file")
val dataSplit=data.map(line=>{var d=line.split("\t");(d(0),d(1),d(2))})
What i am unable to understand is when i use dataSplit.take(2).foreach(println) I see that d(0) is first two columns Littlefield, John (I) which are firstname and lastname and d(1) is movie name such as "x House" and d(2) is year. How can i get the count of movies each year?
Use reduceByKey with the mapped tuple in this way.
val dataSplit = data
.map(line => {var d = line.split("\t"); (d(2), 1)}) // (2002, 1)
.reduceByKey((a, b) => a + b)
// .collect() gives the result: Array((2004,1), (2002,2))
Related
I did a map reduce which counts the terms of book titles and counts them using scala. I want to output both the term and the number but only get the number using:
println("max term :" +wordCount.reduce( (a,b)=> ("max", a._2 max b._2))._2)
I was wondering how I also include the term.
Thank you
Example:
("The", 5)
("Of", 8)
("is", 10)
…
my current code gives me the maximum number but I don't know how to get the term in.
Initial code:
val inputPR2Q1
val inputPR2Q1 = sc.textFile("/root/pagecounts-20160101-000000")
val titlecolumn = inputPR2Q1.map(line => line.split(" ")(1))
val wordCount = titlecolumn.flatMap(line => line.split("_")).map(word => (word,1)).reduceByKey(_ + _);
Here I just take a file containing book titles with other data. I take book titles alone and do a MapReduce to count and sum each term in the titles separately.
Use .sortBy with ascending=false and take(1) on RDD
sc.textFile("/root/pagecounts-20160101-000000").
map(line => line.split(" ")(1)).
flatMap(line => line.split("_")).
map(word => (word,1)).
reduceByKey(_ + _).
sortBy(_._2,ascending=false).
take(1)
I would suggest you take a look to the scaladoc.
You can just use sortBy.
val (maxTerm, count) = wordCount.sortBy(_._2, ascending = false).take(1).head
I am trying to write a program to count the word for each country in a textfile via RDD approach.
Sample Data:
India, It is having 1.5 Billion population
India, It is prospering in IT and manufacturing
India, It has lot of natural mineral resources
US, It's global economic hub
US, It outsources IT work to India
US, It's global economic hub
US, It's global economic hub
For example, for "India" - How many times all words count like how many times "It" is repeating?
Result should be looking like this.
India, (It,3) ,(is,2)
...and so on. Same as for US as well.
Since I am using Databricks Notebook, so all other spark session and context is not required, please find the below approach.
val textRdd:RDD[String] = sc.textFile("/FileStore/tables/Data1")
val Rdd2 = textRdd.map(rec => rec.split(","))
val Rdd3 = Rdd2.map(rec => (rec(0),rec(1).split(" "))).collect()
def func(str1:String, arr1:Array[String]):(String,String) = {
return (str1,arr1(_))
}
Note : Data1 is having data as mentioned above.
Can anyone please help on above?
For each pair (Country, word), count can be peformed, and then groupped by country:
// such format: ((India,is),2)
val countryWordCountRDD = textRdd
.map(rec => rec.split(","))
.flatMap(r => r.last.trim.split(" ").map(w => (r.head, w)))
.map((_, 1))
.reduceByKey((a, b) => a + b)
val result = countryWordCountRDD.map({ case ((country, word), counter) => (country, (word, counter)) })
.groupByKey()
I am working on a spark project on eclipse IDE using scala
i would like some help with this MapReduce problem
Map function:
remove column 'sport' and 'bourse'
delete any row that has 'NULL'
Add a new column duration cycle. This will have to take a value according to the cycle of the student: license (3 years), Master (3 years), Ingeniorat (5 years) and doctorate (3 years)
Reducer:
add up all the students according to year,cycle and speciality.
my input is
matricule,dateins,cycle,specialite,bourse,sport
0000000001,1999-11-22,Master,IC,Non,Non
0000000002,2014-02-01,Null,IC,Null,Oui
0000000003,2006-09-07,Null,Null,Oui,Oui
0000000004,2008-12-11,Master,IC,Oui,Oui
0000000005,2006-06-07,Master,SI,Non,Oui
0000000006,1996-11-16,Ingeniorat,SI,Null,Null
and so on.
This is the code im starting with. I have removed colomn 'sport' 'bourse' and extracted the year
val sc = new SparkContext(conf)
val x = sc.textFile("/home/amel/one")
val re = x.map(_.split(",")).foreach(r => println(r(1).dropRight(6), r(2),r(3)))
this is the result i got
(2000,Licence,Isil)
(2001,Master,SSI)
The result I want is:
year cycle duration speciality Nbr-students
(2000,Licence,3 years,Isil,400)
(2001,Master,3 years,SSI,120)
// I want the column 'Nbr-students' to be the number of students from each year according to their cycle and speciality.
I'm assuming you just want the year - if you do not want year, change cols(1).split("-")(0) to just cols(1).
First I have faked some data using your sample data:
val x = sc.parallelize(Array(
"001,2000-12-22,License,Isil,no,yes",
"002,2001-11-30,Master,SSI,no,no",
"003,2001-11-30,Master,SSI,no,no",
"004,2001-11-30,Master,SSI,no,no",
"005,2000-12-22,License,Isil,no,yes"
))
Next I have done some RDD transformations. First I remove and create the necessary columns, and then I add a count of 1 to each row. Finally, I reduceByKey to count all of the rows with the same information:
val re = x.map(row => {
val cols = row.split(",")
val cycle = cols(2)
val years = cycle match {
case "License" => "3 years"
case "Master" => "3 years"
case "Ingeniorat" => "5 years"
case "Doctorate" => "3 years"
case _ => "other"
}
(cols(1).split("-")(0) + "," + years + "," + cycle + "," + cols(3), 1)
}).reduceByKey(_ + _)
re.collect.foreach(println)
(2000,3 years,License,Isil,2)
(2001,3 years,Master,SSI,3)
I have an RDD in pyspark of the form (key, other things), where "other things" is a list of fields. I would like to get another RDD that uses a second key from the list of fields. For example, if my initial RDD is:
(User1, 1990 4 2 green...)
(User1, 1990 2 2 green...)
(User2, 1994 3 8 blue...)
(User1, 1987 3 4 blue...)
I would like to get (User1, [(1990, x), (1987, y)]),(User2, (1994 z))
where x, y, z would be an aggregation on the other fields, eg x is the count of how may rows I have with User1 and 1990 (two in this case), and I get a list with one tuple per year.
I am looking at the key value functions from:
https://www.oreilly.com/library/view/learning-spark/9781449359034/ch04.html
But don't seem to find anything that will give and aggregation twice: once for user and one for year. My initial attempt was with combineByKey() but I get stuck in getting a list from the values.
Any help would be appreciated!
You can do the following using groupby:
# sample rdd
l = [("User1", "1990"),
("User1", "1990"),
("User2", "1994"),
("User1", "1987") ]
rd = sc.parallelize(l)
# returns a tuples of count of year
def f(l):
dd = {}
for i in l:
if i not in dd:
dd[i] =1
else:
dd[i]+=1
return list(dd.items())
# using groupby and applying the function on x[1] (which is a list)
rd1 = rd.groupByKey().map(lambda x : (x[0], f(x[1]))).collect()
[('User1', [('1990', 2), ('1987', 1)]), ('User2', [('1994', 1)])]
I have a homework assignment where I must write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
For example, for the word "basketball", the word "is" comes next 5 times, "has" 2 times, and "court" 1 time.
In a text file this might show up as:
basketball is..... (this sequence happens 5 times)
basketball has..... (this sequence happens 2 times)
basketball court.... (this sequence happens 1 time)
I am having a hard time conceptually figuring out how to do this.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
Any help would be appreciated. If I am over thinking this I am open to different ideas and suggestions.
Here's one way to do it using MLlib's sliding function:
import org.apache.spark.mllib.rdd.RDDFunctions._
val resRDD = textFile.
flatMap(_.split("""[\s,.;:!?]+""")).
sliding(2).
map{ case Array(x, y) => ((x, y), 1) }.
reduceByKey(_ + _).
map{ case ((x, y), c) => (x, y, c) }.
sortBy( z => (z._1, z._3, z._2), false )