I am working on a spark project on eclipse IDE using scala
i would like some help with this MapReduce problem
Map function:
remove column 'sport' and 'bourse'
delete any row that has 'NULL'
Add a new column duration cycle. This will have to take a value according to the cycle of the student: license (3 years), Master (3 years), Ingeniorat (5 years) and doctorate (3 years)
Reducer:
add up all the students according to year,cycle and speciality.
my input is
matricule,dateins,cycle,specialite,bourse,sport
0000000001,1999-11-22,Master,IC,Non,Non
0000000002,2014-02-01,Null,IC,Null,Oui
0000000003,2006-09-07,Null,Null,Oui,Oui
0000000004,2008-12-11,Master,IC,Oui,Oui
0000000005,2006-06-07,Master,SI,Non,Oui
0000000006,1996-11-16,Ingeniorat,SI,Null,Null
and so on.
This is the code im starting with. I have removed colomn 'sport' 'bourse' and extracted the year
val sc = new SparkContext(conf)
val x = sc.textFile("/home/amel/one")
val re = x.map(_.split(",")).foreach(r => println(r(1).dropRight(6), r(2),r(3)))
this is the result i got
(2000,Licence,Isil)
(2001,Master,SSI)
The result I want is:
year cycle duration speciality Nbr-students
(2000,Licence,3 years,Isil,400)
(2001,Master,3 years,SSI,120)
// I want the column 'Nbr-students' to be the number of students from each year according to their cycle and speciality.
I'm assuming you just want the year - if you do not want year, change cols(1).split("-")(0) to just cols(1).
First I have faked some data using your sample data:
val x = sc.parallelize(Array(
"001,2000-12-22,License,Isil,no,yes",
"002,2001-11-30,Master,SSI,no,no",
"003,2001-11-30,Master,SSI,no,no",
"004,2001-11-30,Master,SSI,no,no",
"005,2000-12-22,License,Isil,no,yes"
))
Next I have done some RDD transformations. First I remove and create the necessary columns, and then I add a count of 1 to each row. Finally, I reduceByKey to count all of the rows with the same information:
val re = x.map(row => {
val cols = row.split(",")
val cycle = cols(2)
val years = cycle match {
case "License" => "3 years"
case "Master" => "3 years"
case "Ingeniorat" => "5 years"
case "Doctorate" => "3 years"
case _ => "other"
}
(cols(1).split("-")(0) + "," + years + "," + cycle + "," + cols(3), 1)
}).reduceByKey(_ + _)
re.collect.foreach(println)
(2000,3 years,License,Isil,2)
(2001,3 years,Master,SSI,3)
Related
I have a small problem. I would like to delete any row that contains 'NULL'.
This is my input file:
matricule,dateins,cycle,specialite,bourse,sport
0000000001,1999-11-22,Master,IC,Non,Non
0000000002,2014-02-01,Null,IC,Null,Oui
0000000003,2006-09-07,Null,Null,Oui,Oui
0000000004,2008-12-11,Master,IC,Oui,Oui
0000000005,2006-06-07,Master,SI,Non,Oui
I did a lot of research and found a function called drop(any). Which basically drops any rows that contains NULL value. I tried using it in the code below but it wont work
val x = sc.textFile("/home/amel/one")
val re = x.map(row => {
val cols = row.split(",")
val cycle = cols(2)
val years = cycle match {
case "License" => "3 years"
case "Master" => "3 years"
case "Ingeniorat" => "5 years"
case "Doctorate" => "3 years"
case _ => "other"
}
(cols(1).split("-")(0) + "," + years + "," + cycle + "," + cols(3), 1)
}).reduceByKey(_ + _)
re.collect.foreach(println)
This is the current result of my code:
(1999,3 years,Master,IC,57)
(2013,NULL,Doctorat,SI,44)
(2013,NULL,Licence,IC,73)
(2009,5 years,Ingeniorat,Null,58)
(2011,3 years,Master,Null,61)
(2003,5 years,Ingeniorat,Null,65)
(2019,NULL,Doctorat,SI,80)
However, I want the result to be like this:
(1999, 3 years, Master, IC)
I.e., any row that contains 'NULL' should be removed.
Similar but not duplicate question as the following question on SO: Filter spark DataFrame on string contains
You can filter this RDD when you read it in.
val x = sc.textFile("/home/amel/one").filter(!_.toLowerCase.contains("null"))
I have data set of crimes happened from 2001 up till now. I want to calculate no_of_crimes happened per year. My code which i have tried is
val inp = SparkConfig.spark.sparkContext.textFile("file:\\C:\\Users\\M1047320\\Desktop\\Crimes_-_2001_to_present.csv")
val header = inp.first()
val data = inp.filter( line => line(0) != header(0))
val splitRDD = data.map( line =>{
val temp = line.split(",(?![^\\(\\[]*[\\]\\)])")
(temp(0),temp(1),temp(2),temp(3),temp(4),temp(5),
temp(6),temp(7),temp(8),temp(9),temp(10),temp(11),
temp(12),temp(13),temp(14),temp(15),temp(16),temp(17))
})
val crimesPerYear = splitRDD.map( line => (line._18,1)).reduceByKey(_+_)// line._18 represents year column
crimesPerYear.take(20).foreach(println)
Expected Result is,
(2001,54)
(2002,100)
(2003,24) so on
But I am getting result as
(1175860,1)
(1176964,4)
(1178665,123)
(1171273,3)
(1938926,1)
(1141621,8)
(1136278,2)
I am totally confused what i am doing wrong.Why years are summing up ?Please help me
I have a homework assignment where I must write a MapReduce program in Scala to find, for each word in the file which word that follows the most.
For example, for the word "basketball", the word "is" comes next 5 times, "has" 2 times, and "court" 1 time.
In a text file this might show up as:
basketball is..... (this sequence happens 5 times)
basketball has..... (this sequence happens 2 times)
basketball court.... (this sequence happens 1 time)
I am having a hard time conceptually figuring out how to do this.
The idea I have had but have not been able to successfully implement is
Iterate through each word, if the word is basketball, take the next word and add it to a map. Reduce by key, and sort from highest to lowest.
Unfortunately I do not know how to take the next next word in a list of words.
For example, i would like to do something like this
val lines = spark.textFile("basketball_words_only.txt") // process lines in file
// split into individual words
val words = lines.flatMap(line => line.split(" "))
var listBuff = new ListBuffer[String]() // a list Buffer to hold each following word
val it = Iterator(words)
while (it.hasNext) {
listBuff += it.next().next() // <-- this is what I would like to do
}
val follows = listBuff.map(word => (word, 1))
val count = follows.reduceByKey((x, y) => x + y) // another issue as I cannot reduceByKey with a listBuffer
val sort = count.sortBy(_._2,false,1)
val result2 = sort.collect()
for (i <- 0 to result2.length - 1) {
printf("%s follows %d times\n", result1(2)._1, result2(i)._2);
}
Any help would be appreciated. If I am over thinking this I am open to different ideas and suggestions.
Here's one way to do it using MLlib's sliding function:
import org.apache.spark.mllib.rdd.RDDFunctions._
val resRDD = textFile.
flatMap(_.split("""[\s,.;:!?]+""")).
sliding(2).
map{ case Array(x, y) => ((x, y), 1) }.
reduceByKey(_ + _).
map{ case ((x, y), c) => (x, y, c) }.
sortBy( z => (z._1, z._3, z._2), false )
I have a relatively simple problem.
I have an large Spark RDD[String] (containing JSON). In my use case I want to group (concatenate) N strings together into a new RDD[String], so that it will have the size of oldRDD.size/N.
pseudo example:
val oldRDD : RDD[String] = ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}']
val newRDD : RDD[String] = someTransformation(oldRDD, ",", 2)
newRDD = ['{"id": 1},{"id": 2}','{"id": 3},{"id": 4}']
val anotherRDD : RDD[String] = someTransformation(oldRDD, ",", 3)
anotherRDD = ['{"id": 1},{"id": 2},{"id": 3}','{"id": 4}']
I already looked for a similar case, but couldnt find anything.
Thanks!
Here you have to use zipWithIndex function and then calculate group.
For example, index = 3 and n (number of groups) = 2 gives you 2nd group. 3 / 2 = 1 (integer divide), so 0-based 2nd group
val n = 3;
val newRDD1 = oldRDD.zipWithIndex() // creates tuples (element, index)
// map to tuple (group, content)
.map(x => (x._2 / n, x._1))
// merge
.reduceByKey(_ + ", " + _)
// remove key
.map(x => x._2)
One note: order of "zipWithIndex" is internal order. It can make no sense in business logic, you must check if order is ok in your case. If not, sort RDD and then use zipWithIndex
I have a file with thousands of document, and i want to read every documents and replace each word in this document like a pattern (Key,Value) at first all values are zero(0),
docs file :
Problems installing software weak tech support.
great idea executed perfect.
A perfect solution for anyone who listens to the radio.
...
and i have a score_file contains many words: e.g.
idea 1
software 1
weak 1
who 1
perfect 1
price -1
...
output like this pattern:
(Problems,0) (installing,1) (software,1) (develop,2) (weak,1) (tech,1) (support,0).
(great,1) (idea,1) (executed,2) (perfect,1).
(perfect,1) (solution,1) (for,0) (anyone,1) (who,1) (listens,1) (to,0) (the,0) (radio,0).
if a word of document occur in this score_file then value of (left word , this word , right word) in document Adding with 1 or -1 related to word's score.
i've tried :
val Words_file = sc.textFile("score_file.txt")
val Words_List = sc.broadcast(Words_file.map({ (line) =>
val Array(a,b) = line.split(" ").map(_.trim)(a,b.toInt)}).collect().toMap)
val Docs_file = sc.textFile("docs_file.txt")
val tokens = Docs_file.map(line => (line.split(" ").map(word => Words_List.value.getOrElse(word, 0)).reduce(_ + _), line.split(" ").filter(Words_List.value.contains(_) == false).mkString(" ")))
val out_Docs = tokens.map(s => if (s._2.length > 3) {s._1 + "," + s._2})
But it scored every document not its words, how can i generate my favorite output?
It is kinda hard to read you code, you seem to use a weird mix of CamelCase with underscores with vals sometimes starting with uppercase and sometimes not.
I'm not sure I completely got the question, but to get an output where each word in a given line is replaced by itself and the number coming from the other file, this might work:
val sc = new SparkContext()
val wordsFile = sc.textFile("words_file.txt")
val words = sc.broadcast(wordsFile.map( line => {
val Array(a, b) = line.split(" ").map(_.trim)
(a,b.toInt)
}).collect().toMap)
val docs = sc.textFile("docs_file.txt")
val tokens = docs.map(line => {
line.split(" ")
.map(token => s"(${token}, ${words.value.getOrElse(token, 0)})")
.mkString(" ")
})
tokens ends up being just an RDD[String] as the input, preserving lines (documents). I reformatted the code a bit to make it more readable.