I have a small problem. I would like to delete any row that contains 'NULL'.
This is my input file:
matricule,dateins,cycle,specialite,bourse,sport
0000000001,1999-11-22,Master,IC,Non,Non
0000000002,2014-02-01,Null,IC,Null,Oui
0000000003,2006-09-07,Null,Null,Oui,Oui
0000000004,2008-12-11,Master,IC,Oui,Oui
0000000005,2006-06-07,Master,SI,Non,Oui
I did a lot of research and found a function called drop(any). Which basically drops any rows that contains NULL value. I tried using it in the code below but it wont work
val x = sc.textFile("/home/amel/one")
val re = x.map(row => {
val cols = row.split(",")
val cycle = cols(2)
val years = cycle match {
case "License" => "3 years"
case "Master" => "3 years"
case "Ingeniorat" => "5 years"
case "Doctorate" => "3 years"
case _ => "other"
}
(cols(1).split("-")(0) + "," + years + "," + cycle + "," + cols(3), 1)
}).reduceByKey(_ + _)
re.collect.foreach(println)
This is the current result of my code:
(1999,3 years,Master,IC,57)
(2013,NULL,Doctorat,SI,44)
(2013,NULL,Licence,IC,73)
(2009,5 years,Ingeniorat,Null,58)
(2011,3 years,Master,Null,61)
(2003,5 years,Ingeniorat,Null,65)
(2019,NULL,Doctorat,SI,80)
However, I want the result to be like this:
(1999, 3 years, Master, IC)
I.e., any row that contains 'NULL' should be removed.
Similar but not duplicate question as the following question on SO: Filter spark DataFrame on string contains
You can filter this RDD when you read it in.
val x = sc.textFile("/home/amel/one").filter(!_.toLowerCase.contains("null"))
Related
I am working on a spark project on eclipse IDE using scala
i would like some help with this MapReduce problem
Map function:
remove column 'sport' and 'bourse'
delete any row that has 'NULL'
Add a new column duration cycle. This will have to take a value according to the cycle of the student: license (3 years), Master (3 years), Ingeniorat (5 years) and doctorate (3 years)
Reducer:
add up all the students according to year,cycle and speciality.
my input is
matricule,dateins,cycle,specialite,bourse,sport
0000000001,1999-11-22,Master,IC,Non,Non
0000000002,2014-02-01,Null,IC,Null,Oui
0000000003,2006-09-07,Null,Null,Oui,Oui
0000000004,2008-12-11,Master,IC,Oui,Oui
0000000005,2006-06-07,Master,SI,Non,Oui
0000000006,1996-11-16,Ingeniorat,SI,Null,Null
and so on.
This is the code im starting with. I have removed colomn 'sport' 'bourse' and extracted the year
val sc = new SparkContext(conf)
val x = sc.textFile("/home/amel/one")
val re = x.map(_.split(",")).foreach(r => println(r(1).dropRight(6), r(2),r(3)))
this is the result i got
(2000,Licence,Isil)
(2001,Master,SSI)
The result I want is:
year cycle duration speciality Nbr-students
(2000,Licence,3 years,Isil,400)
(2001,Master,3 years,SSI,120)
// I want the column 'Nbr-students' to be the number of students from each year according to their cycle and speciality.
I'm assuming you just want the year - if you do not want year, change cols(1).split("-")(0) to just cols(1).
First I have faked some data using your sample data:
val x = sc.parallelize(Array(
"001,2000-12-22,License,Isil,no,yes",
"002,2001-11-30,Master,SSI,no,no",
"003,2001-11-30,Master,SSI,no,no",
"004,2001-11-30,Master,SSI,no,no",
"005,2000-12-22,License,Isil,no,yes"
))
Next I have done some RDD transformations. First I remove and create the necessary columns, and then I add a count of 1 to each row. Finally, I reduceByKey to count all of the rows with the same information:
val re = x.map(row => {
val cols = row.split(",")
val cycle = cols(2)
val years = cycle match {
case "License" => "3 years"
case "Master" => "3 years"
case "Ingeniorat" => "5 years"
case "Doctorate" => "3 years"
case _ => "other"
}
(cols(1).split("-")(0) + "," + years + "," + cycle + "," + cols(3), 1)
}).reduceByKey(_ + _)
re.collect.foreach(println)
(2000,3 years,License,Isil,2)
(2001,3 years,Master,SSI,3)
I'm learning Scala, curious how to optimize this code. What I have is an RDD loaded from Spark. It's a tab delimited dataset. I want to combine the first column with the second column, and append it as a new column to the end of the dataset, with a "-" separating the two.
For example:
column1\tcolumn2\tcolumn3
becomes
column1\tcolumn2\tcolumn3\tcolumn1-column2
val f = sc.textFile("path/to/dataset")
f.map(line => if (line.split("\t").length > 1)
line.split("\t") :+ line.split("\t")(0)+"-"+line.split("\t")(1)
else
Array[String]()).map(a => a.mkString("\t")
)
.saveAsTextFile("output/path")
Try:
f.map{ line =>
val cols = line.split("\t")
if (cols.length > 1) line + "\t" + cols(0) + "-" + cols(1)
else line
}
I have a file which contains lines which contain items separated by ","
for example:
2 1,3
3 2,5,7
5 4
Now I want to flatMap this file to such rdd:
2 1
2 3
3 2
3 5
5 7
5 4
I wonder how to realize this function in scala:
val pairs = lines.flatMap { line =>
val a = line.split(" ")(0)
val partb = line.split(" ")(1)
for (b <- partb.split(",")) {
yield a + " " + b
}
}
Is this correct?
Thank you for clarifying your code example. In your case, the only problem is the location of your yield keyword. Move it to before the curly braces, like so:
for (b <- partb.split(",")) yield {
a + " " + b
}
You need to do yield THEN the return logic
yield {a}
The way you are doing it now is a for loop, not a for comprehension, which will yell about the yield keyword, and even if not it would return a Unit
val pairs = lines.flatMap { line =>
for (a <- line.split(",")) yield {
a
}
}
In addition to the relocation of yield for delivering a collection, as already exposed, consider this possible refactoring where we extract the first two entries from split,
val pairs = lines.flatMap { line =>
val Array(a, partb, _*) = line.split(" ")
for (b <- partb.split(","))
yield a + " " + b
}
and yet more concise is
val pairs = lines.flatMap { line =>
val Array(a,tail) = line.split(" |,", 2)
for (t <- tail) yield s"$a $t"
}
where we split by either " " or "," and extract the head and the tail, then we apply string interpolation to produce the desired result.
I've a list of nodes (String) that I want to convert into something the following.
create X ({name:"A"}),({name:"B"}),({name:"B"}),({name:"C"}),({name:"D"}),({name:"F"})
Using a fold I get everything with an extra "," at the end. I can remove that using a substring on the final String. I was wondering if there is a better/more functional way of doing this in Scala ?
val nodes = List("A", "B", "B", "C", "D", "F")
val str = nodes.map( x => "({name:\"" + x + "\"}),").foldLeft("create X ")( (acc, curr) => acc + curr )
println(str)
//create X ({name:"A"}),({name:"B"}),({name:"B"}),({name:"C"}),({name:"D"}),({name:"F"}),
Solution 1
You could use the mkString function, which won't append the seperator at the end.
In this case you first map each element to the corresponding String and then use mkString for putting the ',' inbetween.
Since the "create X" is static in the beginning you could just prepend it to the result.
val str = "create X " + nodes.map("({name:\"" + _ + "\"})").mkString(",")
Solution 2
Another way to see this: Since you append exactly one ',' too much, you could just remove it.
val str = nodes.foldLeft("create X ")((acc, x) => acc + "({name:\"" + x + "\"}),").init
init just takes all elements from a collection, except the last.
(A string is seen as a collection of chars here)
So in a case where there are elements in your nodes, you would remove a ','. When there is none you only get "create X " and therefore remove the white-space, which might not be needed anyways.
Solution 1 and 2 are not equivalent when nodes is empty. Solution 1 would keep the white-space.
Joining a bunch of things, splicing something "in between" each of the things, isn't a map-shaped problem. So adding the comma in the map call doesn't really "fit".
I generally do this sort of thing by inserting the comma before each item during the fold; the fold can test whether the accumulator is "empty" and not insert a comma.
For this particular case (string joining) it's so common that there's already a library function for it: mkString.
Move "," from map(which applies to all) to fold/reduce
val str = "create X " + nodes.map( x => "({name:\"" + x + "\"})").reduceLeftOption( _ +","+ _ ).getOrElse("")
Relevant questions
This question is quite relevant, but is 2 years old: In memory OLAP engine in Java
Background
I would like to create a pivot-table like matrix from a given tabular dataset, in memory
e.g. an age by marital status count (rows are age, columns are marital status).
The input: List of People, with age and some Boolean property (e.g. married),
The desired output: count of People, by age (row) and isMarried (column)
What I've tried (Scala)
case class Person(val age:Int, val isMarried:Boolean)
...
val people:List[Person] = ... //
val peopleByAge = people.groupBy(_.age) //only by age
val peopleByMaritalStatus = people.groupBy(_.isMarried) //only by marital status
I managed to do it the naive way, first grouping by age, then map which is doing a count by marital status, and outputs the result, then I foldRight to aggregate
TreeMap(peopleByAge.toSeq: _*).map(x => {
val age = x._1
val rows = x._2
val numMarried = rows.count(_.isMarried())
val numNotMarried = rows.length - numMarried
(age, numMarried, numNotMarried)
}).foldRight(List[FinalResult]())(row,list) => {
val cumMarried = row._2+
(if (list.isEmpty) 0 else list.last.cumMarried)
val cumNotMarried = row._3 +
(if (list.isEmpty) 0 else l.last.cumNotMarried)
list :+ new FinalResult(row._1, row._2, row._3, cumMarried,cumNotMarried)
}.reverse
I don't like the above code, it's not efficient, hard to read, and I'm sure there is a better way.
The question(s)
How do I groupBy "both"? and how do I do a count for each subgroup, e.g.
How many people are exactly 30 years old and married?
Another question, is how do I do a running total, to answer the question:
How many people above 30 are married?
Edit:
Thank you for all the great answers.
just to clarify, I would like the output to include a "table" with the following columns
Age (ascending)
Num Married
Num Not Married
Running Total Married
Running Total Not Married
Not only answering those specific queries, but to produce a report that will allow answering all such type of questions.
Here is an option that is a little more verbose, but does this in a generic fashion instead of using strict data types. You could of course use generics to make this nicer, but i think you get the idea.
/** Creates a new pivot structure by finding correlated values
* and performing an operation on these values
*
* #param accuOp the accumulator function (e.g. sum, max, etc)
* #param xCol the "x" axis column
* #param yCol the "y" axis column
* #param accuCol the column to collect and perform accuOp on
* #return a new Pivot instance that has been transformed with the accuOp function
*/
def doPivot(accuOp: List[String] => String)(xCol: String, yCol: String, accuCol: String) = {
// create list of indexes that correlate to x, y, accuCol
val colsIdx = List(xCol, yCol, accuCol).map(headers.getOrElse(_, 1))
// group by x and y, sending the resulting collection of
// accumulated values to the accuOp function for post-processing
val data = body.groupBy(row => {
(row(colsIdx(0)), row(colsIdx(1)))
}).map(g => {
(g._1, accuOp(g._2.map(_(colsIdx(2)))))
}).toMap
// get distinct axis values
val xAxis = data.map(g => {g._1._1}).toList.distinct
val yAxis = data.map(g => {g._1._2}).toList.distinct
// create result matrix
val newRows = yAxis.map(y => {
xAxis.map(x => {
data.getOrElse((x,y), "")
})
})
// collect it with axis labels for results
Pivot(List((yCol + "/" + xCol) +: xAxis) :::
newRows.zip(yAxis).map(x=> {x._2 +: x._1}))
}
my Pivot type is pretty basic:
class Pivot(val rows: List[List[String]]) {
val headers = rows.head.zipWithIndex.toMap
val body = rows.tail
...
}
And to test it, you could do something like this:
val marriedP = Pivot(
List(
List("Name", "Age", "Married"),
List("Bill", "42", "TRUE"),
List("Heloise", "47", "TRUE"),
List("Thelma", "34", "FALSE"),
List("Bridget", "47", "TRUE"),
List("Robert", "42", "FALSE"),
List("Eddie", "42", "TRUE")
)
)
def accum(values: List[String]) = {
values.map(x => {1}).sum.toString
}
println(marriedP + "\n")
println(marriedP.doPivot(accum)("Age", "Married", "Married"))
Which yields:
Name Age Married
Bill 42 TRUE
Heloise 47 TRUE
Thelma 34 FALSE
Bridget 47 TRUE
Robert 42 FALSE
Eddie 42 TRUE
Married/Age 47 42 34
TRUE 2 2
FALSE 1 1
The nice thing is that you can use currying to pass in any function for the values like you would in a traditional excel pivot table.
More can be found here: https://github.com/vinsonizer/pivotfun
You can
val groups = people.groupBy(p => (p.age, p.isMarried))
and then
val thirty_and_married = groups((30, true))._2
val over_thirty_and_married_count =
groups.filterKeys(k => k._1 > 30 && k._2).map(_._2.length).sum
I think it would be better to use the count method on Lists directly
For question 1
people.count { p => p.age == 30 && p.isMarried }
For question 2
people.count { p => p.age > 30 && p.isMarried }
If you also want to actual groups of people who conform to those predicates use filter.
people.filter { p => p.age > 30 && p.isMarried }
You could probably optimise these by doing the traversal only once but is that a requirement?
You can group using a tuple:
val res1 = people.groupBy(p => (p.age, p.isMarried)) //or
val res2 = people.groupBy(p => (p.age, p.isMarried)).mapValues(_.size) //if you dont care about People instances
You can answer both question like that:
res2.getOrElse((30, true), 0)
res2.filter{case (k, _) => k._1 > 30 && k._2}.values.sum
res2.filterKeys(k => k._1 > 30 && k._2).values.sum // nicer with filterKeys from Rex Kerr's answer
You could answer both questions with a method count on List:
people.count(p => p.age == 30 && p.isMarried)
people.count(p => p.age > 30 && p.isMarried)
Or using filter and size:
people.filter(p => p.age == 30 && p.isMarried).size
people.filter(p => p.age > 30 && p.isMarried).size
edit:
slightly cleaner version of your code:
TreeMap(peopleByAge.toSeq: _*).map {case (age, ps) =>
val (married, notMarried) = ps.span(_.isMarried)
(age, married.size, notMarried.size)
}.foldLeft(List[FinalResult]()) { case (acc, (age, married, notMarried)) =>
def prevValue(f: (FinalResult) => Int) = acc.headOption.map(f).getOrElse(0)
new FinalResult(age, married, notMarried, prevValue(_.cumMarried) + married, prevValue(_.cumNotMarried) + notMarried) :: acc
}.reverse