I am trying to implement a pagerank alghoritm on the reddit May2015 dataset but I can't manage to extract the subreddits referenced in the comments.
A column contains the name of the subreddit and the other contains a comment posted in that subreddit that references another subreddit.
subreddit body
videos|"Tagged you as ""...
Quebec|Ok, c'est quoi le...
pokemon|Sorry to hear abo...
videos|Not sure what the...
ClashOfClans|Your submission, ...
realtech|Original /r/techn...
guns|Welp, those basta...
IAmA|If you are very i...
WTF|If you go on /r/w...
Fitness|Your submission h...
gifs|Hi! Take a look a...
Coachella|Yeah. If you go /...
What I did is this:
val df = spark.read
.format("csv")
.option("header", "true")
.load("path\\May2015.csv")
val df1 = df.filter(df("body").contains("/r/")).select("subreddit", "body")
val lines = df1.rdd
val links = lines.map{ s =>
val x = s(1).toString.split(" ")
val b = x.filter(_.startsWith("/r/")).toList
val t = b(0)
(s(0), t)
}.distinct().groupByKey().cache()
var ranks = links.mapValues(v =>0.25)
for (i <- 1 to iters) {
val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>
val size = urls.size
urls.map(url =(url, rank / size))
}
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
}
Problem is that the output is always:
(subreddit, CompactBuffer())
While what I want is:
(subreddit, anothersubreddit)
I managed to solve this but now I am getting another error:
> type mismatch; found : org.apache.spark.rdd.RDD[(String, Double)]
> required: org.apache.spark.rdd.RDD[(Any, Double)] Note: (String,
> Double) <: (Any, Double), but class RDD is invariant in type T. You
> may wish to define T as +T instead. (SLS 4.5)
> ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
Probably the problem lies here
val links = lines.map{ s =>
val x = s(1).toString.split(" ")
val b = x.filter(_.startsWith("/r/")).toList
val t = b(0)
(s(0), t)
...
You need to avoid the first element of tuple as Any here, so if you expect that s(0) may have a type of String you can use explicit cast like s(0).asInstanceOf[String] or via method s.getAs[String] or even s.getString(0).
So, the version that solves the compile error may be as follows:
val links = lines.map{ s =>
val x = s.getString(1).split(" ")
val b = x.filter(_.startsWith("/r/")).toList
val t = b(0)
(s.getString(0), t)
...
Related
I have a List[MyObject], with MyObject containing the fields field1, field2 and field3.
I'm looking for an efficient way of doing :
Tuple3(_.map(_.field1), _.map(_.field2), _.map(_.field3))
In java I would do something like :
Field1Type f1 = new ArrayList<Field1Type>();
Field2Type f2 = new ArrayList<Field2Type>();
Field3Type f3 = new ArrayList<Field3Type>();
for(MyObject mo : myObjects) {
f1.add(mo.getField1());
f2.add(mo.getField2());
f3.add(mo.getField3());
}
I would like something more functional since I'm in scala but I can't put my finger on it.
Get 2\3 sub-groups with unzip\unzip3
Assuming the starting point:
val objects: Seq[MyObject] = ???
You can unzip to get all 3 sub-groups:
val (firsts, seconds, thirds) =
objects
.unzip3((o: MyObject) => (o.f1, o.f2, o.f3))
What if I have more than 3 relevant sub-groups ?
If you really need more sub-groups you need to implement your own unzipN however instead of working with Tuple22 I would personally use an adapter:
case class MyObjectsProjection(private val objs: Seq[MyObject]) {
lazy val f1s: Seq[String] =
objs.map(_.f1)
lazy val f2s: Seq[String] =
objs.map(_.f2)
...
lazy val f22s: Seq[String] =
objs.map(_.f3)
}
val objects: Seq[MyClass] = ???
val objsProjection = MyObjectsProjection(objects)
objs.f1s
objs.f2s
...
objs.f22s
Notes:
Change MyObjectsProjection according to your needs.
This is from a Scala 2.12\2.11 vanilla perspective.
The following will decompose your objects into three lists:
case class MyObject[T,S,R](f1: T, f2: S, f3: R)
val myObjects: Seq[MyObject[Int, Double, String]] = ???
val (l1, l2, l3) = myObjects.foldLeft((List.empty[Int], List.empty[Double], List.empty[String]))((acc, nxt) => {
(nxt.f1 :: acc._1, nxt.f2 :: acc._2, nxt.f3 :: acc._3)
})
We have a sequence of tuples Seq(department, title) depTitleSeq we would like to extract Set(department) and Set(title) looking for the best way to do so far we could come up with is
val depTitleSeq = getDepTitleTupleSeq()
var departmentSeq = ArrayBuffer[String]()
var titleSeq = ArrayBuffer[String]()
for (depTitle <- depTitleSeq) yield {
departmentSeq += depTitle._1
titleSeq += depTitle._2
}
val depSet = departmentSeq.toSet
val titleSet = titleSeq.toSet
Fairly new to scala, i'm sure there are better and more efficient ways to achieve this if you could please point us in the right direction it would of great help
If you have two Seqs of data that you want combined into a Seq of tuples, you can zip them together.
If you have a Seq of tuples and you want the elements separated, then you can unzip them.
val (departmentSeq, titleSeq) = getDepTitleTupleSeq().unzip
val depSet :Set[String] = departmentSeq.toSet
val titleSet :Set[String] = titleSeq.toSet
val depTitleSeq = Seq(("x","a"),("y","b"))
val depSet = depTitleSeq.map(_._1).toSet
val titleSet = depTitleSeq.map(_._2).toSet
In Scala REPL:
scala> val depTitleSeq = Seq(("x","a"),("y","b"))
depTitleSeq: Seq[(String, String)] = List((x,a), (y,b))
scala> val depSet = depTitleSeq.map(_._1).toSet
depSet: scala.collection.immutable.Set[String] = Set(x, y)
scala> val titleSet = depTitleSeq.map(_._2).toSet
titleSet: scala.collection.immutable.Set[String] = Set(a, b)
val result:(Set[String], Set[String]) = depTitleSeq.foldLeft((Set[String](), Set[String]())){(a, b) => (a._1 + b._1, a._2 + b._2) }
you can use foldLeft to achieve this.
For the following fold invocation we can see that the types of each return value have been indicated:
(Note: the above content that is shown on the first "three" lines are actually all on one line #59 in the code)
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame)
= (1 to nLevels).foldLeft((0, dfsIn, dfIn)) {
case ((nRowsPrior, dfsPrior, dfPrior), level) =>
..
(nnRows, dfs, dfOut1) // These return values are verified as correctly
// matching the listed return types
}
But we have the following error:
Error:(59, 10) recursive value x$3 needs type
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame) = (1 to nLevels).foldLeft((0, dfsIn, dfIn)) { case ((nRowsPrior, dfsPrior, dfPrior), level) =>
Column 10 indicates the first entry nRows which is set as follows:
val nnRows = cntAccum.value.toInt
That is definitively an Int .. so it is unclear what is the root issue.
(fyi there is another similarly titled question - recursive value x$5 needs type - but that question was doing strange things in the output parameters whereas mine is straightforward value assignments)
Here is an MCVE that does not have any dependencies:
trait DataFrameMap
trait DataFrame
val dfsIn: DataFrameMap = ???
val dfIn: DataFrame = ???
val nLevels: Int = 0
val (_, _) = (1, 2)
val (_, _) = (3, 4)
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame) =
(1 to nLevels).foldLeft((0, dfsIn, dfIn)) {
case ((nRowsPrior, dfsPrior, dfPrior), level) =>
val nnRows: Int = nRows
val dfs: DataFrameMap = ???
val dfOut1: DataFrame = ???
(nnRows, dfs, dfOut1)
}
it reproduces the error message exactly:
error: recursive value x$3 needs type
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame) =
^
You must have used nRows, dfsOut or dfOut somewhere inside the body of foldLeft. This here compiles just fine:
trait DataFrameMap
trait DataFrame
val dfsIn: DataFrameMap = ???
val dfIn: DataFrame = ???
val nLevels: Int = 0
val (_, _) = (1, 2)
val (_, _) = (3, 4)
val (nRows, dfsOut, dfOut): (Int,DataFrameMap, DataFrame) =
(1 to nLevels).foldLeft((0, dfsIn, dfIn)) {
case ((nRowsPrior, dfsPrior, dfPrior), level) =>
val nnRows: Int = ???
val dfs: DataFrameMap = ???
val dfOut1: DataFrame = ???
(nnRows, dfs, dfOut1)
}
Fun fact: the x$3 does not refer to dfOut (third component of the tuple), but rather to the entire tuple (nRows, dfsOut, dfOut) itself. This is why I had to add two (_, _) = ...'s before the val (nRows, dfsOut, dfOut) definition to get x$3 instead of x$1.
The problem was inside a print statement: the outer foldLeft argument was being referenced by accident instead of the inner loop argument
info(s"$tag: ** ROWS ** ${props(OutputTag)} at ${props(OutputPath)} count=$nRows")
The $nRows is the outer scoped variable: this causes the recursion. The intention had been to reference $nnRows
How to convert one var to two var List?
Below is my input variable:
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
I want my result should be:
val first= List(List("name","name1"),List("add1"))
val second= List(List("id","id1"),List("city"))
First of all, input is not a valid json
val input="[level:1,var1:name,var2:id][level:1,var1:name1,var2:id1][level:2,var1:add1,var2:city]"
You have to make it valid json RDD ( as you are going to use apache spark)
val validJsonRdd = sc.parallelize(Seq(input)).flatMap(x => x.replace(",", "\",\"").replace(":", "\":\"").replace("[", "{\"").replace("]", "\"}").replace("}{", "}&{").split("&"))
Once you have valid json rdd, you can easily convert that to dataframe and then apply the logic you have
import org.apache.spark.sql.functions._
val df = spark.read.json(validJsonRdd)
.groupBy("level")
.agg(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
.select(collect_list("var1").as("var1"), collect_list("var2").as("var2"))
You should get desired output in dataframe as
+------------------------------------------------+--------------------------------------------+
|var1 |var2 |
+------------------------------------------------+--------------------------------------------+
|[WrappedArray(name1, name2), WrappedArray(add1)]|[WrappedArray(id1, id2), WrappedArray(city)]|
+------------------------------------------------+--------------------------------------------+
And you can convert the array to list if required
To get the values as in the question, you can do the following
val rdd = df.collect().map(row => (row(0).asInstanceOf[Seq[Seq[String]]], row(1).asInstanceOf[Seq[Seq[String]]]))
val first = rdd(0)._1.map(x => x.toList).toList
//first: List[List[String]] = List(List(name1, name2), List(add1))
val second = rdd(0)._2.map(x => x.toList).toList
//second: List[List[String]] = List(List(id1, id2), List(city))
I hope the answer is helpful
reduceByKey is the important function to achieve your required output. More explaination on step by step reduceByKey explanation
You can do the following
val input="[level:1,var1:name1,var2:id1][level:1,var1:name2,var2:id2][level:2,var1:add1,var2:city]"
val groupedrdd = sc.parallelize(Seq(input)).flatMap(_.split("]\\[").map(x => {
val values = x.replace("[", "").replace("]", "").split(",").map(y => y.split(":")(1))
(values(0), (List(values(1)), List(values(2))))
})).reduceByKey((x, y) => (x._1 ::: y._1, x._2 ::: y._2))
val first = groupedrdd.map(x => x._2._1).collect().toList
//first: List[List[String]] = List(List(add1), List(name1, name2))
val second = groupedrdd.map(x => x._2._2).collect().toList
//second: List[List[String]] = List(List(city), List(id1, id2))
I'm trying to load several input files in to a single dataframe:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield {df}
val inputDataFrame = unionAll(dataFrames, sqlContext)
// union of all given DataFrames
private def unionAll(dataFrames: Seq[DataFrame], sqlContext: SQLContext): DataFrame = dataFrames match {
case Nil => sqlContext.emptyDataFrame
case head :: Nil => head
case head :: tail => head.unionAll(unionAll(tail, sqlContext))
}
Compiler says
Error:(40, 8) type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: scala.collection.GenTraversableOnce[?]
df <- sc.textFile(i).toDF()
^
Any idea?
First, SQLContext.read.text(...) accepts multiple filename arguments, so you can simply do:
val inputs = List[String]("input1.txt", "input2.txt", "input3.txt")
val inputDataFrame = sqlContext.read.text(inputs: _*)
Or:
val inputDataFrame = sqlContext.read.text("input1.txt", "input2.txt", "input3.txt")
As for your code - when you write:
val dataFrames = for (
i <- inputs;
df <- sc.textFile(i).toDF()
) yield df
It is translated into:
inputs.flatMap(i => sc.textFile(i).toDF().map(df => df))
Which can't compile, because flatMap expects a function that returns a GenTraversableOnce[?], while the supplied function returns an RDD[Row] (See signature of DataFrame.map). In other words, when you write df <- sc.textFile(i).toDF() you're actually taking each row in the dataframe, and yielding a new RDD with these rows, which isn't what you intended.
What you were trying to do is simpler:
val dataFrames = for (
i <- inputs;
) yield sc.textFile(i).toDF()
But, as mentioned at the beginning, the recommended approach is using sqlContext.read.text.