How can I loop through a Spark data frame?
I have a data frame that consists of:
time, id, direction
10, 4, True //here 4 enters --> (4,)
20, 5, True //here 5 enters --> (4,5)
34, 5, False //here 5 leaves --> (4,)
67, 6, True //here 6 enters --> (4,6)
78, 6, False //here 6 leaves --> (4,)
99, 4, False //here 4 leaves --> ()
it is sorted by time and now I would like to step through and accumulate the valid ids. The ids enter on direction==True and exit on direction==False
so the resulting RDD should look like this
time, valid_ids
(10, (4,))
(20, (4,5))
(34, (4,))
(67, (4,6))
(78, (4,)
(99, ())
I know that this will not parallelize, but the df is not that big. So how could this be done in Spark/Scala?
If data is small ("but the df is not that big") I'd just collect and process using Scala collections. If types are as shown below:
df.printSchema
root
|-- time: integer (nullable = false)
|-- id: integer (nullable = false)
|-- direction: boolean (nullable = false)
you can collect:
val data = df.as[(Int, Int, Boolean)].collect.toSeq
and scanLeft:
val result = data.scanLeft((-1, Set[Int]())){
case ((_, acc), (time, value, true)) => (time, acc + value)
case ((_, acc), (time, value, false)) => (time, acc - value)
}.tail
Use of var is not recommended for scala developers but still I am posting answer using var
var collectArray = Array.empty[Int]
df.rdd.collect().map(row => {
if(row(2).toString.equalsIgnoreCase("true")) collectArray = collectArray :+ row(1).asInstanceOf[Int]
else collectArray = collectArray.drop(1)
(row(0), collectArray.toList)
})
this should give you result as
(10,List(4))
(20,List(4, 5))
(34,List(5))
(67,List(5, 6))
(78,List(6))
(99,List())
Suppose the name of the respective data frame is someDF, then do:
val df1 = someDF.rdd.collect.iterator;
while(df1.hasNext)
{
println(df1.next);
}
Related
I am trying to read a huge file (data in comma seperated values)
The input file contains millions of rows in the below format. The file is also close to 8GB.
Ex:
1, 22, Begin, session1
1, 33, End, session1
2, 20, Begin, session1
2, 30, End, session1
1, 30, Begin, session2
1, 50, End, session2
3, 90, Begin, session1
4, 10, Begin, session1
3, 100, End, session1
4, 20, End, session1
3, 200, OPEN, session2
The first value is the RECORDID, second value is its WEIGHT, the third value is the TRANSACTION_STATUS, fourth value is SESSIONID.
I have to calculate the average of all WEIGHT of each RECORDID for all of its sessions between its BEGIN & END.
If there is a session id without END, it should be ignored.
Example Output:
RECORD ID => 1, WEIGHTS => (33-22) = 11, (50-30)=20 => Average 15.5
RECORD ID => 2, WEIGHTS => (30-20) = 10 => Average 10.0
RECORD ID => 3, WEIGHTS => (100-90) = 10 => Average 10.0
RECORD ID => 4, WEIGHTS => (20-10) = 10 => Average 10.0
Final output:
1, 15.5
2, 10.0
3, 10.0
4, 10.0
I started to code like below:
case class Users(recordid: Int, weight: Int, transaction_status: String, sessionid: String)
val userList = List[Users]()
val in = new BufferedReader(new InputStreamReader(new FileInputStream("/Users/Desktop/sessionfile.txt")))
Iterator continually in.readLine takeWhile (_ != null) foreach(println)
As my input file is a big file, I used InputStreamReader & Iterator to read records from the file. But I am a bit confused here because I previously did the above activity on a Spark Dataset where I created objects of my case class to represent the dataset with type Users => Dataset[Users] and Dataframe with spark SQL
In this case, the code should only be written in plain Scala without using any implementation of Spark or SQL.
Could anyone let me know what is the efficient way to achieve the solution in plain scala code ? Any help is really appreciated.
Here's one way to go about it.
val beginRE = raw"\s*(\d+)\s*,\s*(\d+)\s*,\s*Begin.*".r
val endRE = raw"\s*(\d+)\s*,\s*(\d+)\s*,\s*End.*".r
util.Using(io.Source.fromFile("./inFile.csv")){
_.getLines().foldLeft(Map[String,(Long,Int,Int)]()){
case (acc, beginRE(recID, wght)) => //Begin record
val (rt, cnt, _) = acc.getOrElse(recID,(0L,0,0))
acc + (recID -> (rt, cnt, wght.toInt))
case (acc, endRE(recID, wght)) => //End record
val (rt, cnt, bgn) = acc.getOrElse(recID,(0L,0,-1))
if (bgn < 0) {
println(s"orphan End record: '$recID,$wght,...'")
acc
} else
acc + (recID -> (rt + wght.toInt - bgn, cnt+1, -1))
case (acc, rec) => //bad record
println(s"bad record: $rec")
acc
}
}.map(_.map{case (k,(rt,cnt,_)) => k -> rt/cnt.toDouble})
//res0: Try[Map[String,Double]] =
// Success(Map(1 -> 15.5, 2 -> 10.0, 3 -> 10.0, 4 -> 10.0))
If there are multiple Begin records only the last one counts. The rest are ignored.
If there are multiple End records only the first one counts. The rest are reported as "orphaned."
Let I have big array of some structure (in example below Int is for simplicity) and want to filter this array and take first n elements. How can I do it?
Example:
val outerVar = 22
def filterFunction(a: Int): Boolean = {
if (true/* some condition into this */) return false
if (a > 12) return true
if (a > 750) return false
if (a == 42) return true
if (a == outerVar) return true
// etc conditions that can use context of outer space
false
}
val n = 42
val bigArray = Array(1, 2, 3, 4, 5)
val result = bigArray.filter(element => {
filterFunction(element)
})
//.limit(n) (something like)
// how to stop filling result after it will contain n elements?
I believe, your predicate filterFunction is not going to do its work, since it always returns false.
Let's consider a toy example where we have an Array[Int] and we need to apply filter on it with predicate filterFunction so that the evaluation stops once n elements have been fetched:
scala> :paste
// Entering paste mode (ctrl-D to finish)
val array = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
val filterFunction = (a: Int) => a > 5
val getLazy = (n: Int) => array.view.filter(filterFunction).take(n).toArray
getLazy(2)
// Exiting paste mode, now interpreting.
array: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
filterFunction: Int => Boolean = <function1>
getLazy: Int => Array[Int] = <function1>
res0: Array[Int] = Array(6, 7)
array.view.filter(filterFunction).take(n) becomes lazy expression (which is not evaluated right away) and toArray in fact runs the calculations.
I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement.
So i have created a Scala List of 100 column names.
And then i want to iterate through a for loop to actually drop the column in each for loop iteration.
Below is the code.
final val dropList: List[String] = List("Col1","Col2",...."Col100”)
def drpColsfunc(inputDF: DataFrame): DataFrame = {
for (i <- 0 to dropList.length - 1) {
val returnDF = inputDF.drop(dropList(i))
}
return returnDF
}
val test_df = drpColsfunc(input_dataframe)
test_df.show(5)
If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:
df.drop("colA", "colB", "colC")
Answer:
val colsToRemove = Seq("colA", "colB", "colC", etc)
val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*)
This should work fine :
val dropList : List[String] |
val df : DataFrame |
val test_df = df.drop(dropList : _*)
You can just do,
def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame =
dropList.foldLeft(inputDF)((df, col) => df.drop(col))
It will return you the DataFrame without the columns passed in dropList.
As an example (of what's happening behind the scene), let me put it this way.
scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)
scala> val removeThese = List(0, 2, 3)
removeThese: List[Int] = List(0, 2, 3)
scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
res2: List[Int] = List(1, 4, 5, 6, 7)
The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _.
You can use the drop operation to drop multiple columns. If you are having column names in the list that you need to drop than you can pass that using :_* after the column list variable and it would drop all the columns in the list that you pass.
Scala:
val df = Seq(("One","Two","Three"),("One","Two","Three"),("One","Two","Three")).toDF("Name","Name1","Name2")
val columnstoDrop = List("Name","Name1")
val df1 = df.drop(columnstoDrop:_*)
Python:
In python you can use the * operator to do the same stuff.
data = [("One", "Two","Three"), ("One", "Two","Three"), ("One", "Two","Three")]
columns = ["Name","Name1","Name2"]
df = spark.sparkContext.parallelize(data).toDF(columns)
columnstoDrop = ["Name","Name1"]
df1 = df.drop(*columnstoDrop)
Now in df1 you would get the dataframe with only one column i.e Name2.
I am rather new with Spark and Scala... I have a graph:Graph[Int, String] and I'd like to attach to these vertices some properties I have in a DataFrame.
What I need to do is, for each vertex, to find the average value in the neighbourhood for each property. This is my approach so far, but I don't understand how to correctly map the Row I get from the join of the two data frames:
val res = graph.collectNeighbors(EdgeDirection.Either)
.toDF("ID", "neighbours")
.join(aDataFrameWithProperties, "ID")
.map{x => // this is where I am lost
}
I don't think my approach is any right, because I join the properties of each vertex with the array of their neighbours, but still I don't know the values of the properties for the neighbours...
EDIT
Some data to help understand what I want to accomplish... say you build the graph as in this answer to how to create EdgeRDD from data frame in Spark
val sqlc : SQLContext = ???
case class Person(id: Long, country: String, age: Int)
val testPeople = Seq(
Person(1, "Romania" , 15),
Person(2, "New Zealand", 30),
Person(3, "Romania" , 17),
Person(4, "Iceland" , 20),
Person(5, "Romania" , 40),
Person(6, "Romania" , 44),
Person(7, "Romania" , 45),
Person(8, "Iceland" , 21),
Person(9, "Iceland" , 22)
)
val people = sqlc.createDataFrame(testPeople)
val peopleR = people
.withColumnRenamed("id" , "idR")
.withColumnRenamed("country", "countryR")
.withColumnRenamed("age" , "ageR")
import org.apache.spark.sql.functions._
val relations = people.join(peopleR,
(people("id") < peopleR("idR")) &&
(people("country") === peopleR("countryR")) &&
(abs(people("age") - peopleR("ageR")) < 5))
import org.apache.spark.graphx._
val edges = EdgeRDD.fromEdges(relations.map(row => Edge(
row.getAs[Long]("id"), row.getAs[Long]("idR"), ())))
val users = VertexRDD.apply(people.map(row => (row.getAs[Int]("id").toLong, row.getAs[Int]("id").toInt)))
val graph = Graph(users, edges)
Then you have a data frame like:
case class Person(id:Long, gender:Int, income:Int)
val properties = Seq(
Person(1, 0, 321),
Person(2, 1, 212),
Person(3, 0, 212),
Person(4, 0, 122),
Person(5, 1, 898),
Person(6, 1, 212),
Person(7, 1, 22),
Person(8, 0, 8),
Person(9, 0, 212)
)
val people = sqlc.createDataFrame(properties)
I'd like to compute, for each Vertex, what is the average sex and what is the average income of the neighbours, returned as a DataFrame
Generally speaking you should use graph operators instead of converting everything to a DataFrame but something like this should do the trick:
import org.apache.spark.sql.functions.{explode, avg}
val statsDF = graph.collectNeighbors(EdgeDirection.Either)
.toDF("ID", "neighbours")
// Flatten neighbours column
.withColumn("neighbour", explode($"neighbours"))
// and extract neighbour id
.select($"ID".alias("this_id"), $"neighbour._1".alias("other_id"))
// join with people
.join(people, people("ID") === $"other_id")
.groupBy($"this_id")
.agg(avg($"gender"), avg($"income"))
what if instead of an average I'd like to count, say, the number of neighbours with gender = the gender of myself and then find the average over all connections
To do this you would need two separate joins - one on this_id and one on ohter_id. Next you can simply aggregate with following expression:
avg((this_gender === other_gender).cast("integer"))
Regarding graph operators there are a few operations you can use. For starters you can use join operation to add properties to vertices:
val properties: RDD[(VertexId, (Int, Int))] = sc.parallelize(Seq(
(1L, (0, 321)), (2L, (1, 212)), (3L, (0, 212)),
(4L, (0, 122)), (5L, (1, 898)), (6L, (1, 212)),
(7L, (1, 22)), (8L, (0, 8)), (9L, (0, 212))
))
val graphWithProperties = graph
.outerJoinVertices(properties)((_, _, prop) => prop)
// For simplicity this assumes no missing values
.mapVertices((_, props) => props.get)
Next we can aggregate messages to create new VertexRDD
val neighboursAggregated = graphWithProperties
.aggregateMessages[(Int, (Int, Int))](
triplet => {
triplet.sendToDst(1, triplet.srcAttr)
triplet.sendToSrc(1, triplet.dstAttr)
},
{case ((cnt1, (age1, inc1)), (cnt2, (age2, inc2))) =>
(cnt1 + cnt2, (age1 + age2, inc1 + inc2))}
)
Finally it we can replace existing properties:
graphWithProperties.outerJoinVertices(neighboursAggregated)(
(_, oldProps, newProps) => newProps match {
case Some((cnt, (gender, inc))) => Some(
if (oldProps._1 == 1) gender.toDouble / cnt
else 1 - gender.toDouble / cnt,
inc.toDouble / cnt
)
case _ => None
})
If you're interested only into values you can pass all required values in aggregateMessages and omit the second outerJoinVertices.
My dataset is a RDD[Array[String]] with more than 140 columns. How can I select a subset of columns without hard-coding the column numbers (.map(x => (x(0),x(3),x(6)...))?
This is what I've tried so far (with success):
val peopleTups = people.map(x => x.split(",")).map(i => (i(0),i(1)))
However, I need more than a few columns, and would like to avoid hard-coding them.
This is what I've tried so far (that I think would be better, but has failed):
// Attempt 1
val colIndices = [0,3,6,10,13]
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Error output from attempt 1:
<console>:28: error: type mismatch;
found : List[Int]
required: Int
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Attempt 2
colIndices map peopleTups.lift
// Attempt 3
colIndices map peopleTups
// Attempt 4
colIndices.map(index => peopleTups.apply(index))
I found this question and tried it, but because I'm looking at an RDD instead of an array, it didn't work: How can I select a non-sequential subset elements from an array using Scala and Spark?
You should map over the RDD instead of the indices.
val list = List.fill(2)(Array.range(1, 6))
// List(Array(1, 2, 3, 4, 5), Array(1, 2, 3, 4, 5))
val rdd = sc.parallelize(list) // RDD[Array[Int]]
val indices = Array(0, 2, 3)
val selectedColumns = rdd.map(array => indices.map(array)) // RDD[Array[Int]]
selectedColumns.collect()
// Array[Array[Int]] = Array(Array(1, 3, 4), Array(1, 3, 4))
What about this?
val data = sc.parallelize(List("a,b,c,d,e", "f,g,h,i,j"))
val indices = List(0,3,4)
data.map(_.split(",")).map(ss => indices.map(ss(_))).collect
This should give
res1: Array[List[String]] = Array(List(a, d, e), List(f, i, j))