How to get value of previous row in scala apache rdd[row]? - scala

I need to get value from previous or next row while Im iterating through RDD[Row]
(10,1,string1)
(11,1,string2)
(21,1,string3)
(22,1,string4)
I need to sum strings for rows where difference between 1st value is not higher than 3. 2nd value is ID. So the result should be:
(1, string1string2)
(1, string3string4)
I tried use groupBy, reduce, partitioning but still I can't achieve what I want.
I'm trying to make something like this(I know it's not proper way):
rows.groupBy(row => {
row(1)
}).map(rowList => {
rowList.reduce((acc, next) => {
diff = next(0) - acc(0)
if(diff <= 3){
val strings = acc(2) + next(2)
(acc(1), strings)
}else{
//create new group to aggregatre strings
(acc(1), acc(2))
}
})
})
I wonder if my idea is proper to solve this problem.
Looking for help!

I think you can use sqlContext to Solve your problem by using lag function
Create RDD:
val rdd = sc.parallelize(List(
(10, 1, "string1"),
(11, 1, "string2"),
(21, 1, "string3"),
(22, 1, "string4"))
)
Create DataFrame:
val df = rdd.map(rec => (rec._1.toInt, rec._2.toInt, rec._3.toInt)).toDF("a", "b", "c")
Register your Dataframe:
df.registerTempTable("df")
Query the result:
val res = sqlContext.sql("""
SELECT CASE WHEN l < 3 THEN ROW_NUMBER() OVER (ORDER BY b) - 1
ELSE ROW_NUMBER() OVER (ORDER BY b)
END m, b, c
FROM (
SELECT b,
(a - CASE WHEN lag(a, 1) OVER (ORDER BY a) is not null
THEN lag(a, 1) OVER (ORDER BY a)
ELSE 0
END) l, c
FROM df) A
""")
Show the Results:
res.show
I Hope this will Help.

Related

Spark DataFrames Scala - jump to next group during a loop

I have a dataframe as below and it records the quarter and date during which same incident occurs to which IDs.
I would like to mark an ID and date if the incident happen at two consecutive quarters. And this is how I did it.
val arrArray = dtf.collect.map(x => (x(0).toString, x(1).toString, x(2).toString.toInt))
if (arrArray.length > 0) {
var bufBAQDate = ArrayBuffer[Any]()
for (a <- 1 to arrArray.length - 1) {
val (strBAQ1, strDate1, douTime1) = arrArray(a - 1)
val (strBAQ2, strDate2, douTime2) = arrArray(a)
if (douTime2 - douTime1 == 15 && strBAQ1 == strBAQ2 && strDate1 == strDate2) {
bufBAQDate = (strBAQ2, strDate2) +: bufBAQDate
//println(strBAQ2+" "+strDate2+" "+douTime2)
}
}
val vecBAQDate = bufBAQDate.distinct.toVector.reverse
Is there a better way of doing it? As the same insident can happen many times to one ID during a single day, it is better to jump to the next ID and/or date once an ID and a date is marked. I dont want to create nested loops to filter dataframe.
Note that you current solution misses 20210103 as 1400 - 1345 = 55
I think this does the trick
val windowSpec = Window.partitionBy("ID")
.orderBy("datetime_seconds")
val consecutiveIncidents = dtf.withColumn("raw_datetime", concat($"Date", $"Time"))
.withColumn("datetime", to_timestamp($"raw_datetime", "yyyyMMddHHmm"))
.withColumn("datetime_seconds", $"datetime".cast(LongType))
.withColumn("time_delta", ($"datetime_seconds" - lag($"datetime_seconds", 1).over(windowSpec)) / 60)
.filter($"time_delta" === lit(15))
.select("ID", "Date")
.distinct()
.collect()
.map { case Row(id, date) => (id, date) }
.toList
Basically - convert the datetimes to timestamps, then look for records with the same ID and consecutive times, with their times separated by 15 minutes.
This is done by using lag over a window grouped by ID and ordered by the time.
In order to calculate the time difference The timestamp is converted to unix epoch seconds.
If you don't want to count day-crossing incidients, you can add the date to the groupyBy clause of the window

combining slick queries into single query

Using Slick 3.1, how do I combine multiple Queries into a single query for the same type? This is not a join or a union, but combining query "segments" to create a single query request. These "segments" can be any individually valid query.
val query = TableQuery[SomeThingValid]
// build up pieces of the query in various parts of the application logic
val q1 = query.filter(_.value > 10)
val q2 = query.filter(_.value < 40)
val q3 = query.sortBy(_.date.desc)
val q4 = query.take(5)
// how to combine these into a single query ?
val finalQ = ??? q1 q2 q3 q4 ???
// in order to run in a single request
val result = DB.connection.run(finalQ.result)
EDIT:
the expected sql should be something like:
SELECT * FROM "SomeThingValid" WHERE "SomeThingValid"."value" > 10 AND "SomeThingValid"."valid" < 40 ORDER BY "MemberFeedItem"."date" DESC LIMIT 5
val q1 = query.filter(_.value > 10)
val q2 = q1.filter(_.value < 40)
val q3 = q2.sortBy(_.date.desc)
val q4 = q3.take(5)
I think you should do something like the above (and pass around Querys) but if you insist on passing around query "segments", something like this could work:
type QuerySegment = Query[SomeThingValid, SomeThingValid, Seq] => Query[SomeThingValid, SomeThingValid, Seq]
val q1: QuerySegment = _.filter(_.value > 10)
val q2: QuerySegment = _.filter(_.value < 40)
val q3: QuerySegment = _.sortBy(_.date.desc)
val q4: QuerySegment = _.take(5)
val finalQ = Function.chain(Seq(q1, q2, q3, q4))(query)
I've used this "pattern" with slick2.0
val query = TableQuery[SomeThingValid]
val flag1, flag3 = false
val flag2, flag4 = true
val bottomFilteredQuery = if(flag1) query.filter(_.value > 10) else query
val topFilteredQuery = if(flag2) bottomFilteredQuery.filter(_.value < 40) else bottomFilteredQuery
val sortedQuery = if(flag3) topFilteredQuery.soryBy(_.date.desc) else topFilteredQuery
val finalQ = if(flag4) sortedQuery.take(5) else sortedQuery
It's Just a worth remark to mention here from the slick essential book, it seems that you might need to avoid combining multiple queries in one single query.
Combining actions to sequence queries is a powerful feature of Slick.
However, you may be able to reduce multiple queries into a single
database query. If you can do that, you’re probably better off doing
it.
I think, it should work. But I didn't test it yet.
val users = TableQuery[Users]
val filter1: Query[Users, User, Seq] = users.filter(condition1)
val filter2: Query[Users, User, Seq] = users.filter(condition2)
(filter1 ++ filter2).result.headOption

Null value Left Outer Join in Spark

I have two RDD and got the left join
left join p1.leftOuterJoin(p2) the result is like:
Array[((String, String), (Int, Option[Int]))] =
Array(((1001-150329-002-0-04624,5060567),(1,None)), ((1002-141105-008-0-01934,10145500),(1,None)), ((1013-150324-009-0-02270,15750046),(1,None)), ((1005-150814-005-0-05885,5060656),(1,Some(1))), ((1009-150318-004-0-02537,5060583),(1,None)))
I want to replace all None with 0 and get a clean data set like:
Array(((1001-150329-002-0-04624,5060567),0), ((1002-141105-008-0-01934,10145500),0), ((1013-150324-009-0-02270,15750046),0), ((1005-150814-005-0-05885,5060656),1)), ((1009-150318-004-0-02537,5060583),0))
Basically replace all (1,None) with 0 and (1,Some(1)) with 1
If you are looking for the value of the Option when Some or 0 when None, I would implement it with .map and .getOrElse:
a.map { case (k, (_, o)) => (k, o.getOrElse(0)) }
The result match the expected one:
Array(
((1001-150329-002-0-04624,5060567),0),
((1002-141105-008-0-01934,10145500),0),
((1013-150324-009-0-02270,15750046),0),
((1005-150814-005-0-05885,5060656),1)),
((1009-150318-004-0-02537,5060583),0))

How to make aggregations with slick

I want to force slick to create queries like
select max(price) from coffees where ...
But slick's documentation doesn't help
val q = Coffees.map(_.price) //this is query Query[Coffees.type, ...]
val q1 = q.min // this is Column[Option[Double]]
val q2 = q.max
val q3 = q.sum
val q4 = q.avg
Because those q1-q4 aren't queries, I can't get the results but can use them inside other queries.
This statement
for {
coffee <- Coffees
} yield coffee.price.max
generates right query but is deprecated (generates warning: " method max in class ColumnExtensionMethods is deprecated: Use Query.max instead").
How to generate such query without warnings?
Another issue is to aggregate with group by:
"select name, max(price) from coffees group by name"
Tried to solve it with
for {
coffee <- Coffees
} yield (coffee.name, coffee.price.max)).groupBy(x => x._1)
which generates
select x2.x3, x2.x3, x2.x4 from (select x5."COF_NAME" as x3, max(x5."PRICE") as x4 from "coffees" x5) x2 group by x2.x3
which causes obvious db error
column "x5.COF_NAME" must appear in the GROUP BY clause or be used in an aggregate function
How to generate such query?
As far as I can tell is the first one simply
Query(Coffees.map(_.price).max).first
And the second one
val maxQuery = Coffees
.groupBy { _.name }
.map { case (name, c) =>
name -> c.map(_.price).max
}
maxQuery.list
or
val maxQuery = for {
(name, c) <- Coffees groupBy (_.name)
} yield name -> c.map(_.price).max
maxQuery.list

scalaquery retrieve values

I have few tables, lets say 2 for simplicity. I can create them in this way,
...
val tableA = new Table[(Int,Int)]("tableA"){
def a = column[Int]("a")
def b = column[Int]("b")
}
val tableB = new Table[(Int,Int)]("tableB"){
def a = column[Int]("a")
def b = column[Int]("b")
}
Im going to have a query to retrieve value 'a' from tableA and value 'a' from tableB as a list inside the results from 'a'
my result should be:
List[(a,List(b))]
so far i came upto this point in query,
def createSecondItr(b1:NamedColumn[Int]) = for(
b2 <- tableB if b1 === b1.b
) yield b2.a
val q1 = for (
a1 <- tableA
listB = createSecondItr(a1.b)
) yield (a1.a , listB)
i didn't test the code so there might be errors in the code. My problem is I cannot retrieve data from the results.
to understand the question, take trains and classes of it. you search the trains after 12pm and you need to have a result set where the train name and the classes which the train have as a list inside the train's result.
I don't think you can do this directly in ScalaQuery. What I would do is to do a normal join and then manipulate the result accordingly:
import scala.collection.mutable.{HashMap, Set, MultiMap}
def list2multimap[A, B](list: List[(A, B)]) =
list.foldLeft(new HashMap[A, Set[B]] with MultiMap[A, B]){(acc, pair) => acc.addBinding(pair._1, pair._2)}
val q = for (
a <- tableA
b <- tableB
if (a.b === b.b)
) yield (a.a, b.a)
list2multimap(q.list)
The list2multimap is taken from https://stackoverflow.com/a/7210191/66686
The code is written without assistance of an IDE, compiler or similar. Consider the debugging free training :-)