Scala : exclude some Dates - scala

I have a list of dates that i want to ignore :
private val excludeDates = List(
new DateTime("2015-07-17"),
new DateTime("2015-07-20"),
new DateTime("2015-07-23")
)
But i always need to display four dates excluding my black Dates list and the weekends. So far with the following code, my counter is increased when i hit an ignored date and it make sens. So how i can jump to the next date until i hit 4 dates not in my black list and my Weekends ? Maybe with a while, but i dont know how to add it in my scala code :
1 to 4 map { endDate.minusDays(_)} diff excludeDates filter {
_.getDayOfWeek() match {
case DateTimeConstants.SUNDAY | DateTimeConstants.SATURDAY => false
case _ => true
}
}

You could use a Stream :
val blacklist = excludeDates.toSet
Stream.from(1)
.map(endDate.minusDays(_))
.filter(dt => ! blacklist.contains(dt))
.take(4)
.toList

In a quick and rough way I would do it like this
val upperLimit = 4 + excludeDates.length
(1 to upperLimit).map( endDate.minusDays ).filter( d => !excludeDates.contains(d) ).take(4)
In short you go from the end date backward at max the number of dates you need plus the size of the excluded dates, then you filter the sequence checking if the date is not the excluded list and finally you pick only the dates you need with .take( n )
Hope it helps :)

Related

How to get number of lines from RDD which contain any digits

Lines of the document as follows:
I am 12 year old.
I go to school.
I am playing.
Its 4 pm.
There are two lines of the document that contain numbers in them. I want to count how many lines are there in the document with number?
This is to be implemented in scala spark.
val lineswithnum=linesRdd.filter(line => (line.contains([^0-9]))).count()
I expect output to be 2 . But I am getting 0
You can use exists method:
val lineswithnum=linesRdd.filter(line => line.exists(_.isDigit)).count()
In line with your original approach and not discounting the other answer(s):
val textFileLines = sc.textFile("/FileStore/tables/so99.txt")
val linesWithNumCollect = textFileLines.filter(_.matches(".*[0-9].*")).count
The .* added so as to capture within a line string.

How to identify each of every elements is Before from next element in a List

I have a list that contain number of elements which are formatted as Local Date Time type as follow.
List(2017-06-25T00:00, 2017-06-25T00:05:13, 2017-06-25T00:11:11, 2017-06-25T00:17:39, 2017-06-25T00:24:44, 2017-06-25T00:32:33, 2017-06-25T00:41:11, 2017-06-25T01:01:03)
I want to check each and every elements of List is before from next element.As example first value of list is before from second value in list. Like wise I want to check for whole elements inside list.Can any one help me.
This will test if a List of well-formatted date strings are all in chronological order.
import java.time.LocalDateTime
val dates: List[String] = List( "2017-06-25T00:00"
, "2017-06-25T00:05:13"
// etc.
, "2017-06-25T01:01:03"
)
dates.iterator
.map(LocalDateTime.parse)
.sliding(2)
.forall(x => x(0) isBefore x(1)) // returns true/false
The .iterator is included so that the date-strings can be parsed lazily, i.e. if one of the first dates is out of order, then the subsequent dates in the list don't have to be parsed at all.
Here is a method for checking that each string with a date in the sequence is earlier than the next (it assumes that you have a method called isBefore that compares two of these Strings
#tailrec
def isMonotouslyIncreasing(seq: Seq[String]): Boolean = {
if(seq.size <= 1) true
else if(isBefore(seq(0), seq(1))) isMonotouslyIncreasing(seq.tail)
else false
}

How can I split the data based on time in netcdf through SciSpark?

val data = RDD[SciTensor]
data.map(y => {
val time = y("time")}
How we get the units of time and precip long name in SciSpark ?
below showing ncdump results :
time:units = "hours since 1800-01-01 00:00:0.0".
float precip(time, lat, lon) ;
precip:long_name = "Average Monthly Rate of Precipitation"
Thank you for asking the question.
SciSpark is working towards preserving units.
To answer your original question about splitting by time.
Looking at your variable you can definitely split by time, if the time dimension is greater than 1. Otherwise, you don't really need to.
The precip variable array has dimensions time, lat, and lon.
If you want to split by each time epoch you can access the sub-array in each time index like so :
val array = y("time")()
val time1 = array(0)
val time2 = array(1)
val time3 = array(2)
etc.
If you want to extract the sub-arrays in each time dimension and have the RDD be a collection of these sub-arrays you can do that like so :
data.map(y => {
val timeArray = y("time")()
val timeLength = timeArray.shape(0)
(0 until timeLength).map(i => timeArray(i))
})
This will give you an RDD of type RDD[Iterable[AbstractTensor]]
The Iterable[AbstractTensor] corresponds to the original array which you have split by time.
You can go further and call a flatMap to get an RDD of type RDD[AbstractTensor] like so :
data.flatMap(y => {
val timeArray = y("time")()
val timeLength = timeArray.shape(0)
(0 until timeLength).map(i => timeArray(i))
})
Make sure you are using the latest version of SciSpark.
Some of the indexing functionality is recently introduced.

Calculate sums of even/odd pairs on Hadoop?

I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done).
Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new sequence with the sums of consecutive even/odd pairs. For example:
input sequence:
0,1,2,3,4,5,6,7,8,9,10
output sequence:
0+1, 2+3, 4+5, 6+7, 8+9, 10
i.e.
1,5,9,13,17,10
I think in order to do this, I need to write an InputFormat and InputSplits classes for Hadoop, but I don't know how to do this.
See this section 3.3 here. Below is an example algorithm in Scala:
// for simplicity assume input length is a power of 2
def scanadd(input : IndexedSeq[Int]) : IndexedSeq[Int] =
if (input.length == 1)
input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
val collapsed = IndexedSeq.tabulate(input.length/2)(i => input(2 * i) + input(2*i+1))
//recursively scan collapsed values
val scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
val output = IndexedSeq.tabulate(input.length)(
i => i.evenOdd match {
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
case Even => scancollapse(i/2)
case Odd => scancollapse((i-1)/2) + input(i)
}
output
}
I understand that this might need a fair bit of optimization for it to work nicely with Hadoop. Translating this directly I think would lead to pretty inefficient Hadoop code. For example, Obviously in Hadoop you can't use an IndexedSeq. I would appreciate any specific problems you see. I think it can probably be made to work well, though.
Superfluous. You meant this code?
val vv = (0 to 1000000).grouped(2).toVector
vv.par.foldLeft((0L, 0L, false))((a, v) =>
if (a._3) (a._1, a._2 + v.sum, !a._3) else (a._1 + v.sum, a._2, !a._3))
This was the best tutorial I found for writing an InputFormat and RecordReader. I ended up reading the whole split as one ArrayWritable record.

How to groupBy groupBy?

I need to map through a List[(A,B,C)] to produce an html report. Specifically, a
List[(Schedule,GameResult,Team)]
Schedule contains a gameDate property that I need to group by on to get a
Map[JodaTime, List(Schedule,GameResult,Team)]
which I use to display gameDate table row headers. Easy enough:
val data = repo.games.findAllByDate(fooDate).groupBy(_._1.gameDate)
Now the tricky bit (for me) is, how to further refine the grouping in order to enable mapping through the game results as pairs? To clarify, each GameResult consists of a team's "version" of the game (i.e. score, location, etc.), sharing a common Schedule gameID with the opponent team.
Basically, I need to display a game result outcome on one row as:
3 London Dragons vs. Paris Frogs 2
Grouping on gameDate let's me do something like:
data.map{case(date,games) =>
// game date row headers
<tr><td>{date.toString("MMMM dd, yyyy")}</td></tr>
// print out game result data rows
games.map{case(schedule,result, team)=>
...
// BUT (result,team) slice is ungrouped, need grouped by Schedule gameID
}
}
In the old version of the existing application (PHP) I used to
for($x = 0; $x < $this->gameCnt; $x = $x + 2) {...}
but I'd prefer to refer to variable names and not the come-back-later-wtf-is-that-inducing:
games._._2(rowCnt).total games._._3(rowCnt).name games._._1(rowCnt).location games._._2(rowCnt+1).total games._._3(rowCnt+1).name
maybe zip or double up for(t1 <- data; t2 <- data) yield(?) or something else entirely will do the trick. Regardless, there's a concise solution, just not coming to me right now...
Maybe I'm misunderstanding your requirements, but it seems to me that all you need is an additional groupBy:
repo.games.findAllByDate(fooDate).groupBy(_._1.gameDate).mapValues(_.groupBy(_._1.gameID))
The result will be of type:
Map[JodaTime, Map[GameId, List[(Schedule,GameResult,Team)]]]
(where GameId is the type of the return type of Schedule.gameId)
Update: if you want the results as pairs, then pattern matching is your friend, as shown by Arjan. This would give us:
val byDate = repo.games.findAllByDate(fooDate).groupBy(_._1.gameDate)
val data = byDate.mapValues(_.groupBy(_._1.gameID).mapValues{ case List((sa, ra, ta), (sb, rb, tb)) => (sa, (ta, ra), (tb, rb)))
This time the result is of type:
Map[JodaTime, Iterable[ (Schedule,(Team,GameResult),(Team,GameResult))]]
Note that this will throw a MatchError if there are not exactly 2 entries with the same gameId. In real code you will definitely want to check for this case.
Ok a soultion from RĂ©gis Jean-Gilles:
val data = repo.games.findAllByDate(fooDate).groupBy(_._1.gameDate).mapValues(_.groupBy(_._1.gameID))
You said it was not correct, maybe you just didnt use it the right way?
Every List in the result is a pair of games with the same GameId.
You could pruduce html like that:
data.map{case(date,games) =>
// game date row headers
<tr><td>{date.toString("MMMM dd, yyyy")}</td></tr>
// print out game result data rows
games.map{case (gameId, List((schedule, result, team), (schedule, result, team))) =>
...
}
}
And since you dont need a gameId, you can return just the paired games:
val data = repo.games.findAllByDate(fooDate).groupBy(_._1.gameDate).mapValues(_.groupBy(_._1.gameID).values)
Tipe of result is now:
Map[JodaTime, Iterable[List[(Schedule,GameResult,Team)]]]
Every list again a pair of two games with the same GameId