Scala, Spark, Geotrellis Rdd CRS reprojection - scala

I load a set of point from a CSV file to a RDD:
case class yieldrow(Elevation:Double,DryYield:Double)
val points :RDD[PointFeature[yieldrow]] = lines.map { line =>
val fields = line.split(",")
val point = Point(fields(1).toDouble,fields(0).toDouble)
Feature(point, yieldrow(fields(4).toDouble,fields(20)))
}
Then get:
points: org.apache.spark.rdd.RDD[geotrellis.vector.PointFeature[yieldrow]]
Now need to reproject from EPSG:4326 to EPSG:3270
So I create the CRS from and to:
val crsFrom : geotrellis.proj4.CRS = geotrellis.proj4.CRS.fromName("EPSG:4326")
val crsTo : geotrellis.proj4.CRS = geotrellis.proj4.CRS.fromEpsgCode(32720)
But I can not create the transform and also i don not know:
Hot to apply a transform to a single point:
val pt = Point( -64.9772376007928, -33.6408083223936)
How to use the mapGeom method of Feature to make a CRS transformation ?
points.map(_.mapGeom(?????))
points.map(feature => feature.mapGeom(????))
How to use ReprojectPointFeature(pointfeature) ?
The documentation not have basic code samples.
Any help will be appreciate

I'll start from the last question:
Indeed to perform a reproject on a PointFeature you can use ReprojectPointFeature implict case class. To use it just be sure that you have import geotrellis.vector._ in reproject function call scope.
import geotrellis.vector._
points.map(_.reproject(crsFrom, crsTo))
The same import works for a Point too:
import geotrellis.vector._
pt.reproject(crsFrom, crsTo)
points.map(_.mapGeom(_.reproject(crsFrom, crsTo)))

Related

Is there a simple way to convert Option[Task[T]] to Task[Option[T]]?

While using monix.eval.Task or zio.Task, is there a simple way to convert Option of Task to Task of Option?
If you want a pure ZIO solution, you can use .foreach with identity:
val fx: Option[UIO[Int]] = Option(Task.effectTotal(42))
val res: UIO[Option[Int]] = ZIO.foreach(fx)(identity)
If you're also using cats, the method you're looking for is called .sequence.
import cats.implicits.toTraverseOps
import zio.interop.catz._
import zio.{Task, UIO}
val fx: Option[UIO[Int]] = Option(Task.effectTotal(42))
val res: UIO[Option[Int]] = fx.sequence
The other way around is not possible as one would need to materialize the Task in order to be able to lift it into an Option[T].

Applying multiple map functions to streaming database results in Play 2.6

I have a large query that seems to be a prime candidate for streaming results.
I would like to make a call to a function, which returns an object which I can apply additional map transformations on, and then ultimately convert the entire result into a list. This is because the conversions will results in a set of objects much smaller than the results in the database and there are many different transformations that must take place sequentially. Processing each result at a time will save me significant memory.
For example, if the results from the database were a stream (though the correct thing is likely an AkkaStream or an Iteratee), then I could do something like:
def outer(converter1[String, Int}, converter2[Int,Double]) {
val sqlIterator = getSqlIterator()
val mappedIterator1 = sqlIterator.map(x => converter1(x.bigColumn))
val mappedIterator2 = sqlIterator.map(x => converter2(x))
val retVal = mappedIterator.toList
retVal
}
def getSqlIterator() {
val selectedObjects = SQL( """SELECT * FROM table""").map { x =>
val id = x[Long]("id")
val tinyColumn = x[String]("tiny_column")
val bigColumn = x[String]("big_column")
NewObject(id, tinyColumn, bigColumn)
}
val transformed = UNKNOWN_FUNCTION(selectedObjects)
transformed
}
Most of the documentation appears to provide the mechanism to apply a "reduce" function to the results, rather than a "map" function, but the resulting mapped functions will be much smaller, saving me significant memory. What should I do for UNKNOWN_FUNCTION?
The following is a simple example of using Anorm's Akka Streams support to read the values from a single column of type String, applying two transformations to each element, and placing the results in a Seq. I'll leave it as an exercise for you to retrieve the values from multiple columns at a time, if that's what you need.
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.Sink
import anorm._
import scala.collection.immutable.Seq
import scala.concurrent.Future
implicit val system = ActorSystem("MySystem")
implicit val materializer = ActorMaterializer()
implicit val ec = system.dispatcher
val convertStringToInt: String => Int = ???
val convertIntToDouble: Int => Double = ???
val result: Future[Seq[Double]] =
AkkaStream.source(SQL"SELECT big_column FROM table", SqlParser.scalar[String])
.map(convertStringToInt)
.map(convertIntToDouble)
.runWith(Sink.seq[Double])

Scala - Remove header from Pair RDD

I am new in Scala and want to remove header from data. I have below data
recordid,income
1,50000000
2,50070000
3,50450000
5,50920000
and I am using below code to read
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object PAN {
def main(args: Array[String]) {
case class income(recordid : Int, income : Int)
val sc = new SparkContext(new SparkConf().setAppName("income").setMaster("local[2]"))
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt").map(_.split(","))
val income_recs = income_data.map(r => (r(0).toInt, income(r(0).toInt, r(1).toInt)))
}
}
I want to remove header from pair RDD but not getting how.
Thanks.
===============================Edit=========================================
I was playing with below code
val header = income_data.first()
val a = income_data.filter(row => row != header)
a.foreach { println }
but it return below output
[Ljava.lang.String;#1657737
[Ljava.lang.String;#75c5d3
[Ljava.lang.String;#ed63f
[Ljava.lang.String;#13f04a
[Ljava.lang.String;#1048c5d
You technique to remove the header by filtering it out will work fine. The problem is how you are trying to print the array.
Arrays in Scala do not override toString so when you try to print one it uses the default string representation which is just the name and hashcode and usually not very useful.
If you want to print an array, turn it into a string first using the mkString method on string, or use foreach(println)
a.foreach {array => println(array.mkString("[",", ","]")}
or
a.foreach {array => array.foreach{println}}
Will both print out the elements of your array so you can see what they contain.
Keep in mind that when working with Spark, printing inside transformation and actions only works in local mode. Once you move to the cluster, the work will be done on remote executors so you won't be able to see and console output from them.
val income_data = sc.textFile("file:///home/user/Documents/income_info.txt")
income_data.collect().drop(1)
When you create an RDD it will return RDD[String] , then when you collect() on top of it it will return Array[String], drop(number of elements) is a function on top of Array to remove those many rows from RDD.

Spark read multiple directories into multiple dataframes

I have a directory structure on S3 looking like this:
foo
|-base
|-2017
|-01
|-04
|-part1.orc, part2.orc ....
|-A
|-2017
|-01
|-04
|-part1.orc, part2.orc ....
|-B
|-2017
|-01
|-04
|-part1.orc, part2.orc ....
Meaning that for directory foo I have multiple output tables, base, A, B, etc in a given path based on the timestamp of a job.
I'd like to left join them all, based on a timestamp and the master directory, in this case foo. This would mean reading in each output table base, A, B, etc into new separate input tables on which a left join can be applied. All with the base table as starting point
Something like this (not working code!)
val dfs: Seq[DataFrame] = spark.read.orc("foo/*/2017/01/04/*")
val base: DataFrame = spark.read.orc("foo/base/2017/01/04/*")
val result = dfs.foldLeft(base)((l, r) => l.join(r, 'id, "left"))
Can someone point me in the right direction on how to get that sequence of DataFrames? It might even be worth considering the reads as lazy, or sequential, thus only reading the A or B table when the join is applied to reduce memory requirements.
Note: the directory structure is not final, meaning it can change if that fits the solution.
From what I understand Spark uses the underlying Hadoop API to read in data file. So the inherited behavior is to read everything you specify into one single RDD/DataFrame.
To achieve what you want, you can first get a list of directories with:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{ FileSystem, Path }
val path = "foo/"
val hadoopConf = new Configuration()
val fs = FileSystem.get(hadoopConf)
val paths: Array[String] = fs.listStatus(new Path(path)).
filter(_.isDirectory).
map(_.getPath.toString)
Then load them into separated dataframes:
val dfs: Array[DataFrame] = paths.
map(path => spark.read.orc(path + "/2017/01/04/*"))
Here's a straight-forward solution to what (I think) you're trying to do, with no use of extra features like Hive or build-in partitioning abilities:
import spark.implicits._
// load base
val baseDF = spark.read.orc("foo/base/2017/01/04").as("base")
// create or use existing Hadoop FileSystem - this should use the actual config and path
val fs = FileSystem.get(new URI("."), new Configuration())
// find all other subfolders under foo/
val otherFolderPaths = fs.listStatus(new Path("foo/"), new PathFilter {
override def accept(path: Path): Boolean = path.getName != "base"
}).map(_.getPath)
// use foldLeft to join all, using the DF aliases to find the right "id" column
val result = otherFolderPaths.foldLeft(baseDF) { (df, path) =>
df.join(spark.read.orc(s"$path/2017/01/04").as(path.getName), $"base.id" === $"${path.getName}.id" , "left") }

Tokenization by Stanford parser is slow?

Quession Summary: tokenization by stanford parser is slow on my local machine, but unreasonably much much faster on spark. Why?
I'm using stanford coreNLP tool to tokenize sentences.
My script in Scala is like this:
import java.util.Properties
import scala.collection.JavaConversions._
import scala.collection.immutable.ListMap
import scala.io.Source
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val properties = new Properties()
val coreNLP = new StanfordCoreNLP(properties)
def tokenize(s: String) = {
properties.setProperty("annotators", "tokenize")
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.value.toString)
}
tokenize("Here is my sentence.")
One call of tokenize function takes roughly (at least) 0.1 sec.
This is very very slow because I have 3 million sentences.
(3M * 0.1sec = 300K sec = 5000H)
As an alternative approach, I have applied the tokenizer on Spark.
(with four worker machines.)
import java.util.List
import java.util.Properties
import scala.collection.JavaConversions._
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.ling.CoreLabel
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
val file = sc.textFile("hdfs:///myfiles")
def tokenize(s: String) = {
val properties = new Properties()
properties.setProperty("annotators", "tokenize")
val coreNLP = new StanfordCoreNLP(properties)
val annotation = new Annotation(s)
coreNLP.annotate(annotation)
annotation.get(classOf[TokensAnnotation]).map(_.toString)
}
def normalizeToken(t: String) = {
val ts = t.toLowerCase
val num = "[0-9]+[,0-9]*".r
ts match {
case num() => "NUMBER"
case _ => ts
}
}
val tokens = file.map(tokenize(_))
val tokenList = tokens.flatMap(_.map(normalizeToken))
val wordCount = tokenList.map((_,1)).reduceByKey(_ + _).sortBy(_._2, false)
wordCount.saveAsTextFile("wordcount")
This scripts finishes tokenization and word count of 3 million sentences just in 5 minites!
And results seems reasonable.
Why this is so first? Or, why the first scala script is so slow?
The problem with your first approach is that you set the annotators property after you initialize the StanfordCoreNLP object. Therefore CoreNLP is initialized with the list of default annotators which include the part-of-speech tagger and the parser which are orders of magnitude slower than the tokenizer.
To fix this, simply move the line
properties.setProperty("annotators", "tokenize")
before the line
val coreNLP = new StanfordCoreNLP(properties)
This should be even slightly faster than your second approach as you don't have to reinitialize CoreNLP for each sentence.