RasterFrames extracting location information problem - scala

Is there a way to extract/query latitude, longitude and elevation data from a tif file using RasterFrames (http://rasterframes.io/)?
Following the documentation, I did loadRF a tif file from the following site: https://visibleearth.nasa.gov/view.php?id=73934, however all I can see is generic information and don't know which RasterFunction to use in order to extract position and elevation or any other relevant information. I did try everything I can find in the API.
I did also try to extract temperature information using the following source as well: http://worldclim.org/version2
All I get is tile column with DoubleUserDefinedNoDataArrayTile and boundary (extend or crs).
RasterStack in R can extract this information according to this blog: https://www.benjaminbell.co.uk/2018/01/extracting-data-and-making-climate-maps.html
I need a more granular DataFrame such as lat,lon,temperature(or whatever data is embedded into the tif file).
Is this possible with RasterFrames or GeoTrellis?

The long story short - yes, it is possible (at least with GeoTrellis). It is also possible with RasterFrames, I suppose, but will require some time to figure out how to extract this data. I can't answer more detailed since I need to know more about the dataset and about the pipeline you want to perform and apply.

Currently you have to do it with a UDF and the relevant GeoTrellis method.
We have a ticket to implement as a first-class function, but in the meantime, this is the long form:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.locationtech.rasterframes.encoders.CatalystSerializer._
import geotrellis.raster._
import geotrellis.vector.Extent
import org.locationtech.jts.geom.Point
object ValueAtPoint extends App {
implicit val spark = SparkSession.builder()
.master("local[*]").appName("RasterFrames")
.withKryoSerialization.getOrCreate().withRasterFrames
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val example = "https://raw.githubusercontent.com/locationtech/rasterframes/develop/core/src/test/resources/LC08_B7_Memphis_COG.tiff"
val rf = spark.read.raster.from(example).load()
val point = st_makePoint(766770.000, 3883995.000)
val rf_value_at_point = udf((extentEnc: Row, tile: Tile, point: Point) => {
val extent = extentEnc.to[Extent]
Raster(tile, extent).getDoubleValueAtPoint(point)
})
rf.where(st_intersects(rf_geometry($"proj_raster"), point))
.select(rf_value_at_point(rf_extent($"proj_raster"), rf_tile($"proj_raster"), point) as "value")
.show(false)
spark.stop()
}

Related

Read specific column from Parquet without using Spark

I am trying to read Parquet files without using Apache Spark and I am able to do it but I am finding it hard to read specific columns. I am not able to find any good resource of Google as almost all the post is about reading the parquet file using. Below is my code:
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.avro.generic.GenericRecord
import org.apache.parquet.hadoop.ParquetReader
import org.apache.parquet.avro.AvroParquetReader
object parquetToJson{
def main (args : Array[String]):Unit= {
//case class Customer(key: Int, name: String, sellAmount: Double, profit: Double, state:String)
val parquetFilePath = new Path("data/parquet/Customer/")
val reader = AvroParquetReader.builder[GenericRecord](parquetFilePath).build()//.asInstanceOf[ParquetReader[GenericRecord]]
val iter = Iterator.continually(reader.read).takeWhile(_ != null)
val list = iter.toList
list.foreach(record => println(record))
}
}
The commented out case class represents the schema of my file and write now the above code reads all the columns from the file. I want to read specific columns.
If you just want to read specific columns, then you need to set a read schema on the configuration that the ParquetReader builder accepts. (This is also known as a projection).
In your case you should be able to call .withConf(conf) on the AvroParquetReader builder class, and in the conf you pass in, invoke conf.set(ReadSupport.PARQUET_READ_SCHEMA, schema) where schema is a avro schema in String form.

Save Spark StandardScaler for later use in Scala

I am still using Spark 1.6 and trained a StandardScalar that I would like to save and reuse on future datasets.
Using the supplied examples I could transform the data successfully but I can't find a way to save the trained normaliser.
Is there any way in which the trained normaliser can be saved?
Assuming that you have created the scalerModel:
import org.apache.spark.ml.feature.StandardScalerModel
scalerModel.write.save("path/folder/")
val scalerModel = StandardScalerModel.load("path/folder/")
StandardScalerModel class has a save method. After calling the fit method on StandardScaler, the returned object is StandardScalerModel: API Docs
e.g. similar to the supplied example:
import org.apache.spark.ml.feature.StandardScaler
import org.apache.spark.ml.PipelineModel
val dataFrame = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val scaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false)
// Compute summary statistics by fitting the StandardScaler.
val scalerModel = scaler.fit(dataFrame)
scalerModel.write.overwrite().save("/path/to/the/file")
val sameModel = PipelineModel.load("/path/to/the/file")

How can I call a UDF in a UDF?

Hopefully, my title is the correct description of what I am trying to accomplish. I have weather data that is aggregated by week, with each row being one weak and this data is sorted by time. I then have a mathematical expression that I evaluate using this weather data in a Spark UDF. The expressions are evaluated using dynamically generated code that is then injected back into the jvm, I wanted to eventually replace this with a Scala macro, but for now this uses Janino and SimpleCompiler to cook the code and reload the class back in.
Sometimes in these model strings there are variables and functions. The variables are easy to put in since they can be string replaced in the generated code, and the functions for the most part are easy too, because if their names map to an existing static function than it will just execute that when the model is evaluated. For instance an exponent maps to Math.pow in scala.Math.
So my issue is specifically is implementing a lag and lead function for this analysis. Spark has these 2 functions built in, but they are in the above Dataframe layer while this function would be called inside of a UDF, so I am having trouble trying to be able to reference this data back from the top.
So I have this code
import org.apache.spark.sql.expressions.{Window, WindowSpec}
import org.apache.spark.sql.functions.{lag => slag, udf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.{Column, DataFrame}
import org.apache.spark.{SparkConf, SparkContext}
object Functions {
val conf: SparkConf = new SparkConf().setAppName("Blah").setMaster("local[*]")
val ctx: SparkContext = new SparkContext(conf)
val hctx: HiveContext = new HiveContext(ctx)
import hctx.implicits._
def lag(x: Double, window: Int): Double = {
x
}
def lag(c: Column, window: Int = 1)(implicit windowSpec: WindowSpec): Column = {
slag(c, window).over(windowSpec).as(c.toString() + "_lag")
}
def main(args: Array[String]): Unit = {
val funcUdf = udf((f: Column) => lag(f))
val data: DataFrame = ctx.parallelize(Seq(0, 1, 2, 3, 4, 5)).toDF("value")
implicit val spec: WindowSpec = Window.orderBy($"value")
data.select(funcUdf($"value")).show()
}
}
Is there a way to accomplish this? This code doesn't work because of a forward reference. Is there some way or do I have to compute lag windows ahead of time and pass them all around?

How to create a graph from a CSV file using Graph.fromEdgeTuples in Spark Scala

I'm new to Spark and Scala, and I'm trying to carry out a simple task of creating a graph from data in a text file.
From the documentation
https://spark.apache.org/docs/0.9.0/api/graphx/index.html#org.apache.spark.graphx.Graph$#fromEdges[VD,ED]%28RDD[Edge[ED]],VD%29%28ClassTag[VD],ClassTag[ED]%29:Graph[VD,ED]
I can see that I can create a graph from tuples of vertices.
My simple text file looks like this, where each number is a vertex:
v1 v3
v2 v1
v3 v4
v4
v5 v3
When I read the data from the file
val myVertices = myData.map(line=>line.split(" "))
I get an RDD[Array[String]].
My questions are:
If this is the right way to approach the problem, how do I turn the RDD[Array[String]] into the correct format, which according to the documentation is RDD[(VertexId, VertexId)] (also VertexID has to be of type long, and I am working with strings)
Is there an alternative, easier way in which I can construct a graph from a similar structure of csv file?
Any suggestion would be very welcome. Thanks!
There are Many ways by which you can create graph from a text file .
This code creates a graph from Graph.fromEdgeTuples method
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.graphx.GraphLoader
import scala.util.MurmurHash
import org.apache.spark.graphx.Graph
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.VertexId
object GraphFromFile {
def main(args: Array[String]) {
//create SparkContext
val sparkConf = new SparkConf().setAppName("GraphFromFile").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
// read your file
/*suppose your data is like
v1 v3
v2 v1
v3 v4
v4 v2
v5 v3
*/
val file = sc.textFile("src/main/resources/textFile1.csv");
// create edge RDD of type RDD[(VertexId, VertexId)]
val edgesRDD: RDD[(VertexId, VertexId)] = file.map(line => line.split(" "))
.map(line =>
(MurmurHash.stringHash(line(0).toString), MurmurHash.stringHash(line(1).toString)))
// create a graph
val graph = Graph.fromEdgeTuples(edgesRDD, 1)
// you can see your graph
graph.triplets.collect.foreach(println)
}
}
MurmurHash.stringHash is used because file contains vertex in form of String . If its of Numeric type then it wont be required .
First of all, you should read and understand the Spark programming guide: https://spark.apache.org/docs/1.1.0/graphx-programming-guide.html
Next, you need to determine what kind of Edge and Vertex you will represent in your graph. Given that you appear to have nothing to attach to your vertices and edges, it looks like you need something like:
type MyVertex = (Long,Unit)
If you find you do have something, like a String, to attach to each vertex, then replace Unit by String and, in the following, replace null by an appropriate String.
Now you need an array (or other Seq) of vertices which you then convert to an RDD--something like this:
val vertices: Seq[MyVertex] = Array(new MyVertex(1L,null),new MyVertex(2L,null),new MyVertex(3L,null))
val rddVertices: RDD[(VertexId, Unit)] = sc.parallelize(vertices)
where sc is your instance of SparkContext. And your vertices and edges are read from your CSV file and suitably converted to longs. I won't detail that code but it's simple enough, especially if you change the format of the CSV file to remove the "v" prefix from each vertex id.
Similarly, you have to create the edges that you want:
type MyEdge = Edge[Unit]
val edge1 = new MyEdge(1L,2L)
val edge2 = new MyEdge(2L,3L)
val edges = Array(edge1,edge2)
val rdd = sc.parallelize(edges)
Finally, you create your graph:
val graph = Graph(rddVertices,rddEdges)
I have similar code in my own application which I have tried to massage into what you need, but I can't guarantee that this will be perfect. But it should get you started.
you can use a good hash function to convert the string value into a long.
If you file was in edge list format e.g.
v1 v3
v2 v1
v3 v4
v5 v3
then you can simply use the following which will work out what the vertices are from the endpoints of the edges:
import org.apache.spark.graphx._
val graph = GraphLoader.edgeListFile(sc, "so_test.txt")
However as it stands that 'v4' on it's own means that edgeListFile throws an exception

getOrElse method not being found in Scala Spark

Attempting to follow example in Sandy Ryza's book Advanced Analytics with Spark, coding using IntelliJ. Below I seem to have imported all the right libraries, but why is it not recognizing getOrElse?
Error:(84, 28) value getOrElse is not a member of org.apache.spark.rdd.RDD[String]
bArtistAlias.value.getOrElse(artistID, artistID)
^
Code:
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd._
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.recommendation._
val trainData = rawUserArtistData.map { line =>
val Array(userID, artistID, count) = line.split(' ').map(_.toInt)
val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID)
Rating(userID, finalArtistID, count)
}.cache()
I can only make an assumption as the code listed is missing pieces, but my guess is that bArtistAlias is supposed to be a Map that SHOULD be broadcast, but isnt.
I went and found the piece of code in Sandy's book and it corroborates my guess. So, you seem to be missing this piece:
val bArtistAlias = sc.broadcast(artistAlias)
I am not even sure what you did without the code, but it looks like you broadcast an RDD[String], thus the error.....this would not even work anyway as you cannot work with another RDD inside of an RDD