I have a dataframe tableDS
In scala I am able to remove duplicates over primary keys using the following -
import org.apache.spark.sql.expressions.Window.partitionBy
import org.apache.spark.sql.functions.row_number
val window = partitionBy(primaryKeySeq.map(k => tableDS(k)): _*).orderBy(tableDS(mergeCol).desc)
tableDS.withColumn("rn", row_number.over(window)).where($"rn" === 1).drop("rn")
I need to write a similar thing in python. primaryKeySeq is a list in python. I tried the first statement like this -
from pyspark.sql.window import Window
import pyspark.sql.functions as func
window = Window.partitionBy(primaryKeySeq).orderBy(tableDS[bdtVersionColumnName].desc())
tableDS1=tableDS.withColumn("rn",rank().over(window))
This does not give me the correct result.
It got solved -
Here is the final conversion.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy(primaryKeySeq).orderBy(tableDS[bdtVersionColumnName].desc())
tableDS1=tableDS.withColumn("rn", row_number.over(window)).where(tableDS["rn"] == 1).drop("rn")
Related
I have created a PySpark udf by doing the following:
from urllib.parse import urljoin, urlparse
import unicodedata
from pyspark.sql.functions import col, udf, count, substring
from pyspark.sql.types import StringType
decode_udf = udf(lambda val: urljoin(unicodedata.normalize('NFKC',val), urlparse(unicodedata.normalize('NFKC',val)).path), StringType())
For reference, the code above takes a url like this:
https://www.dagens.dk/udland/steve-irwins-soen-taet-paa-miste-livet-ny-video-viser-flugt-fra-kaempe-krokodille?utm_medium=Social&utm_source=Facebook#Echobox=1644308898
and transforms into
https://www.dagens.dk/udland/steve-irwins-soen-taet-paa-miste-livet-ny-video-viser-flugt-fra-kaempe-krokodille
How can I convert this into Scala? I have tried many ways to replicate the code but unsuccessful. Thanks in advance.
Is there a way to extract/query latitude, longitude and elevation data from a tif file using RasterFrames (http://rasterframes.io/)?
Following the documentation, I did loadRF a tif file from the following site: https://visibleearth.nasa.gov/view.php?id=73934, however all I can see is generic information and don't know which RasterFunction to use in order to extract position and elevation or any other relevant information. I did try everything I can find in the API.
I did also try to extract temperature information using the following source as well: http://worldclim.org/version2
All I get is tile column with DoubleUserDefinedNoDataArrayTile and boundary (extend or crs).
RasterStack in R can extract this information according to this blog: https://www.benjaminbell.co.uk/2018/01/extracting-data-and-making-climate-maps.html
I need a more granular DataFrame such as lat,lon,temperature(or whatever data is embedded into the tif file).
Is this possible with RasterFrames or GeoTrellis?
The long story short - yes, it is possible (at least with GeoTrellis). It is also possible with RasterFrames, I suppose, but will require some time to figure out how to extract this data. I can't answer more detailed since I need to know more about the dataset and about the pipeline you want to perform and apply.
Currently you have to do it with a UDF and the relevant GeoTrellis method.
We have a ticket to implement as a first-class function, but in the meantime, this is the long form:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.locationtech.rasterframes.encoders.CatalystSerializer._
import geotrellis.raster._
import geotrellis.vector.Extent
import org.locationtech.jts.geom.Point
object ValueAtPoint extends App {
implicit val spark = SparkSession.builder()
.master("local[*]").appName("RasterFrames")
.withKryoSerialization.getOrCreate().withRasterFrames
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val example = "https://raw.githubusercontent.com/locationtech/rasterframes/develop/core/src/test/resources/LC08_B7_Memphis_COG.tiff"
val rf = spark.read.raster.from(example).load()
val point = st_makePoint(766770.000, 3883995.000)
val rf_value_at_point = udf((extentEnc: Row, tile: Tile, point: Point) => {
val extent = extentEnc.to[Extent]
Raster(tile, extent).getDoubleValueAtPoint(point)
})
rf.where(st_intersects(rf_geometry($"proj_raster"), point))
.select(rf_value_at_point(rf_extent($"proj_raster"), rf_tile($"proj_raster"), point) as "value")
.show(false)
spark.stop()
}
I am trying to convert an RDD to a DataFrame in scala as follows
val posts = spark.textFile("~/allPosts/part-02064.xml.gz")
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
val sqlContext = new org.apache.spark.sql.SQLContext(spark)
import sqlContext.implicits._
posts.map(identity).toDF()
When I do this I get the following error.
java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext$implicits$.stringRddToDataFrameHolder(Lorg/apache/spark/rdd/RDD;)Lorg/apache/spark/sql/DataFrameHolder;
I can't for the life of me figure out what I'm doing wrong.
you need to define a schema to convert a RDD to Dataframes either by Reflection method or via programmatically.
One very important point about Dataframes- Dataframe is a RDD with a schema. In your case define a case class and map the values of a file to that class. Hope it will help
I am reading a directory of files using the following code:
val data = sc.textFile("/mySource/dir1/*")
now my data rdd contains all rows of all files in the directory (right?)
I want now to add a column to each row with the source files name, how can I do that?
The other options I tried is using wholeTextFile but I keep getting out of memory exceptions.
5 servers 24 cores 24 GB (executor-core 5 executor-memory 5G)
any ideas?
You can use this code. I have tested it with Spark 1.4 and 1.5.
It gets the file name from the inputSplit and adds it to each line using the iterator using the mapPartitionsWithInputSplit of the NewHadoopRDD
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.spark.rdd.{NewHadoopRDD}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
val sc = new SparkContext(new SparkConf().setMaster("local"))
val fc = classOf[TextInputFormat]
val kc = classOf[LongWritable]
val vc = classOf[Text]
val path :String = "file:///home/user/test"
val text = sc.newAPIHadoopFile(path, fc ,kc, vc, sc.hadoopConfiguration)
val linesWithFileNames = text.asInstanceOf[NewHadoopRDD[LongWritable, Text]]
.mapPartitionsWithInputSplit((inputSplit, iterator) => {
val file = inputSplit.asInstanceOf[FileSplit]
iterator.map(tup => (file.getPath, tup._2))
}
)
linesWithFileNames.foreach(println)
I think it's pretty late to answer this question but I found an easy way to do what you were looking for:
Step 0: from pyspark.sql import functions as F
Step 1: createDataFrame using the RDD as usual. Let's say df
Step 2: Use input_file_name()
df.withColumn("INPUT_FILE", F.input_file_name())
This will add a column to your DataFrame with source file name.
Attempting to follow example in Sandy Ryza's book Advanced Analytics with Spark, coding using IntelliJ. Below I seem to have imported all the right libraries, but why is it not recognizing getOrElse?
Error:(84, 28) value getOrElse is not a member of org.apache.spark.rdd.RDD[String]
bArtistAlias.value.getOrElse(artistID, artistID)
^
Code:
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd._
import org.apache.spark.rdd.PairRDDFunctions
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.mllib.recommendation._
val trainData = rawUserArtistData.map { line =>
val Array(userID, artistID, count) = line.split(' ').map(_.toInt)
val finalArtistID = bArtistAlias.value.getOrElse(artistID, artistID)
Rating(userID, finalArtistID, count)
}.cache()
I can only make an assumption as the code listed is missing pieces, but my guess is that bArtistAlias is supposed to be a Map that SHOULD be broadcast, but isnt.
I went and found the piece of code in Sandy's book and it corroborates my guess. So, you seem to be missing this piece:
val bArtistAlias = sc.broadcast(artistAlias)
I am not even sure what you did without the code, but it looks like you broadcast an RDD[String], thus the error.....this would not even work anyway as you cannot work with another RDD inside of an RDD