How to Pickle Spacy Model for Use in PySpark Function - pyspark

I am running a spacy matcher model as defined:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
def spacy_matcher(text):
doc = nlp(text)
matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None, [{"POS": "NOUN", "OP": "+"}] )
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
return [spacy.util.filter_spans(spans)]
matcher2 = udf(spacy_matcher, ArrayType(StringType()))
When I try to apply this udf to a new column:
test = reviews.withColumn('chunk',matcher2('SENTENCE'))
test.show()
I get a pickling error:
NotImplementedError: [E112] Pickling a span is not supported, because spans are only views of the parent Doc and can't exist on their own. A pickled span would always have to include its Doc and Vocab, which has practically no advantage over pickling the parent Doc directly. So instead of pickling the span, pickle the Doc it belongs to or use Span.as_doc to convert the span to a standalone Doc object.
I haven't done a lot with pickling and am unsure how to handle this because I do want to keep the spans because those are what define my chunks. Any idea how to properly pickle?

Related

How can i iterate over json files in code repositories and incrementally append to a dataset

I have imported a dataset with 100,000 raw json files of about 100gb through data connection into foundry. I want to use the Python Transforms raw file access transformation to read the files, Flatten array of structs and structs into a dataframe as an incremental update to df.
I want to use something like from the below example from the documentation for *.json files and also convert that into an incremental updated using #incremental() decorator.
>>> import csv
>>> from pyspark.sql import Row
>>> from transforms.api import transform, Input, Output
>>>
>>> #transform(
... processed=Output('/examples/hair_eye_color_processed'),
... hair_eye_color=Input('/examples/students_hair_eye_color_csv'),
... )
... def example_computation(hair_eye_color, processed):
...
... def process_file(file_status):
... with hair_eye_color.filesystem().open(file_status.path) as f:
... r = csv.reader(f)
...
... # Construct a pyspark.Row from our header row
... header = next(r)
... MyRow = Row(*header)
...
... for row in csv.reader(f):
... yield MyRow(*row)
...
... files_df = hair_eye_color.filesystem().files('**/*.csv')
... processed_df = files_df.rdd.flatMap(process_file).toDF()
... processed.write_dataframe(processed_df)
With the help of #Jeremy David Gamet i was able to develop the code to get the dataset i want.
from transforms.api import transform, Input, Output
from pyspark import *
import json
#transform(
out=Output('foundry/outputdataset'),
inpt=Input('foundry/inputdataset'),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path,'r', encoding='utf-8-sig') as fi:
data = json.load(fi)
file_dates.append(data)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2.drop_duplicates()
# this code to [Flatten array column][1]
df_2 = flatten(df_2)
out.write_dataframe(df_2)
code to flatten__df
The above code works for few files, since the files are above 100,0000 i am hitting the following error:
Connection To Driver Lost
This error indicates that connection to the driver was lost unexpectedly, which is often caused by the driver being terminated due to running out of memory. Common reasons for driver out-of-memory (OOM) errors include functions that materialize data to the driver such as .collect(), broadcasted joins, and using Pandas dataframes.
any way around this ?
I have given an example of how this can be done dynamically as an answer to another question.
Here is the link to that code answer: How to union multiple dynamic inputs in Palantir Foundry? and a copy of the same code:
from transforms.api import Input, Output, transform
from pyspark.sql import functions as F
import json
import logging
def transform_generator():
transforms = []
transf_dict = {## enter your dynamic mappings here ##}
for value in transf_dict:
#transform(
out=Output(' path to your output here '.format(val=value)),
inpt=Input(" path to input here ".format(val=value)),
)
def update_set(ctx, inpt, out):
spark = ctx.spark_session
sc = spark.sparkContext
filesystem = list(inpt.filesystem().ls())
file_dates = []
for files in filesystem:
with inpt.filesystem().open(files.path) as fi:
data = json.load(fi)
file_dates.append(data)
logging.info('info logs:')
logging.info(file_dates)
json_object = json.dumps(file_dates)
df_2 = spark.read.option("multiline", "true").json(sc.parallelize([json_object]))
df_2 = df_2.withColumn('upload_date', F.current_date())
df_2.drop_duplicates()
out.write_dataframe(df_2)
transforms.append(update_logs)
return transforms
TRANSFORMS = transform_generator()
Please let me know if there is anything I can clarify.

Best practice to define implicit/explicit encoding in dataframe column value extraction without RDD

I am trying to get column data in a collection without RDD map api (doing the pure dataframe way)
object CommonObject{
def doSomething(...){
.......
val releaseDate = tableDF.where(tableDF("item") <=> "releaseDate").select("value").map(r => r.getString(0)).collect.toList.head
}
}
this is all good except Spark 2.3 suggests
No implicits found for parameter evidence$6: Encoder[String]
between map and collect
map(r => r.getString(0))(...).collect
I understand to add
import spark.implicits._
before the process however it requires a spark session instance
it's pretty annoying especially when there is no spark session instance in a method. As a Spark newbie how to nicely resolve the implicit encoding parameter in the context?
You can always add a call to SparkSession.builder.getOrCreate() inside your method. Spark will find the already existing SparkSession and won't create a new one, so there is no performance impact. Then you can import explicits which will work for all case classes. This is easiest way to add encoding. Alternatively an explicit encoder can be added using Encoders class.
val spark = SparkSession.builder
.appName("name")
.master("local[2]")
.getOrCreate()
import spark.implicits._
The other way is to get SparkSession from the dataframe dataframe.sparkSession
def dummy (df : DataFrame) = {
val spark = df.sparkSession
import spark.implicits._
}

RasterFrames extracting location information problem

Is there a way to extract/query latitude, longitude and elevation data from a tif file using RasterFrames (http://rasterframes.io/)?
Following the documentation, I did loadRF a tif file from the following site: https://visibleearth.nasa.gov/view.php?id=73934, however all I can see is generic information and don't know which RasterFunction to use in order to extract position and elevation or any other relevant information. I did try everything I can find in the API.
I did also try to extract temperature information using the following source as well: http://worldclim.org/version2
All I get is tile column with DoubleUserDefinedNoDataArrayTile and boundary (extend or crs).
RasterStack in R can extract this information according to this blog: https://www.benjaminbell.co.uk/2018/01/extracting-data-and-making-climate-maps.html
I need a more granular DataFrame such as lat,lon,temperature(or whatever data is embedded into the tif file).
Is this possible with RasterFrames or GeoTrellis?
The long story short - yes, it is possible (at least with GeoTrellis). It is also possible with RasterFrames, I suppose, but will require some time to figure out how to extract this data. I can't answer more detailed since I need to know more about the dataset and about the pipeline you want to perform and apply.
Currently you have to do it with a UDF and the relevant GeoTrellis method.
We have a ticket to implement as a first-class function, but in the meantime, this is the long form:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.datasource.raster._
import org.locationtech.rasterframes.encoders.CatalystSerializer._
import geotrellis.raster._
import geotrellis.vector.Extent
import org.locationtech.jts.geom.Point
object ValueAtPoint extends App {
implicit val spark = SparkSession.builder()
.master("local[*]").appName("RasterFrames")
.withKryoSerialization.getOrCreate().withRasterFrames
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val example = "https://raw.githubusercontent.com/locationtech/rasterframes/develop/core/src/test/resources/LC08_B7_Memphis_COG.tiff"
val rf = spark.read.raster.from(example).load()
val point = st_makePoint(766770.000, 3883995.000)
val rf_value_at_point = udf((extentEnc: Row, tile: Tile, point: Point) => {
val extent = extentEnc.to[Extent]
Raster(tile, extent).getDoubleValueAtPoint(point)
})
rf.where(st_intersects(rf_geometry($"proj_raster"), point))
.select(rf_value_at_point(rf_extent($"proj_raster"), rf_tile($"proj_raster"), point) as "value")
.show(false)
spark.stop()
}

Spark Catalyst flatMapGroupsWithState: Group State with sorted collection

I am trying to have a sorted collection in the state of my groups and I get an error from catalyst which I think regards default instance creation for the collection.
Below is simplified pipeline that demonstrates the error:
package com.example
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.{GroupState, GroupStateTimeout, OutputMode, Trigger}
import scala.collection.immutable.TreeMap
case class Event
(
key: String
)
case class KeyState
(
prop: TreeMap[Long, String]
)
object CatalystIssue {
def updateState(k: String, vs: Iterator[Event],
state: GroupState[KeyState]) : Iterator[Event] = vs
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local[*]")
.appName("CatalystIssue")
.getOrCreate()
import spark.implicits._
val df = spark.readStream.format("rate")
.load()
.select(lit("a").as("key"))
.as[Event]
.groupByKey(_.key)
.flatMapGroupsWithState(OutputMode.Append(),
GroupStateTimeout.NoTimeout())(updateState)
val query = df.writeStream.format("console")
.trigger(Trigger.ProcessingTime("30 seconds")).start()
query.awaitTermination()
}
}
Which produces the error:
ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 53, Column 106: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.generic.SortedMapFactory.newBuilder(scala.math.Ordering)"
This might be because Sorted Maps are not supported as a dataframe attribute type although that's not my intention here and I would have thought the KeyState would have been opaque to spark since you don't actually access it like a dataframe attribute.
While not very attractive one option might be to serialize the sorted set into a byte array which is an attribute of the KeyState. i.e.
case class KeyState
(
prop: Array[Byte]
)
If Java Serialization were used would that preserve the internal tree structure of the TreeMap, so that at least that would not have to be be rebuilt? Are there any alternative serialization technologies that would preserve the structure?
It seems useful to be able to keep some sorted collections in the group state, especially as the computation is supposed to be primarily in memory. Is there something about the way spark works that makes this fundamentally unworkable?

Issue with VectorUDT when using Spark ML

I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD.
Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas:
def inputSchema = new StructType().add("features", new VectorUDT())
def bufferSchema: StructType =
StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)
override def dataType: DataType = ArrayType(DoubleType,true)
VectorUDT is what I would use with spark.mllib.linalg.Vector:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
However, when I try to import it from spark.ml instead: import org.apache.spark.ml.linalg.VectorUDT
I get a runtime error (no errors during the build):
class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg
Is it expected/can you suggest a workaround?
I am using Spark 2.0.0
In Spark 2.0.0, the proper way to go is to use org.apache.spark.ml.linalg.SQLDataTypes.VectorType instead of VectorUDT. It was introduced in this issue.