how to remove "Missing transform attribute error"? - pyspark

I am writing a code in palantir using pyspark and I have this error which I am unable to figure out.
The Error is:
A TransformInput object does not have an attribute withColumn.
Please check the spelling and/or the datatype of the object.
My code for your reference
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.functions import when
from transforms.api import configure, transform, Input, Output
#transform(
result = Output('Output_data_file_location'),
first_input=Input('Input_file1'),
second_input= Input('Input_file2'),
)
def function_temp(first_input, second_input, result):
from pyspark.sql.functions import monotonically_increasing_id
res = ncbs.withColumn("id", monotonically_increasing_id())
# Recode type
res = res.withColumn("old_col_type", F.when(
(F.col("col_type") == 'left') | (F.col("col_type") == 'right'), 'turn'
).when(
(F.col("col_type") == 'up') | (F.col("col_type") == 'down'), 'straight'
))
res = res.withColumnRenamed("old_col_type","t_old_col_type") \
.withColumnRenamed("old_col2_type","t_old_col2_type")
res = res.filter((res.col_type== 'straight')
res = res.join(second_input, #eqNullSafe is like an equal sign but includes null in join
(res.col1.eqNullSafe(second_input.pre_col1)) &
(res.col2.eqNullSafe(second_input.pre_col2)),
how='left')\
.drop(*["pre_col1", "pre_col2"]).withColumnRenamed("temp_result", "final_answer")
result.write_dataframe(res)
Can anyone help me with the error. Thanks in advance

The error code you are receiving explains it pretty well, you are calling .withColumn() on an object that is not a regular Spark Dataframe but a TransformInput object. You need to call the .dataframe() method to access the Dataframe.
The documentation for reference.
In addition you should move the monotonically_increasing_id to the top of the file, since Foundrys transform logic level versioning only works when the imports are happening on the module level, according to the documentation.

Related

Add error handling decorator in addition to pandas_udf

I'd like to create a decorator that handles errors inside of a pandas_udf. I've tried a few attempts with no luck so wanted to see if anyone has been successful in doing this?
Below is some initial code I've tried but it fails. In this example, I'm trying to decorate the function pandas_divide with both pandas_udf and a new decorator to detect errors, return_code.
I'm not sure if my idea is possible given the fact that pandas UDFs require us to define a single return data type (whereas this idea of wrapping it in a safe call would allow for either the output of the function to be returned in the column or an exception). I tried researching if I could define a new pyspark data type that is the union of one data type, an exception and None but did not have any luck - is this possible?
I was also thinking of using a closure to try and get this functionality but closures are new to me so I'm still looking into this.
from pyspark.sql import types as T
from pyspark.sql import functions as F
from pyspark.sql.functions import pandas_udf, PandasUDFType
# create dataframe for testing
df = spark.range(0, 10).withColumn('id', (F.col('id') / 10).cast('integer')).withColumn('v', F.rand())
columns = ['id', 'v']
vals = [(1, 2), (2, 0), (3, 0)]
new_rows_df = spark.createDataFrame(vals, columns)
df = df.union(new_rows_df)
df.cache()
df.count()
display(df)
class ReturnCode:
def __init__(self):
self.pass1 = 'PASS'
self.fail1 = 'FAIL'
def __call__(self, fn, *args, **kwargs):
def inner_func(self, *args, **kwargs):
try:
output = func(**kwargs)
return_code = self.pass1
except Exception as ex:
output = f"{ex}"
return_code = self.fail1
return (return_code, output)
return inner_func
return_code = ReturnCode()
#pandas_udf(T.StructType(T.StructField('return_code', T.StringType()), T.StructField('value', T.IntegerType())))
#return_code
def pandas_divide(v):
if v == 0:
raise
return 1/v
#pandas_divide(0)[0]
df = df.withColumn('pandas_divide', pandas_divide(F.col('v')))
df.show()

Using JsonPath from scala to extract full field : value list

I'm trying to get a fully qualified set of path : value pairs from a json document.
i.e. given
{"a":"b", "c":{"d":3"}}
I'd like
a :: "b"
c.d :: 3
or something spiritually similar. There appears to be a java library which claims to do exactly that:
import $ivy.`com.jayway.jsonpath:json-path:2.6.0`
import com.jayway.jsonpath.Configuration
import com.jayway.jsonpath.Option
import com.jayway.jsonpath.JsonPath._
val conf = com.jayway.jsonpath.Configuration.defaultConfiguration();
val pathList = using(conf).parse("""{"a":"b", "c":{"d":3}}""")
val arg = pathList.read("$..id")
I get this error
java.lang.ClassCastException: net.minidev.json.JSONArray cannot be cast to scala.runtime.Nothing$
at repl.MdocSession$App.<init>(json test.worksheet.sc:38)
at repl.MdocSession$.app(json test.worksheet.sc:3)
Any ideas out there?
val arg = pathList.read[net.minidev.json.JSONArray]("$..*")
Needed a cast...

How can pyspark remember something in memory like class attributes in mapreduce?

I have a table with 2 columns: image_url, comment. The same image may have many comments, and data are sorted by image_url in files.
I need to crawl the image, and transfer it to binary. This will take a long time. So, for the same image, I want to do it only once.
In mapreduce, I can remember the last row and result in memory.
class Mapper:
def __init__(self):
self.image_url = None
self.image_bin = None
def run(self, image_url, comment):
if image_url != self.image_url:
self.image_url = image_url
self.image_bin = process(image_url)
return self.image_url, self.image_bin, comment
How can I do it in pyspark? Either rdd and dataframe is ok.
I would advice you to simply process a grouped version of your dataframe. Something like this :
from pyspark.sql import functions as F
# Assuming df is your dataframe
df = df.groupBy("image_url").agg(F.collect_list("comment").alias("comments"))
df = df.withColumn("image_bin", process(F.col("image_url")))
df.select(
"image_url",
"image_bin",
F.explode("comments").alias("comment"),
).show()
I found mapPartitions works. The code look like this.
def do_cover_partition(partitionData):
last_url = None
last_bin = None
for row in partitionData:
data = row.asDict()
print(data)
if data['cover_url'] != last_url:
last_url = data['cover_url']
last_bin = url2bin(last_url)
print(data['comment'])
data['frames'] = last_bin
yield data
columns = ["id","cover_url","comment","frames"]
df = df.rdd.mapPartitions(do_cover_partition).map(lambda x: [x[c] for c in columns]).toDF(columns)

Scala, Spark, Geotrellis Rdd CRS reprojection

I load a set of point from a CSV file to a RDD:
case class yieldrow(Elevation:Double,DryYield:Double)
val points :RDD[PointFeature[yieldrow]] = lines.map { line =>
val fields = line.split(",")
val point = Point(fields(1).toDouble,fields(0).toDouble)
Feature(point, yieldrow(fields(4).toDouble,fields(20)))
}
Then get:
points: org.apache.spark.rdd.RDD[geotrellis.vector.PointFeature[yieldrow]]
Now need to reproject from EPSG:4326 to EPSG:3270
So I create the CRS from and to:
val crsFrom : geotrellis.proj4.CRS = geotrellis.proj4.CRS.fromName("EPSG:4326")
val crsTo : geotrellis.proj4.CRS = geotrellis.proj4.CRS.fromEpsgCode(32720)
But I can not create the transform and also i don not know:
Hot to apply a transform to a single point:
val pt = Point( -64.9772376007928, -33.6408083223936)
How to use the mapGeom method of Feature to make a CRS transformation ?
points.map(_.mapGeom(?????))
points.map(feature => feature.mapGeom(????))
How to use ReprojectPointFeature(pointfeature) ?
The documentation not have basic code samples.
Any help will be appreciate
I'll start from the last question:
Indeed to perform a reproject on a PointFeature you can use ReprojectPointFeature implict case class. To use it just be sure that you have import geotrellis.vector._ in reproject function call scope.
import geotrellis.vector._
points.map(_.reproject(crsFrom, crsTo))
The same import works for a Point too:
import geotrellis.vector._
pt.reproject(crsFrom, crsTo)
points.map(_.mapGeom(_.reproject(crsFrom, crsTo)))

Reading Basic File Attributes in Scala?

I'm trying to get basic file attributes using Scala, and my reference is this Java question:
Determine file creation date in Java
and this piece of code I'm trying to rewrite in Scala:
static void getAttributes(String pathStr) throws IOException {
Path p = Paths.get(pathStr);
BasicFileAttributes view
= Files.getFileAttributeView(p, BasicFileAttributeView.class)
.readAttributes();
System.out.println(view.creationTime()+" is the same as "+view.lastModifiedTime());
}
The thing I just can't figure out is this line of code..I don't understand how to pass a class in this way using scala... or why Java is insisting upon this in the first place instead of using an actual constructed object as the parameter. Can someone please help me write this line of code to function properly? I must be using the wrong syntax
val attr = Files.readAttributes(f,Class[BasicFileAttributeView])
Try this:
def attrs(pathStr:String) =
Files.getFileAttributeView(
Paths.get(pathStr),
classOf[BasicFileAttributes] //corrected
).readAttributes
Get file creation date in Scala, from Basic Files Attributes:
// option 1,
import java.nio.file.{Files, Paths}
import java.nio.file.attribute.BasicFileAttributes
val pathStr = "/tmp/test.sql"
Files.readAttributes(Paths.get(pathStr), classOf[BasicFileAttributes]).creationTime
res3: java.nio.file.attribute.FileTime = 2018-03-06T00:25:52Z
// option 2,
import java.nio.file.{Files, Paths}
import java.nio.file.attribute.BasicFileAttributeView
val pathStr = "/tmp/test.sql"
{
Files
.getFileAttributeView(Paths.get(pathStr), classOf[BasicFileAttributeView])
.readAttributes.creationTime
}
res20: java.nio.file.attribute.FileTime = 2018-03-07T19:00:19Z