Geomesa Pyspark AnalysisException: Undefined function: st_transform - scala

I am trying to get the area of a bunch of polygons. I am able to use st_area and st_geomFromText, but I get an undefined function error when trying to use st_transform. I need to transform from 4326 to 3857 (or whatever will give me acres).
geomesa version: I had 2.4.2, but now I have 3.4.
error:
AnalysisException: Undefined function: st_transform. This function is neither a built-in/temporary function, nor a persistent function that is qualified as spark_catalog.default.st_transform.; line 4 pos 2
code:
%scala
import spark.implicits._
import org.apache.spark.sql.functions._
import org.locationtech.geomesa.spark.jts._
spark.withJTS
import org.locationtech.geomesa.spark.geotools._
%sql
select
a.user_id
, Sum(st_area(st_geomFromText(a.polygon))) --this works
, Sum(st_transform(st_geomFromText(a.polygon), 'EPSG:4326','EPSG:3857')) --this does not work
from core_table a
group by a.user_id
I've tried changing the fromCRS and toCRS strings. I've tried using lowercase, different numbers, with/without quotes and letters, etc. I've tried encapsulating the st_transform() in a SUM(), but nothing works.
I've also tried something similar to this.

The geometric functions are in a different package, because they require GeoTools and not just JTS. I believe you need to call org.apache.spark.sql.SQLTypes#init or equivalently org.locationtech.geomesa.spark.GeometricDistanceFunctions#registerFunctions to make them available.

Related

In PySpark, why does the add_months function only work when given an integer, and not a Column of integers?

I'm experimenting with PySpark, and the following has me stumped. The documentation for the add_months function says it can take a Column as its second argument, but my simple toy examples are failing. Is this an error? Or am I missing some fundamental understanding of how to read the docs and/or the source code?
To recreate, start with a simple DataFrame containing a column of dates as strings:
import pyspark.sql.functions as F
dates = ["2020-01-01", "2020-02-01"]
df = spark.createDataFrame(zip(dates), ["date"])
df.show()
Result:
+----------+
| date|
+----------+
|2020-01-01|
|2020-02-01|
+----------+
The following code works. It adds 1 month. Note that I am passing an integer for the second argument.
df.withColumn(
"date_plus_one",
F.add_months(
F.col("date"),
1,
)
).show()
Result:
+----------+-------------+
| date|date_plus_one|
+----------+-------------+
|2020-01-01| 2020-02-01|
|2020-02-01| 2020-03-01|
+----------+-------------+
However, this version does not work. Note that I am passing in a literal column of integer.
df.withColumn(
"date_plus_one",
F.add_months(
F.col("date"),
F.lit(1) # <-- only difference
)
).show()
The error I receive is: "Column is not iterable."
According to the documentation for add_months, the second argument should be able to receive either a ColumnOrName, or an int.
In fact, the source code will even convert an integer into a column of integers before passing along to the next function:
def add_months(start: "ColumnOrName", months: Union["ColumnOrName", int]) -> Column:
months = lit(months) if isinstance(months, int) else months
return _invoke_function_over_columns("add_months", start, months)
Though, this is where my ability to read the source code stops.
I'm confused why my second attempt results in an error (because the function should be able to receive a column), and particularly the error, "Column is not iterable." (I understand that Columns in PySpark are generally not iterable because they are spread across multiple RDDs, which is why I'm not supposed to write code myself to do things when a pyspark function exists to do it for me, like in this case.)
Note that I'm getting similar errors for similar functions, like date_sub().
I would like to understand why the function doesn't seem to take the arguments that its signature says it can take.
I was curious to understand the discrepancy when I first stumbled upon this. Turns out the documentation I/we usually refer is for the latest version!
So, I went through the spark package to find the source code for the version I was using (3.1.3) and, voila! The package used the older version of the function which is strongly typed and asks for an integer as an second input. An easy way to bypass it was to use the SQL expression within expr which readily accepts columns.
A little bit of research made things more clear (which, ideally, IMO should've been highlighted in the documentation itself). I read through the blames/commits on spark's github repo and found this commit, tagged to this jira issue, which updated the function signatures for all date calculation functions. These updated versions have been rolled out with the spark 3.3 version, meaning all previous versions will have the older version of the aforementioned functions and will continue to accept only integers.
Here's how you can find out about your version's functions.
When you run the function with a bad input, it'll raise an exception. The traceback will have the location to your functions.py file at the top.
Here's what mine looks like (in google colab)
You can then follow the path, and search for the function's source which, in my case, was the following
def add_months(start, months):
"""
Returns the date that is `months` months after `start`
.. versionadded:: 1.5.0
Examples
--------
>>> df = spark.createDataFrame([('2015-04-08',)], ['dt'])
>>> df.select(add_months(df.dt, 1).alias('next_month')).collect()
[Row(next_month=datetime.date(2015, 5, 8))]
"""
sc = SparkContext._active_spark_context
return Column(sc._jvm.functions.add_months(_to_java_column(start), months))
The aforementioned assumes the second input will always be an integer.

pyspark round function throws "Invalid argument, not a string or column" error

I have an ELT process that after the dataset has been created, the code below is executed to count the rows to determine the number of partitions.
provider_out = get_provider(spark)
numofpartitions = round(provider_out.count()/10000000)
This numofpartitions variable is used later to partition the data equally when writing to the destition as shown below.
provider_out.repartition(numofpartitions).write.mode("overwrite").parquet(dest_path)
I'm running into a problem when the numofpartions variable gets calculated and throws the "Invalid argument, not a string or column: 45.0838586 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function." error. I know this can occur when the result gets passed as an object, but it's a single float value that is being rounded.
Any idea why this may be happening?
It seems you've imported pyspark sql functions without an alias. Are they being imported as from pyspark.sql.functions import *? if yes, the round() from pyspark sql functions is being called instead of python's native round(). TIP - It's a good practice to import pyspark sql functions with an alias.
pyspark's round() requires a column name (col() or lit()) to process and, thus, throws an error when given an integer.
See below test.
import pyspark.sql.functions as func
round(data1_rdd.count()/10)
# 0
func.round(data1_rdd.count()/10) # you inadvertently called this
# TypeError: Invalid argument, not a string or column: 0.2 of type <class 'float'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

How do I calculate a simple one-sample t-statistic in Scala-Spark in an AWS EMR cluster?

I'm a data scientist and still relatively new to Scala. I'm trying to understand the Scala documentation and run a t-test from any existing package. I am looking for sample Scala code on a dummy data set that will work and insight into understanding how to understand the documentation.
I'm working in an EMR Notebook (basically Jupyter notebook) in an AWS EMR cluster environment. I tried referring to this documentation but apparently I am not able to understand it: https://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/stat/inference/TTest.html#TTest()
Here's what I've tried, using multiple load statements for two different packages that have t-test functions. I have multiple lines for the math3.state.inference package since I'm not entirely certain the differences between each and wanted to make sure this part wasn't the problem.
import org.apache.commons.math3.stat.inference
import org.apache.commons.math3.stat.inference._ // note sure if this means, import all classes/methods/functions
import org.apache.commons.math3.stat.inference.TTest._
import org.apache.commons.math3.stat.inference.TTest
import org.apache.spark.mllib.stat.test
No errors there.
import org.apache.asdf
Returns an error, as expected.
The documentation for math3.state.inference says there is a TTest() constructor and then shows a bunch of methods. How does this tell me how to use these functions/methods/classes? I see the following "method" does what I'm looking for:
t(double m, double mu, double v, double n)
Computes t test statistic for 1-sample t-test.
but I don't know how to use it. Here's just several things I've tried:
inference.t
inference.StudentTTest
test.student
test.TTest
TTest.t
etc.
But I get errors like the following:
An error was encountered:
<console>:42: error: object t is not a member of package org.apache.spark.mllib.stat.test
test.t
An error was encountered:
<console>:42: error: object TTest is not a member of package org.apache.spark.mllib.stat.test
test.TTest
...etc.
So how do I fix these issues/calculate a simple, one-sample t-statistic in Scala with a Spark kernel? Any instructions/guidance on how to understand the documentation will be helpful for the long-term as well.
the formula for computing one sample t test is quite straightforward to implement as a udf (user defined function)
udfs are how we can write custom functions to apply to different rows of the DataFrame. I assume you are okay with generating the aggregated values using standard groupby and agg functions.
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions.UserDefinedFunction
val data = Seq((310, 40, 300.0, 18.5), (310, 41, 320.0, 14.5)).toDF("mu", "sample_size", "sample_mean", "sample_sd")
+---+-----------+-----------+---------+
| mu|sample_size|sample_mean|sample_sd|
+---+-----------+-----------+---------+
|310| 40| 300.0| 18.5|
|310| 41| 320.0| 14.5|
+---+-----------+-----------+---------+
val testStatisticUdf: UserDefinedFunction = udf {
(sample_mean: Double, mu:Double, sample_sd:Double, sample_size: Int) =>
(sample_mean - mu) / (sample_sd / math.sqrt(sample_size.toDouble))
}
val result = data.withColumn("testStatistic", testStatisticUdf(col("sample_mean"), col("mu"), col("sample_sd"), col("sample_size")))
+---+-----------+-----------+---------+-------------------+
| mu|sample_size|sample_mean|sample_sd| testStatistic|
+---+-----------+-----------+---------+-------------------+
|310| 40| 300.0| 18.5|-3.4186785515333833|
|310| 41| 320.0| 14.5| 4.4159477499536886|
+---+-----------+-----------+---------+-------------------+

How does $ symbol working when selecting columns from DataFrame?

when we try to select Columns from DataFrame, one can use $"columnname" or col("columnname") or just "columnname".
My question is how $ symbol[which returns ColumnName] is working, i can understand i need to import sqlContext.implicits._ to use $ symbol on df.select
I dont see $ method on SQLImplicits class as well. I can see one method with the name symbolToColumn(scala.Symbol s).
Can someone explain more on this?
It comes from StringToColumn implicit inner class in SQLImplicits (which is implemented by the implicits object).
StringContext is the way that f / s and other string interpolators are written in Scala.

Why can you import a package *after* using its content in a function?

I'm on MATLAB R2014b and have a question that I will illustrate with the following example.
MWE can be made as follows or download it as a .zip file here.
Create a package folder +test on your path with four function files in it:
+test
a.m
b.m
c.m
d.m
Content of a.m:
function a
disp 'Hello World!'
Content of b.m:
function b
a
If you run b from the command line, you will have to import the test package first (import test.*) or run test.b.
Running b will result in an error, since the scope of function b doesn't contain function a. We must import it before it can be used. For this I've created c.m:
function c
import test.*
a
Now running c works fine.
Now my question. If I change c.m to (saved in d.m):
function d
a
import test.*
I.e. the import command is issued after the call to package function a. Running d still works just fine, as if the position of the import command in d.m does not matter. The import appears to have occurred before the call to function a, which in d.m happens on the line before the import.
Why does this happen. Is this the intended behaviour and what are its uses? How and in what order does MATLAB read a .m file and process it? And more off-topic, but in general: how is importing packages handled in different languages compared to MATLAB, does the order of commands matter?
My preemptive conclusion based on the comments: It is probably best practice to only use the import function at or near the beginning of MATLAB code. This makes clearly visible the imported content is available throughout the entire element (e.g. function). It also prevents the incorrect assumption that before the import, the content is not yet available or refers to a different thing with the same name.
MATLAB performs static code analysis prior to evaluating a function in order to determine the variables/functions used by that function. Evaluation of the import statements is part of this static code analysis. This is by design because if you import a package and then use it's functions, MATLAB needs to know this during the static code analysis. As a result, regardless of where you put the import statement within your function, it will have the same effect as if it were at the beginning of the function.
You can easily test this by looking at the output of import which will list all of the current imported packages.
+test/a.m
function a(x)
disp(import)
import test.*
end
test.a()
% test.*
This is why the documentation states to not put an import statement within a conditional.
Do not use import in conditional statements inside a function. MATLAB preprocesses the import statement before evaluating the variables in the conditional statements.
function a(x)
disp(import)
if x
import test.*
else
import othertest.*
end
end
test.a()
% test.*
% othertest.*
The only way to avoid this behavior is to allow the static code analyzer to determine (without a doubt) that an import statement won't be executed. We can do this by having our conditional statement be simply a logical value.
function a()
disp(import)
if true
import test.*
else
import othertest.*
end
end
test.a()
% test.*
As far as importing compared to other languages, it really depends on the language. In Python for example, you must place the import before accessing the module contents. In my experience, this is the typical case but I'm sure there are many exceptions. Every language is going to be different.