Running geospatial queries in PySpark in Databricks

Running geospatial queries in PySpark in Databricks - pyspark

I have PySpark dataframes with couple of columns, on of them being gps location (in WKT format). What is the easiest way to pick only rows that are inside some polygon? Does it scale when there are ~1B rows?
I'm using Azure Databricks and if the solution exists in Python, that would be even better, but Scala and SQl are also fine.
Edit: Alex Ott's answer - Mosaic - works and I find it easy to use.

Databricks Labs includes the project Mosaic that is a library for processing of the geospatial data. And it's heavily optimized for Databricks.
This library provides the st_contains & st_intersects (doc) functions that could be used to find rows that are inside your polygons or other objects. That functions are available in all available languages - Scala, SQL, Python, R. For example, in SQL:
SELECT st_contains("POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))",
"POINT (25 15)")

openai says:
I think you can use ST_Contains function.
import pyspark.sql.functions as F
df.withColumn("is_inside", F.expr("ST_Contains(ST_GeomFromText('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))'), gps)")).where("is_inside").show()

Related

How to specify structured streaming time based window in straight Spark SQL

We are using structured streaming to perform aggregations on real time data. I'm creating a configurable Spark job that is given a configuration and uses it to group rows across tumbling windows and performs aggregations. I know how to do this with the functional interface.
Here is a code fragment using the functional interface
var valStream = sparkSession.sql(sparkSession.sql(config.aggSelect)) //<- 1
.withWatermark("eventTime", "15 minutes") //<- 2
.groupBy(window($"eventTime", "1 minute"), $"aggCol1", $"aggCol2") //<- 3
.agg(count($"aggCol2").as("myAgg2Count"))
Line 1 executes a SQL string that comes from the configuration. I would like to move lines 2 & 3 into the SQL syntax so that the grouping and aggregations are specified in the configuration.
Does anyone out there know how to specify this in Spark SQL?

withWatermark does not have a corresponding SQL syntax. You have to use the dataframe API.
For aggregation, you can do something like
select count(aggcol2) as myAgg2Count
from xxx
group by window(eventTime, "1 minute"), aggCo1, aggCol2

KairoDB metric names or tags for IoT

We have a lot of sensors like energymeters and want to store the data using kairosdb. Before we used a simple SQL store, where each sensor has it's own table where each measurement is one row.
a measturement is published to the system from the sensor via a JSON object:
{
time: 12363453,
volt: 238.33,
ampere: 9.3,
watts: 29.0,
}
So, for example for two sensors on two devices, we have this in our DB:
Sensor A1234:
id, time, volt, ampere, watts
12, <unix-ts>, 238.33, 9.3, 29.0
13, <unix-ts>, 238.21, 9.1, 23.8
...
Sensor B5678:
id, time, volt, ampere, watts
75, <unix-ts>, 230.12, 3.9, 19.5
76, <unix-ts>, 234.65, 2.8, 24.5
...
Now, we're investigating what's the best solution to store the same information but using KairosDB instead.
Our goal is to answser some "questions" like:
- the latest volt, watt, ampere
- what was the average ampere between x (start timestamp) and y (end timestamp)
- what was the sum watts over all (or a subset) sensors between two dates
and so on.
So, what would be the best approach for choosing metric names and/or tags?
Should we use the sensor-names for the metric names (without tags):
sensors.energy.a1234.volt=238.33
sensors.energy.a1234.ampere=9.3
sensors.energy.a1234.watts=29.0
sensors.energy.b5678.volt=230.12
sensors.energy.b5678.ampere=3.9
sensors.energy.b5678.watts=19.5
Or should we use tags on the same metrics for all sensors:
sensors.energy.volt=238.33, tag: sensor=a1234
sensors.energy.ampere=9.3, tag: sensor=a1234
sensors.energy.watts=29.0, tag: sensor=a1234
sensors.energy.volt=230.12, tag: sensor=b5678
sensors.energy.ampere=3.9, tag: sensor=b5678
sensors.energy.watts=19.5, tag: sensor=b5678
or, are we totally on the wrong way?
Is there a difference regarding the query performance?

it depends on your usage and in the tags cardinality.
If you need to group-by sensor it's useful to have in tag.
If you always query sensors individually you will have simpler queries by having in the metric name.
If the tags cardinality (number of time series for a metric) is too high it affects query performances (I would saw at hundred thousands+).
There are methods to overcome this impact by using an indexing/search plugin like the kairosDB SolR search (https://github.com/kairosdb/kairos-solr).
I hope this helps.

Redshift to dask DataFrame

Does anyone have a nice neat and stable way to achieve the equivalent of:
pandas.read_sql(sql, con, chunksize=None)
and/or
pandas.read_sql_table(table_name, con, schema=None, chunksize=None)
connected to redshift with SQLAlchemy & psycopg2, directly into a dask DataFrame ?
The solution should be able to handle large amounts of data

You might consider the read_sql_table function in dask.dataframe.
http://dask.pydata.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table
>>> df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
... npartitions=10, index_col='id') # doctest: +SKIP
This relies on the pandas.read_sql_table function internally, so should be able to operate with the same restrictions, except that now you're asked to provide a number of partitions and an index column.

split a field in redshift

I have a my table in redshift that contains some concatenated id
Product_id , options_id
1, 2
5, 5;9;7
52, 4;5;8,11
I want to split every my table like this:
Product_id , options_id
1 , 2
5, 5
5, 9
5, 7
52, 4
52, 5
52, 9
in the documentation of redshift, i find a similar function 'split part' but with this function i must enter the number of the part that i want to get exp:
Product_id , options_id
5, 5;9;7
split_part(options_id,';',2) will return 9,
Any help please
Thanks.

So, the problem here is to take one row and split it into multiple rows. That's not too hard in PostgreSQL -- you could use the unnest() function.
However, Amazon Redshift does not implement every function available in PostgreSQL, and unnest() is unsupported.
While it is possible to write a User Defined Function in Redshift, the function can only return one value, not several rows.
A good option is to iterate through each part, extracting each in turn as a row. See the workaround in Error while using regexp_split_to_table (Amazon Redshift) for a clever implementation (but still something of a hack). This is a similar concept to Expanding JSON arrays to rows with SQL on RedShift.
The bottom line is that you can come up with some hacks that will work to a limited degree, but the best option is to clean the data before loading it into Amazon Redshift. At the moment, Redshift is optimized for extremely fast querying over massive amounts of data, but it is not fully-featured in terms of data manipulation. That will probably change in future (just like User Defined functions were not originally available) but for now we have to work within its current functionality.

Stealing from this answer Split column into multiple rows in Postgres
select product_id, p.option
from product_options po,
unnest(string_to_array(po.options_id, ';')) p(option)
sqlfiddle

Does Cassandra support conditional queries?

I'm thinking of switching to cassandra from my current SQL-esque solution (simpledb) mainly due to speed, cost and the built in caching feature of cassandra. However I'm stuck on the idea of indexing. Ive gathered that in cassandra you have to manually create indexes in order to execute complex queries. But what if you have data like the following, a row with a simple supercolumn:
row1 {value1="5", value2="7", value3="9"}
And you need to execute dynamic queries like "give me all the rows with value1 between x and y and value2 between z and q, etc. Is this possible? Or if you have queries like this is it a bad idea to use cassandra?

Cassandra 0.7.x contains secondary index that let you make queries like the one above.
The following blog post describes the concept:
http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes

Secondary indices were introduced in 0.7. However, to use an indexed_slice_query, you need to have at least one equals expression. For example, you can do value1 = x and value2 < y, but not both range queries.
See Cassandra API