How to get geospatial POINT using SparkSQL - postgresql

I'm converting a process from postgreSQL over to DataBrick ApacheSpark,
The postgresql process uses the following sql function to get the point on a map from a X and Y value. ST_Transform(ST_SetSrid(ST_MakePoint(x, y),4326),3857)
Does anyone know how I can achieve this same logic in SparkSQL o databricks?

To achieve this you need to use some library, like, Apache Sedona, GeoMesa, or something else. Sedona, for example, has the ST_TRANSFORM function, maybe it has the rest as well.
The only thing that you need to take care, is that if you're using pure SQL, then on Databricks you will need:
install Sedona libraries using the init script, so libraries should be there before Spark starts
set Spark configuration parameters, as described in the following pull request
Update June 2022nd: people at Databricks developed the Mosaic library that is heavily optimized for geospatial analysis on Databricks, and it's compatible with standard ST_ functions.

Related

REGEXP_EXTRACT unknown function on Tableau 10.0.1

I'm using REGEXP_EXTRACT function on Tableau trying to extract numbers from a string. My line of code:
INT(REGEXP_EXTRACT([Name], '([0-9]+)'))
My colleague can use it and I can't. Getting an unknown function error. We are both using the same version of Tableau, 10.0.1 and mine for some reason is unknown. Do I need to install some drivers or so to get it working? By the way, none of the regex_ functions work on my machine.
As you are using Amazon Redshift you cannot use the Tableau built in regex-functions. This is because the regex functions are not currently supported in tabelau for redshift. find out more here
In order to get around this you can:
1) Create a tableau data extract from your redshift data source and schedule this to update as required,
2) access them via the raw sql functions (see here)

MongoDB data pipeline to Redshift using Apache-Spark

As my employer makes the big jump to MongoDB, Redshift and Spark. I am trying to be proactive and get hands on with each of these technologies. Could you please refer me to any resources that will be helpful in performing this task -
"Creating a data pipeline using Apache Spark to move data from MongoDB to RedShift"
So, far I have been able to download a dev version of MongoDB and create a test Redshift instance. How do I go about setting the rest of the process and get my feet wet.
I understand to create the data pipeline using Apache Spark, one has to either code in Scala or Python or Java. I have a solid understanding of SQL, so feel free to suggest which language out of Scala, Python or Java would be easy for me to learn.
My background is in data warehousing, traditional ETL (Informatica, Datastage etc.).
Thank you in advance :)
A really good approach may be to use AWS Data Migration Services
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
you can specify mongodb as a source endpoint and redshift as the target endpoint

From PostgreSQL to Cassandra - Aggregation functions not supported

I need your advise please. I have an application that runs on PostgreSQL but takes too long to bring back data.
I would like to use Cassandra but noticed that CQL does not support aggregation.
Would that be possible with Hadoop or am I going completely the wrong way?
Also all the dates are stored in Epoch, and CQL can't convert them.
What would be the best approach to convert an application that runs on PostGreSQL to Cassandra?
Thank you for any suggestions.
Cassandra introduced aggregate functions in 2.2 with CASSANDRA-4914. The documentation for using the standard (built in) functions is here and for creating custom aggregate functions is here.

Using Apache Spark as a backend for web application [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
We have Terabytes of data stored in HDFS, comprising of customer data and behavioral information. Business Analysts want to perform slicing and dicing of this data using filters.
These filters are similar to Spark RDD filters. Some examples of the filter are:
age > 18 and age < 35, date between 10-02-2015, 20-02-2015, gender=male, country in (UK, US, India), etc. We want to integrate this filter functionality in our JSF (or Play) based web application.
Analysts would like to experiment by applying/removing filters, and verifying if the count of the final filtered data is as desired. This is a repeated exercise, and the maximum number of people using this web application could be around 100.
We are planning to use Scala as a programming language for implementing the filters. The web application would initialize a single SparkContext at the load of the server, and every filter would reuse the same SparkContext.
Is Spark good for this use case of interactive querying through a web application. Also, the idea of sharing a single SparkContext, is this a work-around approach? The other alternative we have is Apache Hive with Tez engine using ORC compressed file format, and querying using JDBC/Thrift. Is this option better than Spark, for the given job?
It's not the best use case for Spark, but it is completely possible. The latency can be high though.
You might want to check out Spark Jobserver, it should offer most of your required features. You can also get an SQL view over your data using Spark's JDBC Thrift server.
In general I'd advise using SparkSQL for this, it already handles a lot of the things you might be interested in.
Another option would be to use Databricks Cloud, but it's not publicly available yet.
Analysts would like to experiment by applying/removing filters, and verifying if the count of the final filtered data is as desired. This is a repeated exercise, and the maximum number of people using this web application could be around 100.
Apache Zeppelin provides a framework for interactively ingesting and visualizing data (via web application) using apache spark as the back end. Here is a video demonstrating the features.
Also, the idea of sharing a single SparkContext, is this a work-around approach?
It looks like that project uses a single sparkContext for low latency query jobs.
I'd like to know which solution you chose in the end.
I have two propositions:
following the zeppelin idea of #quickinsights, there is also the interactive notebook jupyter that is well established now. It is firstly designed for python, but specialized kernel can be installed. I tried using toree a couple of month ago. The basic installation is simple:
pip install jupyter
pip install toree
jupyter install toree
but at the time I had to do a
couple low level twicks to make it works (s.as editing /usr/local/share/jupyter/kernels/toree/kernel.json). But it worked and I could use a spark cluster from a scala notebook. Check this tuto, it
fits what I have in
memory.
Most (all?) docs on spark speak about running app with spark-submit or using spark-shell for interactive usage (sorry but spark&scala shell are so disappointing...). They never speak about using spark in an interactive app, such as a web-app. It is possible (I tried), but there are indeed some issues to be check, such as sharing sparkContext as you mentioned, and also some issues about managing dependencies. You can checks the two getting-started-prototypes I made to use spark in a spring web-app. It is in java, but I would strongly recommend using scala. I did not work long enough with this to learn a lot. However I can say that it is possible, and it works well (tried on a 12 nodes cluster + app running on an edge node)
Just remember that the spark driver, i.e. where the code with rdd is running, should be physically on the same cluster that the spark nodes: there are lots of communications between the driver and the workers.
Apache Livy enables programmatic, fault-tolerant, multi-tenant submission of Spark jobs from web/mobile apps (no Spark client needed). So, multiple users can interact with your Spark cluster concurrently.
We had a similar problem at our company. We have ~2-2.5 TB of data in the form of logs. Had some basic analytics to do on that data.
We used following:
Apache Flink for Streaming data from source to HDFS via Hive.
Have Zeppelin configured on the top of HDFS.
SQL interface for Joins and JDBC connection to connect to HDFS via
hive.
Spark for putting batches of data offline
You can use Flink + Hive-HDFS
Filters can be applied via SQL ( Yes! everything is supported in latest releases)
Zeppelin can automate task of report generation and it has cool features of filters without actually mordifying sql queries using ${sql-variable} feature.
Check it out. I am sure you'll find your answer:)
Thanks.

Integrate PMML to MongoDB

I have build a supervised learning model in R, and exported the model/decision rules in PMML format. I was hoping I could link the PMML straightforwardly to MongoDB using something like the JPMML library (as JPMML integrates well with PostgreSQL).
However, it seems the only way to link MongoDB to my PMML xml file is to use Cascading Pattern through Hadoop. Since my dataset isn't large (<50GB), I don't really need Hadoop.
Has anyone used PMML with MongoDB before that doesn't involve having to go down the hadoop route? Many thanks
Basically, you have two options here:
Convert the PMML file to something that you can execute inside MongoDB.
Deploy the PMML file "natively" to some outside service and connect MongoDB to it.
50 GB is still quite a lot of data, so option #1 is clearly preferable in terms of the ease of setup and the speed of execution. Is it possible to write a Java user-defined function (UDF) for MongoDB? If so, then it would be possible to run the JPMML library inside MongoDB. Otherwise, you might see if it would be possible to convert your PMML model to SQL script. For example, the latest versions of KNIME software (2.11.1 and newer) contain a "PMML to SQL" conversion node.
If you fall back to option #2, then the following technical article might provide some inspiration to you: Applying predictive models to database data: the REST web service approach.