Integrate PMML to MongoDB - mongodb

I have build a supervised learning model in R, and exported the model/decision rules in PMML format. I was hoping I could link the PMML straightforwardly to MongoDB using something like the JPMML library (as JPMML integrates well with PostgreSQL).
However, it seems the only way to link MongoDB to my PMML xml file is to use Cascading Pattern through Hadoop. Since my dataset isn't large (<50GB), I don't really need Hadoop.
Has anyone used PMML with MongoDB before that doesn't involve having to go down the hadoop route? Many thanks

Basically, you have two options here:
Convert the PMML file to something that you can execute inside MongoDB.
Deploy the PMML file "natively" to some outside service and connect MongoDB to it.
50 GB is still quite a lot of data, so option #1 is clearly preferable in terms of the ease of setup and the speed of execution. Is it possible to write a Java user-defined function (UDF) for MongoDB? If so, then it would be possible to run the JPMML library inside MongoDB. Otherwise, you might see if it would be possible to convert your PMML model to SQL script. For example, the latest versions of KNIME software (2.11.1 and newer) contain a "PMML to SQL" conversion node.
If you fall back to option #2, then the following technical article might provide some inspiration to you: Applying predictive models to database data: the REST web service approach.

Related

How to get geospatial POINT using SparkSQL

I'm converting a process from postgreSQL over to DataBrick ApacheSpark,
The postgresql process uses the following sql function to get the point on a map from a X and Y value. ST_Transform(ST_SetSrid(ST_MakePoint(x, y),4326),3857)
Does anyone know how I can achieve this same logic in SparkSQL o databricks?
To achieve this you need to use some library, like, Apache Sedona, GeoMesa, or something else. Sedona, for example, has the ST_TRANSFORM function, maybe it has the rest as well.
The only thing that you need to take care, is that if you're using pure SQL, then on Databricks you will need:
install Sedona libraries using the init script, so libraries should be there before Spark starts
set Spark configuration parameters, as described in the following pull request
Update June 2022nd: people at Databricks developed the Mosaic library that is heavily optimized for geospatial analysis on Databricks, and it's compatible with standard ST_ functions.

What's the fastest way to put RDF data (specifically DBPedia dumps) into Postgres?

I'm looking to put RDF data from DBPedia Turtle (.ttl) files into Postgres. I don't really care how the data is modelled in Postgres as long as it is a complete mapping (it would also be nice if there were sensible indexes), I just want to get the data in Postgres and then I can transform it with SQL from there.
I tried using this StackOverflow solution that leverages Python and sqlalchemy, but it seems to be much too slow (would take days if not more at the pace I observed on my machine).
I expected there might have been some kind of ODBC/JDBC-level tool for this type of connection. I did the same thing with Neo4j in less than an hour using a plugin Neo4j provides.
Thanks to anyone that can provide help.

MongoDB data pipeline to Redshift using Apache-Spark

As my employer makes the big jump to MongoDB, Redshift and Spark. I am trying to be proactive and get hands on with each of these technologies. Could you please refer me to any resources that will be helpful in performing this task -
"Creating a data pipeline using Apache Spark to move data from MongoDB to RedShift"
So, far I have been able to download a dev version of MongoDB and create a test Redshift instance. How do I go about setting the rest of the process and get my feet wet.
I understand to create the data pipeline using Apache Spark, one has to either code in Scala or Python or Java. I have a solid understanding of SQL, so feel free to suggest which language out of Scala, Python or Java would be easy for me to learn.
My background is in data warehousing, traditional ETL (Informatica, Datastage etc.).
Thank you in advance :)
A really good approach may be to use AWS Data Migration Services
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
you can specify mongodb as a source endpoint and redshift as the target endpoint

Data virtualization with SQL Server DB using Marklogic

I would like to use data from a SQL Server database in Marklogic without moving it physically. I have read about data virtualization in Marklogic but cannot get any example or documentation explaining how to go about it. Please point me to any reference that may help me.
I have already tried reading data using MLSAM. Is this the only way and is this virtualization?
MarkLogic introduced the concept of Views to allow data visualization tools to connect to MarkLogic through ODBC, executing SQL against MarkLogic. These views are fed from XML content within MarkLogic through range indexes. So, I think that is the other way around for what you are looking for. In general, MarkLogic will need data inside its own databases, to allow indexing it.
MLSAM can be a way to pull such data in, executing SQL statements from within XQuery against external sources (contrary to xdmp:sql, which runs against the Views inside MarkLogic). Tools like RecordLoader, XQsync, and XMLSh might be worth looking at as well. See
http://developer.marklogic.com/code
HTH!

HBase and elasticsearch integration like MongoDB river

I am kinda new to both elasticsearch and HBase but for a research project I would like to combine the two. My research project mainly involves searching through large portion of documents (doc,pdf,msg etc) and extracting named entities from the documents through
mapreduce jobs running on the documents stored in HBase.
Does any one know if there is something similar to MongoDB river plugin for HBase? Or can point me to some documentation about integrating ElasticSearch and Hbase? I have looked on the internet for any documentation but unfortunately without any luck.
Kind regards,
Martijn
I don't know of any elasticsearch hbase integrations but there are a few Solr and HBase integrations that you can use like Lily and SolBase
Tell me what you think about this https://github.com/posix4e/Elasticsearch-HBase-River. It uses hbase log shipping to reliably handle updates and deletes from hbase into an elastic search cluster. It could easily be extended to do n regionserver to m elastic search server replication.
you can use phoenix jdbc driver + es jdbc river as shown here: http://lessc0de.github.io/connecting_hbase_to_elasticsearch.html
I don't know of any packaged solutions, but as long as your mapreduce preps the data in the right way, it should be fairly easy to write a simple batch job in the programming language of your choice that reads from HBase and submits to ElasticSearch.
take a look to this page (3 years later) : http://lessc0de.github.io/connecting_hbase_to_elasticsearch.html