Apache beam Python SDK - apache-beam

So I have a python file which has code related to rest api to extract from a url and load it in a sql database. The code contains python packages such as graphql to extract the data and sqlalchemy to inject the data into the database. I’m trying integrate this code into beam api, but I have no clue how to do so. Do I have to generate the data first and then use the csv output for my pipeline or can I just insert all of this into a beam pipeline and extract the csv by executing the apache beam code? Any help is extremely appreciated thank you for reading.
I am not going to share any code, im just here to understand how to tackle this problem so that I can look for solutions myself!

Related

How to get geospatial POINT using SparkSQL

I'm converting a process from postgreSQL over to DataBrick ApacheSpark,
The postgresql process uses the following sql function to get the point on a map from a X and Y value. ST_Transform(ST_SetSrid(ST_MakePoint(x, y),4326),3857)
Does anyone know how I can achieve this same logic in SparkSQL o databricks?
To achieve this you need to use some library, like, Apache Sedona, GeoMesa, or something else. Sedona, for example, has the ST_TRANSFORM function, maybe it has the rest as well.
The only thing that you need to take care, is that if you're using pure SQL, then on Databricks you will need:
install Sedona libraries using the init script, so libraries should be there before Spark starts
set Spark configuration parameters, as described in the following pull request
Update June 2022nd: people at Databricks developed the Mosaic library that is heavily optimized for geospatial analysis on Databricks, and it's compatible with standard ST_ functions.

MongoDB data pipeline to Redshift using Apache-Spark

As my employer makes the big jump to MongoDB, Redshift and Spark. I am trying to be proactive and get hands on with each of these technologies. Could you please refer me to any resources that will be helpful in performing this task -
"Creating a data pipeline using Apache Spark to move data from MongoDB to RedShift"
So, far I have been able to download a dev version of MongoDB and create a test Redshift instance. How do I go about setting the rest of the process and get my feet wet.
I understand to create the data pipeline using Apache Spark, one has to either code in Scala or Python or Java. I have a solid understanding of SQL, so feel free to suggest which language out of Scala, Python or Java would be easy for me to learn.
My background is in data warehousing, traditional ETL (Informatica, Datastage etc.).
Thank you in advance :)
A really good approach may be to use AWS Data Migration Services
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
you can specify mongodb as a source endpoint and redshift as the target endpoint

import data from Postgres to Cassandra

I need to import data from Postgres to Cassandra using open source technologies only.
Can anyone please outline the steps I need to take.
As per instructions, I have to refrain from using DataStax software as they come with license.
Steps I have already tried:
Export one table from Postgres in csv format and imported to HDFS (using sqoop) {If I take this approach do I need to use Map_Reduce after this?}.
Tried to import the csv file to Cassandra using cql, however, got this error
Cassandra: Unable to import null value from csv
I am trying several methods, but unable to find a solid approach of attach.
Can anyone of you please provide me the steps required for the whole process. I believe there would be many people who have already done that.

How to do ultra fast batchimport in orientdb from csv files?

We are evaluating graph databases to store our networked communication data and zeroed upon neo4j and orientdb.
Is there a batch importer tool or script for orient similar to what neo4j has? I was able to import a csv files with150M relationships and 18M nodes in under 25 mins for neo4j. Reading the documentation on orientdb site, looks like I need to use the ETL feature by modifying an json file to be able to do the import. Is there no other simpler and faster way to do the import from csv files?
Using OrientDB ETL is pretty easy. Look at: http://orientdb.com/docs/last/Import-from-CSV-to-a-Graph.html. Just create your json with the ETL steps and it's done.

Integrate PMML to MongoDB

I have build a supervised learning model in R, and exported the model/decision rules in PMML format. I was hoping I could link the PMML straightforwardly to MongoDB using something like the JPMML library (as JPMML integrates well with PostgreSQL).
However, it seems the only way to link MongoDB to my PMML xml file is to use Cascading Pattern through Hadoop. Since my dataset isn't large (<50GB), I don't really need Hadoop.
Has anyone used PMML with MongoDB before that doesn't involve having to go down the hadoop route? Many thanks
Basically, you have two options here:
Convert the PMML file to something that you can execute inside MongoDB.
Deploy the PMML file "natively" to some outside service and connect MongoDB to it.
50 GB is still quite a lot of data, so option #1 is clearly preferable in terms of the ease of setup and the speed of execution. Is it possible to write a Java user-defined function (UDF) for MongoDB? If so, then it would be possible to run the JPMML library inside MongoDB. Otherwise, you might see if it would be possible to convert your PMML model to SQL script. For example, the latest versions of KNIME software (2.11.1 and newer) contain a "PMML to SQL" conversion node.
If you fall back to option #2, then the following technical article might provide some inspiration to you: Applying predictive models to database data: the REST web service approach.