I want to integrate an orienddb database with spark.so so I can fetch records anb run queries through spark.
Can someone advice me?
Take a look at the OrientDB-Spark plugin: https://github.com/orientechnologies/spark-orientdb.
Related
So I have a python file which has code related to rest api to extract from a url and load it in a sql database. The code contains python packages such as graphql to extract the data and sqlalchemy to inject the data into the database. I’m trying integrate this code into beam api, but I have no clue how to do so. Do I have to generate the data first and then use the csv output for my pipeline or can I just insert all of this into a beam pipeline and extract the csv by executing the apache beam code? Any help is extremely appreciated thank you for reading.
I am not going to share any code, im just here to understand how to tackle this problem so that I can look for solutions myself!
I'm currently developing a Glue ETL script in PySpark that needs to query my Glue Data Catalog's partitions and join that information with other Glue tables programmatically.
At the moment, I'm able to do this with Athena using SELECT * FROM db_name.table_name$partitions JOIN table_name2 ON ..., but looks like this doesn't work with Spark SQL. The closest thing I've been able to find is SHOW PARTIIONS db_name.table_name, which doesn't seem to cut it.
Does anyone know an easy way I can leverage Glue ETL / Boto3 (Glue API) / PySpark to query my partition information in a SQL-like manner?
For the time being, the only possible workaround seems like the get_partitions() method in Boto3, but this looks like a lot more complex work to deal with from my end. I already have my Athena queries to get the information I need, so if there's ideally a way to replicate getting my tables' partitions in a similar way using SQL, that'd be amazing. Please let me know, thank you!
For those interested, an alternative workaround I've been able to find but still need to test out is the Athena API with the Boto3 client. I may also possibly use the AWS Wrangler integrated with Athena to retrieve a dataframe.
As my employer makes the big jump to MongoDB, Redshift and Spark. I am trying to be proactive and get hands on with each of these technologies. Could you please refer me to any resources that will be helpful in performing this task -
"Creating a data pipeline using Apache Spark to move data from MongoDB to RedShift"
So, far I have been able to download a dev version of MongoDB and create a test Redshift instance. How do I go about setting the rest of the process and get my feet wet.
I understand to create the data pipeline using Apache Spark, one has to either code in Scala or Python or Java. I have a solid understanding of SQL, so feel free to suggest which language out of Scala, Python or Java would be easy for me to learn.
My background is in data warehousing, traditional ETL (Informatica, Datastage etc.).
Thank you in advance :)
A really good approach may be to use AWS Data Migration Services
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
you can specify mongodb as a source endpoint and redshift as the target endpoint
I would like to know if there is a way to output to a MongoDB in Luigi. I see in the documentation they support files (local FS, HDFS), S3, PostgreSQL but not MongoDB. If not, could someone explain me why not? Maybe it is a bad idea to have it? I would like to store the data in a database because then I can explore it by querying it. However I am using mongodb and I would not like to install another database. I do not need a relational database as I am using the database only to store and query ( NoSql ) without relationships, so the best option is mongodb.
Basically I need a task to read the data and save it in the database. Then the next task take this output and process the data.
Any recommendation, suggestion or clarification is more than welcome. Thanks!
You can try using mortar-luigi.
Check out this link for MongoDB tasks and this example.
I am kinda new to both elasticsearch and HBase but for a research project I would like to combine the two. My research project mainly involves searching through large portion of documents (doc,pdf,msg etc) and extracting named entities from the documents through
mapreduce jobs running on the documents stored in HBase.
Does any one know if there is something similar to MongoDB river plugin for HBase? Or can point me to some documentation about integrating ElasticSearch and Hbase? I have looked on the internet for any documentation but unfortunately without any luck.
Kind regards,
Martijn
I don't know of any elasticsearch hbase integrations but there are a few Solr and HBase integrations that you can use like Lily and SolBase
Tell me what you think about this https://github.com/posix4e/Elasticsearch-HBase-River. It uses hbase log shipping to reliably handle updates and deletes from hbase into an elastic search cluster. It could easily be extended to do n regionserver to m elastic search server replication.
you can use phoenix jdbc driver + es jdbc river as shown here: http://lessc0de.github.io/connecting_hbase_to_elasticsearch.html
I don't know of any packaged solutions, but as long as your mapreduce preps the data in the right way, it should be fairly easy to write a simple batch job in the programming language of your choice that reads from HBase and submits to ElasticSearch.
take a look to this page (3 years later) : http://lessc0de.github.io/connecting_hbase_to_elasticsearch.html