How I can access to couchbase from pyspark - scala

I'm new in working with NoSQL databases. I have Spark 1.6.0 on my cluster and I need to get document from Couchbase bucket, make some operations with it an load it back.
I know ip, port, bucket's name and bucket's password. Unfortunately, I'm out of ideas, how I can access this database using pyspark. But if it's impossible, how I can do it using scala?
Besides, I need execute suchlike operation with HBase.
Great thanks for any suggestions and useful urls.
Best regards,
Vladimir.

To access Couchbase from the Python tools universe, you need to use the Python SDK.
Start here: https://docs.couchbase.com/python-sdk/2.5/start-using-sdk.html

Related

AWS glue with MongoDB Atlas

I've tried multiple things to try to connect AWS glue to MongoDB atlas. Has someone been successful in doing so and if so, please can someone help me with the steps.
The AWS documentation claims that it should work with any compatible MongoDB link but it doesn't.
I am facing a similar issue. I checked with the AWS support team and it seems like they have a huge backlog of similar issues where customers have requested the ability to connect to MongoDB Atlas. Unfortunately, they don't have an ETA for this.
Either you can opt to migrate to AWS Document DB and then use Glue to crawl your data store or you probably have to think of some other way to get your data from atlas to a layer that is supported properly by Glue.

Easy way to create services on HDP cluster

We have a company with a HDP 2.6.4 cluster and an outsource / offshore team that handles different ops tasks. Due to quite strict data access policies, that cannot have access to quite a few datasets. However, we do need them to be able to 24/7 monitor and (ideally) execute different jobs.
So I'm in position as someone in Big data team to enable them to do so, but without access to data. Not being sure what would HDP 2.6 have to offer, I do know that there are certain tools that enable devs to develop all kind of API endpoints, which could then be mapped to different ETL jobs / shell scripts etc.
Would this be optimal approach from an architectural standpoint?
I was thinking of getting us something like Dreamfactory, but opensource and something I can run on premise. Any ideas?
Cheers!
DreamFactory can generate REST APIs for a multitude of databases, among them MySQL, Microsoft SQL Server, Oracle, PostgreSQL, and MongoDB. Along with the API, DreamFactory will also auto-generate an extensive set of interactive Swagger documentation for your API. Are you looking to connect databases?

MongoLab and Elasticsearch

My Mongo database is hosted at MongoLab. I'd like to use ElasticSearch as a full text search engine on top of my DB.
As I understand MongoDB needs to run as a replica-set, but I don't have any control on how the database run. I'm currently using the 500mb free plan.
On the top of that, I'm using the scala playframework.
Was anyone successful with those technologies and services?
Update:
Finally I'm not using MongoDB anymore, and went straight for a ElasticSearch solution.
I found this nice cloud host providing a 500MB free plan http://facetflow.com/
It was very useful for my development.
I didn't find any satisfying Scala library for ES, therefore I'm using Dispatch and make direct http requests to the ES instance.
I hope that someone will find this useful.
Just a quick note ... MongoHQ has oplog support with their MongoDB Elastic Deployments ... those could help you with using Elastic Search and River.
http://blog.mongohq.com/elastic-deployments-now-with-oplog-access/
I haven't looked into this too deeply, but you might want to check out Searchly http://www.searchly.com/features/ . The features mention
Built-in crawler for crawling web pages and databases. (Currently MongoDB)
If you try this out, please let me know how it goes. I will do the same.
Update:
I haven't tried searchly, but I was able to start a MongoDB instance in replica mode on OpenShift.
I have also an Elastic Search server running on the same OpenShift "gear".
Now I need time to try connecting those two together, and then the fun will start :-)

How to replicate MySQL database to Cloud SQL Database

I have read that you can replicate a Cloud SQL database to MySQL. Instead, I want to replicate from a MySQL database (that the business uses to keep inventory) to Cloud SQL so it can have up-to-date inventory levels for use on a web site.
Is it possible to replicate MySQL to Cloud SQL. If so, how do I configure that?
This is something that is not yet possible in CloudSQL.
I'm using DBSync to do it, and working fine.
http://dbconvert.com/mysql.php
The Sync version do the service that you want.
It work well with App Engine and Cloud SQL. You must authorize external conections first.
This is a rather old question, but it might be worth noting that this seems now possible by Configuring External Masters.
The high level steps are:
Create a dump of the data from the master and upload the file to a storage bucket
Create a master instance in CloudSQL
Setup a replica of that instance, using the external master IP, username and password. Also provide the dump file location
Setup additional replicas if needed
VoilĂ !

How to continuously write mongodb data into a running hdinsight cluster

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.
How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.
Is that possible on hdinsight? Is it even possible with hadoop?
Thanks,
It is certainly possible to have that data pushed from Mongo into Hadoop.
Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.
Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.
Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.