Migrate data from NoSQL to an RDBMS - nosql

We have data existing in HBase and we want to move to AWS Aurora (MySQL) and we need to use the existing data so have to somehow load the NoSQL data into Aurora.
It's not a very big data base. Just a few tables.
Are there any best practices/tools to migrate data from NoSQL to a relational DB? I saw a lot of questions on the internet that ask to the reverse (DB -> NoSQL) but my requirement is a bit different and I don't find any helpful information.
Can someone please help? Where do I even start?

One simple way to do this without writing too much custom code would be to use Spark-HBase Connector from Hortonworks (SHC) to read data from an HBase table into a Spark dataframe and to write that dataframe into a MySQL table. The key challenge would be to get SHC to work, because in my experience it's extremely version sensitive. So the trick is to correctly coordinate your version of Spark, HBase, and SHC (and finding that right combination is trickier than you may think).
However, if you manage to get all the dependencies right, then doing the above is a matter of a few lines of code in Jupyter Notebook or Pyspark. You could run this on Yarn to parallelize the workload, in case it's large. Should work. Give it a try.

Related

SQL vs PySpark/Spark SQL

Could someone please help me understand why we need to use PySpark or SprakSQL etc if the source and target of my data is the same DB?
For example, lets say I need to load data to table X in PostgresDB from tables X and Y. Would it not be simpler and faster to just do it in Postgres instead of using SprakSQL or PySpark etc?
I understand the need for these solutions if data is from multiple sources, but if it is from same source, do I need to use PySpark?
You can use spark when you want to do heavy data transformations, it makes it easier to load and process due to distributed processing.
It totally depends on how large is the data and how you want to transform it.
Using Postgres will be a good idea if data is relatively small and no transformation is required.
It is not necessary to use PySpark. Both PySpark & SparkSQL have their value in managing/manipulating large volumes of data few hundred of GBs, TBs, or PBs in a distributed computing setup. If this is your case, please use PySpark, it will be more efficient to load, manipulate, process/shape the data before inserting it into another table.
Thank you all for the feedback. I think I will use glue pyspark if source and destination are different. Else i will use glue python with jdbc connection and have one session do the tasks without bringing data to dataframes.

What's the fastest way to put RDF data (specifically DBPedia dumps) into Postgres?

I'm looking to put RDF data from DBPedia Turtle (.ttl) files into Postgres. I don't really care how the data is modelled in Postgres as long as it is a complete mapping (it would also be nice if there were sensible indexes), I just want to get the data in Postgres and then I can transform it with SQL from there.
I tried using this StackOverflow solution that leverages Python and sqlalchemy, but it seems to be much too slow (would take days if not more at the pace I observed on my machine).
I expected there might have been some kind of ODBC/JDBC-level tool for this type of connection. I did the same thing with Neo4j in less than an hour using a plugin Neo4j provides.
Thanks to anyone that can provide help.

How do I efficiently migrate the BigQuery Tables to On-Prem Postgres?

I need to migrate the tables from the BigQuery to the on-prem Postgres database.
How can I efficiently achieve that?
Some thoughts that are coming
I will use Google APIs to export the data from the tables
Store it locally
And finally, import to Postgres
But I am not sure if that can be done for a huge amount of data in TBs. Also, how can I automate this process? Can I use Jenkins for that?
Exporting the data from BigQuery, store it and importing it to PostgreSQL is a good approach. Here are other two alternatives that you can consider:
1) There's a PostgreSQL wrapper for BigQuery that allows to query directly from BigQuery. Depending on your case scenario this might be the easiest way to transfer the data; although, for TBs it might not be the best approach. This suggestion was made by #David in this SO question.
2) Using Dataflow. You can create a ETL process using Apache Beam to made the transfer. Take a look at this how-to for transferring data from BigQuery to CloudSQL. You would need to adapt it for local PostgreSQL, but the idea maintains.
Here's another SO answer that gives more context on this approach.

What value does Postgres adapter for spark/hadoop add?

I am not an HDFS nerd but coming from traditional RDMS background, I am scratching surface with newer technologies like Hadoop and Spark. Now, I was looking at my options when it comes to SQL querying on Spark data.
What I realized that Spark inherently supports SQL querying. Then I came across this link
https://www.enterprisedb.com/news/enterprisedb-announces-new-apache-spark-connecter-speed-postgres-big-data-processing
Which I am trying to make some sense of. If I am understanding it correctly. Data is still stored in HDFS format but Postgres connector is used as a query engine? If so, in presence of an existing querying framework, what new value does this postgress connector add?
Or I am misunderstanding what it actually does?
I think you are misunderstanding.
They allude to the concept of Foreign Data Wrapper.
"... They allow PostgreSQL queries to include structured or unstructured data, from multiple sources such as Postgres and NoSQL databases, as well as HDFS, as if they were in a single database. ...
"
This sounds to me like the Oracle Big Data Appliance approach. From Postgres you can look at the world of data processing it logically as though it is all Postgres, but underwater the HDFS data is accessed using Spark query engine invoked by the Postgres Query engine, but you need not concern yourself with that is the likely premise. We are in the domain of Virtualization. You can combine Big Data and Postgres data on the fly.
There is no such thing as Spark data as it is not a database as such barring some Spark fomatted data that is not compatible with Hive.
The value will be invariably be stated that you need not learn Big Data etc. Whether that is true remains to be seen.

HBase or Mongo for an Analytics DB if already using Hadoop?

I currently have a Hadoop cluster where I store tons of logs over which I run pig scripts for calculating aggregated analytics. I also have a Mongo cluster where I store production data.
I've recently been put in a position where I need to do a lot of one-off analytics queries, or enable others to do them. These queries frequently need to use both production data and log data together, so whatever I go with, I'd like to have everything in one place. My log data is in json and about 10x the size of my prod data. Here are the pros/cons of Mongo and HBase I'm seeing:
Mongo Pros/ HBase Cons:
Since log data is in JSON, I can get it into Mongo pretty easily, and I can do this in real time as it comes in through something like FluentD.
Most people I work with already have experience writing Mongo queries from needing to work with prod data, so getting an analytics db up on Mongo would be very simple for everyone to use.
I know much less about Hbase than Mongo.
No idea how easy/difficult it would be to get data in JSON or from Mongo into Hbase. I imagine this isn't so bad, but I don't see much documentation.
HBase Pros/Mongo Cons:
My log data is much bigger than my prod data, so storing it in both hadoop and mongo would be way more expensive than storing my prod data in both hadoop and mongo.
I can build HBase on top of my already running Hadoop cluster and fit my prod data in there without adding many extra machines. If I went with Mongo, I'd need a whole new Mongo cluster.
I could use Phoenix on top of Hbase to allow a simple SQL syntax for accessing all our data, but I'm not sure how unwieldily this would be for multi-level document-based data.
I know very little about Hbase currently, and I wouldn't consider myself a Mongo expert, so I'm probably missing a lot.
So, what am I missing, and which is right for my situation?
First of all, you should use something which you already can handle. Therefore, Mongo DB seems a good choice, especially when the data is already in the json format.
On the other hand, I used HBase quite a while and the read performance is amazing although having a lot of rows and I really don't know if there is any good and fast integration of Mongo DB with Hadoop.
HBase is the Hadoop database, so it is predestinated to work with Hadoop together.
If the logs could be indexed by (in the HBase Rowkey):
producing_program_identifier, timestamp, ...
HBase could work quite well for this query pattern.
But if you decide on HBase, use the
phoenix framwork, it will save you time using familiar interfaces like jdbc and sql-like queries. It also provides simple aggregation functions (count, avg, max, min) which may be sufficient.
From what you're saying it seems a mongoDB based solution would work best for you.
HBase is extremely versatile and you can get it to serve both your prod needs as well as your analytics needs however the general purpose SQL capabilities (in Phoenix, Cloudera's Impala and others) are in their infancy and the standard HBase way to get high query performance (designing the data structure for reads) will take a lot on effort (esp. since you don't have experience in HBase).
By the way it may be applicable for you to use map/reduces pre-aggregated data and then load it into MongoDB and thus utilize your current setup bette rather than change it either way