I was trying to import data from mysql to hdfs.
I was able to do it with sqoop but this can be done by fetching the data from api also.
My question is about when to use rest api to load data in hdfs instead of sqoop?
Please specify some difference with use cases!
You could use Sqoop to pull data from Mysql and into Hbase, then put a REST API over Hbase (on Hadoop)... Would be not much different than a REST API over Mysql.
Basically, you're comparing two different things. Hadoop is not meant to replace traditional databases or N-tier user-facing applications, it just is a more distributed, fault tolerant place to store large amounts of data.
And you typically wouldn't use a REST API to talk to a database, then put those values into Hadoop, because that wouldn't be distributed and all database results go through a single process
Sqoop (SQL <=> Hadoop) is basically used for loading data from RDBMS to HDFS.
It's a direct connection to database where you can append/modify/delete data in table(s) using sqoop eval command if privileges are not defined properly for the user accessing the db from sqoop
But using Rest web services api we can fetch data from various databases (can be NoSQL or RDBMS both) connected internally via code.
Consider you are calling a getUsersData restful web service using curl command which is specifically designed only to provide users data and doesn't allow to append/modify/update any components of db irrespective of database (RDBMS/NoSQL)
Related
I am currently contemplating on how to expose data present in Redshift tables in a meaningful and consistent way through REST API.
The way I want it to work is that caller calls the API and then we do some kind of dynamic querying on the tables. I am worried about the latency as the queries could range from simple to very complicated ones. Since Redshift requires connecting to the database as a client, Some of the approaches we could have are:
Create a lambda function connecting to Redshift, which is invoked through API gateway
Using OData to create RESTful APIs. However, I don't think Redshift supports OData out of the box.
I am inclining towards OData since it has advance filtering options along with pagination.
I am seeking advice, will OData be enough and if yes, how exactly one integrates OData with redshift.
Any other advise/approaches are welcome too.
Thanks!
Let me go over the different options:
Redshift data api
Redshift data API let's you invoke queries and get their results in an asynchronous manner.
You can use the API directly from front-end, or you can put it behind API Gateway.
Lambda
If you trust your users and can get a proper authentication you can simply invoke the Lambda directly from front-end and pass it some SQL to run or generate SQL based on the parameters. You can potentially swap this with Athena using federated query. Optionally you can add in API Gateway for some additional features like rate-limiting and different forms of authentication. Keep in mind that both Lambda and API Gateway have limit in terms of data returned and execution time.
For long running queries I would suggest that the Lambda, API Gateway or even from the front-end itself invoke an AWS Glue Python Shell Job which will use an unload query to drop the results in S3. The front-end can pool for when the job is done.
If you have few types of queries then you can make proper rest API.
Instead of Lambda, you can also use Amazon Athena Federated Query, which you can actually query with directly from the front-end.
OData Implementation
There are third party OData implementations for Redshift. Just google it. With a front-end library that consumes OData(I used KendoUI in the past) you can potentially make a working feature rich front-end in days. The main concern with this option is the tools costs may over your budget. Of course the hours you spent making things is also a cost but it really depends on what are your actual requirements.
So how to choose?
Depending on your requirements I would suggest simply going through the option and selecting them based on costs, time to implementation, performance, reliability and security.
How about Redshift performance?
This is the most difficult part about Redshift and on-demand queries. On Redshift you don't have indexes, data can be compressed and the data is stored columnar fashion. All of these can make Redshift slower than your average relational database for a random query.
However you can make sure that your table is sorted with a distribution style that matches your queries and your queries use the columnar storage to their advantage(not all columns are requested), then it can be faster.
Another thing to keep in mind is that Redshift doesn't handle concurrency well, I believe by default there can only 8 concurrent queries, which you can increase it but you definitely wouldn't to go more than 20.
If your users can wait for their queries(I've seen bad queries go over 2 hours. I'm sure you can make them take longer, then Redshift is fine, if not then you could try putting Postgres in front of Redshift by using external tables and then use your average indexes in front of it to speed things up.
i have a problem where in i need to store the users's address data which can come from different vendors in different format. once i have the data i need to do some cleaning and wrinkling and run the de-duplication process to get the clean structured data. once the data is clean, i may have to pick the different attributes of address from different vendors based on some complex logic which is not defined yet. my question is
1) which database i should use i.e. NOSQL database family like document/keyvalue/dynamoDB etc or RDBMS with MPP database like redshift or azure data warehouse
2) NOSQL DB like mongoDB provide the flexibility of schema but at the same time the queries or de-duplication process is not something inbuilt for these databases.
if anyone can guide me on this i shell be very thankful for him
Thanks
Atul
I am not an HDFS nerd but coming from traditional RDMS background, I am scratching surface with newer technologies like Hadoop and Spark. Now, I was looking at my options when it comes to SQL querying on Spark data.
What I realized that Spark inherently supports SQL querying. Then I came across this link
https://www.enterprisedb.com/news/enterprisedb-announces-new-apache-spark-connecter-speed-postgres-big-data-processing
Which I am trying to make some sense of. If I am understanding it correctly. Data is still stored in HDFS format but Postgres connector is used as a query engine? If so, in presence of an existing querying framework, what new value does this postgress connector add?
Or I am misunderstanding what it actually does?
I think you are misunderstanding.
They allude to the concept of Foreign Data Wrapper.
"... They allow PostgreSQL queries to include structured or unstructured data, from multiple sources such as Postgres and NoSQL databases, as well as HDFS, as if they were in a single database. ...
"
This sounds to me like the Oracle Big Data Appliance approach. From Postgres you can look at the world of data processing it logically as though it is all Postgres, but underwater the HDFS data is accessed using Spark query engine invoked by the Postgres Query engine, but you need not concern yourself with that is the likely premise. We are in the domain of Virtualization. You can combine Big Data and Postgres data on the fly.
There is no such thing as Spark data as it is not a database as such barring some Spark fomatted data that is not compatible with Hive.
The value will be invariably be stated that you need not learn Big Data etc. Whether that is true remains to be seen.
I'm new to spark. I'm used to a Web developer, not familiar to big data.
That's say I have a portal website. user's behavior and action will store in 5 sharded mongoDB clusters.
How to I analyze it with spark ?
Or Spark can get the data from any databases directly (postgres/mongoDB/mysql/....)
Because most website may use Relational DB as back-end database.
Should I export whole data in the website's databases into HBase ?
I stored all the users log in postgreSQL, is it practical to export data into HBase or other Spark preffered databases ?
It seems it will make lots of duplicated data if I copy the data to a new database.
Does my big data model need other framework excepts Spark ?
For analyze the data in the website's databases,
I don't see the reasons that I need HDFS, Mesos, ...
How to make Spark workers can access the data in PostgreSQL databases ?
I only know how to read data from text file,
and saw some codes about how to load data from HDFS://
But I don't have HDFS system now , should I create one HDFS for my purpose ?
Spark is a distributed compute engine; so it expects to have files accessible from all nodes. Here are some choices you might consider
There seems to be Spark - MongoDB connector. This post explains how to get it working
Export the data out of MongoDB into Hadoop. And then use Spark to process the files. For this , you need to have a Hadoop cluster running
If you are on Amazon, you can put the files in S3 store and access from Spark
We need a process in place to pull data from Hadoop Distributed File System (HDFS) to a relational DB (PostgreSQL) on a regular basis. We will need to transfer several million records per hour and I am looking for the best industry standards to move data out of HDFS. Does any one have any suggestions?
The idea is for a web app to interact with PostgreSQL which will have aggregated data.
Sqoop is built for the purpose of moving data between relational data stores and Hadoop. Specifically, you want sqoop-export.