What are the options to integrate external data into Cloudera Impala? - postgresql

My use case is that I have an API to access SAP ERP data and would like to integrate both the meta-data (list of tables) and the actual data, so that Impala would join HDFS data with the data coming from my API. The best would be to use a similar method like PostgreSQL FDWs. Would like to hear opinions or even existing solutions for this problem.
All opinions, comments or thoughts would be highly appreciated.
Thanks a lot, David

Related

Structured data in confluence

is there a way to use structured data for confluence pages or even decouple the data from the view and access only the data by an API?
Are there any solutions without using any plugins?
Thank you and kind regards

MongoDB + Google Big Query - Normalize Data and Import to BQ

I've done quite a bit of searching, but haven't been able to find anything within this community that fits my problem.
I have a MongoDB collection that I would like to normalize and upload to Google Big Query. Unfortunately, I don't even know where to start with this project.
What would be the best approach to normalize the data? From there, what is recommended when it comes to loading that data to BQ?
I realize I'm not giving much detail here... but any help would be appreciated. Please let me know if I can provide any additional information.
If you're using python, easy way is to read collection chunky and use pandas' to_gbq method. Easy and quite fast to implement. But better to get more details.
Additionally to the answer provided by SirJ, you have multiple options to load data to BigQuery, including loading the data to Cloud Storage, local machine, Dataflow any more as mentioned here. Cloud Storage supports data in multiple formats such as CSV, JSON, Avro, Parquet and more. You also have various options to load data using Web UI, Command Line, API or using the Client Libraries which support C#, GO, Java, Node.JS, PHP, Python and Ruby.

Creating Metadata Catalog in Marklogic

I am trying to combined data from multiple sources like RDBMS, xml files, web services using Marklogic. For this as I see from MarkLogic documentation on Metadata Catalog (https://www.marklogic.com/solutions/metadata-catalog/), Data Virtualization (https://www.marklogic.com/solutions/data-virtualization/) and Data Unification it is very well possible. But I am not able to get hold of any documentation describing how exactly to go about it or which tools to use to achieve this.
Looking for some pointers.
As the second image in the data-virtualization link shows, you need to ingest all data into MarkLogic databases. MarkLogic can then be put in between to become the single entry point for end user applications that need access to that data.
The first link describes the capabilities of MarkLogic to hold all kinds of data. It partly does so by storing them as-is, partly by extracting text and metadata for searching, partly by conversion (if you needs go beyond what the original format allows).
MarkLogic provides the general purpose MarkLogic Content Pump (MLCP) tool for this purpose. It allows ingesting zipped or unzipped files, and applying transformations if necessary. If you need to retrieve your data from a different database, you might need a bit more work to get that out. http://developer.marklogic.com holds tutorials, blogs, and tools that should help you get going. Searching the MarkLogic Mailing List through http://marklogic.markmail.org/ can provide answers as well.
HTH!
Combining a lot of data is a very broad topic. Can you describe a couple types of data you'd like to integrate, and what services or queries you would like to build on that data?

Is it viable to use Hadoop with MongoDB as Database rather than HDFS

I am doing research in Hadoop with MongoDB as Database not HDFS. So, I need some guidance in terms of performance and usability.
My scenario
My data is
Tweets from twitter
Facebook News feed
I can get the data from twitter and Facebook API . In order for hadoop processing I need to store.
So my question is, Is it viable (or beneficial) to use Hadoop along with Mongo DB to store social networking data like twitter feeds, facebook posts, etc? Or is it better to go with HDFS and store data in a file . Any expertise guidance will be appreciated. Thanks
It is totally viable to do that. But it mainly depends on your needs. Basically on, what do you want to do once you have the data?
That said, MongoDB is definitely a good option. It is good at storing unstructured, deeply nested documents, like JSON in your case. You don't have to worry too much about nesting and relations in your data. You don't have to worry about the schema as well. Schema-less storage is certainly a compelling reason to go with MongoDB.
On the other hand, I find HDFS more suitable for flat files, where you just have to pick the normalized data and start processing.
But these are just my thoughts. Others might have a different opinion. My final suggestion would be analyze your use case well and then finalize your store.
HTH

Relational database or NoSQL database

I'm about to implement a system which will need to receive a lot of calls per day and save them. It needs to bring information to internet users as well (something like a call center or a 911).
I've two doubts:
1) I'm between using SQL Server, MongoDB or Cassandra,
2) If it's SQL Server I'm between using and ORM like NHibernate or Entity Framework.
Any suggestions will be appreciated.
Thanks in advance. Daniel
Can't tell without more detailed requirements (e.g., volume, etc.) I'll bet any of them will "work". You should pick the one you know best, implement it, and get some data.
You don't need an ORM layer if you don't have objects, especially those with complex relationships.
Why dont you do both? For every write or update you make in cassandra, fire a jms message. The consumer of that jms message can read from cassandra and update your RDMS (mysql, oracle, etc).