I've done quite a bit of searching, but haven't been able to find anything within this community that fits my problem.
I have a MongoDB collection that I would like to normalize and upload to Google Big Query. Unfortunately, I don't even know where to start with this project.
What would be the best approach to normalize the data? From there, what is recommended when it comes to loading that data to BQ?
I realize I'm not giving much detail here... but any help would be appreciated. Please let me know if I can provide any additional information.
If you're using python, easy way is to read collection chunky and use pandas' to_gbq method. Easy and quite fast to implement. But better to get more details.
Additionally to the answer provided by SirJ, you have multiple options to load data to BigQuery, including loading the data to Cloud Storage, local machine, Dataflow any more as mentioned here. Cloud Storage supports data in multiple formats such as CSV, JSON, Avro, Parquet and more. You also have various options to load data using Web UI, Command Line, API or using the Client Libraries which support C#, GO, Java, Node.JS, PHP, Python and Ruby.
Related
I've been using JetBrains Datagrip recently since I was able to get the whole suite for free. It's pretty nice, but I didn't notice any way to read GridFS using it. It seems like it should be common enough to have some sort of support, but I couldn't find any information online and its not immediately obvious from inside DataGrip.
We have created a feature request to implement GridFS support:
https://youtrack.jetbrains.com/issue/DBE-17458
What is the proper way to cache API results using Hive?
The current way I plan to implement it is using the request URL as the key and the returned data as the body.
Is there a proper way to do this more production friendly? I can't find a tutorial as most tutorials are abstracted by using another package that takes care of this for them or a tutorial uses a different package.
In order to cache rest API data, you can use hive which is a No-SQL database and is easier to use and faster to retrieve and is faster than shared preferences and sqflite.
For more details you can check this repo to understand better :
https://github.com/shashiben/Anime-details
And you can read this article: https://medium.com/flutter-community/flutter-cache-with-hive-410c3283280c
The code is written cleaner and is architectures by using stacked architecture. Hope this answer is helpful to you
I need to some visualize data from a Postgresql in Kibana. I have also ElasticSearch installed just in case. So how visualize data from a Postgresql in Kibana? Of course, I don't need the whole database, but only data returned by a custom sql query.
Also, I want it to be as simple as possible, I wouldn't like to use libraries I really don't need to use.
Kibana was built with Elastisearch in mind.
Having used it quite a lot in a startup I worked for, I can tell you that even the front-end query DSL (built on Lucene) will only work with Elasticsearch (or might need some serious tweaks).
I would advise you to push your data into Elasticsearch, and just work with Kibana the way it was made for :)
I am trying to combined data from multiple sources like RDBMS, xml files, web services using Marklogic. For this as I see from MarkLogic documentation on Metadata Catalog (https://www.marklogic.com/solutions/metadata-catalog/), Data Virtualization (https://www.marklogic.com/solutions/data-virtualization/) and Data Unification it is very well possible. But I am not able to get hold of any documentation describing how exactly to go about it or which tools to use to achieve this.
Looking for some pointers.
As the second image in the data-virtualization link shows, you need to ingest all data into MarkLogic databases. MarkLogic can then be put in between to become the single entry point for end user applications that need access to that data.
The first link describes the capabilities of MarkLogic to hold all kinds of data. It partly does so by storing them as-is, partly by extracting text and metadata for searching, partly by conversion (if you needs go beyond what the original format allows).
MarkLogic provides the general purpose MarkLogic Content Pump (MLCP) tool for this purpose. It allows ingesting zipped or unzipped files, and applying transformations if necessary. If you need to retrieve your data from a different database, you might need a bit more work to get that out. http://developer.marklogic.com holds tutorials, blogs, and tools that should help you get going. Searching the MarkLogic Mailing List through http://marklogic.markmail.org/ can provide answers as well.
HTH!
Combining a lot of data is a very broad topic. Can you describe a couple types of data you'd like to integrate, and what services or queries you would like to build on that data?
I am doing research in Hadoop with MongoDB as Database not HDFS. So, I need some guidance in terms of performance and usability.
My scenario
My data is
Tweets from twitter
Facebook News feed
I can get the data from twitter and Facebook API . In order for hadoop processing I need to store.
So my question is, Is it viable (or beneficial) to use Hadoop along with Mongo DB to store social networking data like twitter feeds, facebook posts, etc? Or is it better to go with HDFS and store data in a file . Any expertise guidance will be appreciated. Thanks
It is totally viable to do that. But it mainly depends on your needs. Basically on, what do you want to do once you have the data?
That said, MongoDB is definitely a good option. It is good at storing unstructured, deeply nested documents, like JSON in your case. You don't have to worry too much about nesting and relations in your data. You don't have to worry about the schema as well. Schema-less storage is certainly a compelling reason to go with MongoDB.
On the other hand, I find HDFS more suitable for flat files, where you just have to pick the normalized data and start processing.
But these are just my thoughts. Others might have a different opinion. My final suggestion would be analyze your use case well and then finalize your store.
HTH