I am considering to store a stream of IoT data into the CloudantDB (it looks like a native choice with IoT and Node-Red) I'd like to get some reporting / visualization on top of the data.
Apparently the Bluemix integrated choice is the dashDB warehouse where I could import the data and build reports I need. As a downsize - the dashDB seems to be priced monthly per instance (where for the monthly fee looks reasonable while in production). I just hesitate to pay for a whole month just to try it out.
So - is there other option to get the Cloudant data for for reporting / visualization? (ok, I can still build an ETL to push the data into an SQL dataabase) Is there something available out-of-box?
Note:
For long term stats I could still define indexes or aggregate objects, but it's not that flexible
Related
I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.
I would like to automatically stream data from an external PostgreSQL database into a Google Cloud Platform BigQuery database in my GCP account. So far, I have seen that one can query external databases (MySQL or PostgreSQL) with the EXTERNAL_QUERY() function, e.g.:
https://cloud.google.com/bigquery/docs/cloud-sql-federated-queries
But for that to work, the database has to be in GCP Cloud SQL. I tried to see what options are there for streaming from the external PostgreSQL into a Cloud SQL PostgreSQL database, but I could only find information about replicating it in a one time copy, not streaming:
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external
The reason why I want this streaming into BigQuery is that I am using Google Data Studio to create reports from the external PostgreSQL, which works great, but GDS can only accept SQL query parameters if it comes from a Google BigQuery database. E.g. if we have a table with 1M entries, and we want a Google Data Studio parameter to be added by the user, this will turn into a:
SELECT * from table WHERE id=#parameter;
which means that the query will be faster, and won't hit the 100K records limit in Google Data Studio.
What's the best way of creating a connection between an external PostgreSQL (read-only access) and Google BigQuery so that when querying via BigQuery, one gets the same live results as querying the external PostgreSQL?
Perhaps you missed the options stated on the google cloud user guide?
https://cloud.google.com/sql/docs/mysql/replication/replication-from-external#setup-replication
Notice in this section, it says:
"When you set up your replication settings, you can also decide whether the Cloud SQL replica should stay in-sync with the source database server after the initial import is complete. A replica that should stay in-sync is online. A replica that is only updated once, is offline."
I suspect online mode is what you are looking for.
What you are looking for will require some architecture design based on your needs and some coding. There isn't a feature to automatically sync your PostgreSQL database with BigQuery (apart from the EXTERNAL_QUERY() functionality that has some limitations - 1 connection per db - performance - total of connections - etc).
In case you are not looking for the data in real time, what you can do is with Airflow for instance, have a DAG to connect to all your DBs once per day (using KubernetesPodOperator for instance), extract the data (from past day) and loading it into BQ. A typical ETL process, but in this case more EL(T). You can run this process more often if you cannot wait one day for the previous day of data.
On the other hand, if streaming is what you are looking for, then I can think on a Dataflow Job. I guess you can connect using a JDBC connector.
In addition, depending on how you have your pipeline structure, it might be easier to implement (but harder to maintain) if at the same moment you write to your PostgreSQL DB, you also stream your data into BigQuery.
Not sure if you have tried this already, but instead of adding a parameter, if you add a dropdown filter based on a dimension, Data Studio will push that down to the underlying Postgres db in this form:
SELECT * from table WHERE id=$filter_value;
This should achieve the same results you want without going through BigQuery.
I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby
Just going through the concepts of Business inteligence for Relational Databases. There Present lots of Tools for Relational DB's.
I want to know is there any tool which is used to do BI for NOSQL(MongoDB) and if yes then which is more powerful.
I have heared about Nucleaon BI. But dont know how powerful it is and advantages above other tools
There are currently 3 major BI platforms for MongoDB ecosystem.
Jaspersoft :
The only BI server that can connect directly to MongoDB, leveraging the aggregation framework APIs, so that you can report on and analyze data in MongoDB without having to move the data through ETL to a relational database.
Pentaho :
Increase Data Value – With Pentaho, MongoDB data can be accessed, blended, visualized and reported in combination with any other data source for increased insight and operational analytics. Reduce Complexity – Reporting on data stored in MongoDB is simplified, increasing developer productivity with Pentaho’s automatic document sampling, drag and drop interface and schema generation. Accelerate Data Access and Querying– With no impact on throughput, this integration builds on the features and capabilities in MongoDB, such as the Aggregation Framework, Replication and Tag Sets.
JSON Analytics :
Native JSON handling – no mapping to dimensions and measures means very short up-and-running times and no changes when the structure of the data changes. Contrary to previous-generation BI tools, JSON Studio was built from the ground up for JSON and MongoDB and is not based on a connector that tries to map JSON data into columns.
Native usage of MongoDB’s aggregation framework under an easy to use UI means very fast response times, for the first time accessible to all types of users.
HTTP Gateway with parameters means power users can design reports and graphs that can be used by any user, used for building dashboards and used from within other applications.
Rich d3 visualization and exploratory analytics gives power users the perfect platform to understand and work with data.
Low cost.
Nucleon BI is also in the picture but not so popular.
I have used Jaspersoft and found it great for BI and reporting.
I have a cloud native app based on azure. The app uses azure table storage.
Due to a fantastic opportunity I have decided to also provide the app on-premises. So I have to replace the NoSql data provider... my question is: Which solution is more alike Azure Table Storage? Mongo? Raven? you name it!
What I intend is to migrate the code effortlessly, like migrating from SQL Azure to Sql Server 2012... no code change needed... but I know that theres no equivalent for table storage... so I intend to find the one that will reduce my TTM as much as possible...
MongoDB and Table Storage are not exactly swappable replacements for each other. One is key/value, the other is document. I compared the two in this answer.
There's no getting around the fact that Table Storage is Storage-as-a-Service and you only pay for quantity of data (plus a very small per-transaction cost), whereas to work with MongoDB, you'd either have to host it in your own VMs (which gives you plenty of storage room, but at the expense of VMs) or work with a hoster (such as MongoLab, which offers 500MB for free currently). Regardless, you'd have do do some code changes to work with MongoDB over Table Storage.
I'm not sure if there exists a key/value store equivalent to Table Storage that's locally-installable. No matter what you pick, you'll have modifications on your Azure-side solution if you swap out Table Storage.
Is it possible, for your on-premises solution, to provide a MongoDB backend that stays relatively simple? That is: Stick with a single index to substitute for rowkey, and then store your table entities as documents (avoiding sub-documents)? This would keep your data layout very similar. At that point, you could use things like Aggregation Framework for a bit of data processing, and not damage the overall layout style/schema of your data.
MongoDB would give you a consistent storage framework that you could use in-cloud and on-premises, and has good support for Windows Azure.