Data profiling on complete data in DataPrep - google-cloud-dataprep

I need to do data profiling on complete dataset in Dataprep. I also wanted to confirm if data profiling in Dataprep is done on sample data or complete data which we have?

Profiling in the UI is only done over the sample data which is loaded in the transformer interface.
During job execution, profiling is executed across the entire dataset.
See Overview of profiling or Overview+of+Visual+Profiling.

Related

design question: best way to aggregate data from several microservices and show in UI

we have a scenario where we need to aggregate data from several services and show in UI. The current scenario is when an agent logins in, we need to show cases assigned to that agent. Case information needs to be aggregated from several microservices. There would be around 1K cases assigned to agent at a time and all of the needs to be shown to agent so that he can perform sorting based on certain case data.
What be best approach to show data in this scenario? should we do API calls to several services for each case and aggregate and show ? Or there are better approaches to achieve this.
No. You'll certainly not call multiple APIs to aggregate data on runtime. Even if you call the apis parallely, it will be a huge latency.
You need to pre-aggregate the case details and cache them in a distributed caching system (e.g. Redis or memcached) using a streaming platform (e.g. Kafka). Also, store the pre-aggregated case details in a persistent database. Basically, it's a kind of materialized views.
Caching will enable you to serve the case details fast to the user without any noticeable latency. And streaming will help you to keep the cache and DB aggregations updated in a near-real time. Storing the materialized view in database will save you from storing everything in memory. You can use an LRU cache. Only the recently used data will be in cache. If you need to show any case data that is not in cache, you'd read it from database and store it in cache for future requests.
I recommend you read these two Martin Kleppmann articles here and here

Visualize Real Time Data from MongoDB

any suggestions as to what tools I should use to visualize data aggregated into MongoDB on my desktop? I am trying to visual stock data that I've pulled from an API and would like stream said data being pulled into MongoDB in real time. Any suggestions? Is power BI a good go to? Kinda lost here...
You can use MongoDB BI connector with Power BI's On-Premise Gateway solution for Mongo DB. But, it won't give you real time data. You will have to refresh your dataset periodically.
If you want to go with real time data visualization, you should try Power BI's Real Time Streaming functionality with API or PubNub.

Build internal dashboard analytics on top of mongodb

I want to build an internal dashboard to show the key metrics of a startup.
All data is stored in a mongodb database on Mongolab (SaaS on top of AWS).
Queries to aggregate datas from all documents take 1-10minutes.
What is the best practice to cache such data and make it immediately available?
Should I run a worker thread once a day and store the result somewhere?
I want to build an internal dashboard to show the key metrics of a startup. All data is stored in a mongodb database on Mongolab (SaaS on top of AWS). Queries to aggregate datas from all documents take 1-10minutes.
Generally users aren't happy to wait on the order of minutes to interact with dashboard metrics, so it is common to pre-aggregate using suitable formats to support more realtime interaction.
For example, with time series data you might want to present summaries with different granularity for charts (minute, hour, day, week, ...). The MongoDB manual includes some examples of Pre-Aggregated Reports using time-based metrics from web server access logs.
What is the best practice to cache such data and make it immediately available?
Should I run a worker thread once a day and store the result somewhere?
How and when to pre-aggregate will depend on the source and interdependency of your metrics as well as your use case requirements. Since use cases vary wildly, the only general best practice that comes to mind is that a startup probably doesn't want to be investing too much developer time in building their own dashboard tool (unless that's a core part of the product offering).
There are plenty of dashboard frameworks and reporting/BI tools available in both open source and commercial flavours.

How i can integrate Apache Spark with the Play Framework to display predictions in real time?

I'm doing some testing with Apache Spark, for my final project in college. I have a data set that I use to generate a decision tree, and make some predictions on new data.
In the future, I think to use this project into production, where I would generate a decision tree (batch processing), and through a web interface or a mobile application receives new data, making the prediction of the class of that entry, and inform the result instantly to the user. And also go storing these new entries for after a while generating a new decision tree (batch processing), and repeat this process continuously.
Despite the Apache Spark have the purpose of performing batch processing, there is the streaming API that allows you to receive real-time data, and in my application this data will only be used by a model built in a batch process with a decision tree, and how the prediction is quite fast, it allows the user to have the answer quickly.
My question is what are the best ways to integrate Apache Spark with a web application (plan to use the Play Framework scala version)?
One of the issues you will run into with Spark is it takes some time to start up and build a SparkContext. If you want to do Spark queries via web calls, it will not be practical to fire up spark-submit every time. Instead, you will want to turn your driver application (these terms will make more sense later) into an RPC server.
In my application I am embedding a web server (http4s) so I can do XmlHttpRequests in JavaScript to directly query my application, which will return JSON objects.
Spark is a fast large scale data processing platform. The key here is large scale data. In most cases, the time to process that data will not be sufficiently fast to meet the expectations of your average web app user. It is far better practice to perform the processing offline and write the results of your Spark processing to e.g a database. Your web app can then efficiently retrieve those results by querying that database.
That being said, spark job server server provides a REST api for submitting Spark jobs.
Spark (< v1.6) uses Akka underneath. So does Play. You should be able to write a Spark action as an actor that communicates with a receiving actor in the Play system (that you also write).
You can let Akka worry about de/serialization, which will work as long as both systems have the same class definitions on their classpaths.
If you want to go further than that, you can write Akka Streams code that tees the data stream to your Play application.
check this link out, you need to run spark in local mode (on your web server) and the offline ML model should be saved in S3 so you can access the model then from web app and cache the model jut once and you will be having spark context running in local mode continuously .
https://commitlogs.com/2017/02/18/serve-spark-ml-model-using-play-framework-and-s3/
Also another approach is to use Livy (REST API calls on spark)
https://index.scala-lang.org/luqmansahaf/play-livy-module/play-livy/1.0?target=_2.11
the s3 option is the way forward i guess, if the batch model changes you need to refresh the website cache (down time) for few minutes.
look into these links
https://github.com/openforce/spark-mllib-scala-play/blob/master/app/modules/SparkUtil.scala
https://github.com/openforce/spark-mllib-scala-play
Thanks
Sri

Using DB or XML Parsing for an iPhone App?

For a products catalogue app on iphone which approach is more efficient? Using sqllite db or directly parsing online from xml without db?
Small amounts of data can be loaded as XML directly into memory. Thus, XML would do just fine. When using a large amount of data, a database would be a better option, but it will decrease speed simply because it needs to read/write the data to storage.With iPhone apps and other mobile phone apps, the difference between memory and storage tends to be very small. Unfortunately, for an app to understand an XML file, it must load the XML in a DOM model. This will eat up additional memory of about the size of the XML. Thus XML is not suitable for large amounts of data. (Or huge records.)
If you have up to 50 products, the balance is in favor for XML. Over 50 and you're better off with sqllite.
An added bonus of XML is that you need to explicitly save back to storage to update your changes. With databases, any updates to the data tends to be done directly. Thus, with a database you have a bit more problems undoing any errors. However, with XML your changes will be lost if your application crashes. Personally, I prefer it to only update data explicitly on my command, thus I would prefer XML. (But not for large amounts of data.)
Add your products to sqllite and update only changed/newly added products to the db at every launch asynchronously.
Render your View from the data in DB.