Deploy a Predictive model created by R in production server - deployment

I have created a random forest predictive model using R and have saved it in my local machine. Now I want to deploy in a server running Hive and predict using this model using the full data set in my Hive data warehouse. Now how do I deploy this model and how do I run this model against Hive full data set.
Can someone help me and share me the code to predict the model I created against the Hive full data set.
I am wondering if I have 10 million rows then how kind of batch size should I use. Even if I take a batch size of 100,000 it would be too time consuming . How others are deploying in production and predicting using large data size.

Related

Spark data pipeline initial load impact on production DB

I want to write a Spark pipeline to perform aggregation on my production DB data and then write data back to the DB. My goal of writing the pipeline is to perform aggregation and not impact production DB while it runs, meaning I don't want users experiencing lag nor DB having heavy IOPS while the aggregation is performed. For example, an equivalent aggregation query just run as SQL would take a long time and also use up the RDS IOPS, which results in users not being able to get data - trying to avoid this. A few questions:
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)? For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
When writing data back to DB, does that incur load on DB as well?
I'm using a PostgreSQL database in case this matters.
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
By default there will be a single partition in Glue to which the whole table is read into.But you can do parallel reads using this and make sure to chose a column that will not affect the DB performance.
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)?
Yes when you pass a query instead of table you will be only reading the result of it from the DB and reducing the large n/w and IO transfer. This means you are delegating it to DB engine to calculate the result.Refer to this on how you can do it.
For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
Yes depending on the table size and query complexity this might affect DB performance and if you have a read replica then you can simply use that.
When writing data back to DB, does that incur load on DB as well?
Yes it depends on how you are writing the result back to DB. Few partitions is always good i.e, not too many and not too less.

How to export data from firestore in a cost-effective manner for data analysis?

I am looking for a way to get the data in my company's firestore database without it costing too many reads, it is a huge database and currently, it is not an option to export it regularly due to cost, I need a reasonable sample size if not all the content in the database to run some analytical work using either python or R.

How to access gold table in delta lake for web dashboards and other?

I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby

Row based database or Column based database

We are working on a audit system where auditor are given access to transaction processed in last quarter. Auditor performs various analysis on the data to find out invalid/erroneous transactions that have some exceptions.
Generally, these analysis requires data to be present on some charts to view the out-layers or sometime duplication detection are done based on multiple columns.
Sometime exception detection algorithm are pretty involved that require multiple processing steps using stored procedure.
Please note that analysis rarely involves aggregation on huge rows.
Occasionally , they can change some data if they find it missing or incorrect.
We are evaluating row based (sql & nosql databases) and column store (like data warehouse systems).
Is this a use case for datawarehouse or row based store, like nosql or some RDBMS?
In short, requirements are:
- Occasional update
- Mostly read queries over last 3/months of data
- Reading data my require several messaging steps, like creating temp table in step 1, forming join with another table in step rule, delete some rows ect.
Thanks
For your task, it does not really matter how the data is stored. You need to think instead how to create a solid dimensional model, populate it with data properly, and what reporting tools to use.
To give you an example, here are a couple of common setups I've used in my projects:
Microsoft stack setup:
SQL Server for data storage
SSIS for data ETL (or write your own stored procedures if you know what you are doing)
Publish dimensional model on the same SQL Server. If your data set is large (over billion records), use SSAS Tabular instead
Power Pivot or Power BI for interactive reporting, or SSRS for paginated reports.
Open-source setup:
PostgreSQL for data storage
Use stored procedures and/or Python to process data
Publish dimensional model to another PostgreSQL database. If your data is large, publish the dimensional model to Redshift or
other columnar database
Use Tableau or Power BI for interactive reporting, or build your own reporting interface.
I think NoSQL database is a wrong choice here because audit will require highly structured data.

Adding new data to the neo4j graph database

I am importing a huge dataset of about 46K nodes into Neo4j using import option.Now this dataset is dynamic i.e new entries keep getting adding to it now and then so if i have to re perform the entire import then its wastage of resources.I tried using neo4j rest client of python to send the queries to create the new data points but as the number of new data points increase the time taken is more than the importing of 46k nodes.So is there any alternative to add these datapoints or do i have to redo the entire import?
First of all - 46k is rather tiny.
The most easy way to import data into Neo4j is using LOAD CSV togesther with PERIODIC COMMIT. http://neo4j.com/developer/guide-import-csv/ contains all the details.
Be sure to have indexes in place to find the stuff that needs to be changed with an incremental update quickly.