Incrementally updating/adding data on HDFS - postgresql

In my application there are 4 tables and each table is having more than 1 million data.
currently my java based reporting engine joins all the tables and get the data to show in reports.
Now i want to introduce Hadoop using sqoop. I have installed hadoop 2.2 and sqoop 1.9.
I have done a small POC to import the data in hdfs. problem is that, every time it creates new data file.
My need is :
there would be a scheduler which will run once in day, and It will:
Pick the data from all four tables and load in hdfs using sqoop.
PIG will do some transformation and joining in data and will prepare the concrete de normalized data.
Sqoop will again export this data in a separate eporting table.
I have few questions around this:
Do i need to import whole data from DB to HDFS on every sqoop import call ?
in the master table some data is updated and some data in new so how can i handle that if i merge the data while loading in HDFS.
at the time of export do i need to export whole data again to reporting table. If Yes, how would i do that.
Please help me out in this case...
Please suggest me the better solution if you have..

Sqoop support incremental and delta imports. Check the Sqoop documentation here for more details.

Related

Incrementally loading into a Synapse table using Spark

I am creating a data warehouse using Azure Data Factory to extract data from a MySQL table and saving it in parquet format in an ADLS Gen 2 filesystem. From there, I use Synapse notebooks to process and load data into destination tables.
The initial load is fairly easy using spark.write.saveAsTable('orders') however, I am running into some issues doing incremental load following the intial load. In particular, I have not been able to find a way to reliably insert/update information into an existing Synapse table.
Since Spark does not allow DML operations on a table, I have resorted to reading the current table into a Spark DataFrame and inserting/updating records in that DataFrame. However, when I try to save that DataFrame using spark.write.saveAsTable('orders', mode='overwrite', format='parquet'), I run into a Cannot overwrite table 'orders' that is also being read from error.
A solution indicated by this suggests creating a temporary table and then inserting using that but that still resorts in the above error.
Another solution in this post suggests to write the data into a temporary table, drop the target table, and then rename the table but upon doing this, Spark gives me a FileNotFound errors regarding metadata.
I know Delta Tables can fix this issue pretty reliably but our company is not yet ready to move over to DataBricks.
All suggestions are greatly appreciated.

Lazy loading Azure Data Explorer data into a Databricks workspace

My company is implementing Azure Data Explorer (ADX) as a backend. They also want to use Databricks for Data Science projects including data exploration. I'm in charge of connecting Databricks to ADX.
I first tried Azure Kusto package.
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
import pandas as pd
...
df = dataframe_from_result_table(RESPONSE.primary_results[0])
Full steps here
Functionlly this works well.
But it loses completely the lazy loading feature of both ADX and Databricks-Spark.
I thought that because df is a just a Pandas dataframe, also if I try to convert this to a hive table it persists the data, which is not required, as we need fresh online data we don't want a local copy.
The next thing I've tried was to have this loaded in a spark dataframe. I've tried the following code (after installing the relevant libraries)
df = spark.read. \
format("com.microsoft.kusto.spark.datasource"). \
option("kustoCluster", KUSTO_URI). \
option("kustoDatabase",KUSTO_DATABASE). \
option("kustoQuery", "some_table_in_adx"). \
option("kustoAadAppId",CLIENT_ID). \
option("kustoAadAppSecret",CLIENT_SECRET). \
option("kustoAadAuthorityID", AAD_TENANT_ID).load()
which again loads the data into the spark dataframe without any issue.
However, performance wise it's far far away from the direct query in ADX. A count in a table of 600 thousands records is subseconds in ADX while it's more than 20 seconds in the Databricks notebook on a DS3_V2 14GB 4 cores.
Before even to consider a SaveAsTable or CreateOrReplaceTempView I wonder why I'm experiencing this performance issue. So my questions are :
Does this connection use the lazy loading (I know doing a count is not the right way to check that)?
If not is there any way to have lazy loading instea of loading the full table in dataframes before doing operations
what would happen if I create a hive table from the spark dataframe, will it copy the data or still have a virtual table pointing to ADX
Thanks for your help
For the Python part -
Pandas is not Spark dataframe therefore it's not lazy computed, to utilize these together you may use Spark parallelize.
For the Spark ADX connector -
This is indeed lazy loading. It is not evaluated until some evaluation method was requested - like the count.
If the count was done by Spark syntax i.e spark.read.kusto...count() then it would cause all the data to be first brought into spark and then operate count on it - so 20 seconds sounds legit, to compare with a query simply change the value of "kustoQuery" to "some_table_in_adx | count" which will lead to a count on the ADX side - uploading to spark just the final int result.
The connector offers simple Kusto query or distributed export command via the readMode option with ForceSingleMode to perform a simple Kusto query from driver node - as explained in the docs
As Hive table operated over dataframes - once you create a dataframe from Kusto read - it will also be lazy evaluated with table operations.

Extract data from MongoDB with Sqoop to write on HDFS?

I am concerning about extracting data from MongoDB where my application transact most of the data from MongoDB.
I have worked on sqoop to extract data and found RDBMS gel up with HDFS via sqoop. However, no clear direction found to extract data from NoSQL DB with sqoop to dump it over HDFS for big chunk of data processing?
Please share your suggestions and investigations.
I have extracted static information and data transactions from MySQL. Simply, used sqoop to store data in HDFS and processed the data. Now, I have some live transactions of 1million unique emailIDs per day which data modelled into MongoDB. I need to move data from mongoDB to HDFS for processing/ETL. How can I achieve this goal using Sqoop. I know I can schedule my task but what should be the best approach to take out data from mongoDB via sqoop.
Consider 5DN cluster with 2TB size. Data size varies from 1GB ~ 2GB in peak hours.
Sqoop is applied to import data only from relational databases. There are other ways to get data from mongo to Hadoop.
eg: https://docs.mongodb.com/ecosystem/tools/hadoop/
Or else you can use any data flow management tools like Nifi or Streamsets and get data from mongo in realtime.

What is the common practice to store users data and analysis it with Spark/hadoop?

I'm new to spark. I'm used to a Web developer, not familiar to big data.
That's say I have a portal website. user's behavior and action will store in 5 sharded mongoDB clusters.
How to I analyze it with spark ?
Or Spark can get the data from any databases directly (postgres/mongoDB/mysql/....)
Because most website may use Relational DB as back-end database.
Should I export whole data in the website's databases into HBase ?
I stored all the users log in postgreSQL, is it practical to export data into HBase or other Spark preffered databases ?
It seems it will make lots of duplicated data if I copy the data to a new database.
Does my big data model need other framework excepts Spark ?
For analyze the data in the website's databases,
I don't see the reasons that I need HDFS, Mesos, ...
How to make Spark workers can access the data in PostgreSQL databases ?
I only know how to read data from text file,
and saw some codes about how to load data from HDFS://
But I don't have HDFS system now , should I create one HDFS for my purpose ?
Spark is a distributed compute engine; so it expects to have files accessible from all nodes. Here are some choices you might consider
There seems to be Spark - MongoDB connector. This post explains how to get it working
Export the data out of MongoDB into Hadoop. And then use Spark to process the files. For this , you need to have a Hadoop cluster running
If you are on Amazon, you can put the files in S3 store and access from Spark

HDFS to PostgreSQL

We need a process in place to pull data from Hadoop Distributed File System (HDFS) to a relational DB (PostgreSQL) on a regular basis. We will need to transfer several million records per hour and I am looking for the best industry standards to move data out of HDFS. Does any one have any suggestions?
The idea is for a web app to interact with PostgreSQL which will have aggregated data.
Sqoop is built for the purpose of moving data between relational data stores and Hadoop. Specifically, you want sqoop-export.