Group by In HBase

Group by In HBase - group-by

I almost do not know anything about HBase. Sorry for basic questions.
Imagine I have a table of 100 billion rows with 10 int, one datetime, and one string column.
Does HBase allow querying this table and Group the result based on key (even a composite key)?
If so, does it have to run a map/reduce job to it?
How do you feed it the query?
Can HBase in general perform real-time like queries on a table?

Data aggregation in HBase intersects with the "real time analytics" need. While HBase is not built for this type of functionality there is a lot of need for it. So the number of ways to do so is / will be developed.
1) Register HBase table as an external table in Hive and do aggregations. Data will be accessed via HBase API what is not that efficient. Configuring Hive with Hbase this is a discussion about how it can be done.
It is the most powerful way to group by HBase data. It does imply running MR jobs but by Hive, not by HBase.
2) You can write you own MR job working with HBase data sitting in HFiles in the HDFS. It will be most efficient way, but not simple and data you processed would be somewhat stale. It is most efficient since data will not be transferred via HBase API - instead it will be accesses right from HDFS in sequential manner.
3) Next version of HBase will contain coprocessors which would be able to aggregations inside specific regions. You can assume them to be a kind of stored procedures in the RDBMS world.
4) In memory, Inter-region MR job which will be parralelized in one node is also planned in the future HBase releases. It will enable somewhat more advanced analytical processing then coprocessors.

FAST RANDOM READS = PREPREPARED data sitting in HBase!
Use Hbase for what it is...
1. A place to store a lot of data.
2. A place from which you can do super fast reads.
3. A place where SQL is not gonna do you any good (use java).
Although you can read data from HBase and do all sorts of aggregates right in Java data structures before you return your aggregated result, its best to leave the computation to mapreduce. From your questions, it seems as if you want the source data for computation to sit in HBase. If this is the case, the route you want to take is have HBase as the source data for a mapreduce job. Do computations on that and return the aggregated data. But then again, why would you read from Hbase to run a mapreduce job? Just leave the data sitting HDFS/ Hive tables and run mapreduce jobs on them THEN load the data to Hbase tables "pre-prepared" so that you can do super fast random reads from it.

Once you have the preaggregated data in HBase, you can use Crux http://github.com/sonalgoyal/crux to further drill, slice and dice your HBase data. Crux supports composite and simple keys, with advanced filters and group by.

Related

Is BigQuery suitable for frequent updates of partial data?

I'm on GCP, I have a use case where I want to ingest large-volume events streaming from remote machines.
To compose a final event - I need to ingest and "combine" event of type X, with events of types Y and Z.
event type X schema:
SrcPort
ProcessID
event type Y schema:
DstPort
ProcessID
event type Z schema:
ProcessID
ProcessName
I'm currently using Cloud SQL (PostgreSQL) to store most of my relational data.
I'm wondering whether I should use BigQuery for this use case, since I'm expecting large volume of these kind of events, and I may have future plans for running analysis on this data.
I'm also wondering about how to model these events.
What I care about is the "JOIN" between these events, So the "JOIN"ed event will be:
SrcPort, SrcProcessID, SrcProcessName, DstPort, DstProcessID, DstProcessName
When the "final event" is complete, I want to publish it to PubSub.
I can create a de-normalized table and just update partially upon event (how is BigQuery doing in terms of update performance?), and then publish to pubsub when complete.
Or, I can store these as raw events in separate "tables", and then JOIN periodically complete events, then publish to pubsub.
I'm not sure how good PostgreSQL is in terms of storing and handling a large volume of events.
The thing that attracted me with BigQuery is the comfort of handling large volume with ease.

If you have this already on Postgres, I advise you should see BigQuery a complementary system to store a duplicate of the data for analyses purposes.
BigQuery offers you different ways to reduce costs and improve query performance:
read about Partitioning and Clustering, with this in mind you "scan" only the partitions that you are interested to perform the "event completion".
you can use scheduled queries to run MERGE statements periodically to have materialized table (you can schedule this as often as you want)
you can use Materialized Views for some of the situations

BigQuery works well with bulk imports and frequent inserts like http logging. Inserting into bigquery with segments of ~100 or ~1000 rows every few seconds works well.
Your idea of creating a final view will definitely help. Storing data in BigQuery is cheaper than processing it so it won't hurt to keep a raw set of your data.
How you model or structure your events is up to you.

Saving JDBC db data as shared state Spark

I have an MSSQL table as a data source and I would like to save some kind of the processing offset in the form of the timestamp (it is one of the table's columns). So it would be possible to process the data from the latest offset. I would like to save as some kind of shared state between Spark sessions. I have researched shared state in Spark session, however, I did not find the way to store this offset in the shared state. So is it possible to use existing Spark constructs to perform this task?

As far as I know there is no official built-in feature supporting passing data between sessions in Spark. As alternative I would consider the following options/suggestions:
First the offset column must be an indexed field in MSSQL in order to be able to query it fast.
If there is already an in-memory (i.e Redis, Apache Ignite) system installed and used by your project I would store there the offset.
I wouldn't use a message queue system such as Kafka because once you consume one message you will need to resend it therefore that would't make sense.
As solution I would prefer to save it in the filesystem or in Hive even if it would add extra overhead since you will have only one value in that table. In the case of the filesystem of course the performance would be much better.
Let me know if further information is needed

Can join operations be demanded to database when using Spark SQL?

I am not an expert of Spark SQL API, nor of the underlying RDD one.
But, knowing of the Catalyst optimization engine, I would expect Spark to try and minimize in-memory effort.
This is my situation:
I have, let's say, two table
TABLE GenericOperation (ID, CommonFields...)
TABLE SpecificOperation (OperationID, SpecificFields...)
They are both quite huge (~500M, not big data, but unfeasible to have as a whole in memory in a standard application server)
That said, suppose I have to retrieve using Spark (part of a larger use case) all the SpecificOperation instances that match some particular condition on fields that belong to GenericOperation.
This is the code that I am using:
val gOps = spark.read.jdbc(db.connection, "GenericOperation", db.properties)
val sOps = spark.read.jdbc(db.connection, "SpecificOperation", db.properties)
val joined = sOps.join(gOps).where("ID = OperationID")
joined.where("CommonField= 'SomeValue'").select("SpecificField").show()
Problem is, when it comes to run the above, I can see from SQL Profiler that Spark does not execute the join on the database, but rather retrieves all the OperationID from SpecificOperation, and then I assume it will be running all the merge in memory. Since no filter is applicable on SpecificOperation, such retrieve would bring a lot, too much, data to the end system.
Is it possible to write the above so that the join is demanded directly to dbms?
Or it depends on some magic configuration of Spark I am not aware of?
Of course, I could simply hardcode the join as a subquery when retrieving, but that's not feasible in my case: statements hve to be created at runtime starting from simple building blocks. Hence, I need to implement this starting from two spark.sql.DataFrame already built up
As a side note, I am running this with Spark 2.3.0 for Scala 2.11, against a SQL Server 2016 database instance.

Is it possible to write the above so that the join is demanded directly to dbms? Or it depends on some magic configuration of Spark I am not aware of?
Excluding statically generated queries (In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?), Spark doesn't support join pushdown. Only predicates and selection can be delegated to the source.
There is no magic configuration or code that could even support this type of process.
In general if server can handle join, data is usually not large enough to benefit from Spark.

Is really Hive on Tez with ORC performance better than Spark SQL for ETL?

I have little experience in Hive and currently learning Spark with Scala. I am curious to know whether Hive on Tez really faster than SparkSQL. I searched many forums with test results but they have compared older version of Spark and most of them are written in 2015. Summarized main points below
ORC will do the same as parquet in Spark
Tez engine will give better performance like Spark engine
Joins are better/faster in Hive than Spark
I feel like Hortonworks supports more for Hive than Spark and Cloudera vice versa.
sample links :
link1
link2
link3
Initially I thought Spark would be faster than anything because of their in-memory execution. after reading some articles I got Somehow existing Hive also getting improvised with new concepts like Tez, ORC, LLAP etc.
Currently running with PL/SQL Oracle and migrating to big data since volumes are getting increased. My requirements are kind of ETL batch processing and included data details involved in every weekly batch runs. Data will increase widely soon.
Input/lookup data are csv/text formats and updating into tables
Two input tables which has 5 million rows and 30 columns
30 look up tables used to generate each column of output table which contains around 10 million rows and 220 columns.
Multiple joins involved like inner and left outer since many look up tables used.
Kindly please advise which one of below method I should choose for better performance with readability and easy to include minor updates on columns for future production deployment.
Method 1:
Hive on Tez with ORC tables
Python UDF thru TRANSFORM option
Joins with performance tuning like map join
Method 2:
SparkSQL with Parquet format which is converting from text/csv
Scala for UDF
Hope we can perform multiple inner and left outer join in Spark

The best way to implement the solution to your problem as below.
To load the data into the table the spark looks good option to me. You can read the tables from the hive metastore and perform the incremental updates using some kind of windowing functions and register them in hive. While ingesting as data is populated from various lookup table, you are able to write the code in programatical way in scala.
But at the end of the day, there need to be a query engine that is very easy to use. As your spark program register the table with hive, you can use hive.
Hive support three execution engines
Spark
Tez
Mapreduce
Tez is matured, spark is evolving with various commits from Facebook and community.
Business can understand hive very easily as a query engine as it is much more matured in the industry.
In short use spark to process the data for daily processing and register them with hive.
Create business users in hive.

HBase or Mongo for an Analytics DB if already using Hadoop?

I currently have a Hadoop cluster where I store tons of logs over which I run pig scripts for calculating aggregated analytics. I also have a Mongo cluster where I store production data.
I've recently been put in a position where I need to do a lot of one-off analytics queries, or enable others to do them. These queries frequently need to use both production data and log data together, so whatever I go with, I'd like to have everything in one place. My log data is in json and about 10x the size of my prod data. Here are the pros/cons of Mongo and HBase I'm seeing:
Mongo Pros/ HBase Cons:
Since log data is in JSON, I can get it into Mongo pretty easily, and I can do this in real time as it comes in through something like FluentD.
Most people I work with already have experience writing Mongo queries from needing to work with prod data, so getting an analytics db up on Mongo would be very simple for everyone to use.
I know much less about Hbase than Mongo.
No idea how easy/difficult it would be to get data in JSON or from Mongo into Hbase. I imagine this isn't so bad, but I don't see much documentation.
HBase Pros/Mongo Cons:
My log data is much bigger than my prod data, so storing it in both hadoop and mongo would be way more expensive than storing my prod data in both hadoop and mongo.
I can build HBase on top of my already running Hadoop cluster and fit my prod data in there without adding many extra machines. If I went with Mongo, I'd need a whole new Mongo cluster.
I could use Phoenix on top of Hbase to allow a simple SQL syntax for accessing all our data, but I'm not sure how unwieldily this would be for multi-level document-based data.
I know very little about Hbase currently, and I wouldn't consider myself a Mongo expert, so I'm probably missing a lot.
So, what am I missing, and which is right for my situation?

First of all, you should use something which you already can handle. Therefore, Mongo DB seems a good choice, especially when the data is already in the json format.
On the other hand, I used HBase quite a while and the read performance is amazing although having a lot of rows and I really don't know if there is any good and fast integration of Mongo DB with Hadoop.
HBase is the Hadoop database, so it is predestinated to work with Hadoop together.
If the logs could be indexed by (in the HBase Rowkey):
producing_program_identifier, timestamp, ...
HBase could work quite well for this query pattern.
But if you decide on HBase, use the
phoenix framwork, it will save you time using familiar interfaces like jdbc and sql-like queries. It also provides simple aggregation functions (count, avg, max, min) which may be sufficient.

From what you're saying it seems a mongoDB based solution would work best for you.
HBase is extremely versatile and you can get it to serve both your prod needs as well as your analytics needs however the general purpose SQL capabilities (in Phoenix, Cloudera's Impala and others) are in their infancy and the standard HBase way to get high query performance (designing the data structure for reads) will take a lot on effort (esp. since you don't have experience in HBase).
By the way it may be applicable for you to use map/reduces pre-aggregated data and then load it into MongoDB and thus utilize your current setup bette rather than change it either way

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse