I have an MSSQL table as a data source and I would like to save some kind of the processing offset in the form of the timestamp (it is one of the table's columns). So it would be possible to process the data from the latest offset. I would like to save as some kind of shared state between Spark sessions. I have researched shared state in Spark session, however, I did not find the way to store this offset in the shared state. So is it possible to use existing Spark constructs to perform this task?
As far as I know there is no official built-in feature supporting passing data between sessions in Spark. As alternative I would consider the following options/suggestions:
First the offset column must be an indexed field in MSSQL in order to be able to query it fast.
If there is already an in-memory (i.e Redis, Apache Ignite) system installed and used by your project I would store there the offset.
I wouldn't use a message queue system such as Kafka because once you consume one message you will need to resend it therefore that would't make sense.
As solution I would prefer to save it in the filesystem or in Hive even if it would add extra overhead since you will have only one value in that table. In the case of the filesystem of course the performance would be much better.
Let me know if further information is needed
Related
I am using the delta lake oss version 0.8.0.
Let's assume we calculated aggregated data and cubes using the raw data and saved the results in a gold table using delta lake.
My question is, is there a well known way to access these gold table data and deliver them to a web dashboard for example?
In my understanding, you need a running spark session to query a delta table.
So one possible solution could be to write a web api, which executes these spark queries.
Also you could write the gold results in a database like postgres to access it, but that seems just duplicating the data.
Is there a known best practice solution?
The real answer depends on your requirements regarding latency, number of requests per second, amount of data, deployment options (cloud/on-prem, where data located - HDFS/S3/...), etc. Possible approaches are:
Have the Spark running in the local mode inside your application - it may require a lot of memory, etc.
Run Thrift JDBC/ODBC server as a separate process, and access data via JDBC/ODBC
Read data directly using the Delta Standalone Reader library for JVM, or via delta-rs library that works with Rust/Python/Ruby
In regards to: https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability
What's the best way to programatically determine if a table's data is available after streaming?
I am getting unexpected results trying to fetch Rows and TotalRows with the following apis: Jobs.Query, Jobs.GetQueryResults, Tables.Get, Tabledata.List
Thanks.
You can tell whether data is flushed on the table by executing the Tables.Get() API and looking at the streamingBuffer.oldestEntryTime value. This can be considered a high-water mark of data that has been flushed out of the buffer.
Any data before this timestamp should be available for copy, export, and list operations.
Also, I should clarify that data in the table is available for query immediately after streaming. It only is unavailable to table copy, export, and tabledata.list() operations. Yes, this is confusing, but yes, we're also working on addressing the problem.
For tables that haven't been streamed to before or recently, there is a warmup period where new streaming data won't show up.
See https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability for more information.
So a little background - we have a large number of data sources ranging from RDBMS's to S3 files. We would like to synchronize and integrate this data with other various data warehouses, databases, etc.
At first, this seemed like the canonical model for Kafka. We would like to stream the data changes through Kafka to the data output sources. In our test case we are capturing the changes with Oracle Golden Gate and successfully pushing the changes to a Kafka queue. However, pushing these changes through to the data output source has proven challenging.
I realize that this would work very well if we were just adding new data to the Kafka topics and queues. We could cache the changes and write the changes to the various data output sources. However this is not the case. We will be updating, deleting, modifying partitions, etc. The logic for handling this seems to be much more complicated.
We tried using staging tables and joins to update/delete the data but I feel that would become quite unwieldy quickly.
This comes to my question - are there any different approaches we could go about handling these operations? Or should we totally move in a different direction?
Any suggestions/help is much appreciated. Thank you!
There are 3 approaches you can take:
Full dump load
Incremental dump load
Binlog replication
Full dump load
Periodically, dump your RDBMS data source table into a file, and load that into the datawarehouse, replacing the previous version. This approach is mostly useful for small tables, but is very simple to implement, and supports updates and deletes to the data easily.
Incremental dump load
Periodically, get the records that changed since your last query, and send them to be loaded to the data warehouse. Something along the lines of
SELECT *
FROM my_table
WHERE last_update > #{last_import}
This approach is slightly more complex to implement, because you have to maintain the state ("last_import" in the snippet above), and it does not support deletes. It can be extended to support deletes, but that makes it more complicated. Another disadvantage of this approach that it requires your tables to have a last_update column.
Binlog replication
Write a program that continuously listens to the binlog of your RDBMS and sends these updates to be loaded to an intermediate table in the data warehouse, containing the updated values of the row, and whether it is a delete operation or update/create. Then write a query that periodically consolidates these updates to create a table that mirrors the original table. The idea behind this consolidation process is to select, for each id, the last (most advanced) version as seen in all the updates, or in the previous version of the consolidated table.
This approach is slightly more complex to implement, but allows achieving high performance even on large tables and supports updates and deletes.
Kafka is relevant to this approach in that it can be used as a pipeline for the row updates between the binlog listener and the loading to the data warehouse intermediate table.
You can read more about these different replication approaches in this blog post.
Disclosure: I work in Alooma (a co-worker wrote the blog post linked above, and we provide data-pipelines as a service, solving problems like this).
I have a legacy system that is capable of inserting updating data from its database to remote RDBMS (using jdbc driver) in real time. I cannot change the code since I don't have it.
We are thinking of moving this data to nosql data source like cassandra.
I am thinking of deploying postgres in the middle and pushing it to cassandra or writing it to flat file. Since there are frequent updates I will have to store the data in two database. Is there any ETL process which can listen to sql queries (insert,update,delete) and forward it to different source?
One option would be to use bottled water to capture changes in postgresql and a create a consumer that would apply those changes to e.g. cassandra.
I almost do not know anything about HBase. Sorry for basic questions.
Imagine I have a table of 100 billion rows with 10 int, one datetime, and one string column.
Does HBase allow querying this table and Group the result based on key (even a composite key)?
If so, does it have to run a map/reduce job to it?
How do you feed it the query?
Can HBase in general perform real-time like queries on a table?
Data aggregation in HBase intersects with the "real time analytics" need. While HBase is not built for this type of functionality there is a lot of need for it. So the number of ways to do so is / will be developed.
1) Register HBase table as an external table in Hive and do aggregations. Data will be accessed via HBase API what is not that efficient. Configuring Hive with Hbase this is a discussion about how it can be done.
It is the most powerful way to group by HBase data. It does imply running MR jobs but by Hive, not by HBase.
2) You can write you own MR job working with HBase data sitting in HFiles in the HDFS. It will be most efficient way, but not simple and data you processed would be somewhat stale. It is most efficient since data will not be transferred via HBase API - instead it will be accesses right from HDFS in sequential manner.
3) Next version of HBase will contain coprocessors which would be able to aggregations inside specific regions. You can assume them to be a kind of stored procedures in the RDBMS world.
4) In memory, Inter-region MR job which will be parralelized in one node is also planned in the future HBase releases. It will enable somewhat more advanced analytical processing then coprocessors.
FAST RANDOM READS = PREPREPARED data sitting in HBase!
Use Hbase for what it is...
1. A place to store a lot of data.
2. A place from which you can do super fast reads.
3. A place where SQL is not gonna do you any good (use java).
Although you can read data from HBase and do all sorts of aggregates right in Java data structures before you return your aggregated result, its best to leave the computation to mapreduce. From your questions, it seems as if you want the source data for computation to sit in HBase. If this is the case, the route you want to take is have HBase as the source data for a mapreduce job. Do computations on that and return the aggregated data. But then again, why would you read from Hbase to run a mapreduce job? Just leave the data sitting HDFS/ Hive tables and run mapreduce jobs on them THEN load the data to Hbase tables "pre-prepared" so that you can do super fast random reads from it.
Once you have the preaggregated data in HBase, you can use Crux http://github.com/sonalgoyal/crux to further drill, slice and dice your HBase data. Crux supports composite and simple keys, with advanced filters and group by.