I have a use-case where i want to capture the data changes (Insert/Update) in Hbase tables which is getting populated via Kafka.
I have tried the following approach but it doesn't seem to work. Hbase change data capture
Is there any other way i can achieve the task.
Related
I have a use case where I am using spring batch and writing to 3 different data sources based on the job parameters. All of this mechanism is working absolutely fine but the only problem is the meta data. Spring batch is using the default data Source to write the metadata . So whenever I write the data for a job, the transactional data always goes to the correct DB but the batch metadata always goes to default DB.
Is it possible to selectively write the meta data also to the respective databases based on the jobs parameter?
#michaelMinella , #MahmoudBenHassine Can you please help.
It's my first Druid experience.
I have got a local setup of Druid in local machine.
Now I'd like to make some query performance test. My test data is a huge local json file 1.2G.
The idea was to load it into druid and run required SQL query. The file is getting parsed and successfully processed (I'm using Druid web-based UI to submit an injection task).
The problem I run into is the datasource size. It doesn't makes sense that 1.2G of raw json data results in 35M of datasource. Is there any limitation the locally running Druid setup has. I think the test data is processed partially. Unfortunately didn't find any relevant config to change it. Will appreciate if some one is able to shed light on this.
Thanks in advance
With druid 80-90 percent compression is expected. I have seen 2GB CSV file reduced to 200MB druid datasoruce.
Can you query the count to make sure all data is ingested? All please disable approximate algorithm hyper-log-log to get exact count.Druid SQL will switch to exact distinct counts if you set "useApproximateCountDistinct" to "false", either through query context or through broker configuration.( refer http://druid.io/docs/latest/querying/sql.html )
Also can check logs for exception and error messages. If it faces problem to ingest particular JSON record it skips that record.
I have an MSSQL table as a data source and I would like to save some kind of the processing offset in the form of the timestamp (it is one of the table's columns). So it would be possible to process the data from the latest offset. I would like to save as some kind of shared state between Spark sessions. I have researched shared state in Spark session, however, I did not find the way to store this offset in the shared state. So is it possible to use existing Spark constructs to perform this task?
As far as I know there is no official built-in feature supporting passing data between sessions in Spark. As alternative I would consider the following options/suggestions:
First the offset column must be an indexed field in MSSQL in order to be able to query it fast.
If there is already an in-memory (i.e Redis, Apache Ignite) system installed and used by your project I would store there the offset.
I wouldn't use a message queue system such as Kafka because once you consume one message you will need to resend it therefore that would't make sense.
As solution I would prefer to save it in the filesystem or in Hive even if it would add extra overhead since you will have only one value in that table. In the case of the filesystem of course the performance would be much better.
Let me know if further information is needed
I would like to know it will have any solution
Problem :
I have Cassandra database for saving large scale data from other sources continuously. Application-based data are saving in postgressql. For functionality, I want to query all data from postgresql. so I would like to save Cassandra data consistently to postgressql database based on data coming in Cassandra.
Is it possible?
Please suggest
I would like to save Cassandra data consistently to PostgreSQL database based on data coming in Cassandra.
There are no special utilities for this. You need to create your own service to gather data from Cassandra, process it and put results to PostgreSQL.
Sure. You can use Change Data Capture (CDC) to copy the data from Cassandra to PostgreSQL as and when there is a change in Cassandra data. One option is to use Kafka Connect with appropriate coonectors.
I have a Spark job that reads from an Oracle table into a dataframe. The way it seems the jdbc.read method works is to pull an entire table in at once, so I constructed a spark-submit job to work in batch. Whenever I have data I need manipulated I put it in a table and run the spark-submit.
However, I would like this to be more event driven...essentially I want it so anytime data is moved into this table it is run through Spark, so I can have events in a UI drive these insertions and spark is just running. I was thinking about using a spark streaming context just to have it watching and operating on the table all the time, but with a long wait between streaming contexts. This way I can use the results (also written to Oracle in part) to trigger a deletion of the read table and not run data more than once.
Is this a bad idea? Will this work? It seems more elegant than using a cron-job.