How to write integration tests depending on Druid? - scala

I am coding an application that generates reports from a Druid database. My integration tests need to read data from that database.
My current approach involves creating synthetic data for each of the tests. However, I am unable to remove data created from the database (be it by removing entries or completely dropping the schema). Tried this but still getting data back after disabling the segment and firing the kill task.
I think that either I am completely wrong with my approach or there is a way to delete information from the database that I haven't been able to find.

You can do this by below 2 approaches
Approach 1 :
Disable the segment(used=0)
Fire a kill task for that segment
Have the load and drop rules
Refer : http://druid.io/docs/latest/ingestion/tasks.html (look for destroying segments)
Approach 2 : (prefer this for doing integration tests before setting up production):
stop coordinator node and delete all entires in the druid_segments
table in the metadata store
stop historical node and delete everything inside the directory pointed by druid.segmentCache.locations at historical node
start coordinator and historical nodes
Remember this will delete everything from druid cluster.

In the end I worked around the issue by inserting data in Druid with ids specific to each unit test and querying for that.
Not very elegant since now one malicious test can (potentially) mess with the results of another test.

Related

Is there an equivalent of pg_backend_pid in Cassandra?

I am starting to use Cassandra and I need to work with several sessions without creating different roles. I am trying to implement a record that saves the session ID in each modification (aka AuditLog). Previously it was already implemented in Postgresql, so I learned about triggers. I am adapting to Cassandra's triggers. So far I can't find a way to track a cql session / connection that doesn't include an external process. But in this way the use of triggers is excluded.
Cassandra has the function to enable or disable traces with the command TRACING, which will create traces for all the queries in that session. There is also a more useful approach with nodetool settraceprobability, where you can determine a percentage of traces stored.
All those traces are kept in a separate keyspace, for 3.x this is system_traces, the traces are kept with a Time to Live (TTL) of 24 hours.

Quartz JDBC Job Store - Maintenance/Cleanup

I am currently in the processes of setting up Quartz in a load balanced environment using the JDBC job store and I am wondering how everyone manages the quartz job store DB.
For me Quartz (2.2.0) will be deployed as a part of a versioned application with multiple versions potentially existing on the one server at the one time. I am using the notation XXScheduler_v1 to ensure multiple schedulers play nice together. My code is working fine, with the quartz tables being populated with the triggers/jobs/etc as appropriate.
One thing I have noticed though is that there seems to be no database cleanup that occurs when the application is undeployed. What I mean is that the Job/Scheduler data seems to stay in the quartz database even though there is no longer a scheduler active.
This is less than ideal and I can imagine with my model the database would get larger than it needed to be with time. Am I missing how to hook-up some clean-up processes? Or does quartz expect us to do the db cleanup manually?
Cheers!
I got this issue once, and here is what I did to rectify the issue. This will work for sure but in case it does not then we will have backup of table so you don't have anything to loose while trying this.
Take sql dump of following tables using method mentioned at : Taking backup of single table
a) QRTZ_CRON_TRIGGERS
b) QRTZ_SIMPLE_TRIGGERS
c) QRTZ_TRIGGERS
d) QRTZ_JOB_DETAILS
Delete data from above tables in sequence as
delete from QRTZ_CRON_TRIGGERS;
delete from QRTZ_SIMPLE_TRIGGERS;
delete from QRTZ_TRIGGERS;
delete from QRTZ_JOB_DETAILS;
Restart your app which will then freshly insert all deleted tasks and related entries in above tables (Provided your app has its logic right).
This is more like starting your app with all the tasks being scheduled for the first time. So you must keep in mind that tasks will behave as if these are freshly inserted.
NOTE: If this does not work then apply the backup you took for tables and try to debug more closely. As of now, I have not seen this method fail.
It's definitely not doing any DB cleanup when undeploying the application or shutting down the scheduler. You would have to build some cleanup code during application shutdown (i.e. building some sort of StartupServlet or context listener that would do the cleanup on the destroy() event lifecycle)
You're not missing anything.
However, these quartz tables aren't different from any applicative DB objects you use in you data model. You add Employees table and in a later version you don't need it anymore. Who's responsible for deleting the old table? Only you. If you habe a DBA you might roll it on the DBA ;).
This kind of maintenance would typically be done using an uninstall script / wizard, upgrade script / wizard, or during the first startup of the application in its new version.
On a side note, typically different applications use different databases, or different schemas for the least, thus reducing inter-dependencies.
To clean Quartz Scheduler internal data one needs more SQL:
delete from QRTZ_CRON_TRIGGERS;
delete from QRTZ_SIMPLE_TRIGGERS;
delete from QRTZ_TRIGGERS;
delete from QRTZ_JOB_DETAILS;
delete from QRTZ_FIRED_TRIGGERS;
delete from QRTZ_LOCKS;
delete from QRTZ_SCHEDULER_STATE;

Lock table to handle concurrency in Entity Framework 3.5

We have a web service that acccepts an XML file for any faults that occur on a vehicle. The web service then uses EF 3.5 to load these files to a hyper normalized database. Typically an XML file is processed in 10-20 seconds. There are two concurrency scenarios that I need to handle:
Different vehicles sending XML files at the same time: This isn't a problem. EF's default optimistic concurrency ensures that I am able to store all these files in the same tables as their data is mutually exclusive.
Same vehicle sending multiple files at the same time: This creates a problem as my system tries to write same or similar data to the database simultaneously. And this isn't rare.
We needed a solution for point 2.
To solve this I introduced a lock table. Basically, I insert a concatenated vehicle id and fault timestamp (which is same for the multiple files sent by a vehicle for the same fault) into this table when I start writing to the DB and I delete the record once I am done. However, there are a lot of times when both the files try to insert this row into the database simultaneously. In such cases, one file succeeds, while the other throws a duplicate key exception that goes to the caller of the webservice.
What's the best way to handle such scenarios? I wouldn't like to rollback anything from the db as there are many tables involved for a single file.
And what solution do you expect? Your current approach with lock table is exactly what you need. If the exception is fired because of duplicate you can either wait and try it again later or fire typed fault back to client and let him upload the file later. Both solutions are ugly but that is what your application currently offer.
The better solution would be replacing current web service with another solution where web service call would only add job to the queue and some background process would process these jobs and ensure that two files for the same car would not be processed concurrently. This would also offer much better throughput control for peek situations. The disadvantage is that you must implement some notification that file has been processed because it will not be online.

what happens to my dataset in case of unexpected failure

i know this has been asked here. But my question is slightly different. When the dataset was designed keeping the disconnected principle in mind, what was provided as a feature which would handle unexpected termination of the application, say a power failure or a windows hang or system exception leading to restart. Say the user has entered some 100 rows and it is modified at the dataset alone. Usually the dataset is updated at the application close or at a timely period.
In old times which programming using vb 6.0 all interaction used to take place directly with the database, thus each successful transaction was committing itself automatically. How can that be done using datasets?
DataSets are never for direct access to database, they are a disconnected model only. There is no intent that they be able to recover from machine failures.
If you want to work live against the database you need to use DataReaders and issue DbCommands against the database live for changes. This of course will increase your load on the database server though.
You have to balance the two for most applications. If you know a user just entered vital data as a new row, execute an insert command to the database, and put a copy in your local cached DataSet. Then your local queries can run against the disconnected data, and inserts are stored immediately.
A DataSet can be serialized very easily, so you could implement your own regular backup to disk by using serialization of the DataSet to the filesystem. This will give you some protection, but you will have to write your own code to check for any data that your application may have saved to disk previously and so on...
You could also ignore DataSets and use SqlDataReaders and SqlCommands for the same sort of 'direct access to the database' you are describing.

Database last updated?

I'm working with SQL 2000 and I need to determine which of these databases are actually being used.
Is there a SQL script I can used to tell me the last time a database was updated? Read? Etc?
I Googled it, but came up empty.
Edit: the following targets issue of finding, post-facto, the last access date. With regards to figuring out who is using which databases, this can definitively monitored with the right filters in the SQL profiler. Beware however that profiler traces can get quite big (and hence slow/hard to analyze) when the filters are not adequate.
Changes to the database schema, i.e. addition of table, columns, triggers and other such objects typically leaves "dated" tracks in the system tables/views (can provide more detail about that if need be).
However, and unless the data itself includes timestamps of sorts, there are typically very few sure-fire ways of knowing when data was changed, unless the recovery model involves keeping all such changes to the Log. In that case you need some tools to "decompile" the log data...
With regards to detecting "read" activity... A tough one. There may be some computer-forensic like tricks, but again, no easy solution I'm afraid (beyond the ability to see in server activity the very last query for all still active connections; obviously a very transient thing ;-) )
I typically run the profiler if I suspect the database is actually used. If there is no activity, then simply set it to read-only or offline.
You can use a transaction log reader to check when data in a database was last modified.
With SQL 2000, I do not know of a way to know when the data was read.
What you can do is to put a trigger on the login to the database and track when the login is successful and track associated variables to find out who / what application is using the DB.
If your database is fully logged, create a new transaction log backup, and check it's size. The log backup will have a fixed small lengh, when there were no changes made to the database since the previous transaction log backup has been made, and it will be larger in case there were changes.
This is not a very exact method, but it can be easily checked, and might work for you.