Automatically generated test data to a DB from a schema? - postgresql

I have a discussion-db, and I need a great amount of test data, for different sized samples. Please, see the ready SELECT, JOIN and CREATE-queries, please scroll down in the link.
How can I automatically generate test data to the db?
How to generate test data in different sized samples?
Is there some ready tool?

Here are a couple of suggestions for free tools that generate test data:
Databene Benerator: supports many JDBC-capable database brands, uses XML format compatible with DbUnit, GPL license.
Super Smack: originally a load-test tool for MySQL, it also supports PostgreSQL and it includes a generator of mock data.
A current version of Super Smack appears to be available here
I asked a similar question here on StackOverflow in February, and the two choices above seemed like the best options.

I know this question is super dated, but I was looking for the answer to this exact question today and I came across this:
http://wiki.postgresql.org/wiki/Sample_Databases
Out of the options listed (including built in tools like pgbench), pgFoundry has several compelling options that works perfectly for the test cases I am working on.
I thought it might help someone like me, so there it is.

I'm not sure how to get automatically generated data and insert it into the database (I'm sure you could pull it off with a python script or something), but if you're just looking for endless blabbering to stick into a db, this should be helpful.

I'm not a postres person, but in many other DBs I've used, a simple mechanism for generating large quantities of test data is a cross join. The technique is particularly useful for generating large quantities of test data.
Here's a nice blog post on it (SQL Server specific though).

Related

Generating MIS reports and dashboards using opensource technologies

Am need of your suggestion for scenario below :
one of our clients has 8 postgres DB servers used as OLTP and now wants to generate MIS reports/dashboards integrating all the data in the servers.
- There are around 100 reports to be generated
- There would be around 50k rows added to each of these databases
- the reports are to be generated for once every month
- they are running all there setup in baremetals
- they don't want to use hadoop/spark , since they think the maintainabilty will be higher
- they want to use opensource tech to accomplish this task
with all said above, one approach would be to write scripts to bring aggregated data into one server
and then manually code the reports with frontend javascript.
is there any better approach using ETL tools like Talend,Pentaho etc.
which ETL tool would be best suited for this ?
community editions of any ETL tool would suffice the above requirement..?
I know for the fact that the commercial offering of any of the ETL tools will not be in the budget.
could you please let me know your views on this.
Thanks in Advance
Deepak
Sure Yes. I did similar things successfully a dozen time in my life.
My suggestion is to use Pentaho-Data-Integrator (or Talend) to collect the data in one place and then filter, aggregate and format the data. The data volume is not an issue as long as you have a decent server.
For the reports, I suggest to produce them with Pentaho-Report-Designer so that they can be send by mail (with Pentaho-DI) or distributed with a Pentaho-BI-server.
You can also make javascript front end with Pentaho-CDE.
All of these tools are mature, robust, easy to use, have a community edition and well supported by the community.

How to deploy/versioning database with Cruise Control Net?

Hi i have configured the basics of cruise control to make releases, and automated nunit test using just MSBuild. Now i'm wondering if is possible to deploy/versioning databases with this?
I'm a beginner at CCNet .So if is possible some suggestions or tutorials (if there are) . Also if someone knows a free tool for database deployment/versioning let me know.. i will be grateful.
Thanks in advance
Hugh
It isn't free but SQL Source Control from RedGate can do what you're looking for, assuming it's a SQL Server database. It has a commandline interface that you can use in CCNet tasks. The easy approach of just migrating up is... easy, the changes are applied to your database schema / data. There was an issue with v2x of the tool that they've overcome with 3, which is that if you were to rename a table column then it would delete the column and create a new one with the right name. Obviously that's quite a big problem if you've got data you want to keep, so with v3 there's the concept of migrations and this allows you to specify alter scripts so instead of dropping the column you could script the change non-destructively.
As far as I know, at this time, they don't have anything that allows you to roll back your version.
Otherwise you could take a look at database migration tools, there seemed to be some promise for these in .Net at least. There is also this post that has some other tools (again for .net) and then there's this https://stackoverflow.com/search?q=database+migration+tool which is not restricted to any language but is general database migrations
If you're still looking for ways to version and migrate databases, one such tool is dbdeploy.net . I've hosted it on github after forking it and doing some work. Latest version is fully up to date and has some interesting features (done by someone who also uses it and sent a pull request).

How to implement a search system in a database for an iphone application

This is pretty wide question, but I'm hoping to get a push in the right direction (technologies and methodology).
Ok, I have an iphone app (which I am developing) that works with a web service (c#) through http requests. The web service connects to the underlying database, extracts the necessary data depending on the request and feeds it back to the application.
Now, I need to implement a search system in the app. The user searches for some words, and I need to provide the most relevant results. The search must be performed on different tables in the database. Each table can be searched in a number of columns. For example, when searching through the people table I need to search in the first name, lastname, company, and other fields. Other tables have other important columns.
I have so many questions that I don't even know where to start.
How do I make my sql queries to make the search, but still be fast enough. Do I need to make some extra tables with indexed content somehow?
How should I add relevance factor to the results so I can ultimately filter only the most relevant results? For example, if an user searches for Smith, maybe there is a person named Smith or even a Company. They should be displayed before any other content that can have smith in the description.
I know the question is a little vague/wide but I can explain more if somebody desires.
Thank you
This kind of depends on which language/rdbms you are using on your server. You might checkout various DB search solutions like Sphinx which will do all of that indexing for you and provide a simple Search API. Sphinx for example allows you to prioritize columns, define character mappings (ß->s, ä->a) etc.
In the end I have decided to use Lucene. It's a wonderful piece of technology and even if I had some doubts in the beginning, after reading 3/4 of the book called "Lucene in Action" it was clear to me that it had everything I needed (and much more).
I know it's not a fully-functional searching system (with all the elements needed), but merely a library handling the core of a search system. It will need some work to integrate it with my application/webservice/database. I will let you know how it goes :)
Thanks for your input!

How to implement long-term statistics and short-time log?

We develop a larger database web application with Perl Catalyst and PostgreSQL under Linux. Users can login and upload and download data files (scientific measurements).
I wonder how to implement a logging/statistics system.
We need to view general access trends, and want to analyze traffic caused by certain users/IPs and get access numbers for certain files or topics. I was thinking about something like RRDtool to implement this or writing the total numbers to another database table. I would be nice to get some visual graphs from the access data:-)
Additionally we need to analyze the activity over the last days in detail. If problems or attacks occurred it must be understood and undone. IMO this needs an action log in a database table.
Can you give me some inspiration on how to implement these things? I would love to use the same system both for logging and long-term statistics. Maybe we can accumulate log data after a period of e.g. 7 days. Not that I had no idea how to do it, but I'd like to hear some opinion from somebody else.
Hints to useful CPAN modules are appreciated. We know and already use log4perl but this is a bit too detailed to store it for ~7 days...
Actually I think you answered yourself, RRDTool is pretty good for long-term, I use it for 1/2hr automatic meter readings for a communal boiler system, with a 3 year window. Nice graphs too.
However, I'm assuming that all this runs under a web server and the uploads and downloads generate [for example] Apache logfile entries, then you have a great many options with this: http://httpd.apache.org/docs/current/mod/mod_log_config.html.
This would mean that you could use Webalizer for 'routine' reports and write roll-your-own for the detail, maybe starting from: http://search.cpan.org/~ulpfr/Logfile-0.302/Logfile.pod
Hope that's a little bit helpful, it's a broad, broad question though.

How to collect statistics about data usage in an enterprise application?

For optimization purposes, I would like to collect statistics about data usage in an enterprise Java application. In practise, I would like to know which database tables, and moreover, which individual records are accessed most frequently.
What I thought I could do is write an aspect that records all data access and asynchronously writes the results to a database but I feel that I would be re-inventing the wheel by doing so. So, are there any existing open source frameworks that already tackle this problem or is it somehow possible to squeeze this information directly from MySQL?
This might be useful - have you seen the UserTableMonitoring project?