Insert Performance Benchmarks [closed]

Insert Performance Benchmarks [closed] - postgresql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
It would be extremely useful to have some idea of expected performance benchmarks for inserts in a postgresql database. Typically the type of answers one would get on this are vague, and in many ways rightly so. For example, answers could range from every database is different, to it depends on the number of indexes/columns, to hardware makes a big difference, to db tuning makes a big difference etc. MY goal is to know the general guidelines of insert performance, roughly at an equivalent level as when an experienced SQL Developer's intuition says "this seems slow, I should try to optimize this".
Let me illustrate, someone could ask how much does it cost to buy a house? We answer, expensive! And there are many factors that go into the price such as size of the house and location in the country. BUT, to the person asking the question, they might think $20,000 is a lot of money so houses must cost about that much. Saying it's expensive and there are a lot of variables obviously doesn't help the person asking the question much. It would be MUCH more helpful for someone to say, in general the "normal" cost of houses ranges from $100K-$1M, the average middle-class family can afford a house between $200K and $500K, and a normal cost per square foot is $100/square foot.
All that to say I'm looking for ballpark performance benchmarks on inserts for the following factors
Inserting 1000, 10000, 100000 rows into average table size of 15 columns.
Rough effect of every additional 5 columns added to the table
Rough effect of each index on the table
Effect of special types of indexes
Any other ideas that people have
I'm fine with gut feel answers on these if you are an experienced postgresql performance tuner.

You cannot get a meaningful figure here for the list of conditions you specified, because you do not even list the types of conditions that would have a profound effect on the speed of the INSERT command:
Hardware capabilities:
CPU speed + number of cores
storage speed
memory speed and size
Cluster architecture, in case the batch is huge and can cross over
Execution scenario:
text batch, with pre-generated inserts one-by-one
direct stream-based insert
insert via a specific driver, like an ORM
In addition, the insert speed can be:
maintained (consistent or average) speed
single-operation speed, i.e. for a single batch execution
You can always find a combination of such criteria so bad you would be struggling to do 100 inserts a second, and on the other side it is possible to go over 1m of inserts in a properly set up environment and execution plan.
So you will find the speed of your implementation somewhere in between, but given the known conditions, the speed will be 42 :)

Related

is kdb fast solely due to processing in memory

I've heard quite a couple times people talking about KDB deal with millions of rows in nearly no time. why is it that fast? is that solely because the data is all organized in memory?
another thing is that is there alternatives for this? any big database vendors provide in memory databases ?

A quick Google search came up with the answer:
Many operations are more efficient with a column-oriented approach. In particular, operations that need to access a sequence of values from a particular column are much faster. If all the values in a column have the same size (which is true, by design, in kdb), things get even better. This type of access pattern is typical of the applications for which q and kdb are used.
To make this concrete, let's examine a column of 64-bit, floating point numbers:
q).Q.w[] `used
108464j
q)t: ([] f: 1000000 ? 1.0)
q).Q.w[] `used
8497328j
q)
As you can see, the memory needed to hold one million 8-byte values is only a little over 8MB. That's because the data are being stored sequentially in an array. To clarify, let's create another table:
q)u: update g: 1000000 ? 5.0 from t
q).Q.w[] `used
16885952j
q)
Both t and u are sharing the column f. If q organized its data in rows, the memory usage would have gone up another 8MB. Another way to confirm this is to take a look at k.h.
Now let's see what happens when we write the table to disk:
q)`:t/ set t
`:t/
q)\ls -l t
"total 15632"
"-rw-r--r-- 1 kdbfaq staff 8000016 May 29 19:57 f"
q)
16 bytes of overhead. Clearly, all of the numbers are being stored sequentially on disk. Efficiency is about avoiding unnecessary work, and here we see that q does exactly what needs to be done when reading and writing a column - no more, no less.
OK, so this approach is space efficient. How does this data layout translate into speed?
If we ask q to sum all 1 million numbers, having the entire list packed tightly together in memory is a tremendous advantage over a row-oriented organization, because we'll encounter fewer misses at every stage of the memory hierarchy. Avoiding cache misses and page faults is essential to getting performance out of your machine.
Moreover, doing math on a long list of numbers that are all together in memory is a problem that modern CPU instruction sets have special features to handle, including instructions to prefetch array elements that will be needed in the near future. Although those features were originally created to improve PC multimedia performance, they turned out to be great for statistics as well. In addition, the same synergy of locality and CPU features enables column-oriented systems to perform linear searches (e.g., in where clauses on unindexed columns) faster than indexed searches (with their attendant branch prediction failures) up to astonishing row counts.
Sources(S): http://www.kdbfaq.com/kdb-faq/tag/why-kdb-fast

as for speed, the memory thing does play a big part but there are several other things, fast read from disk for hdb, splaying etc. From personal experienoce I can say, you can get pretty good speeds from c++ provided you want to write that much code. With kdb you get all that and some more.
another thing about speed is also speed of coding. Steep learning curve but once you get it, complex problems can be coded in minutes.
alternatives you can look at onetick or google in memory databases

kdb is fast but really expensive. Plus, it's a pain to learn Q. There are a few alternatives such as DolphinDB, Quasardb, etc.

How to calculate the future database size in Mongo?

I'm using MongoDB and we are really happy with this DB. But recently our client asked us for the database size in the future.
We know how to calculate this in a typical relational database, but we don't have a long experience in production with this No-SQL database.
Things that we know:
db.namecollections.stats() give us important information like, size(documents),avgObjSize(documents), storageSize, totalIndexSize
(more here)
With the size and totalIndexSize we can calculate the total size for the collection only, but the big question here is:
Why is there a difference between collection size and storageSize???
How can one calculate this, thinking in the future database size?

MongoDB pads documents a bit so that they can grow a bit without having to be moved to the end of the collection on disk (an expensive operation).
Also, mongo pre-allocates data files by creating a the next one and filling it with zeros before it is needed to boost speed.
You can throw a --noprealloc flag on mongod to prevent that from hapening.
If you want more info you can look here
In regards to your question about calculating disk space 5 years out, if you can figure out an equation for the growth of your data, make some assumptions about what your average document size will be, and how many / what kinds of indexes you will have, you might be able to come up with something.
Having worked for a bank also, my suggestion would be to come up with an an insane upper bound and then quadruple it. Money is cheap inside a bank, calculation mistakes are not.

Which is the best clustering algorithm to find outliers? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Basically I have some hourly and daily data like
Day 1
Hours,Measure
(1,21)
(2,22)
(3,27)
(4,24)
Day 2
hours,measure
(1,23)
(2,26)
(3,29)
(4,20)
Now I want to find outliers in the data by considering hourly variations and as well as the daily variations using bivariate analysis...which includes hourly and measure...
So which is the best clustering algorithm is more suited to find outlier considering this scenario?
.

one 'good' advice (:P) I can give you is that (based on my experience) it is NOT a good idea to treat time similar to spatial features. So beware of solutions that do this. You probably can start with searching the literature in outlier detection for time-series data.

You really should use a different repesentation for your data.
Why don't you use an actual outlier detection method, if you want to detect outliers?
Other than that, just read through some literature. k-means for example is known to have problems with outliers. DBSCAN on the other hand is designed to be used on data with "Noise" (the N in DBSCAN), which essentially are outliers.
Still, the way you are representing your data will make none of these work very well.

You should use time series based outlier detection method because of the nature of your data (it has its own seasonality, trend, autocorrelation etc.). Time series based outliers are of different kinds (AO, IO etc.) and it's kind of complicated but there are applications which make it easy to implement.
Download the latest build of R from http://cran.r-project.org/. Install the packages "forecast" & "TSA".
Use the auto.arima function of forecast package to derive the best model fit for your data amd pass on those variables along with your data to detectAO & detectIO of TSA functions. These functions will pop up any outlier which is present in the data with their time indexes.
R is also easy to integrate with other applications or just simply run a batch job ....Hope that helps...

Can an artificial neural network predict the outcome of sports games? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I was trying to find something original and fun to do with artificial neural networks (ANNs) as a personal/learning project and I though it would be cool if I could predict the results of sports games (especially NHL games).
I'm pretty sure it would be easy to evolve an ANN that can predict which team is most likely to win (usually the team with the better record). However, what I would like to do is create an ANN that would tell how likely the outcome is, similar to bookmaker odds.
Is this something an ANN can do? In the affirmative, what kind of success can I expect? I know I can't beat the bookmaker (at least not with a software solution). I want do this as a recreational project/challenge to myself. I don't expect to bet money on sports games with this project.

Way back in the days of the IBM XT I played with a shareware ANN program to try and improve my chances on the British football (soccer) pools. This is a form of betting where you try and predict which football matches will result in draws. I assigned each team a number then looked back thorough past results and from them generated a single digit for the result. From memory it was 0 from a home win , 1 for an away win and 2 for a draw. Each result went on a single line in a training file. I would then run the training file through the program and generate the ANN settings. I would then look up the following Saturdays matches and feed them into the ANN then look for matches predicted as draws.
As the weeks went on my predictions of draws did definetly become more and more accurate. However ...
1) The XT was so slow that by Christmas it was taking 24 hours to generate the ANN settings from the training data. I really had better things to do with my precious (and expensive) PC.
2) Although it was better at predicting draws it wasn't predicting enough to actually win any money. Looking back I suppose the program had just worked out that Manchester United would always beat Sheffield United. This was more football knowledge than I had but not enough to win any money.
3) Entering the results into the training data and then generating the forthcoming matches data was taking me ages and to be honest sport bores me rigid.
So I gave up and didn't become a millionaire.
These days however PC's are much faster and much of the training data could be scraped from the web. But I still doubt it is a route to a fortune but its certainly an interesting project.
Ian

A reply above stated:
I know that if the bookmakers odds could be beaten by an ANN,
bookmakers would already be using one to fix their odds.
Bookmakers don't set the line based on their analysis of the teams - they set it based on their analysis of the betting public's opinion of the teams. An ideal line for the bookie is where he has exactly the same amount bet on each side of the line - then he is guaranteed a profit = the 'juice' on the losers' bets. They move the line as game approaches to try to keep that 50/50 split. Bookie may think Home team -5 is accurate line based on game analysis, but if he expects that will draw 2x $$ on the Home team he will not set the line at -5 - he will set at -7 or -8 - to where he expects to draw equal $$ for both -5 and +5 bets.

ANNs are really good at pattern matching and prediction, so yes, odds are you could build an ANN that does what you want.
You'll need more than just team win/loss ratio to make it really effective however. Feed it stats for the players, too. For real effectiveness, try to include game-flow information... like which players are on the line for each play (for football, for example).
Ultimately, the biggest problem you'll run into (aside from the whole "writing the ANN" issue) is getting the data you need to feed it.

I've done some stock market predictions with an AI and my conclusion is that it is not very hard to make an AI that gets good results with the historical data.
Making winning transactions in the future is a different ballgame.

I have just worked on this very problem (predicting English Premier League games) for the past 10 days, and ended up with very similar results using 3 different methods: SVM, Logistic Regression, and NN.
LR and NN will give probabilities. SVM outputs 0/1 (but it can be tweaked for probas too (I haven't tried yet).
I needed a "massive" (by my standards at least) feature set though (almost 300) and a good chunk of data (13 years worth).
Re. data, I got it from the web, simply.
Conclusion: I can just about match the bookies in terms of accuracy (predicting victories in my case). If I add the pre-match odds to the feature set, I get the exact same accuracy as the bookies (as expected), but no better (surely meaning my feature set is summarized in the bookies odds, and they have a little extra knowledge on top).
I'm sure there is a way to get better accuracy, either by improving the algos, or more likely by having extremely granular data (as in which players play which games, for how many minutes, and a lot of player-level historical stats, so as to build bottom-up models of team performance).
But bottom line is I can testify NNs work quite well for that purpose. SVM is slightly better though, in my limited experience.

I think it's indeed all about data, but there's no end to what you could feed it with in order to be more accurate : winning/loosing streaks, players biorhythms, player's girlfriends mood before the game, minor/major injuries they suffered in the recent past, extra-sportive events that are bothering the players, etc, etc, etc.
But I don't think you can accurately predict which team is more likely to win, it would be just a more-or-less educated guess.

In my opinion and experience, because of the excessively large number of factors in play, designing and training the ANN will be unreasonably complex and time-consuming. ANNs are good at pattern matching, and game prediction takes much deductive reasoning rather than mere pattern matching.
But if you want to enjoy learning neural networks, it will be a good adventure. If you are successful, you might want to host your code somewhere for others to see and learn!
For game prediction, it would be much easier and faster with decision trees or a rules engine and so on. This will be no easy task either, but it will be another interesting activity.

My belief is that the unpredictability of an event is due to lack of information and understanding...If you have all the knowledge, then yes it could be done. Or, the more knowledge you have, the better it can be done.
So in theory, the answer is yes.
However, in practice, you can get a PhD and have a whole career working on this question and you still may not succeed.

Is LOC correct parameter for project estimation? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Is LOC correct parameter for project estimation?
there are so many scenarios where complexity takes much more time for a single line of code,
other than LOC what could be the suggested parameter for project estimation?
As peoples are talking about functional point of program does it mean for use case related information?
i am trying to find out any solid base for full software developement estimation which can consist analysis, design, testcase preparation, and coding, please suggest?

Steve McConnell in Rapid Development (Microsoft Press, 1996):
Because different programming
languages produce such different bangs
for a given number of lines of code,
much of the software industry is
moving toward a measure called
"function points" to estimate program
sizes. A function point is a synthetic
measure of program size that is based
on a weighted sum of the number of
inputs, outputs, inquiries, and files.
Function points are useful because
they allow you to think about program
size in a languageindependent way.
Google "Function Point" for more information.

Seeing as developers are likely to* spend most of their time trying to test changes, lines-of-code is never a good indicator of size of a problem.
Let's suppose you have an existing large application - changing a single line of code may seem trivial, but the test planning and execution could take weeks.
Likewise, adding a relatively large amount of code in a single limited-scope module which is easily testable might be only a few days.
* they should do, at least. If they're spending more time writing code than testing it, it is probably full of bugs. And I mean BEFORE it reaches your dedicated QA team.

Only if you use it in the inverse.
-- Edit
But no. It isn't. It's a mostly useless measure, and generally harmful. As you note, less code is almost always better.
Other things to check? Well, what are you trying to measure? What result do you want to see from a change in the things that you would be checking? What sort of decisions will you be making on the basis of these changes?

LOC is one proxy measure for measuring the problem size.
LOC estimate can be used, and LOC count is relatively cheap to measure from historical projects. But LOC can be problematic if used for anything else than a proxy for problem size, as already pointed out by other answers.
Problem size is rather constant given the requirements. From a size estimate you can go to effort, schedule and cost estimates. It depends on your planning drivers such as cost or schedule. From the historical data you can find correlation how problem size translates to effort and how other planning drivers further influence the outcome. So you need to measure size measure and effort vs. other parameters and keep on fine-tuning your estimation process. There are some LOC-to-effort measures available in the literature, but they are not very accurate in your domain, using the technology you are using, and the team you have.
Other proxies for problem size are function points and story points. My experience on function points is that they are rarely worth the effort. On the other hand, story points in agile methods work very well since they are deliberately abstract (thus avoiding a lot of problems with with LOC) and measured on a sprint-by-sprint basis, with instant feedback into following sprints.

No, it isn't. The reason is simple: if you produce a new line of code during your development, are you one step closer to a solution? If you estimated 1000 lines of code to complete a task, are you now 0.1% complete with that task?
Lines of code can be used as a metric but only in the negative sense: for a greater number of lines of code, it is reasonable to assume that you have a greater number of bugs. Based on historical data, there is generally a linear correlation between lines of code and bug count.
Here are some useful and measurable factors that are worth considering:
Hours of labor.
Dollars spent: this is a good one because it strongly enforces the reality that you'd rather find bugs at the developer's desktop than in the hands of a tester or customer).
Milestones met: is the system available for the customers on the right date?
Requirements completed: this can be a funny one - what if you discover a new customer need during the project?
In short, lines of code is very nearly the worst possible metric you could ever use.

The only way to get any reasonable estimate on project duration is to COMPLETELY implement and deliver some subset of the final requirements. Then you can estimate the remaining requirements by comparing their complexity against the completed work.