Jun 16th, 2006, 06:00 PM
Optimizing Derived Values from a Database
A colleague of mine and I got into a discussion about calculating derived values from data located in a database. In our case, we have an average rating we would like to calculate from a table of ratings. To implement it, I decided to write a service method that would eventually delegate to a DAO method that would be one SQL statement that would pull out the average rating. So in essence, it's one SQL statement to do this. He said that as the number of ratings go up, the cost of calculating the average rating "on the fly" would be higher and higher. Not only that, he said that at some point we are going to need to move the data out of the database and archive it, and at some point the data's not going to be there to make the calculation. He proposed that we add an "average rating field" to the object that is being rated. This in turn would be a database field that would hold the derived value. This value will be updated everytime a user creates a new rating and then it would be saved to the database.
I understand why he would want to do that because it is faster since you're calculating the average as you go and not all at once. But the problem I have with it is that it seems so unnatural. Adding a derived value as a field in the database doesn't quite fit well with my programming model. And it seems more prone to inaccuracy. What if one of the ratings change? Plus, it seems more of an optimization, and I don't want to optimize something that doesn't need optimization. Can anyone comment and help find a medium between the two approaches given the concerns? Are his concerns valid?
Jun 16th, 2006, 08:06 PM
Storing derived fields in the database like your colleague mentioned is a common practice. Obviously its done for optimization reasons. I cannot see anything wrong with it.
Can you be more specific why you think it is 'unnatural'? Or, more importantly, why you think it could lead to the wrong result?
Jun 16th, 2006, 09:30 PM
IMHO, there isn't a clean cut right/wrong for either approach. It's really case by case. E.g., in this particular case, since calculating average is generally an O(n) operation, whether a derived field is worth it would depend on the nature of your system. I would ask how many ratings there are in the table, and how frequently this average value is requested.
Jun 17th, 2006, 11:05 AM
I see. I definitely don't have the experience to know whether it's common practice or not so I'm glad you let me know that. Also, I guess my feeling on it being "unnatural" is subjective.
Originally Posted by too_many_details
As far as leading to the wrong result, I was thinking that the average could be inaccurate if at some point you had to change some of the historical rating values. But as long as you update the average after you change the rating values which is what you are supposed to do anyway then it shouldn't be a problem. I guess I wanted to see how other people viewed/solved this problem, and it looks I got my answer.
Jun 17th, 2006, 11:07 AM
The average would be requested pretty often, and there could be up to 8,000 to 10,000 rating values. That definitely will be slow to calculate on the fly.
Originally Posted by manifoldronin
Jun 18th, 2006, 01:14 AM
Oh, for 10K values, I don't really think there is any significant difference between SELECT DERIVED_AVG and SELECT AVG(VALUE). The communication overhead to make either call would dominate.
If you are really scraping up the last bit of the juice, you can always cache the average on the Java side. That'll save you the trip to the DB, which like I said, is definitely going to dominate at 10K level.
Jun 18th, 2006, 02:54 AM
This issue is very similar to one I had recently about keeping a running total of accounts a relationship manager has (thats my specific domain).
The options were:
1. Call a stored proc that does a realtime count for the specified relationship manager.
2. Keep a count in the relationship manager table and either:
(a) update the value via a trigger on the account table (so when there is an update, delete or insert to an account have a look at the relationship manager and update the count).
(b) if having an accurate count at all times is not critical (it is not for us) then have a job run every night to update the counts.
There are also Oracle-specific features such as materialised views that could be used.
Looks like 2(b) is the current favourite in our team. Since the responsibilityof keeping the count upto date is down to the db, the java boys treat the values as just another field in the db.
PS.There are some rules I have to work under, namely no SQL in Javacode and we don't use an ORM. Every db access is via stored procs.