1

I am building a forum and I am trying to count all of the posts submitted by each user. Should I use COUNT(*) WHERE user_id = $user_id, or would it be faster if I kept a record of how many posts each user has each time he made a post and used a SELECT query to find it?

How much of a performance difference would this make? Would there be any difference between using InnoDB and MyISAM storage engines for this?

13
  • Use COUNT(*), the limited amount of optimization you might get by doing something else isn't worth the time -- pretty much guaranteed to not be your bottle neck. Commented Aug 14, 2011 at 1:11
  • @Kerry: cannot agree with you. Such kind of optimization is really trivial to implement but it gives huge performance improvement Commented Aug 14, 2011 at 1:14
  • 1
    @zerkms: If the database is properly designed with proper indexes, COUNT(*) is the correct approach. Keeping some cache somewhere else is not "trivial": it is wasteful. Commented Aug 14, 2011 at 1:37
  • 1
    @zerkms: using a separate count implies a longer transaction. If the volumes are so high that COUNT(*) is inefficient, then maintaining an extra value is a concurrency issue. using a cache or whatever is added complexity. COUNT(*) is simple and efficient enough for the average mickey mouse database. The key here is the WHERE clause on the count: we aren't counting a whole table with billions of rows.... Commented Aug 14, 2011 at 9:27
  • 1
    @zerkms: "if"... "if"... "if"... No facts there... Commented Aug 14, 2011 at 9:42

7 Answers 7

2

If you keep a record of how many post a user made, it will definitely be faster.

If you have an index on user field of posts table, you will get decent query speeds also. But it will hurt your database when your posts table is big enough. If you are planning to scale, then I would definitely recommend keeping record of users posts on a specific field.

Sign up to request clarification or add additional context in comments.

Comments

0

Storing precalculated values is a common and simple, but very efficient sort of optimization.

So just add the column with amount of comments user has posted and maintain it with triggers or by your application.

The performance difference is:

  • With COUNT(*) you always will have index lookup + counting of results
  • With additional field you'll have index lookup + returning of a number (that already has an answer).

And there will be no significant difference between myisam and innodb in this case

7 Comments

What about the longer transaction and extra writes for the pre-calculation?
@gbn: one user in one moment could only make one comment, obviously. So there is no any user's row locking issues.
@gbn: and as well as reading is most frequent operation - I think the increased transaction length and extra writes worth having the ability of retrieving that precalculated comments count in "constant" time.
It's not just locking as you know: you say COUNT isn't scalable and you should maintain an extra column. This means extra writes to be logged (WAL etc). If volumes are enough that COUNT is bad, then the extra writes matter. COUNT is a read load and a filtered count will be fairly efficient. You've also denormalised and added complexity based on a not yet quantified need...
@gbn: can you select top 10 commenters fast with your advice?
|
0

Store the post count. It seems that this is a scalability question, regardless of the storage engine. Would you recalculate the count each time the user submitted a post, or would you run a job to take care of this load somewhere outside of the webserver sphere? What is your post volume? What kind of load can your server(s) handle? I really don't think the storage engine will be the point of failure. I say store the value.

Comments

0

If you have the proper index on user_id, then COUNT(user_id) is trivial.

It's also the correct approach, semantically.

11 Comments

Could you create a table with 100M rows, with 40M belongs to one user and compare after that COUNT(*) with retrieving single ready to use int from users table? (surprise, index will not be used in this case and you will have pretty fullscan over 100M rows. Yeah, great advice!!). Btw, if user_id is not defined as NOT NULL - it will be slower than COUNT(*)
Zerkms how can you make a blanket statement that an index will not be used? And the use of an index is not effect by if the column accepts nulls (null is just a value). +1 for Tomalak. An index scan over 40M row is trivial. No way the optimizer skips an index on a single column query to perform a table scan.
@BalamBalam: you don't get it. 40% will prevent using index, without doubts. That is how optimizer works. Index scan over 40M is not trivial, it will take seconds to perform. Well, I suggest you to have experiments and read about how mysql optimizer works. And I did not say that not null will affect or not index usage, I said that COUNT(col) performance varies on whether col can accept null values or not (read about this too, before discussing about performance)
do you think it likely that the table will be dominated by one user? have you considered how long it would take a user to make 40M posts, if they do one a minute?
@andrew cooke: it doesn't matter - they said that 40M is trivial scan, which is not.
|
0

this is really one of those 'trade off' questions.

Realistically, if your 'Posts' table has an index on the 'UserID' column and you are truly only wanting to return the number of posts pers user then using a query based on this column should perform perfectly well.

If you had another table 'UserPosts' for e'g., yes it would be quicker to query that table, but the real question would be 'is your 'Posts' table really so large that you cant just query it for this count. The trade off on both approaches is obviously this:

1) having a separate audit table, then there is an overhead when adding, updating a post 2) not having a separate audit table, then overhead in querying the table directly

My gut instinct is always to design a system to record the data in a sensibly normalised fashion. I NEVER make tables based on the fact that it might be quicker to GET some data for reporting purposes. I would only create them, if the need arised and it was essential to incoroporate them then, i would incorporate it.

At the end of the day, i think unless your 'posts' table is ridiculously large (i.e. more than a few millions of records, then there should be no problem in querying it for a distinct user count, presuming it is indexed correctly, i.e. an index placed on the 'UserID' column.

If you're using this information purely for display purposes (i.e. user jonny has posted 73 times), then it's easy enough to get the info out from the DB once, cache it, and then update it (the cache), when or if a change detection occurs.

Comments

0

Performance on post or performance on performance on count? From a data purist perspective a recorded count is not the same as an actual count. You can watch the front door to an auditorium and add the people that come in and subtract those the leave but what if some sneak in the back door? What if you bulk delete a problem topic? If you record the count then the a post is slowed down to calculate and record the count. For me data integrity is everything and I will count(star) every time. I just did a test on a table with 31 million row for a count(star) on an indexed column where the value had 424,887 rows - 1.4 seconds (on my P4 2 GB development machine as I intentionally under power my development server so I get punished for slow queries - on the production 8 core 16 GB server that count is less than 0.1 second). You can never guard your data from unexpected changes or errors in your program logic. Count(star) is the count and it is fast. If count(star) is slow you are going to have performance issues in other queries. I did star as the symbol caused a format change.

6 Comments

You can never guard your data from unexpected changes or errors in your program logic. --- triggers on insert/delete will not break integrity ever. 0.1 second is terribly slow for the information that is queried often. That means that ONLY 10+ requests will make the loading that your server cannot serve. 10 requests per second on 8 core server, you must be kidding?
OK @zerkms. I agree a trigger should not break. And I agree the developer should not mess up or forget a trigger but it is adding a possible point of failure. It depends on how the app is going to be used. If you want to optimize for insert and delete then count(star) may be the appropriate design. If count is a primary query then the solution you propose is optimal. But a single user is not going to post 40M so I would not design for that. A forum is going to have input from many users and none dominant or it would not really be a forum.
you mentioned a forum, great. How would you output a number of messages user posted? That is what almost every forum has. Let's suppose it is a regular thread with a post message on the central part and user's info (including count of messages) in the left part. And there are 50 posts from different users on a page. So you propose to perform 50 additional queries to get the amount of messages every user made?
@zerkms Cool. I would count(star) on the master for the total and a it might outperform a sum(usercount) on a summary table as count is more efficient than sum. Stackoverflow has 1.9 million and I would a count(star) and then cache that result for 1-10 seconds as user does not really care if the count is 10 seconds stale. For count by user I would count(star) with a group by on the user - one query not 50.
True story. Commercial app I supported used a trigger to facilitate a summary report. On an app and SQL upgrade the trigger broke such that it produced a result but not formatted the same so it did not match data. We did not figure this out until it had been in production for 3 days. Problem is that report was required for federal compliance. I knew how to fix the data and write SQL to get the right report but due to the nature of the data the report had to come out of the app. The trigger produced the report in 2 seconds rather than 4. We did not care if the report took 2 hours.
|
0

there are a whole pile of trade-offs, so no-one can give you the right answer. but here's an approach no-one else has mentioned:

you could use the "select where" query, but cache the result in a higher layer (memcache for example). so you code would look like:

count = memcache.get('article-count-' + user_id)
if count is None:
    count = database.execute('select ..... where user_id = ' + user_id)
    memcache.put('article-count-' + user_id, count)

and you would also need, when a user makes a new post

memcache.delete('article-count-' + user_id)

this will work best when the article count is used often, but updated rarely. it combines the advantage of efficient caching with the advantage of a normalized database. but it is not a good solution if the article count is needed only rarely (in which case, is optimisation necessary at all?). another unsuitable case is when someone's article count is needed often, but it is almost always a different person.

a further advantage of an approach like this is that you don't need to add the caching now. you can use the simplest database design and, if it turns out to be important to cache this data, add the caching later (without needing to change your schema).

more generally: you don't need to cache in your database. you could also put a cache "around" your database. something i have done with java is to use caching at the ibatis level, for example.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.