In this post I will try to explain what exactly is happening in Thumbs
app in SciPy Central.
Firstly, Thumbs app is located at /scipy_central/thumbs/ directory. The handles voting of objects. Right now, Revision, Comment objects are available for voting while comment objects are only available for up-votes.
Broadly, Thumbs app does the following
1. Create thumb object
2. Get/ update reputation
3. Get update user profile reputation
4. Set wilson confidence score
We also primarily changed the way Revision objects are ordered. They are now ordered based on wilson score while before they are ordered based on created date.
I noticed two typical ways of implementing Thumbs considering the above mentioned points.
Trying to understand from the front-end of the site, the following are the areas where thumb-objects data is influenced.
1. When we open a Submission (Snippet, Link, Package..) page
2. When we open a User profile page
3. When we make a search query
 When we open a submission page, the following are required for us
1. Total reputation the object got
2. submit-vote form
 When we open user profile page, we need to list out all the objects that user created which also include the total reputation of each object (that's what we are interested here)
 When we make a search query, we are ordering based on the wilson score confidence levels. These levels are changed upon each vote submission
Now, trying to implement the above actions, we can do in two ways. The below are those including some reasons to support.
As "Thumb" objects are generic in nature and can be found with a reverse-generic relationship for any attached objects. We can find the above parameters reputation, wilson-score in two ways
1. calculate reputation/ wilson score every time they are needed to show on page
2. calculate reputation/ wilson score only when a vote is made and store in database.
At first I implemented the thumbs app in this way! This implementation completely reduces admin maintenance overhead. For instance, if moderators need to manually moderate a vote of user, the total-reputation or wilson score levels are automatically done!
Well, if we see  implementation, we are calculating reputation of objects each time they are needed based on the thumb-objects it has got. This implementation is fine as long as we have 10-20 thumb-objects considering the state of performance of servers we have and amount of parallelism and performance database (PostgreSQL) gives us.
What if we got 1000 thumb-objects for a submission object? I don't think its anything uncommon these days. We can typically find may questions in Stackoverflow getting 500 reputation (lets guess 650 up and 150 down -- this may not happen but we can assume as such). So, I thought counting each time the total reputation, querying all 1000 objects (approx), is not a good bet.
Consider a search query where 30 objects are returned, calculating reputation for each object dynamically and display would be really bad especially if those have 500 reputation!
It would also look bad if a user has lots of contributions on the site and opening his profile page is going to take lots of database queries!
What  does is create reputation, wilson-score fields. Since these are only changed when thumb-object is created or changed i.e., when someone votes or moderator make changes, we can simply update those fields when vote is made. Rest of the time, we can query those small integer fields.
This might sound quite good enough for you (I felt at least) but the following points are to be noted
On every vote a user makes we are additionally doing the following things
1. Calculate new reputation and update field
2. Calculate new wilson score and update field
3. Calculate user profile reputation and update field
Considering that voting objects is less obvious than search-queries or page-requests (opening submission page), we can possibly bet on this overhead since lots of performance can be boosted while in the latter.
You might very well see that my opinions are more favourable towards  implementation :) However, this implementation requires quite a bit of work on the maintenance side. For instance moderator manually deletes a vote (makes it None) made by the user, now, the reputation field, wilson score have to be changed. No wonder we also need to consider cases where several lots of other people are trying to vote on the same object while we are moderating it. We can try to compensate here with some custom admin-actions which update those fields!
Coming to the testing part of the above two situations I described, we need to see two parts in each situation
1. Time taken to vote
2. Time taken to query or open submission page
In case we are only considering time taken to display reputation value
I have generated 1000 User objects, 1000 Submission (Link type) objects to test the situations. To be precise, 998 objects are created since Sqlite can't handle more than 1000 at once! (deleting all objects at once would become easy this way)Submit Vote (Case A)
(Finding reputation dynamically each time required)
Each of 1000 users are made to vote on randomly selected 50 Submission objects (out of 1000). I have logged about 49,950 thumb objects with time elapsed for voting.
The total time is accounted only for the execution of the below steps
1. create thumb object
2. calculate reputation (dynamically)
3. calculate wilson score
On an average, the execution time was between 0.1 sec to 0.2 sec (increasing slowly). This rate of increase of time is probably due to the fact that number of votes are increasing. On an average I noticed while monitoring the iterations, each revision object's total reputation was around 20-40 (don't know total number of votes)
The below is the plot over 20,000 iterations! x-axis represents iteration number; y-axis represents execution time at that iteration
Submit Vote (Case B)
|Case A - submit_vote execution time distribution |
(Finding reputation only when vote is made)
The total time is accounted only for the execution of the below steps
1. create thumb object
2. calculate reputation based on previous value, store it
3. calculate wilson score, store it
4. calculate user profile reputation, store it
On an average, the execution was around 0.2-0.35 sec with each object roughly having total reputation roughly -20 to 40. The plot over 20,000 iterations is shown below. The x-axis represents iterations while y-axis represents execution time corresponding to iteration number
|Case B - submit_vote execution time distribution|
Here we need to note some interesting trend in the plot. In case-b, the average execution time over 20,000 iterations was roughly constant while in case-a, it was increasing slowly. This could mean that the behavior of case-a when too many votes is going to be bad enough than case-b!
However, the average execution time in case-b has been increased by 0.1 to 0.15 seconds. Well, this can be supported by the reason that we additionally calculating user-profile reputation apart from other functionality. Also, this information is stored in database, probably a bit of slowness here too.Page Request (Case A)
Here 1000 revision objects which are previously voted randomly according to case-a are taken and iterated to check execution time for calculating reputation.
The trends seems to look quite surprising at first as the execution-time distribution on an average seems to be constant while I was expecting more often up's. The can be supported by the reason that the total number of votes for an object happen to be a bit constant roughly, this kind of trends may happen. However, there are few sudden up's in the plot where number of thumb-objects for a reputation object seems to be higher
For instance, if we have more objects having higher reputation than rest of them, we might see more up's.
Page Request (Case B)
|Page-request case A - execution time calculation|
This case is obvious with 0 execution time (although it is technically not 0 but very near to 0).
In the case, all we are doing is to query saved reputation field.
Search Query (case A)
|Page Request Case B - execution time calculation|
Here we are assuming to query some objects and display on the page. Thus, we also need to calculate reputation. I have assumed that a search query returns 20 objects each time and calculated time taken to execute reputation for all those objects!
For each iteration, 20 objects are further iterated. All those 20 are timed together.
x-axis represents iteration number
y-axis represents execution time.
Again, a surprising trend for the reason that I expected for up's. However, its taking nearly 1.5 seconds to calculate!
Search Query (case B)
|Search Query - Case A - execution time to calculate reputation|
A similar situation is taken here as in the above case-a, but we are only returning saved reputation field and not calculating dynamically. The trends seems to be quite convincing but there are some unexpected up's in the middle.
On an average, the execution time for each iteration containing 20 objects took 0.001sec, 0.0005 sec and even 0sec. This is fairly less!
Since the iterations are too fast, I would take up a bit large data. While in case-a, 200 iterations itself too quite lot of time.
The source code used for testing can be found at https://gist.github.com/ksurya/6473160