Rating systems

wordman

Two Thousand Club
The EC feature with the greatest potential that went wasted was the rating system. A lot of this waste was due to the site just not giving the user the power to exploit it, but some of it was due to the way the rating system worked. I've mentioned the usability issue in another thread, so this thread is for talking about the mechanics of rating systems. Such systems have been built in a lot of ways. None of these is perfect, but each has their advantages. In this thread, please discuss the merits of various rating systems to see what might work best for this site.


It's worth mentioning the flaws in the system the EC used. It used a numeric system allowing users to vote from zero to ten, then reported the average of the votes cast. This had the following problems:

  • New entries received the lowest possible score. Combined with the way records were presented, this tended to hide new items from readers.
  • No standard was offered for what votes meant, so one persons 8 vote would mean something completely different from someone else's 8 vote. Some reviewers would almost never give a 9, while others would hand them out like candy.
  • People could vote on their own entries. Since the number of votes was usually low, this tended to skew the results.
  • No weight was given to how many votes were cast, so an item with a single 10 vote would rank above an item with seven 10 votes and one 9 vote.
This post is followed by several others suggesting other voting systems which try to correct these problems. Respond or offer your own. Note that the idea here is not to make an "unbreakable" system. This would be impossible, as people can just create dozens of virtual users to mess with a vote system. Instead, the idea is just to build a system that gives more use the the user than the EC system did.
 
Thumbs


In this system, users are offered only three choices: thumbs up, thumbs down or uncommitted. Scores for new items start at zero. A thumbs up vote gives the item +1, a thumbs down -1 and an uncommitted +0. The total is then divided by the number of votes cast, yielding a final score of -1.0 to 1.0, with 0 as the average. You cannot vote on your own item.


Pros:

  • Items start in the center of the voting range. Good items will bubble up, the bad ones down, and the average or ignored will stay average.
  • Provides reasonable guarantee that people will have a similar notion of what the different votes mean.
  • Binary good/bad doesn't allow room for nuanced votes.
Cons:

  • Binary good/bad doesn't allow room for nuanced votes.
  • Doesn't solve problem of fewer votes skewing the result.
Variations:

  • Number range could change from -1.0 to 1.0 to anything else (0 to 10 with five being the starting value, etc.)
  • Rather than average the score, the final score could just be the total of the vote points. This would keep the average centered at zero, but allow theoretically infinite range. This would mean that items with more votes could (potentially) have much higher scores, which may be desirable.
 
Percentile


This system focusses more on unifying what the votes mean. Rather than using a numeric system, a voter would be asked to choose one of the following:

  • Greatest idea ever
  • Will definitely use this in my campaign as is
  • Considering using in my campaign, perhaps with some of my own changes
  • Interesting idea, but probably won't use it in my campaign
  • Doesn't move me one way or the other
  • Would not use in my campaign
  • Fundamentally flawed
Internally, a number is assigned to each of these votes, probably +3, +2 and so on down to -3. New records start with a score of zero. Users cannot vote on their own items. An average is made of the total scores, divided by the number of votes cast, but this average is not displayed.


Instead, all of the items are internally arranged on a line based on their average score. Items with a low number of votes are shifted down the line by reducing their average. For example, items with less than three votes might have two subtracted from their average, items with less than five subtract one (or some similar variation). The line is then scaled proportionately, with the item with the lowest score becoming 0% and the item with the highest score becoming 100%. This percentage is displayed as the vote on the item.


Mathematically, the score is found like so:


min = lowest average


max = highest average


range = max - min


item score = (item average - min) * 100 / range


The end result works a bit like a percentile on a standardized test, showing how the item scored in relation to the other items. The highest scoring item would show 100%.


Pros:

  • Items start near the center of the voting range.
  • Provides reasonable guarantee that people will have a similar notion of what the different votes mean.
  • Items with fewer votes do not score as well.
  • The scores of the items change based on the quality of other items, generating a sort of "quality race" that should keep content evolving.
Cons:

  • Voting on a single item requires recalculation of all votes, which means this might need to be done once per night.
  • Complicated
  • The scores of the items change based on the quality of other items
 
Is this because of an overabundace of perfect scores? I've found most ratings systems online quickly become worthless when everybody votes 10.
That's kind of the idea behind the "thumbs" system: since many people tend to vote "all or nothing", just make everyone do so. In aggragate, you should get a decent representation of how good the thing is.
 
In the meantime, while we're getting a proper submission system going, I recommend that people add a poll to their submission post if they want some kind of quantifiable feedback.


The dialogue for adding polls is directly below the editing screen for a new thread post.


-S
 
I mentioned editors in the other trhead. I think this is a good way to preserve quality - anything altogether rubbish can be immediately rejected. Anything that doesn't meet the standards of english usage can be politely turned down. Anything that is nonsense, or inappropriate can also be shot down.


Of course, the Editors would need to be charged to use their power wisely. Anything legible and legitimate should be approved, even if we disagree with it's form, style or manner.


Did I say we? I meant they. Yeah.
 
I mentioned editors in the other trhead. I think this is a good way to preserve quality - anything altogether rubbish can be immediately rejected. Anything that doesn't meet the standards of english usage can be politely turned down. Anything that is nonsense' date=' or inappropriate can also be shot down.[/quote']
Agreed. However I think there should still be a set of guidelines posted so that people will have an idea of what constitutes "acceptable" and moderators will have a set of standards to judge submissions by.


-S
 
Carrying on this coversation intwo seperate threads seems unnessecarilly confusing. At least for me. Might I request we constrain my Editors idea to the Submissions Guidelines thread, with those handy submissions guidelines you posted?
 
Thoughts of the possibility of having multiple submission indexes may work, eg:

  • Overall (derived from all other scores)
    Reasoning (inc. progression positioning)
    Mechanics
    Description (would inc. Spelling/Grammar)
 
A good measurement of rating.


I like the idea of the thumbs up/down system. I especially like it because there is no room for interpretation, and therefore no room for overinflation of scores. I'll sacrifice a bit of nuance for that.


The pitfall here is that people will abstain from voting - or vote neutral - when they really should be giving thumbs down. What I mean is: everyone likes to give something a good score but few people like giving a low score. In a 1-10 system you can give a 3 or 4 and still feel OK about it. In a thumbs up/down system, if you do not like the idea you always give it a 1.


But I say we try it anyway. :-)


I believe the interesting numbers in such a system are:


- The number of votes cast.


- The minimum number of votes needed to score at all.


- The number of ups/downs/neutrals.


- The percentage of ups/downs/neutrals.


- The total added value of all votes.


- The average vote value.


- The average vote, not counting the top and bottom 5%.


- The median vote.


To make this useful we want:


- A final score that everyone can understand/relate to.


- A score that represents the consensus, not the margin of opinion.


- A score that can show not only (lack of) popularity but also shows when the votes are really divided.


My suggestion:


- Don't calculate score before a minimum of votes have been cast. I suggest 10.


- Show ups/downs/neutrals both number and percent.


- Show total number of votes.


- Show score.


To calculate the score I would ignore the top 5% and the bottom 5% of votes (this is to take away extremes, instead focusing on the consensus) and average the rest.


Example:


Billstorn makes a new item called the Buckler of Swashing.


It receives 19 votes. The score would show:


Score: 0.58


Thumbs up: 63% (12 votes)


Neutral: 26%  (5 votes)


Thumbs down: 11% (2 votes)


Total votes: 19


This score is calculated after removing 0.95 (5% of 19) votes from the top and from the bottom. Without this removal the score would be 0.53.


Another example:


Score: -0.86


Thumbs up: 4% (1)


Neutral: 13% (3)


Thumbs down: 83% (19)


Total votes: 23


Without removal this score would be -0.78


How I removed 5% of the score, using the second example:


5% of 23 = 1.15 votes, giving


Up: 0


Neutral: 2.85


Down: 17.85


Total: 20.7


Average: (0*1 + 2.85*0 + 17.85*-1) / 20.7 = -0.86


I believe this system encapsulates all three points of usefulness, as stated above.


What do you think of it?
 
Re: A good measurement of rating.

Relic said:
What do you think of it?
Seems reasonable. The only quibble I have is over the minimum number of votes. On the old EC, very few items ever received 10 votes, so it the threshold was set to that, the vast majority of the items would have no score.


Maybe this is good. Maybe not.
 
Maybe have a query set up to find entries needing votes (<10).


Also maybe have the status rewards include entries rated as well as forum posts, to encourage participation.
 
It's time to revive this thread, since we're near the point where ratings come into play.


I've got the SQL that will calculate things like "most rated items" or "highest-rated items".  I want a final consensus from people on how this will work here.
 
ashenphoenix said:
Maybe have a query set up to find entries needing votes (<10).
Also maybe have the status rewards include entries rated as well as forum posts, to encourage participation.
I like both of these ideas.  However, I have fears that a rewards system would encourage people to review things poorle in a badge-collecting effort.
 
I'd agree with the badge collecting thing, except it seems to work in the forums, and vaguely on old EC.


It would acknowedge people like Joe's work.


I just stared thinking about another level. IMDB has a "Was this review useful to you?" function, where people rate reviews. I know that it's probably not worth the effort to program, but I'm just throwing it in there.


Old EC used to have "top raters" (I'm guessing it was based on how accurate their ratings were). Instead I'm suggesting a "top reviewers" function. The original submitter rates whether or not the review was helpful to them, making reviewing and suggesting an interactive and constructive process.


EC's rating's helped us decide whether a given custom rule was worthwhile. Reviews helped the original creator improve the item, unless they were from ThatGuy or MasterOfThatGuy (TG consistantly rated items 1, and gave crap feedback, MOTG gave a little feedback, but consistantly chased TG around giving 10s to counter TG's 1's).


Maybe have reviewers gain status on whether their feedback was constructive.


This seems like another level of complexity that people may not want though.
 
ashenphoenix said:
Old EC used to have "top raters" (I'm guessing it was based on how accurate their ratings were)
Actually, I think it was purely statistical. A "top rater" was either someone who rated a lot of items, or rated a lot of things highly. I forget which. In either case, it was fairly useless information, as it did not pertain to the actual submissions at all.


-S
 
ashenphoenix said:
EC's rating's helped us decide whether a given custom rule was worthwhile.
A simple three-state attribute of the Rating instance ("Helpful", "Useless", "Neutral") can easily be instituted.  This would not affect the overall rating of the ITEM (because people would then Useless all their low ratings and inflate their items), but we can easily then put together a "Most Useful Raters" sort of report.


The downside here is that people have yet another way to hold grudges against each other, which makes me wonder if it's not better to hide the submitter's name until the useful/useless/neutral rating is given (and to not let them change the feedback).  Then again, this will only come into play if people turn out to be real assholes.
 
Regardless of what statistical rating system is put in place, I think the ability to add (and respond to) comments is critical.


If coding such a thing is a pain in the ass, perhaps it could somehow intgrate with phpBB's mojo.


-S
 
Stillborn said:
Regardless of what statistical rating system is put in place, I think the ability to add (and respond to) comments is critical.
If coding such a thing is a pain in the ass, perhaps it could somehow intgrate with phpBB's mojo.


-S
Dude, it's me.  This is virtually done already.  The slowdown is that it's tough to test without a lot of bogus user accounts, which is why we still have text fields for the usernames in charm submissions.  once ratings have been tested to work, it's time for a move to a live database connection for users.
 
I can make a bunch of bogus accounts for testing and delete them when you're done, if you'd like.


-S
 
Stillborn said:
I can make a bunch of bogus accounts for testing and delete them when you're done, if you'd like.
-S
That'd be good.  Please give them clearly bogus names, and PM me the password?  One needs to be in Auditors, one in Scholars, and maybe two other non-privileged?
 
Preliminary Conclusions:


- Thumbs ratings work better when many people vote


- Nuanced ratings work better when few people vote


- More effort to vote means fewer participants doing so


- Identifying abusive raters is important


- Some users are selective about the quality of things they give ratings for (e.g. people who only rate crap, or people who only reward good items but ignore bad ones)


My Goals:


- A scalable, pluggable ratings system that can be attached with minimal dependencies to any sort of submission


- A system that encourages many voters but does not break with few


- A way to reward people who rate well - hopefully such people would be somehow recognized in the forums as well


Mechanism:


- Users may rate any submission.  Each user may rate each submission once.  Users may not rate their own submissions.


- Ratings consist of two Yes/No/Abstain rankings (Was this submission good, by the rater, and Was this feedback useful, by the submission author)


- The submission author, the rater, or a moderator may delete a rating.


- Ratings come with a text field for the rater to give feedback, and another for the sub author to reply.


- The default views for submissions will be ordered by "newest" and/or "least rated" items, encouraging people to explore new submissions first.  These views will lead to the "best rated" views.


- There will be a special view that displays "most useful raters" and "least useful raters".  This may be accompanied by an "average rating given" or "average rating received" metric on the user's own view.
 
memesis said:
- Users may rate any submission.  Each user may rate each submission once.  Users may not rate their own submissions.
- Ratings consist of two Yes/No/Abstain rankings (Was this submission good, by the rater, and Was this feedback useful, by the submission author)


- The submission author, the rater, or a moderator may delete a rating.


- Ratings come with a text field for the rater to give feedback, and another for the sub author to reply.


- The default views for submissions will be ordered by "newest" and/or "least rated" items, encouraging people to explore new submissions first.  These views will lead to the "best rated" views.


- There will be a special view that displays "most useful raters" and "least useful raters".  This may be accompanied by an "average rating given" or "average rating received" metric on the user's own view.
This has been implemented.  The "most useful raters" view is not yet public, but will be soon with the new database in place.
 
memesis said:
- Some users are selective about the quality of things they give ratings for (e.g. people who only rate crap, or people who only reward good items but ignore bad ones)
I've noticed this problem as well. There are several users who have never given a rating other than +1 to anything. that would be great if everything in the database was awesome, but there are obviously some stinkers in there as well.


Hopefully the "Unrated" screen will encourage more people to rate everything, and not just the stuff they like.


-S
 
Actually, the problem I've had is that sometimes I've found a submission I suspect might be good; but I can't appreciate it, since I

  1. don't like the theme of it, or
  2. don't feel certain about any rules it leans on, and whether it upsets game balance significantly.


... which makes me a bit reluctant to pass judgement.
 

Users who are viewing this thread

Back
Top