Thomas Cool

June 2000, report TC-2000-06-a

http://thomascool.eu

JEL A00

**Summary**

This paper explores the links between matching, ranking, rating, Elo-rating,
logit modeling,

Rasch modeling, Item-Response Theory (IRT), voting, aggregation, and the
like.

These issues arise e.g. when testing students, or rating web pages or scientific
papers.

The analysis is at a very basic level. The main result are *Mathematica*
notebooks and a package.

The original inspiration lead to some study in Logit modeling, Rasch modeling, Item-Response Theory (IRT), etcetera.

The result of this study is (i) a *Mathematica* package to deal
with some basics, and (ii) a set of notebooks to discuss those basics.

So the results are limited. But my impression is that the whole gives a nice discussion of aspects of testing - matching, ranking and rating - and that others could benefit when reading this. This is, also, the only conclusion that we will reach - and thus there is no section with the title 'conclusion'.

For us, however, these two meanings of 'match' are a bit confusing,
and we should avoid the confusion. We will use the expression 'find the
best *combination*' for matching in the sense of pairing up. For such
combining, we would tend to determine the distance between the objects
per criterion, and aggregate these distances.

In ranking, we would aggregate the scores and only then compare these
aggregates, e.g. to create classifications like Pass / Fail. Ranking would
be for an ordinal scale only. If we have an interval scale, so that only
the difference between variables has objective meaning, then the *ranking
*turns
into a *rating.*

Ranking and rating can be done deterministically or with an element of randomness. When player 1 wins against player 2, it is possible that this result is deterministic. For example, if the game is 'weight', then player 1 or player 2 is heavier, and this result will be the same in repeated trials. However, in some matches there is only a probability to win. But even with winning probabilities we still can define a 'distance measure'.

It is important to see that there are always criteria. Even if we organise pairwise duels, like in chess, then the comparison of the objects (players) still relies on criteria. The criterion for winning in chess is to take the opponent's King. It may be an enormous task to further develop such criteria, and hence we may skip such development, and we may only regard the outcomes of such contests. But we should be aware that this is only a simplification.

A classic example of testing is where the criteria are exam questions. People who do an exam, can be seen as being in a contest with the questions. They can also be seen as being in pairwise contests to do better on the exam. This insight links 'testing with criteria' to 'pairwise matches'.

We should be aware of at least three points of uncertainty: (1) The criteria might only be an approximation to the real objective of the test. (2) The way of aggregation might also be subject to discussion. (3) And, more in general, the scores need not be certain but can have a stochastic component. Testing quickly becomes statistical testing.

One possible type of testing is voting. Voting normally gives an ordinal
scale which indicates that the object higher on the list wins from the
object lower on the list. This uses *certainty*. Alternatively, there
is only the *probability *of winning. We still could use an ordinal
scale to express such a likelihood of winning (such as "*A* is likelier
to win than *B*").

An important approach is Item Response Theory (IRT). A test consists of subjects answering to items (criteria) on the test. Both subjects and items have a rating. The rating of a subject is interpreted as competence, the rating of an item (criterion) is interpreted as the ease of the question. Then the probability of a proper response depends upon the difference between these ratings.

Another basic idea of testing is the prediction of winning. If we have three persons and we know the winning probabilities in a match between the first two persons, then we would like to make a prediction on the winning probabilities for matches with the third person. To make this prediction, we could use criteria scores on the abilities of the three persons.

Ratings have been used for IQ, sport games, bets or gambling, Social Science Citation Index, etcetera. Once you grow aware of it, it is everywhere.

We will meet the following issues:

- ranking in general - such as utility theory, where very different
inputs are aggregated into one index

- voting theory and the voting paradoxes

- error correction: adjusting an existing score to actual outcomes

- the Elo rating system used in Chess - is it a random walk ?

- the Item Response Theory (IRT) of psychometrics (e.g. the Rasch model),
used for example for school grade point averages and test banks

- Logit modeling in general, in econometrics

- and issues of measurement (log-interval scale).

There are some interconnections that at first may be surprising. For
example, students doing a test, 'vote' for the answers.

(a) In voting the interest is in the winning answer.

(b) In testing, the at issue is rather whether the student belongs
to the winning group - so testing might be seen as inverse voting.

There is also a link to neural networks - where a neuron fires when a threshold is reached.

The recent paper of Rafiei & Mendelzon (2000) looks into rating of internet pages. I have various doubts on their approach, but anyhow the issue is obviously linked.

This discussion and programming has been guided as much by my own intuition
as by this literature. Since there were no *Mathematica* programs
available, I had this luxury that I could proceed anyhow. If I would re-invent
a wheel somewhere, then that would not matter, since at least I created
something novel: the *Mathematica* programs.

Proceeding like this, eventually various ideas came up, and some insights appeared more important to me than others. You may be more at home in the literature and therefor be a better judge whether these points are really worth anything. But they are:

(a) Clarification of the relationship between the various links above - and in particular the link between testing and voting and the link between testing on criteria and pairwise matches.

(b) A heuristic estimation procedure: Translate Item Response (IR) matrices first in pairwise matches, and then estimate from the probabilities found. This could be regarded as a condition for consistency.

(c) My conjecture is that IR can be collapsed into matches (Borda) - but some matches not into IR (some pairwise comparisons).

(d) Clarification, for sorted tests, of the relationship between correlation
of scores and winning probabilities. Regard two subjects who do a series
of questions. Let 1 have probability *p *of answering correctly, and
let 2 have probability *q *of answering correctly. Let *p > q, *and
let there even be the dependence that if 2 answers correctly, then 1 surely
will too. This would be a likely outcome for sorted tests, i.e. tests with
items of increasing difficulty. I find that the correlation here is
= (1 - *p*) / (1 - *q*).

(e) Clarification that the Rasch model is equivalent to a direct relation
of the probabilities so that Logistic transformation are actually superfluous
- this is what I call the 'inverse' approach. Clarification that the Rasch
model has multiplicative odds that allow one to construct the *whole*
match matrix. Which might make the model less attractive.

(f) Enhancing clarity on the *probability model *of the matches
(in relation to voting cycles) - it is very likely that this best be the
multinomial model (that includes deadlocks).

(g) Clarification to psychometricians that taking budget shares as probabilities likely comes at the cost of neglecting econometrics (income elasticities, marginal utility, etc.). On the other hand, such models for budget shares could be used, purely as mathematical models, to design new models, in order to escape from the restrictions of the Rasch model.

The Logistic function appears to be a much
used function to determine the probability of answering correctly. The
simplest model doing so is the Rasch model, i.e. the Logistic with only
a slope parameter. The more competent the student, the likelier it is that
the proper answer is given. The easier the question, idem. Since there
is no obvious 'zero' value, we get an interval scale, and the difference
in ratings becomes the variable that determines the probability of a correct
answer. Hence the variable *x *for the Logistic[*x*] will be
the difference in ratings of the subject and the item. Actually, since
the rating scale makes no real distinction between subjects and items,
the difference in ratings between subjects can also be used to find the
probability of winning a pairwise contest. The Rasch model is consistent,
in that the sum of the probabilities of winning and losing is 1.

When students are graded, then they don't get a rating, but they get a grade on the scale from 0 till 10, which grade effectively gives the percentage of correct answers, i.e. a percentage on a scale from 0 till 100. Indeed, such percentages, winning probabilities, are even more informative than a rating. We thus could do without an explicit rating for a certain set of models. See the notebook on the 'inverse' approach.

It helps to understand the odds, and their meaning for winning probabilities. It appears that the Rasch model implies that the probabilities in a match matrix can be determined from one row only (multiplicative odds) - i.e. from the winning probabilities of one person only.

Hence, the Rasch model has an easy interpretation via the inverse approach, and it has easy multiplicative odds. These can be thought to be an advantage, but also a disadvantage. Searching for alternative specifications, however, we discover that these are difficult to find - see the discussion on formal conditions below.

The area of investigation is rather complex. Some papers or book sections
appear less clear than one would wish. It is advised to remain critical.
*Mathematica*
appears to be useful in that we can quickly test
statements.

We have to consider the definition of Item Response matrices, and random creation of these.

We can see IR as a match, and have routines for transformation of the IR matrices into match matrices. And we can quickly estimate a Rasch model for a match.

The above has covered sufficient ground to start considering the various steps for a systematic development of the Item Response model for the probability to win. Perhaps it is more instructive to first regard estimation, before proceeding. Presently we use a heuristic estimation, in which the IR matrix is first transformed into a match and estimated as in the above (Rasch model). After this practical example of estimation, we might be more understanding about the formal conditions involved.

An example is the Elo rating in chess. Note that we should be aware of the 'inverse' approach. See this longer discussion and an estimation. What seems elegant at first, appears rather complex and not without its problems. Chess rating adjustment might well be a random walk.

In economics, we can use budget shares and income elasticities. Is it useful to turn these budget shares into probabilities ?

For science: Could we have a rating for scientific output ?

This discussion leaves some PM points:

Correlation comes in as a direct measure for the closeness of subjects.

BeginDigits: the sample distribution from large populations and arbitrary distributions. With so many items and people tested, is this not a good prior instead of Laplace's uniform distribution ?

For a conclusion, we refer to the introduction. This discussion has been explorative, and not conclusive.

Cool, Th. (1999), "The Economics Pack, User Guide", published myself

Cool, Th. (2000), "Definition and Reality in the General Theory of Political Economy", Samuel van Houten Genootschap (in particular on my solution to Arrow's problem in voting)

Elo, A.E. (1978), "The Rating of Chess Players, Past and Present", Arco Publishing, Inc., New York.

Freeman, J. A. (1994), "Simulating neural networks with *Mathematica*",
Addison-Wesley

Hambleton cs, "Item response theory: principles and applications"

Nunnally & Bernstein (1994), "Psychometric theory", McGraw Hill

Rafiei, D., & A. Mendelzon (2000), "What is this Page Known for? Computing Web Page Reputations", WWW9 International Conference, http://www9.org/w9cdrom/index.html

Rasch, G. (1977), "Specific Objectivity: An Attempt at Formalizing the Request for Generality and Validity of Scientific Statements" - memo 18 at http://www.rasch.org/

Theil, H. (1971), "Principles of Econometrics", Wiley

http://www.vks.is/skak/olymp96r.html

http://www.rasch.org/

http://thomascool.eu
(my own site)