EuRoC C2S2 Benchmarking Rules

Re: EuRoC C2S2 Benchmarking Rules

Postby Christian Ott » Mon 21. Mar 2016, 11:13


Thank you for sharing your opinions about the evaluation process.

The maximum number of 75 points refers only to the difficulty points. According to the rules described in the instructions of the deliverable it is indeed possible to have an achievement value higher than the target, which possibly can lead to more than 75 awarded points.

For a specific example, let us consider the case {B=0, T=1, D=10} and an achievement of A=1.1. Then the formula gives P = 10 * 1.1 / 1.0 = 11 points.

Hope this clarifies the situation.

Christian Ott
Posts: 1
Joined: Mon 21. Mar 2016, 11:01

by Advertising » Mon 21. Mar 2016, 11:13


Re: EuRoC C2S2 Benchmarking Rules

Postby fabrizio.caccavale » Fri 25. Mar 2016, 16:07

Dear All.

Let me add some more elements to the explanations by Christian.

1) The difficulty coefficients are capped to 75 (to be distributed among all objectives/metrics, experts could not distribute all 75 points). How these coefficients will be distributed will depend in the experts' judgement of each objective. It is very very unlikely that all 75 points will be assigned to a single metric, unless: 1) they judge that all the other metrics are trivial/unambitious but one, 2) the only metric with non-null difficulty points is judged extremely difficult/ambitious (such that it deserves all the 75 points).

2) The improvement is not capped at 100%. Thus, "overperforming" is possible when achievement > target (achievement < target, in the case target < baseline).
When deciding the scoring system we discussed a lot on this: we decided to let teams overperform for a number of reasons. Of course, putting a trivial target to achieve overperforming is not a good idea, since this would lead experts to assign very low difficulty coefficients.

The template we provided, in my view, was quite clear about the above issues, since capping of the achievement/improvement was never mentioned (formulas for the improvement simply reported the case A<B and A>=B), while to was clearly stated that at most 75 points are to be distributed across all objectives. However, if it was not clear enough, please, accept my apologies.

3) Not only plitting of the 75 difficulty points is a very difficult task, the whole evaluation of Stage II and III is a quite innovative, and thus complex, task. But developing innovative approaches to comparative performance evaluation is a major objective of EuRoC. That's the reason why we are going to ask the help of all challengers to determine difficulty points (more on this soon in the next week).

4) The general rules and procedures for evaluation were presented at the Benchmarking workshop in June 2015 and described in the document "Benchmarking Rules and Evaluation Procedures" (sent to all challengers). Moreover, I sent a clarification regarding the coolness factor in December 2015. The text of this e-mail is reported below:

"Coolness Factor:
Please note the coolness of your freestyle makes another 15 points, which you can earn in addition to the points you earn through the quantifiable objectives.
With regard to coolness factor: As Prof. Behnke points out, this criterion is hard to quantify and the evaluation will be very subjective.
The coolness factor will be judged solely by the independent experts on a scale from 0 to 15 points, using the videos you will shoot at the challenge hosts facilities. the reports prepared by the challenge Host and the Challenger Team, and the interviews with the Challenger Team.
As this criterion is the same for all teams, you don’t need to include it in the quantifiable objectives."

I hope this contributes to clarify all issues. Please, feel free to ask in the case you need further clarifications.

Please, don't forget to send you revised deliverable for the quantifiable objectives of the free-style task by monday march 28 evening. In the case you don't, we will assume that you are happy with the last version you sent us.

Best regards.


nunolau wrote:We would also like to have the complete specification of the Freestyle Evaluation Process.
We were surprised when we were informed that teams can actually have Achievements that are greater than one in a single metric, if they surpass the Target. In fact, we think this is a bad idea.
Now, it is not clear if the evaluation based on the quantifiable objectives/metrics will be saturated at 75 points or not.
Can theoretically a team achieve the 75 points from a single metric by obtaining an Achievement significantly higher than its Target? We were obliged to define at least 3 metrics, hence we find it strange that teams can get the maximum evaluation from a single metric, unless the maximum is unbounded (which is also strange, in our opinion).
The splitting of the 75 difficulty points is, in any case, a very difficult task, but we feel it is much more difficult if the Achievements can be greater than one.
We would also like to know how is the coolness factor evaluation merged in the final result and if there are any additional evaluation factors (interview?, others?).
Could you provide a specific example on how the evaluation will be performed when a team obtains Achievements with a classification higher than 1?

Posts: 8
Joined: Tue 15. Jul 2014, 11:16

Re: EuRoC C2S2 Benchmarking Rules

Postby nunolau » Sat 26. Mar 2016, 22:17

Dear Fabrizio and Christian,

Thanks for the new clarifications on the free-style evaluation process.

We appreciate your effort in this difficult task of finding the best way to evaluate free-style and on making it as clear and fair as possible.
We have also been actively involved, as can be seen, from previous messages on the EuRoC Forum, on helping to improve Challenge 2
evaluation. We will, of course, follow the evaluation methodology that you select and try to get the best score for our team.

As stated before, we will follow the evaluation methodology selected. However, robotics task performance evaluation is still a very open
research topic, and thus open for discussion. Thus we would like to clarify the reasons why we believe that allowing extra points for
achievements higher than 1 and, consequently, having an unbounded score for the product difficulty/achievement may not be a good idea.

The rules are mandatory on making every team state at least 3 metrics and on specifying that the sum of all difficulty points should be 75
or less (in case the targets seem trivial, unambitious, etc.).

It is true that nothing in the provided information goes against achievements higher that 1, but we thought that achieving Targets would
actually be the objective of the teams and not surpassing the targets they have themselves specified. The “target” word used in the
evaluation rules clearly points in that direction.

With achievements higher than 1, a single misjudgment on the difficulty of a single metric can severely skew the evaluation.

With achievements higher than 1 and unbounded products, in theory, a team can get almost all ofits points from a single metric,
if they have a really high product difficulty/achievement (which may be the result of a misjudgment in the difficulty of the target).
We believe this is not good for the evaluation process.

Achievements higher than 1 also make the task of splitting the difficulty points much more harsh, as experts will not only have to assess
the difficulty of achieving the target but also the possibility of surpassing it, possibly several times.

If the product of achievement/metric is not bounded then the weight attributed to the coolness factor, which we thought would be 16.7%
(15/(75+15)), will be undetermined and different for each team. Summing several evaluation factors in which some are bounded (coolness)
and others are unbounded (product achievement/difficulty) seems also strange.

We thought that making the difficulty points saturate at 75 called for a relative evaluation of the difficulty of the metrics of each team
(independently from the others), where metrics judged with higher difficulty would be awarded more points, those judged easier would
be awarded less points and the sum would not be more than 75 (only in cases of triviality or less ambitious metrics would
this value be less than 75). The standard case would be that every team would have 75 points.
With achievements saturating at 1, if a team meets all its targets it will have 75 points (or less if not all points have been distributed),
and if one of the metrics is not fully achieved the team will have less points.
However, if achievements higher than 1 are permitted, then, we think that it would also make more sense to have an absolute difficulty
scale, as in this case the final score may depend on how much the achievements will surpass 1 and this may be very different for different
teams. Of course, experts may attribute 75 points to the most difficult set of metrics provided by one of the teams and then normalize
the difficulty points assigned for other teams based on that value, but this is a quite elaborate reading of the provided documents.
Is this what you had in mind?

Finally, we believe that you are doing a very good work on having an optimal free-style evaluation, which is indeed a quite challenging
and difficult task. Although still not clear for us, given the previous points, it may happen that there are other more relevant advantages
to have achievements higher than 1 and some of the concerns that we addressed here are not that important.

We would like to finish stating that we completely trust on your final decision for the evaluation procedure.

Best regards,
Posts: 3
Joined: Thu 29. Oct 2015, 11:54


Similar topics

EuRoC C2S2 Software Versions
Forum: Stage 2 - Benchmarking
Author: Peter Lehner
Replies: 0

Return to Stage 2 - Benchmarking

Who is online

No registered users