User Tools

Site Tools


appendix:guidebook_authoring:benchmarks

Guidebook Benchmarks

Standards

5 Is the Number

  • Each test case should be run 5 times, and averaged.
    • 5 may seem somewhat arbitrary because it is.
    • We needed a standard, so we selected one.
    • 5 tests can be easily run by hand, and averaged by hand.
      • Automation scripts will create many additional files to be maintained to ensure completeness.
    • For many examples, > 5 test will not change results significantly.1)2)
  • Only display the average.

Present Results in Seconds

  • Consistency means little chance of misreading.
  • Stick to the convention of displaying 3 places after the decimal.
  • Avoid results that would be time limit exceeded wherever possible.
    • 1s is a good cutoff for many purposes, 2s for some others.
    • For a single column a - will suffice.
    • Entire rows can be omitted when warranted.
    • Ignore this rule if it will mislead the reader.
      • E.g., If the last row is 105 at .030s, one would likely expect 106 to be .300 if the trend otherwise appears linear. If 106 is actually 2 seconds (for reasons we cannot control), this is important information for the reader.
    • Ignore this rule if insufficient data would otherwise be presented.
      • A table with a single row is not typically useful. A table with a row at 0s, a row at almost 0s, and no other rows is also likely to be misleading.

Make a Change -> Rerun All Tests

  • If you make any changes to a test file, please rerun all associated tests.
    • In order to make this possible, put a new files: page up for any new benchmarks created.
    • Ensure that all associated files are included on this files: page, and are not links to other areas.
      • This duplication is a violates terseness, but is important to guarantee that tests only rely on one page.
      • Otherwise (in the event where many benchmarks refer to a single file location) it is impossible to know what to update and keep all results consistent.
  • If you do not believe that you have changed results prove it rather than assuming that a change would not happen.
  • In order for this guide to be usable, benchmarks must be implicitly worthy of trust.

Make All Efforts to Only Test Your Target

  • Keep all test cases as simple as possible.
  • Consider your tests carefully, and make efforts to only test the desired language feature.
    • For our standard input benchmarks we timed our entire test using the bash time builtin.
      • In all cases we stored the results of a read, but we did not maintain them between iterations.
      • E.g., We did not want to consider the cost of list growth outside of stdin.readlines() where storing an entire file is unavoidable.
    • For our standard output benchmarks we decided to read the same data as our standard input benchmarks to maintain consistency, but did not want to account for the cost of reads.
      • Instead of the time builtin we opted for the standard Python3 timeit library to isolate only writes.
  • Some level of judgment on the part of the author at the time is necessary to ensure that these efforts are made correctly.
1)
In our own tests this appears to hold true.
2)
If your own tests for a particular example vary wildly, it is possible that your source/tests should be adjusted to compensate and hone in on what you are trying to test.
appendix/guidebook_authoring/benchmarks.txt · Last modified: 2018/08/15 17:01 by jguerin