User Tools

Site Tools


appendix:guidebook_authoring:benchmarks

This is an old revision of the document!


Guidebook Benchmarks

Standards

5 Is the Number

  • Each test case should be run 5 times, and averaged.
    • 5 may seem somewhat arbitrary because it is.
    • We needed a standard, so we selected one.
    • 5 tests can be easily run by hand, and averaged by hand.
      • Automation scripts will create many additional files to be maintained to ensure completeness.
    • For many examples, > 5 test will not change results significantly.1)2)
  • Only display the average.

Present Results in Seconds

Make a Change -> Rerun All Tests

  • If you make any changes to a test file, please rerun all associated tests.
    • In order to make this possible, put a new files: page up for any new benchmarks created.
    • Ensure that all associated files are included on this files: page, and are not links to other areas.
      • This duplication is a violates terseness, but is important to guarantee that tests only rely on one page.
      • Otherwise (in the event where many benchmarks refer to a single file location) it is impossible to know what to update and keep all results consistent.
  • If you do not believe that you have changed results prove it rather than assuming that a change would not happen.
  • In order for this guide to be usable, benchmarks must be implicitly worthy of trust.

Make All Efforts to Only Test Your Target

  • Keep all test cases as simple as possible.
  • Consider your tests carefully, and make efforts to only test the desired language feature.
    • For our standard input benchmarks we timed our entire test using the bash time builtin.
      • In all cases we stored the results of a read, but we did not maintain them between iterations.
      • E.g., We did not want to consider the cost of list growth outside of stdin.readlines() where storing an entire file is unavoidable.
    • For our standard output benchmarks we decided to read the same data as our standard input benchmarks to maintain consistency, but did not want to account for the cost of reads.
      • Instead of the time builtin we opted for the standard Python3 timeit library to isolate only writes.
  • Some level of judgment on the part of the author at the time is necessary to ensure that these efforts are made correctly.
1)
In our own tests this appears to hold true.
2)
If your own tests for a particular example vary wildly, it is possible that your source/tests should be adjusted to compensate and hone in on what you are trying to test.
appendix/guidebook_authoring/benchmarks.1534370025.txt.gz · Last modified: 2018/08/15 16:53 by jguerin