Sunday, May 18, 2014

Auto-test Topcoder problems with multiple correct answers

Topcoder have been improving the decade old contest system a bit. Things like allowing larger constraints, possibility of different time/memory limit per problem, etc. But the greatest change of them all, the one that actually brings problem variety rather than Fake Difficulty is that some problems allow multiple correct answers.

The first SRM to have these problems was SRM 617. The division 2 easy and division 1 medium allow multiple correct answers. They were interesting problems that would not have worked quite as well without this ability.

In SRM 617 div2 medium, you have to return two composite numbers that add up to `n`. `n` is at least 20 and at most 1000. It turns out there is a very cool and easy solution: Return `(4, n-4)` if `n` is even, else `(9, n-9)` if `n` is odd. But if this problem appeared in old topcoder, how would we have led coders to this solution? We would have had to specify tons of things. Like, the first integer must be the smallest composite possible that makes the result as intended. But then there would be odd cases in which `(4, n-4)` is an answer, and then the solution would be more complicated (not appropriate for div2 easy). Even if we did all that, the problem wouldn't be very nice, because the results of the example cases's results would give the solution away too easily...

However, the bad thing about these new problems is that, it is Topcoder, and many of us are used to Topcoder Arena plugins that generate automatic testers based on the example test cases. If we used our old arena plugins on these problems and the correct results of our solution didn't exactly match the expected solutions, the automated tester would tell us that the results are wrong. Even when they are actually right. We need some changes.

So after SRM 617 I made plans to work as hard as possible in making my Greed tester setup able to work around this a bit. This is the story.

Identifying those problems

The first thing to do is to identify the problems that allow multiple correct answers for a test case. This requires help from the Topcoder Arena developers. The first thing I did was, during SRM 617 I told admins to remember that arena plugin API needs a way to detect this. Then I waited, and waited. Then I figured, there is a thread I know that seems to be monitored closely by Arena developers. Making post in this thread seems to be the only way I have to contact them, and so I did. I really the plugin API was documented at all and that each change was announced and explained and all new methods were known. But this is not the case. Even after receiving confirmation that this feature was in the API, I had to dig through the Jar looking for something that might be it.

problem.getComponent().isCustomChecker();
// where problem is com.topcoder.client.contestant.ProblemComponentModel

So now I just needed to add the feature to Greed so that any template was able to identify these problems and then we can do the magic.

I made a pull request, that PR adds a Problem.HasCustomChecker field. You can use it in templates: ${if ProblemHasCustomChecker}...${end} . The Pull request is still not merged and thus you will have to do some magic to compile greed using it. If you want...

Modifying the tester templates

We also need to modify the tester so that it can make use of this information. But how?

My idea of how to allow automated testers in these problems consists of two parts:

  • My tester already supports test results in which it is not known whether or not the returned answer is correct. And it can already report results as such (Using a ? symbol). Testing is still useful because you can get the results for all examples with a single run and you can also detect any example that times out or has a runtime error. So the first idea is to make the generated test code to make all test cases do this when the problem has this flexibility.
  • However, that doesn't fix that maybe the coder wants/needs also to detect wrong answers. The only workaround to this is to allow user to code a checker method for the returned output.

In many problems, it is easier to check if an output is correct than to solve the problem. For example, the problem I described above. It is easy to check that the returned value is a pair of integers, that they are both positive, they add up to n and they are composite. So we can make a tester.

In other problems, it may not be so easy to know if a result is correct without solving the problem. But if you already know one correct result for that input, then it is still easy to write a grader. For example. A problem that returns the shortest sequence following some conditions. If you know that the provided example case result gives a 3 elements sequence, then we can tell that any sequence with more than 3 elements is wrong and can report as such making use of this result.

My idea is to have a checker method in the problem class. It should take input, resulting output and the example' expected output (if it exists) If checker method returns 0, then it is not known if the provided answer is correct. If return -1, we know it is wrong and if it is 1, we know it is correct. Initially, this method returns 0 in all cases, but it can be modified by the user.

I implemented this in python, it looks like this:

class SilverbachConjecture:
 def solve(self, n):
    if n % 2 == 0:
        return (4,n - 4)
    else:
        return (9,n - 9)

# CUT begin
#-------------------------------------------------------------------------------
 def greedCheckResult(self, testInput, returnedAnswer, expectedAnswer):
    (n,) = testInput
    #You can implement a custom checker.
    # Return value: -1 is WA, 0 is unknown and 1 is AC.
    # expectedAnswer is not None, then expectedAnswer is a known correct answer.
    if len(returnedAnswer) != 2:
        return -1
    (a,b) = returnedAnswer
    if (a + b != n) or (a <= 0) or (b <= 0):
        return -1
    
    def isComposite(x):
        p = 2
        while p*p <= x:
            if x % p == 0:
                return True
            p += 1
        return False
        
    return 1 if isComposite(a) and isComposite(b) else -1

The trick is that the generic tester is a separate file, but I just updated it a bit to detect if a greedCheckResult method exists, if it does, then it does the required logic.

I am not sure yet how to do this in c++, yet, but I should try something tomorrow.

Besides of modifying the generic tester, I also modified my python template. This is the part that makes use of the HasCustomChecker field:

# ${Contest.Name} - ${ClassName} (${Problem.Score} points) by vexorian :

${<if Problem.Description.Modulo}
MOD = ${Problem.Description.Modulo}

${<end}
class ${ClassName}:
 def ${Method.Name}(self, ${Method.Params}):
    return ${Method.ReturnType;zeroval}
${<if Problem.HasCustomChecker}


${CutBegin}
#-------------------------------------------------------------------------------
 def greedCheckResult(self, testInput, returnedAnswer, expectedAnswer):
    (${Method.Params},) = testInput
    #You can implement a custom checker.
    # Return value: -1 is WA, 0 is unknown and 1 is AC.
    # expectedAnswer is not None, then expectedAnswer is a known correct answer.
    return 0

${CutEnd}
${<end}

${CutBegin}
${<TestCode}
${CutEnd}

The relevant changes to the generic tester affect only the run_tests function:

def run_testcase( problemClass, methodName, testInExpected, caseNo, totalCase, testTimeOut, finalLines, compactMode ):
    (testIn, expected) = testInExpected # so that it compiles in python3
    if compactMode != ONLY_REPORT: 
        sys.stdout.write(TEST_COLOR + "Test " + str(caseNo) + COLOR_RESET + ": " + prettyStr(testIn)+"\n")
    startTime = time.time()
    instance = problemClass()
    exception = None
    try:
        result = getattr(instance, methodName)(*testIn);
    except Exception as e:
        ln = None
        for x in traceback.extract_tb(sys.exc_traceback):
            if x[0] == problemClass.__name__+'.py':
                ln = x[1] 
        if ln is None:
            exceptionShort = '%s, (unknown line)'%( type(e).__name__ )
        else:
            exceptionShort = '%s, line: %d'%( type(e).__name__, ln )
        exception = traceback.format_exc()
        
    elapsed = time.time() - startTime   # in sec
    if compactMode != ONLY_REPORT:
        sys.stdout.write("Time: %.02f seconds.\n" % elapsed)

    knownAnswer = (expected is not None)
    if exception is not None:
        if compactMode != ONLY_REPORT:
            sys.stdout.write("RUNTIME ERROR: \n")
            sys.stdout.write(exception)
        correct = False
    else:
        if hasattr(instance, 'greedCheckResult'):
            check = instance.greedCheckResult(testIn, result, expected)
            correct = (check >= 0)
            knownAnswer = (check != 0)
        else:
            correct = expected is None or tc_equal(expected, result)
        if compactMode != ONLY_REPORT:
            if not correct or not knownAnswer:
                sys.stdout.write("Desired answer:\n")
                sys.stdout.write("\t%s\n" % prettyStr(expected) )
            sys.stdout.write("Your answer:\n")
            sys.stdout.write("\t%s\n" % prettyStr(result) )
    
    res = '?'
    if exception is not None:
        res = 'E'
    elif not correct:
        res = 'X'
    elif elapsed > testTimeOut:
        res = 'T'
    elif knownAnswer:
        res = '+'
    
    # final lines:
    s = ( TEST_COLOR + "t" + prettyCase(caseNo, totalCase) + COLOR_RESET + ": " + colorTestResult(res) )
    s += (" (%.2fs)" % elapsed)
    if res in ('?', 'X'):
        s += (" (" + prettyStr( result)+ ")" )
    elif res == 'E':
        s += (" " + exceptionShort)
    finalLines.append(s)

    if compactMode != ONLY_REPORT:
        sys.stdout.write(" %s\n" % colorTestResult(res) )
        sys.stdout.write( BAR_COLOR + ('='*BAR_LENGTH) + COLOR_RESET + "\n")
    return res

I'll eventually make a more formal release of this update once the field is added to the official Greed and I figure out a clean way to do this in c++.

Sure , it is possible that the checker you implement has a bug and thus it reports your results as correct when they are not, you submit and fail system tests. But it is better than nothing.

Victory (for now):

No comments :