Outsourcing Thought Is Going Great

This is a quick one.

Never mind that AI cybersecurity backfires.

Never mind that AI deskilling is a problem.

This is about LLMs in software on the assumption that it is good, actually. Which it is not.

Anecdote

I’m not advertising what it is, but it isn’t really a secret.

Surprisingly, I have a day job. In that day job, I had a conversation the other day which made me rant a little.

We were discussing various software-related things, and the conversation, heavily paraphrased, went something like this:

Them:: So of course we have to explore what capabilities AI provides us with regards to software development. For example, it’s great at generating test code…
Me:: NOOOOOOOOOOOOOOOOOOOOOOOOOOOOooooooooooooooo!!!!!1111111oneoneeleven

No, of course it didn’t literally go like this, but close enough.

Contract

This led me to eventually give a presentation and my work place on Design by Contract. It didn’t draw a huge crowd, but we had a great conversation about it afterwards. But the TL;DR is:

Software interfaces are contracts.

This applies to any interface, so library/module APIs, web service APIs, but also user interfaces, whether graphical or command line.

The contract is always the same in structure: the software promises to do something. And the user relies on that in their work.

Because users need to rely on this, interfaces are contracts.

Contracts & Testing

At whichever level you’re implementing tests, unit or integration, etc. your test is checking if the contract is fulfilled.

That’s it.

Tests don’t really check if the software does the right thing. We sometimes like to think about it this way, but that is false.

Consider this:

def add(a, b):
  return a - b

def test_add():
  assert add(10, 2) == 8

In no way does this test ensure that the function actually performs an addition. But it does ensure that the implicit contract (which is about subtraction) is being adhered to.

You can argue that the test does the wrong thing. And that’s fair.

Contract Testing

As a quick aside, there’s also a thing called “contract testing”. The idea is that you specify the contract of an API a little more formally. Then implementers implement the code, and consumers of the API write test code.

In a sense, contract testing is the outside-in look. On the implementation side, design by contract is probably the better approach.

End of intermission.

Generated Test Code

Oh, I just needed something I can access without login. It doesn’t really matter.

So I asked ChatGPT to generate test code for me. The result is below.

Even if the code is cut off a little in the screenshot above, it is clearly visible that the output is very comparable to what I did above.

I have to admit one thing: I first gave the task to test the function when it was still called add(). ChatGPT correctly stated that the function does a subtraction, and that I may have misnamed it.

It nonetheless generated the same test code.

And that is the problem.

Outsourcing Thought

The whole point of this very simple exercise is to demonstrate that the real issue with generating code isn’t whether or not the code is syntactically correct, or does something sensible.

The problem is that the contract is badly tested; as we noted above, the test does the wrong thing. It should test whether the documented contract of “function performs addition” is adhered to. Instead, it tests the actual implementation, which does subtraction.

We’re not off to a good start.

This is because in this scenario – where we ask the LLM to generate test code – we’re essentially outsourcing the thinking part.

Plenty of people have noted the bottleneck in software development isn’t actually writing code. Generating test code is not addressing the bottleneck.

The bottleneck is specifying precisely enough what it is we want, i.e. the thinking part. When we ask the an LLM to generate test code, yes, it writes code. But that is the side effect.

The main thing is that we outsource the specification or thinking process, and that fails rather spectacularly even in a simple case.

Inverting the Approach

So I am sometimes found to say that if you absolutely must generate code, the last thing you want to do is generate test code. Write the test code yourself, and let the LLM generate the production code.

That feels wrong to a lot of developers. And probably, that says a whole lot about the usefulness of LLMs here more succinctly than this article does, but hey, let’s not go there.

Let’s instead invert the approach.

Cucumber/Gherkin

I was CTO of a testing “stay-up”, which developed a product for uploading cucumber test results, and allowing users to visualize and explore changes in the software’s behaviour over the development lifespan.

Cucumber’s thing of behaviour driven development is essentially the same as test driven development, except the cucumber guys will tell you this is missing the point.

They’re not wrong, they’re not right. It is the same, but it also contains a crucial difference, namely that Gherkin allows you to codify high-level product requirements, while TDD tends to focus on unit testing.

The important difference is that devs and product people can use Gherkin to converge on the same language. That is actually great!

But the practice, in my experience, is that no product person will write Gherkin, or write it well enough. So in the end, it ends up as a developer tool, and consequently the Gherkin reads very low-level, like unit tests.

I have a background in software testing, which led me to work a fair bit with cucumber. The nice feature about it is the Gherkin specification language.

Now I’ve created a .feature file that takes the very same test scenarios that ChatGPT produced above, and then asks it to generate stuff for me.

First off, here’s the specification:

Feature: ChatGPT code generation
  Developers should be able to specify Gherkin feature defintions, and LLMs
  should generate good code from those.

Scenario Outline: generate a function
  Given that I am a developer
  When I provide <first> and <second> to a function
  Then the function should return <result>

  Examples:
    | first | second | result |
    |   5   |      0 |      5 |
    |  -3   |     -5 |      2 |
    |  -3   |      5 |     -8 |
    |   3   |     -5 |      8 |
    |  10   |      5 |      5 |
    |   0   |      0 |      0 |
    | 100   |     50 |     50 |
    | -10   |     10 |    -20 |

Prompt

I fed this to ChatGPT with the following prompt:

I have a cucumber feature file. Could you generate step definitions implemented with rspec for me, as well as a function that passes those tests?

The slop generator started blahblahing about keeping stuff idiomatic and so forth. Fine, get on with it, I pasted the feature file contents.

Step Definitions

Starting with the good, the step definitions are actually quite good. Now keeping in mind that this is still a very basic example, anything less would have been a major disappointment:

Given("that I am a developer") do
  # No setup needed for this scenario
end

When("I provide {int} and {int} to a function") do |first, second|
  @result = generated_function(first, second)
end

Then("the function should return {int}") do |expected|
  expect(@result).to eq(expected)
end

Reasoning and Function

So let’s examine ChatGPT’s reasoning and output.

Nice feature – this is a good example of deriving behavior purely from scenarios. If you look closely at the examples, the rule the function must follow is:

If the second number is negative → subtract it (i.e., first - second)

Otherwise → return the absolute difference (|first - second|)

Um…

Did… did it really complicate this example massively?

Looking at the code, then yes. Yes it did.

def generated_function(first, second)
  if second < 0
    first - second
  else
    (first - second).abs
  end
end

Technically correct is the best kind of correct, right?

But is it technically correct?

Taking the last example of inputs -10 and 10, for now. If I feed that to this function, we’re entering the else branch, because 10 >= 0. So this should result in:

(-10 - 10).abs == (-20).abs == 20

That isn’t even what is specified!

Result

So, I took an extremely simple example, a true coding 101 thing. And the model:

Fails at generating good test code, because it tests the wrong thing.
Given the same tests to start with, fails to generate reasonable code.

Are we really trusting this?

If code generation were the issue we need to solve, then at minimum one would expect a model to be reasonable either at generating test code, or at generating production code.

It seems they’re not reasonable. They can’t be, because they don’t think.

But the way we use these tools, whether we admit to it or not, is to outsource the actual hard part: reasoning, analysis, our thoughts.

And predictably, statistical parrots fail at that.

Well, duh.