Notice & Comment

Iterative Reasoning, Arbitrary Results: Chain-of-Thought Prompt Engineering for APA Compliance, by Elliot E.C. Ping

This post is the seventh contribution to Notice & Comment’s symposium on AI and the APA. For other posts in the series, click here.

Artificial intelligence has been proffered as a solution to the administrative state’s efficiency woes ad nauseam over the past decade. These tools are so promising that automated administrative decision-making is more of an inevitability than a mere possibility, and some agencies have already begun deploying some limited AI tools in various capacities.

Unfortunately, like all tools, AI comes with baggage. Among the most pressing, but well-known, issues is that the models necessary for making complex decisions are black boxes: we can’t describe how the model produces its outputs from its inputs, and the model generally can’t explain itself. This explainability limitation in turn limits the government’s ability to use AI tools responsibly, effectively, and sometimes legally.

The APA’s requirements touch on government use of AI in several ways. Most relevant here, a reviewing court must “hold unlawful and set aside” any “agency action, findings, or conclusions” that are arbitrary or capricious or, in the case of adjudications, unsupported by substantial evidence. 5 U.S.C. § 706(2)(a), (e). An agency decision can be found arbitrary and capricious if, among other things, the agency fails to adequately explain itself or if the decision was based on clear logical or factual errors. And the substantial evidence standard requires decision-makers to “show their work” and demonstrate that their conclusions are rational. The inscrutability of AI systems and their decisions, while perhaps not always rising to the level of per se arbitrary-and-capriciousness, makes it difficult for agency users of these tools to carry their statutory burden.

To measure up to these APA-derived standards, a government AI tool needs to do at least three things relatively well. First, the model’s decisions must comport with the available evidence and accurately deploy that evidence in its reasoning. This also requires it to be at least as unbiased as a human decision-maker would be (though hopefully it would perform better). Second, the model needs to produce explanations of how it arrives at its decisions. These explanations should have enough detail that another model or a human could at least theoretically replicate the results. Reproducibility does two things: first, it protects against “hallucinations,” or factually incorrect or irrelevant responses; and second, it serves as a safeguard against capricious outcomes. Finally, these explanations need to have fidelity to the actual reasoning process the model followed. It is not enough to justify the decisions made post hoc. Per Chenery I, agencies generally may not rely on after-the-fact or retrospective justifications when defending their choices, so the models agencies use would ideally be able to describe the actual steps they took to get from input to output.

Computer scientists have been inching closer to methods that would allow an AI model to achieve this level of explainability and reliability. In one such method, chain-of-thought prompt engineering (CoT), the user prompts the model to produce explanations of its intermediate steps along the way to its final solution. CoT can be done in several ways. In a standard CoT prompt, the user demonstrates a similar type of problem with intermediate steps and adds it to her prompt as an example. The LLM then uses that example as it works through the problem in the prompt, generating intermediate steps as it goes. Of course, this sort of prompting requires a lot of human legwork beforehand, making it not ideal for problems in new domains and limiting the efficiency gains from using the tool. Simpler methods, like zero-shot CoT, just add a trigger phrase like “think step-by-step” to the initial prompt and let the model take it from there. Still others try to instruct the model on how, exactly, to develop its intermediate steps. Least-to-most CoT, for example, asks the model to break a larger problem down into smaller problems, solve each of the smaller problems in order from least to most difficult, and then generate a final solution.

Early studies on CoT showed a lot of promise. In one study involving simple arithmetic and logic problems, CoT improved accuracy and reduced hallucinations while also providing descriptions of how its solution was reached.

But other studies have revealed that CoT is flawed, sometimes fatally. CoT can make a model vulnerable to confirmation bias that impacts the reliability of both the model’s individual reasoning steps and its final solution. Models prompted with CoT also sometimes show “inconsistent reasoning,” starting off on the right track but changing course at the last minute. And, when more than one logically valid reasoning path is available, as is often the case in complex decision-making environments, models often struggle to select the correct way forward.

While CoT does reduce hallucinations, another study found that it can also obscure the usual markers of a hallucination, making it more difficult to identify them when they occur. Hallucinations can be identified by several methods, including by assessing how consistent a model’s answers are when it is given the same prompt or identifying and analyzing internal-state confidence metrics. When CoT is used, however, the model’s confidence at each step of the process is amplified. This changes the internal state of the model, making internal-state-based methods of detecting hallucinations less reliable. It also diminishes the ability of the model to evaluate its own accuracy or confidence.

The biggest risk likely comes from the combination of unreliable technology with overly-trusting users. People are prone to “automation bias,” a phenomenon where the user places too much trust in the decisions of an automated system, deferring to it instead of seeking out additional information. And having high confidence in AI systems or a lower sense of self-efficacy reduces the tendency of knowledge workers to engage in critical thinking. These cognitive biases make reliance on inaccurate outputs more likely, especially when models provide a plausible explanation with low fidelity to the actual reason for a given decision.

Each of these flaws makes it more likely that a complex bureaucratic decision made partially or entirely by AI could be found arbitrary and capricious. Building effective government systems requires us to always be, as Professor Susan Rose-Ackerman put it, “seeking a difficult compromise between efficiency, transparency, participation, and accountability.” Finding the right balance requires agencies to avoid jumping the gun on efficiency-promising technology when that technology diminishes the integrity and trustworthiness of government decision-making. The development of CoT presents an interesting wrinkle in the transparency issues plaguing government AI, but this method is still in its infancy and likely won’t be the hero we need to resolve the AI explainability crisis. Still, as CoT, or whatever better method comes after it, advances, it may allow the adoption of government AI to comport with existing standards. Of course, the technical capabilities of any system should be only the starting point for an AI-adoption framework. Agencies still need to grapple with other difficult, more normative, questions. But interrogating whether new prompting methods allow an AI system’s outputs to meet the APA’s standards can serve as a useful threshold for agencies considering adopting an AI tool.

Elliot E.C. Ping is a 2025 graduate of Yale Law School currently serving as a judicial law clerk in Ohio.

A blog from the Yale Journal on Regulation and ABA Section of Administrative Law & Regulatory Practice.

Made possible in part by the support of Davis Polk & Wardwell LLP