Notice & Comment

AI, Taxi Drivers, and Administrative Law, by Cary Coglianese

This post is the second contribution to Notice & Comment’s symposium on AI and the APAFor other posts in the series, click here.

I suspect we all know the type of person who is willing to offer a confident answer to just about any question. The answer need not always be correct, or even particularly sensible, but the person expressing it displays unalloyed fervor and unshakable certitude. This type of person can tell you exactly what is wrong with sports teams, Washington politics, and the culture at large. They will opine with just as much conviction about how to fix the potholes in city streets as to how to achieve world peace.

I have encountered this type of person from time to time when I have been a passenger in a taxicab. Along with driving me to my destination, some of my taxi drivers over the years have also been willing to share with me their diagnoses of all that ails the world and what government must or must not do in response. 

The most widely known forms of AI can, if not used thoughtfully, produce answers that are not entirely dissimilar from those offered by overconfident taxi drivers. These AI-produced answers sound convincing because of a combination of a degree of plausibility and an air of authority in expression. Agency officials need to be on their guard against over-reliance on untested and insufficiently thoughtful uses of AI. Even in the age of exploding reliance on AI tools like ChatGPT—which garners over 700 million users per week—courts will continue to demand the type of reason-giving long expected of agencies to ensure that adjudicatory and regulatory decisions are made on the basis of sound evidence and analysis—and not on the basis of the policy equivalent of taxi driver opinions.

I hasten to note, of course, that I do not mean to suggest that all taxi drivers are overly opinionated. Nor do I mean to deny that opinionated people can occupy all sorts of other jobs and walks of life. I have encountered colleagues with advanced degrees who are more than willing to venture confident answers to many tough questions well beyond their specialized domains of expertise. 

When encountering such people, no matter what they do for a living, it is not difficult to see through their bluster. We know their “type.” If they start opining about the decision a football coach or player ought to have made in a pivotal game over the weekend, we call them a Monday morning quarterback. When they opine on how to cut the budget deficit or reduce crime, we might admire their self-assurance, but we can also easily spot and dismiss their claims as mere speculation. 

Few of us presumably want the heads of administrative agencies to rely on this kind of bluster to make actual decisions with real consequences for society and the economy. Administrative law reflects this intuition. The U.S. Administrative Procedure Act (APA) directs courts to reject administrative decisions that are based more on hunches or overly confident opinions than on evidence and careful analysis. 

The canonical description of the APA’s arbitrary and capricious test holds that courts should set aside an agency’s decision if it has, among other things, “entirely failed to consider an important aspect of the problem, offered an explanation for its decision that runs counter to the evidence before the agency, or is so implausible that it could not be ascribed to a difference in view or the product of agency expertise.” The APA, one might say, does not countenance administrative decision-making that is based on little more than a taxi driver’s say-so. It demands reasoning based on an assessment of different policy alternatives and sound expertise.

What does this mean for AI? AI can powerfully perform many tasks. But large language models such as ChatGPT, Claude, Gemini, and Llama—along with other forms of “general-purpose AI”—can also respond with assured-sounding answers that contain errors (hallucinations) or that tell us what we want to hear (sycophancy). 

Here is a banal but striking illustration of AI’s overconfidence: A simple search for information about college football bowl games on New Year’s Day 2026 resulted in the AI tool, Gemini, reporting to users matter-of-factly that Indiana University had defeated the University of Alabama in the Rose Bowl—but it did so before the game even began. Of course, Indiana eventually won by a wide margin. But Gemini did the same thing for the Sugar Bowl—stating prior to the game’s start that Georgia had defeated Ole Miss without any apparent “awareness” that the game had yet to begin. Once that game was actually played, Georgia lost.

These models can respond just as matter-of-factly to policy-related questions that they are not capable of answering, at least not in a manner that by itself would satisfy the arbitrary-and-capricious standard of review. I can ask ChatGPT, for example, if the Environmental Protection Agency (EPA) should tighten its national ambient air quality standard (NAAQS) for ozone. It will give me an answer. If prompted further, it will even tell me exactly where the agency should set the standard, lowering it from 70 parts per billion (ppb) to 60 ppb. It will also express its recommendation succinctly and clearly—and with confidence—with answers like these: 

  • “Given the current scientific record, EPA’s regulatory judgment should be exercised to lower the ozone standard.” 
  • “EPA should set a revised primary ozone NAAQS at 60 ppb.”

Similar results can be obtained after prompting other large language models about EPA’s air quality standards. Because much has already been written about EPA’s standard for ozone, these models can draw on that existing text on the internet to supply not only answers to policy prompts, but also plausible-sounding reasons purporting to support their policy recommendations. 

This raises a question for administrative law. If I were the EPA Administrator, could I lawfully cite these responses from ChatGPT as the basis, on their own, for me to issue a rule lowering the ozone standard to 60 ppb? 

The short answer is “no.” Even putting aside that the Clean Air Act requires specific procedures for establishing ambient air quality standards (such as a consultation with an advisory committee), EPA officials would still need to do more if they wanted a lowered standard to withstand judicial scrutiny under the arbitrary-and-capricious test. Officials would need to prepare a thorough agency record. That record would need to demonstrate that officials had reviewed evidence on the health effects of ozone, given careful consideration of the effects of alternative ozone standards (including retaining the current one), and developed a reasoned account of the agency’s policy judgment for selecting the standard that it did. In short, agency officials would need to act much like they already do, rather than simply rely on an AI-generated recommendation.

This is not because AI can lawfully play no role in the rulemaking process. On the contrary, some current large language models and natural language processing tools might be quite helpful to administrators in performing various internal tasks. These tools might help in drafting or editing emails and other documents (think: Grammarly), sifting through large volumes of public comments, or performing other discrete administrative tasks. Throughout the administrative process, these tools can perform a wide array of routine tasks of a kind that might ordinarily be assigned to junior staff members or interns. 

Furthermore, as I have explained elsewhere, there is almost certainly nothing inherently impermissible under administrative law about an agency relying on a well-trained and validated AI tool to perform other, more fundamental tasks—even, in theory, to help make policy decisions. As one court put it in a pre-AI era decision, it simply must be shown that the “ultimate responsibility for the policy decision remains with the agency rather than the computer.” 

Such a showing could rest ultimately on the validation of an AI tool and its use. To pass muster, agencies that rely on AI “need to validate that the algorithm performs as intended and that it achieves the justified objectives.” When an AI tool can be validated to produce reliable answers—that is, if a digital algorithm can be shown to perform better than a human one—then the use of that tool can be sufficiently justified under prevailing administrative law principles. Indeed, the deliberate failure to rely on such a validated-as-superior AI tool would almost certainly prove to be arbitrary and capricious. 

What constitutes sufficient validation, of course, will vary. When an AI tool is designed to perform or assist with specific, repeatable tasks—such as with “traditional” or “narrow” AI—it is possible to rely on experience to validate just how well it performs intended tasks. This performance can be compared to real-world benchmarks or to decisions made by humans. 

But when it comes to the many sui generis decisions that administrators must make, the opinions offered by a general-purpose AI tool like ChatGPT will need a different type of validation. Agencies will need to show that these opinions are more than just the digital equivalent of overconfident expressions by a taxi driver. And at least at the present time, that will not be possible without providing the same kind of regulatory impact analysis that agencies already perform.

Tools like ChatGPT do not draw upon a comparative assessment of the consequences of different policy options. Rather, their responses are generated by complex algorithms that draw on patterns in vast quantities of existing text to predict strings of words that have a high probability of responding to a user’s question. Those probabilistically responsive words are often correct with respect to many of the varied questions and tasks that users ask of them, whether how to hard-boil an egg or draft a business letter.

But even though the results of large language models can generate correct and highly useful responses to many questions asked of them, this does not mean that these answers are always correct. Nor does it mean that these answers are sufficient on their own to withstand courts’ expectations under the arbitrary-and-capricious standard. When it comes to the kind of policy questions that government administrators must answer when making consequential decisions that affect individuals and society—such as the setting of a nationwide air quality standard—answers to these one-of-a-kind questions simply cannot be validated by showing how often they are correct. The EPA, after all, basically sets just one national standard per pollutant. Moreover, validating an answer to a policy question like where to set a standard depends not on probabilistic choices about words but on judgments about real-world consequences

The work of the Trump Administration’s so-called Department of Government Efficiency (DOGE) reveals the dangers of over-relying on confident-sounding results from general-purpose language-based AI tools. In an ostensible effort to eliminate unnecessary government spending in the Veterans Administration’s (VA) health care system, DOGE staff essentially just asked a large-language model to identify government contracts that were not “directly supporting patient care.” These contracts were then treated as candidates for cancellation under the theory that the services they covered were either not needed or could be handled in-house by hospital staff. Even though the tool that DOGE used could provide instant and extensive responses with utter confidence, the tool fundamentally lacked the ability to make nuanced, context-specific judgments about medical care and the management of large health care facilities. Its mere reading of words in contracts did nothing to weigh the advantages or disadvantages of outsourcing different services. 

As a result, DOGE’s AI tool apparently identified as worthy of cancellation contracts that provided critical safety equipment and other valuable health support services. The results were sketchy enough that even the DOGE employee who developed the code to assess the VA contracts would later concede that his AI tool made mistakes: “I would never recommend someone run my code and do what it says,” he reportedly said in an interview after leaving the government. “It’s like that ‘Office’ episode where Steve Carell drives into the lake because Google Maps says drive into the lake.” Administrative officials should heed this former DOGE employee’s simple advice: “Do not drive into the lake.” 

To put the point more generally, officials should not read more into the results of general-purpose AI tools than they can truly bear. Fortunately, courts applying the arbitrary-and-capricious standard are unlikely to allow them to do so. When faced with policy questions—such as whether or how to set an environmental standard, or whether to rescind such a standard—administrative officials will need to validate any answer that a general-purpose AI tool like ChatGPT might provide. To validate answers to these kinds of policy questions, officials will need to perform basically the same kind of analysis they have long needed to conduct to satisfy the arbitrary and capricious test: that is, some kind of regulatory impact analysis. 

Agencies will not be able to rely solely on today’s most ubiquitous forms of AI—namely, those based on ChatGPT and similar large language models—to avoid their obligation under the APA’s arbitrary and capricious standard to understand the problems they seek to solve, assess alternative solutions against legally relevant criteria, and make some kind of forecast about how these alternatives would change outcomes in the world. Administrators’ forecasts need to be about tangible outcomes, not about plausible-sounding words in sentences, however confidently they might be expressed. 

In the end, notwithstanding the high certainty with which the results of today’s AI may be expressed, administrative decisions under the APA must be grounded in more than just the digital equivalent of opinions expressed by even the most well-read taxi drivers.

Cary Coglianese is the Edward B. Shils Professor of Law and Professor of Political Science at the University of Pennsylvania, where he is the founder and director of the Penn Program on Regulation.