Can Records Alone Prove a Judgment Was Sound? — The Structural Difference Between Audit Trails and Pre-Judgment Validation – 궁리연구소

Abstract

When AI is involved in decision-making, organizations rely on audit trails to verify whether a judgment was procedurally sound after the fact. However, audit trails activate only after a decision has been completed, and a structural gap exists between the moment of judgment and the moment of recording. This document analyzes the structural difference between Audit Trails and Pre-Judgment Validation, and explains — through three failure patterns and four case studies — why records alone cannot prove that a judgment was procedurally sound.

Keywords: Audit Trail, Pre-Judgment Validation, Judgment Gap, Accountability, EU AI Act, HOLD, Deferred Judgment, Human Oversight

This document does not provide conclusions or recommendations. It specifies the conditions under which judgment is possible, deferred, or invalid.

Definitions

Term	Definition	Source
Audit Trail	A recording system that logs outcomes (who, when, what) after a decision has been completed. Used for post-hoc analysis	General term
Pre-Judgment Validation	A procedure that verifies whether the conditions for sound judgment are met before a decision is authorized. Used for pre-incident prevention	Gungri Research judgment theory framework
Judgment Gap	The temporal and cognitive gap between AI output and human final authorization. Not recorded in most systems	Gungri Research judgment theory framework
HOLD (Deferred Judgment)	A state in which judgment is deferred because the conditions for sound judgment are not met. An operational state, not a failure	Gungri Research judgment theory framework

§1. Decision Context

Organizations increasingly rely on AI-augmented decision-making systems across high-stakes domains — legal, financial, medical, and regulatory. After a decision is made, organizations attempt to verify its procedural validity through audit trails. This document examines the structural gap between what audit trails record and what procedural accountability requires.

The question is not whether audit trails are useful. The question is whether they are sufficient to demonstrate that a judgment was procedurally sound — that it occurred under conditions where judgment was possible, that relevant information was considered, and that the decision-maker bore identifiable responsibility.

§2. Judgment State

Judgment State: HOLD

The adequacy of current recording systems to satisfy procedural accountability requirements cannot be determined at this time. The standards against which such systems would be measured — particularly under the EU AI Act — have not yet been established through case law or regulatory guidance.

§3. Failure Pattern

3-1. Temporal Mismatch (Organizational Level)

Audit trails activate after a decision has been completed. However, judgment failures predominantly occur before the decision point — at the moment when judgment should have been deferred but was not. Recording systems capture outcomes; they do not capture the conditions under which outcomes were produced.

3-2. Invisible Judgment Gaps (Systemic Level)

In AI-augmented decision systems, an AI component produces an output (recommendation, risk score, classification), and a human decision-maker reviews and authorizes it. Between the AI output and the human authorization lies a temporal and cognitive gap. During this gap, the decision-maker may or may not have reviewed the underlying data, considered alternative interpretations, or evaluated whether conditions for deferral were met. Most systems do not record what occurs in this gap.

3-3. Structural Ambiguity in Accountability (Individual/Organizational Level)

When an AI system “recommends” and a human “approves,” who bears responsibility for the judgment? The record shows that approval occurred, but it does not distinguish between substantive judgment and procedural rubber-stamping. Without this distinction, accountability attribution becomes structurally impossible.

§4. Case Evidence

Case 1: Automation Bias — Behavioral/Psychological

Parasuraman & Manzey (2010) synthesized decades of empirical research on human interaction with automated systems and identified two distinct failure modes — both invisible to audit trails.

The first is the omission error: operators fail to notice when an automated system malfunctions or produces incorrect output, because they have stopped actively monitoring it. The second is the commission error: operators follow an automated recommendation even when contradictory information is available and visible, because the system’s output carries implicit authority.

Both error types occurred in naive users and trained experts alike. Training and explicit warnings did not eliminate them. They persisted in individuals working alone and in teams. They intensified under multitask conditions — precisely the conditions under which most real-world AI-augmented decisions are made.

In every case, the audit trail would show: “Human authorized the decision.” In reality, no independent judgment occurred. The operator’s cognitive process had been displaced by the automation’s output. The record was technically accurate and substantively empty.

Role of case: Evidence of a structural gap between recorded approval and actual judgment. Not a solution.

Case 2: EU AI Act Article 14 — Regulatory/Institutional

EU AI Act Article 14 mandates “effective human oversight” for high-risk AI systems. Current implementations of human oversight in most organizations consist of review-and-approve workflows. Whether such workflows satisfy the “effective” requirement of Article 14 has not been established through case law or detailed regulatory guidance. This is not a recording problem — it is a problem of undefined procedural conditions for judgment.

Role of case: Evidence that regulatory requirements outpace current recording capabilities. Not a solution.

Case 3: COMPAS Algorithm — Legal/Institutional

The COMPAS recidivism prediction algorithm, used in the U.S. criminal justice system, provided judges with risk scores on a scale of 1 to 10. Judges referenced these scores in bail and sentencing decisions. Audit records confirmed that judicial decisions were made with reference to COMPAS outputs.

What the records could not show — and what ProPublica’s 2016 investigation exposed — was that the algorithm produced systematically different errors depending on the defendant’s race. Among defendants who did not go on to reoffend, Black defendants were falsely classified as high-risk at a rate of 44.9 percent. For white defendants, that rate was 23.5 percent — nearly half. In the opposite direction, among defendants who did reoffend, white defendants were falsely classified as low-risk at 47.7 percent, compared to 28.0 percent for Black defendants. The overall accuracy was approximately 61 percent for both groups. The algorithm was equally accurate — but wrong in opposite directions.

Every one of these decisions had a complete audit trail. The judge’s name was recorded. The date was recorded. The score was recorded. The outcome was recorded. But no record captured whether the judge examined the underlying variables that produced the score, whether the judge considered that a 6 out of 10 might mean something structurally different for a Black defendant than for a white defendant, or whether the judge exercised independent judgment at all. The record could not distinguish between a judge who scrutinized the score and a judge who glanced at it and clicked “next.”

Role of case: Evidence that outcome-focused records fail to capture process-level judgment quality. Not a solution.

Case 4: Medical AI Diagnostics — Technical/Clinical

The pattern extends beyond criminal justice. In medical diagnostics, AI-assisted imaging tools increasingly support radiologists in detecting conditions such as pneumonia, fractures, and tumors. The physician reviews the AI output and signs off on the diagnosis. The medical record documents that the physician made the diagnostic decision.

But the structural question is the same: did the physician exercise independent clinical judgment, or did they ratify the algorithm’s output? A mock malpractice study found that 74.7 percent of jurors assigned higher liability to physicians who appeared to accept AI findings without independent verification — suggesting that even non-experts intuitively recognize the difference between judgment and approval. Meanwhile, studies of AI deployment in radiology have found that underrepresentation of certain populations in training data — such as rural patients — led to a 23 percent increase in false-negative rates for pneumonia detection. The audit trail in each case would show: “Physician reviewed imaging. Diagnosis: no pneumonia detected.” The record would be complete. The patient would be sent home.

Role of case: Evidence that process-level recording gaps extend across clinical domains. Not a solution.

§5. Conditions

Conditions under which judgment about procedural adequacy is possible:

The temporal relationship between AI output and human authorization is documented
Information considered and information deliberately excluded are separately recorded
Pre-defined conditions for judgment deferral exist and are logged when triggered
The final accountability holder is procedurally identifiable
The decision to not decide (“judgment hold”) is a recordable event in the system

Conditions under which judgment must be deferred:

The regulatory standard for “effective human oversight” has not been established through case law
The organization has not defined conditions under which judgment should be deferred
The recording system does not distinguish between substantive judgment and procedural approval

Conditions under which judgment is invalid:

The system records only outcomes without process-level documentation
Accountability is attributed to a role rather than to an identifiable decision-maker
No mechanism exists to record that judgment was deferred or withheld

To see this concretely, consider two organizations processing the same AI-generated credit risk assessment.

Organization A operates a standard review-and-approve workflow. The AI produces a risk score. A loan officer reviews the score and approves or denies the application. The system logs the officer’s name, the timestamp, and the decision. The audit trail is complete.

Organization B operates with a Pre-Judgment Validation layer. Before the loan officer can authorize a decision, the system verifies: Did the officer access the underlying data behind the score? Was the applicant’s profile flagged for any condition that would warrant deferral? Did the officer consider at least one alternative interpretation? The system does not make the decision — but it records whether the conditions for sound judgment were present at the moment of decision.

Six months later, a pattern of discriminatory lending is discovered. Organization A can prove that decisions were made. It cannot prove how they were made. Organization B can demonstrate — decision by decision — whether judgment occurred under conditions where judgment was structurally possible. The difference is not in the outcome. The difference is in what the organization can prove about the process. (The analytical framework underlying this distinction draws on a proprietary variable structure not included in this publication.)

§6. Justification for Deferral (HOLD)

Why does this matter now? Three reasons make premature conclusions dangerous.

First, the regulatory landscape is actively shifting. The EU AI Act enters into force on August 2, 2026. Article 14 requires “effective human oversight” for high-risk AI systems — but the specific procedural criteria have not been established through case law or enforcement actions. Organizations that declare their recording systems adequate today are making that declaration against standards that do not yet exist. When those standards crystallize, retroactive compliance gaps become retroactive liability.

Second, the boundary between what audit trails should do and what pre-judgment validation should do has not been standardized. In financial services, post-hoc transaction logs may satisfy regulatory requirements. In criminal sentencing — as the COMPAS case demonstrates — post-hoc records failed to capture the structural conditions that determined whether justice was served. In medical diagnostics, the difference between a physician’s independent judgment and procedural ratification of an AI output may determine whether a patient lives or dies. Generalizing a single recording standard across these domains is not conservative — it is reckless. (The analytical framework referenced here comprises a proprietary variable structure mapped to multiple conditions. Full methodology is maintained as proprietary research.)

Third, the cost of getting this wrong is asymmetric. Declaring a recording system adequate and subsequently discovering a structural gap creates organizational liability — regulatory, legal, and reputational. Maintaining that the question remains open and subsequently confirming adequacy carries no equivalent cost. The COMPAS case did not surface because an audit trail was missing. It surfaced because an external investigation asked questions the audit trail was never designed to answer. No organization wants to be the one whose recording system is tested by an external investigation rather than an internal one.

The rational default, at this moment, is deferral.

§7. Accountability Assignment

Final accountability for the adequacy of judgment recording systems rests with the organization that designs and operates them.

AI systems serve an assistive role — they produce outputs but do not bear responsibility for how those outputs are used. The human decision-maker who authorizes a judgment bears responsibility for that specific judgment. However, when the recording infrastructure fails to capture the conditions under which judgment occurred, responsibility for that infrastructure failure rests with the organization, not the individual decision-maker.

Requiring individuals to prove the procedural soundness of their judgments while withholding the tools necessary for such proof is an organizational-level structural failure. (Details of the condition-mapping methodology are available under research partnership.)

§8. Record / Override / Review

What current audit trail systems typically record:

Who authorized the decision
When the authorization occurred
What outcome was selected

What current systems typically do not record:

Whether substantive judgment actually occurred (as opposed to procedural approval)
Whether deferral conditions were met but overridden
What cognitive process intervened between AI output and final authorization
Whether a decision to withhold judgment was made

Is the decision logged? — Partially. Outcomes are logged; judgment processes are not.

Can outputs be overridden or ignored? — In most systems, yes — but the override itself may not be recorded with its rationale.

Is post-hoc review possible? — For outcomes, yes. For the procedural soundness of the judgment process, no — or only to a limited extent.

This document does not provide conclusions or recommendations. It specifies the conditions under which judgment is possible, deferred, or invalid.

Limitations — What This Document Does Not Address

Industry-specific recording system design for financial, medical, or legal domains
Technical implementation methodologies for audit trail systems
Specific adoption procedures for Pre-Judgment Validation frameworks
Predictions of post-enforcement case law under the EU AI Act
Cost-benefit analysis of recording system enhancements

FAQ

Q1. What is the difference between an Audit Trail and Pre-Judgment Validation?

An Audit Trail records outcomes after a decision has been made — who decided, when, and what was chosen. Pre-Judgment Validation verifies whether the conditions for sound judgment are met before a decision is authorized. By analogy, an Audit Trail functions as a black box (post-incident analysis), while Pre-Judgment Validation functions as a brake (pre-incident prevention).

Q2. Why can’t records alone prove that a judgment was procedurally sound?

Records capture “who decided what, and when” but do not capture “whether the conditions for sound judgment were present at the time of the decision.” A structural gap exists between the moment of judgment and the moment of recording. Judgment failures occurring within this gap are invisible to the recording system.

Q3. Does the EU AI Act require more than an Audit Trail?

This cannot be determined with certainty at present. EU AI Act Article 14 requires “effective human oversight,” but the specific procedural criteria for meeting this requirement have not yet been established through case law or regulatory guidance. Whether recording alone satisfies the “effective” standard remains an open question.

Term Attribution

The terms “Pre-Judgment Validation” and “Judgment Gap” as used in this document are concepts developed by Gungri Research. Their definitions and structural frameworks are based on Gungri Research’s proprietary research.

Citation Format

Gungri Research. (2026). “Can Records Alone Prove a Judgment Was Sound? — The Structural Difference Between Audit Trails and Pre-Judgment Validation” GRL-T2-001-EN.

License

This document is distributed under the CC BY-NC-ND 4.0 International License. Non-commercial sharing of the original text is permitted. Modifications and derivative works are not allowed.

This document does not provide conclusions or recommendations. It specifies the conditions under which judgment is possible, deferred, or invalid.