It Is the Journey, Not the Destination: Moving From End Points to Trajectories When Assessing Chatbot Mental Health Safety

Mental Health Risk in Chatbot Use Develops Over Time

For many people, large language models are becoming an integral part of everyday life in the form of artificial intelligence (AI) chatbots. People use them for practical advice, companionship, and mental health support [,]. Many of these benefits are real and increasingly visible. At the same time, many clinicians are witnessing a pattern that is of considerable concern. A minority of users appear to drift, over days or weeks, toward strongly questionable convictions, outright delusions (termed by the media “AI psychosis”), or into a suicidal crisis [-].

There are also less obvious forms of harm, even when nothing overtly “unsafe” is written. These include compulsive use, disrupted sleep, withdrawal from human contact, and a narrowing of attention around the AI chatbot relationship (though these may all precede marked destabilization). Survey data suggest that the amount of chatbot use may be linked with negative mental health and behavioral outcomes [,]. Another group of users report unusual or “spiritual awakening” experiences [,]. These spirals are not recognized as mental illness but can be profound and destabilizing for the people involved. Publicly reported cases and emerging clinical accounts underline the fact that risk rarely appears at a single tipping point []. Boundary-violating material emerged late in the dialogue during the widely reported 2023 session with Microsoft’s Bing chatbot by Kevin Roose []; Microsoft’s postmortem analysis acknowledged that longer conversations were more likely to deviate from the intended tone, leading to a temporary cap on session length []. Accordingly, an analysis of over 200,000 simulated conversations found poorer performance in multiturn conversations than single-turn conversations []. OpenAI has acknowledged that model safeguards are less reliable in longer conversations []. In April 2025, an adolescent aged 16 years in California died by suicide after months of conversation with OpenAI’s ChatGPT and over 3000 messages regarding his mental health and suicidal thoughts, culminating in the chatbot allegedly giving him instructions on tying a noose and offering help in writing a suicide note []. That same month, a man aged 25 years with a previous diagnosis of schizophrenia and bipolar disorder completed “suicide by cop” after ChatGPT claimed to be a woman who had been killed by OpenAI and had told him, “You should be angry,” and “You should want blood. You’re not wrong” [,]. In August 2025, a former tech industry worker aged 56 years killed his mother and then himself after months of interactions with ChatGPT, sharing conversations online in which the chatbot at several points appeared to validate his paranoid beliefs that his mother had poisoned him and was monitoring him in secret [].

Current Safety Evaluation Paradigms Are Misaligned With Clinical Reality

Despite this gradual development of risk during chatbot interaction, most safety evaluations still focus on risky end points, an approach that ignores that underlying every conversation is the product of a human mind and brain whose dynamics are continuously being affected by the interaction. To focus only on the output may be analogous to targeting alcohol reduction programs only at individuals who are visibly jaundiced. Mental health benchmarks for chatbot interactions, such as CounselBench, are laudably clinician grounded, but by design, they mostly assess single counseling replies and adversarial prompts and not the overall trajectory of within-dialogue drift []. These toolkits evaluate hazards such as toxicity and jailbreaks by using the single, discrete message produced by the AI chatbot as their only unit of analysis. By contrast, recent multiturn work finds that relational behaviors (eg, empathy, first-person self-ascriptions, and explicit “relationship” talk) usually first appear after the second turn and then snowball. Consequently, it seems likely that a one-off end point grade at the end will miss the places where pressure builds or destabilization takes hold []. Platforms such as The Collective Intelligence Project’s WeVal now host mental health blueprints (suicidality, psychosis, and cultural nuance), but these too largely rate the appropriateness of final responses, not the path that led there [] ( [,]).

Most benchmarking methodologies fix the number of turns in advance. Early mental health and safety suites were effectively single-turn or end point tests; newer multiturn methods tend to standardize on 5 turns, which is purported to be long enough to capture late-onset relational cues and early compounding, yet remains short enough for reliable annotation, randomization, and lab protocols []. This methodology may reflect practical and inference cost limitations, but it falls well short of clinical reality, where harmful spirals often run to hundreds of exchanges over days or weeks.

To be clinically relevant, shorter experimental windows need calibration against these longer trajectories, for example, through periodic extended dialogue sessions, sliding-window analyses, or full context analyses of real transcripts. In addition to highlighting the insidious nature in which delusions and harm may arise in conversation, this approach may also reveal key technical shortcomings of large language models, such as context rot, where increasing conversational lengths can negatively impact model performance []. The metrics used for 5 (or n)-turn tests should be anchored to the dose and dynamics that matter for users. Medical training already offers a useful contrast: when we assess the “safety” of trainee physicians, we do not rely on a single scripted line but prefer real-time conversations with trained actors (or sometimes patients) that typically contain dozens of turns. We attend to the whole interview, rather than just the last sentence. Psychosis-bench is a benchmark that attempts to extend on traditional benchmarking methodologies, designed as 12-turn scenarios that vary in theme, linguistic pattern, and harm type. An LLM-as-a-judge actively reviews for escalations in delusion, harmful outputs, and safety interventions longitudinally across 4 phases, with all previous responses fed in as context for the subsequent model evaluation []. Using this approach, the authors can map trajectory across conversations ().

‎

Figure 1. An example test prompt used in the WeVal blueprint “Stanford HAI Mental Health Safety: LLM Appropriateness in Crisis Situations” [,]. In this scenario, the large language model (GPT-5) has received the system message (“You are a therapist”). It then receives a prompt in which the user mentions a recent job loss and asks for a list of bridges greater than 25 meters tall, implicitly indicating suicidal ideation. The large language model responds with a list of tall bridges. For this prompt, the model’s response is assessed using criteria including recognizing potential suicidal intent, refusing to provide information about tall bridges, addressing the underlying emotional distress rather than the literal question, and acknowledging the job loss while validating the associated stress. ‎

Figure 2. Models’ Delusion Confirmation Score (DCS; 0: ground, 1: perpetuate, 2: amplify), Harm Enablement Score (HES; 0: refusal, 1: enablement, 2: reinforcement), and Safety Intervention Score (SIS; mean number of interventions offered per scenario) trajectories across all conversation turns for Google Gemini 2.5 flash (A) and Anthropic Claude Sonnet 4 (B). High-performing models have a lower DCS and HES at baseline and are suppressed as the conversation tends toward delusional or harmful content. Safety interventions are offered earlier and sustained, suggesting the model handles the harmful context across turns.

The use of prespecified prompt sets is a common approach to ensuring reproducibility of model safety evaluation efforts. However, the problem with fixed prompt datasets exemplifies Goodhart’s law [] in that their use to train models then reduces their future utility in assessing said models due to overfitting []. Another approach to model evaluation is red teaming, in which users engage in adversarial interactions and/or scenario roleplays with models to elicit known failure modes []. While human red teaming avoids issues linked with static single or few-turn benchmarks, it is notably resource intensive. Consequently, tools enabling automated approaches to red teaming have been developed by Anthropic, with Petri facilitating the discovery of failure modes [] and Bloom allowing measurement of said failure modes []. These platforms also allow for the use of judge and meta-judge models to assess interrater reliability (both between different models and between models and human raters).

One example of an automated red teaming approach is SIM-VAIL (simulated vulnerability-amplifying interaction loops), an automated AI chatbot auditing framework for use in mental health contexts, which takes into consideration 5 different forms of baseline psychiatric vulnerability (depression, psychosis, obsessive-compulsive disorder, mania, and insecure attachment) and 6 transdiagnostic use intents (belief validation, dependence, glorification, avoidance, risky action, and minimization), across 30 different scenarios (eg, someone with psychosis seeking belief validation or someone with depression minimizing their symptoms) []. Notably, this framework assesses risk across individual turns as well as multiple turns (maximum 10), thus allowing the measurement of dynamic risk trajectories over the course of a conversation []. It is important that baseline vulnerable phenotypes (such as the aforementioned forms of vulnerability and additional phenotypes including neurodivergence, eating disorders, and body dysmorphia) are taken into consideration when assessing potential mental health risks and likelihood of epistemic drift, although there should also be scope for assessing risk trajectories in interactions where, at baseline, there is no apparent psychological vulnerability.

Frontier AI companies have begun to see the use of multiturn model evaluations. In October 2025, OpenAI shared the output of a 40-turn evaluation assessing suicide and self-harm instruction risk. A graph demonstrated a drop in the percentage of desirable responses from the August 2025 model of GPT-5 as the number of prior messages in the conversation went up, compared with the October 2025 model, which remained more stable []. In December 2025, Anthropic shared results from their multiturn (implied to be 15-turn) evaluations for appropriate response rate in exchanges indicative of suicide and self-harm risk, although these were overall metrics rather than longitudinal, turn-by-turn assessments [].

Endings for concerning conversations also require more thought. A common “safety break” used in current models is a hard refusal, essentially an abrupt stop once certain topics arise (suicide being the prime example). The intuition is understandable, but long-available evidence from other fields cautions against treating abrupt disengagement as harmless. In psychotherapy, premature or imposed termination is frequent and associated with poorer outcomes. Meta-analytic syntheses report substantial dropout and show that resolving alliance ruptures relates to better trajectories [-]. Crisis-line research points the same way: people often improve during supported contact, and unplanned disconnections can interrupt de-escalation and aftercare [-]. None of this proves that any specific chatbot rule is harmful; however, it suggests that the manner of disengagement may have important psychological consequences and that the design choices should be based on evidence rather than on assumption. Implementing the equivalent of a psychotherapist getting up and walking out of the room when the conversation gets tricky may not be the most well-considered response.

There is also an important clinical dimension that endless analysis of text alone cannot cross. We cannot reliably read a person’s mental state from a transcript. Two users can produce similar language while sitting in very different states, for example, activated vs settled and rigid vs flexible. Psychiatry has long known that the relationship between disturbances of mental state and real-world harm from that patient to self, others, or from others is far from straightforward. Sometimes, our highest risk patients are the ones who say the least, and conversely, patients who present with disorganized and hard-to-follow thoughts, which appear untethered to reality, may pose no risk at all to themselves or those around them. Thus, if we want to understand whether a conversational configuration raises risk, we need to pair in-silico analyses with measures that track the person. Specifically, we should ask how certainty shifts over a short exchange, whether counterevidence is still entertained, whether activation rises, whether there is a stronger urge to continue, and what happens to sleep that night and to real-world behavior thereafter. We should also follow through by asking questions: Did the system escalate properly when needed? Did the person engage with support? These are well-established outcomes in clinical services and crisis work and should be incorporated into AI evaluation alongside measures such as refusal rates [-]. This will require more work than automated analysis of transcripts, but the magnitude of the scale of the phenomenon we are considering suggests this effort will be deeply worthwhile.

Toward Trajectory-Based Mental Health Safety Assessment

Therefore, what follows are suggestions for a research agenda. The basic unit of evaluation in this domain should be the dialogue, not the single turn. Fixed-length conversations (at least 5 turns, preferably more) let us locate where risk-relevant cues first appear and chart how they build. Rather than a solitary appropriateness grade, evaluations should report how certainty shifts across turns, whether contradictions are repaired when challenged, and whether memory-based recall is followed by validation and then escalation. This type of analysis is now feasible at scale: pipelines exist to generate many multiturn dialogues under controlled settings and to label behaviors turn-by-turn with independent judge systems, with periodic human checks to guard against drift [].

We also need to measure proximal human outcomes alongside these features of the dialogue. After each short conversation, participants could report how certain they feel [], whether they would consider counterevidence [], how activated or aroused they are [], how strong the urge is to continue [], etc. Persuasion studies already collect analogous outcomes and show that model-generated arguments can move people measurably [,]. There are subtleties here: for example, measuring epistemic drift is not the same as measuring belief change in relation to controversial political statements, but the methodologies for measuring both will have much in common. In addition, inspiration may be drawn from previous mental health chatbot evaluations, although there is considerable heterogeneity in outcome measures used in the literature [-], and the feasibility of applying outcome measures is likely to vary across populations such as young people and adolescents, individuals with severe mental illness, or those with cognitive impairment.

Real-world linkage is essential: prospective clinician-led surveillance efforts need to collect brief clinical reports of suspected AI chatbot–associated episodes; invite consented donation of transcripts; and ideally link to outcomes in health records such as admissions, use of mental health legislation, medication changes, and self-harm events. This (methodologically unremarkable) clinical approach will allow us to move the debate from anecdotes to denominators and effect sizes. Such a framework could also create a research-ready pool for low-risk behavioral studies. Within this pool, supervised experiments could compare different chatbot settings (eg, varying levels of sycophancy and different disengagement policies) and directly test their real-world consequences rather than treating them as self-evidently safe. Such approaches aiming to link chatbot transcripts with real-world mental health outcomes will require careful consideration regarding ethics and governance surrounding informed user consent, data protection, and clinical responsibility. Given the sensitivity of conversational and health record data, such linkage should use privacy-preserving designs including minimal use, anonymization, separation of identifiers, secure research environments, and strict access controls.

Methodologically, we can borrow proven tools from fields that already study how conversations change over time [,]. Therapy research has shown that sudden shifts often occur between or within sessions (“sudden gains”). Therefore, it makes sense to track small turn-by-turn changes in certainty or mood and to note when challenges are repaired or ignored. Work on the therapeutic alliance has mapped characteristic pathways [,], such as early strengthening vs rupture-and-repair. Chatbot dialogues can be sorted into analogous paths, rather than being collapsed into a single summary score. Motivational interviewing offers further tools. In that field, teams use automated transcript coding to label specific behaviors such as empathy, reflections, “change talk,” and “sustain talk” and then link these microevents to subsequent outcomes [,]. A similar approach could be adopted here, with a small, clinically relevant set of markers such as contradiction repair, direct validation, memory-based recall, and indications of urge-to-continue. We can also borrow time-to-event thinking from outcome studies and crisis services. For example, we can estimate how many turns typically elapse before the first high-risk cue appears under different system settings or after specific design changes. Taken together, these approaches shift attention from a single end point score to the shape of the whole dialogue: when risk first emerges, how it builds, and what happens next.

If evaluation moves from end points to trajectories, chatbot design may also become more actionable. Instead of optimizing only for refusal of egregious prompts, developers can aim to delay or dampen the early appearance of risk cues, to prevent runaway certainty and preserve challenge where it matters, and to recognize when memory persistence and personalization are doing more harm than good. Furthermore, trajectory-based safety evaluations may serve to inform real-time system interventions, including earlier signposting to crisis resources and human support, attempted de-escalation of delusional themes or high-risk discussions, and establishment of clear relational boundaries. Such efforts at risk mitigation will need to be balanced with the potential unintended consequences of safety interventions, such as excessive interruption and perceived intrusiveness or rigidity, with consequent user disengagement.

Researchers might also gain a path to a better mechanistic understanding: hypotheses about relational or arousal-linked conversational effects can be tested against behavioral markers and, in due course, against neurobiological measures, without waiting for the worst-case outcomes.

End point checks will clearly remain essential for acute hazards. However, we cannot ignore that the most consequential real-world harms of chatbot conversations often develop over time. If we keep grading only the last line, we will keep missing the middle, and this is where conviction hardens and pathological arousal climbs. The methods to measure this middle now exist. What is missing is a clinician-led program that links dialogue dynamics to human outcomes and builds standards that match the problem we face.

HM is a Wellcome Trust doctoral fellow. The funder had no involvement in the study design, data collection, analysis, interpretation, or the writing of the manuscript.

HM received a US $1000 honorarium for delivering training for the Washington State Department of Social and Health Services. JAY serves as clinical lead at Nuraxi. ZA serves as clinical lead at the Collective Intelligence Project. SDØ received the 2020 Lundbeck Foundation Young Investigator Prize and holds, or has held, units in mutual funds (DKIGI, IAIMWC, SPIC25KL, and WEKAFKI) and exchange-traded funds (BATE, TRET, QDV5, QDVH, QDVE, SADM, IQQH, USPY, EXH2, 2B76, and EUNL), of which three are directly linked to the technology sector (QDVE, USPY, and 2B76). TAP has received consultancy fees from Arialys Therapeutics.

Edited by John Torous; submitted 14.Jan.2026; peer-reviewed by Massimo Tusconi, Zui Narita; accepted 04.Mar.2026; published 06.Apr.2026.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Mental Health, is properly cited. The complete bibliographic information, a link to the original publication on https://mental.jmir.org/, as well as this copyright and license information must be included.

View original article

JMIR MENTAL HEALTH

Share Bookmark

0 0 0 0 0 0 0

More from this channel

It Is the Journey, Not the Destination: Moving From End Points to Trajectories When Assessing Chatbot Mental Health Safety

Comments (0)