Part 2: Education & Workforce Transformation

Navigating the Jagged Frontier: How Humans and Generative AI Must Learn to Collaborate

François Candelon

Partner, Seven2 — Executive Fellow, D³ Institute at Harvard Business School

Introduction

The rapid spread of generative AI across organizations has triggered an urgent question for business leaders: how should humans work with AI?

The most common answer, "keep humans in the loop," sounds reassuring. Yet it is also incomplete. Human–AI collaboration is not a single, uniform practice. It varies dramatically depending on both the nature of the task and the way individuals engage with the technology.

Large-scale experiments that I co-led with researchers from Harvard Business School, MIT Sloan, the Wharton School, and Warwick Business School show that when organizations misunderstand this relationship, the consequences extend far beyond missed productivity gains. Poorly designed human–AI collaboration can distort judgment, reduce the quality of decisions, and gradually erode the professional expertise on which competitive advantage depends.

This article synthesizes findings from a multi-phase research program examining how professionals interact with generative AI in real work environments.

In the first phase, we conducted a controlled experiment involving 750 consultants at Boston Consulting Group using GPT-4 to complete realistic consulting tasks. The results revealed what we call the "jagged capability frontier," the uneven and shifting boundary between tasks where AI enhances performance and those where it undermines it.

In the second phase, we analyzed nearly 5,000 human–AI interactions from the same experiment. This analysis uncovered three distinct collaboration modes, Cyborgs, Centaurs, and Self-Automators, each associated with different performance outcomes and patterns of skill development.

Together, these findings offer a practical framework for understanding how generative AI is reshaping professional work. The central lesson is clear: workforce transformation is not simply about deploying powerful tools. It requires rethinking how cognitive labor is divided between humans and machines, and giving leaders a basis for making those decisions. The pages that follow present the evidence, the framework, and a set of questions organizations can use to act on them.

The Jagged Capability Frontier

Our research began with a simple but critical question: where should generative AI actually be used in high-skill work?

The answer proved less obvious than many assume.

When 750 BCG consultants completed consulting-style tasks using GPT-4, the results showed a striking divide depending on whether tasks fell inside or outside the model's capability frontier.

For tasks inside the frontier, such as a product innovation exercise involving idea generation, focus group design, and drafting a press release, participants using GPT-4 outperformed the control group by 40 percent. Generative AI significantly improved performance across the board, including among participants who had previously ranked as lower performers. In this context, the technology functioned as a powerful equalizer.

However, the experiment also revealed a critical warning.

For a second task intentionally designed to fall outside GPT-4's frontier, a complex business problem requiring the synthesis of financial data and qualitative interview notes, participants using the model performed 23 percent worse than those working without it.

In other words, AI did not merely fail to help. It actively reduced performance.

Interviews with participants revealed why. GPT-4 produced answers that were fluent, confident, and persuasive even when they were incorrect. In many cases, these responses overrode the users' own analytical reasoning, leading professionals to accept flawed conclusions supported by convincing explanations.

The boundary between what AI can and cannot reliably do is therefore neither stable nor predictable. We describe it as jagged. Unlike traditional software, generative AI models evolve continuously through updates and retraining. Capabilities that appear reliable at one moment may degrade later. For example, after a routine update, GPT-4's accuracy in identifying prime numbers reportedly fell from 98 percent to just 2 percent. Improvements in one area may unintentionally weaken performance elsewhere.

For organizations, the implication is significant. The AI capability frontier cannot be mapped once and trusted indefinitely. Instead, companies must continually reassess where AI enhances or undermines performance.

A final insight emerged from the creative tasks in our experiment. While individuals using GPT-4 produced higher-quality outputs, groups of AI-assisted participants generated 41 percent less diversity of ideas than those working without the tool. AI improved individual productivity but also encouraged convergence, producing similar outputs across users. Organizations may gain efficiency while simultaneously risking reduced collective creativity.

Three Modes of Human–AI Collaboration

If the first phase of our research revealed where AI helps or harms, the second phase examined how professionals actually collaborate with the technology.

Analyzing nearly 5,000 recorded interactions between participants and GPT-4 revealed three distinct behavioral patterns.

Approximately 60 percent of participants adopted what we call the Cyborg model. These individuals engaged in continuous dialogue with the AI throughout the problem-solving process. They broke problems into smaller components, assigned roles to the model, challenged outputs, and iteratively refined results through repeated exchanges. For Cyborgs, the boundary between human reasoning and machine assistance became deliberately fluid. Rather than treating AI as a tool used at discrete moments, they integrated it into a collaborative thinking process.

About 14 percent of participants followed a different strategy: the Centaur model. Centaurs used AI selectively for well-defined subtasks while maintaining firm human control over the overall problem-solving framework. They might rely on AI to gather background information, explore unfamiliar industries, or test elements of their analysis, but they retained responsibility for structuring the problem and synthesizing the final recommendation. In this approach, AI functions as a targeted analytical instrument rather than a collaborative partner.

The remaining 27 percent of participants adopted what we describe as Self-Automation. These professionals delegated entire workflows to AI with minimal iteration. They provided instructions and data and then accepted the generated outputs with little critical engagement. The resulting work often appeared polished and efficient, but it lacked the analytical depth produced by genuine cognitive effort. The AI was effectively doing the work for the professional rather than with them.

Importantly, these collaboration patterns emerged spontaneously. All participants had access to the same tools and received identical instructions. Yet their instinctive approaches to using AI produced fundamentally different interaction styles. A useful way to understand the distinction is through two questions: who decides what needs to be done, and who determines how it gets done? Cyborgs let humans drive the "what" but grant AI significant influence over "how." Centaurs retain human control over both. Self-Automators cede control of both to the machine. This finding suggests that the impact of generative AI depends not only on the technology itself but also on human behavior and organizational culture.

Performance and the Expertise Paradox

The three collaboration modes produced striking differences in performance.

Centaurs achieved the highest analytical accuracy, outperforming both Cyborgs and Self-Automators in identifying the correct recommendation. By maintaining control over the reasoning process and critically evaluating AI inputs, they were less vulnerable to persuasive but incorrect responses.

Both Cyborgs and Centaurs also produced more persuasive deliverables than Self-Automators, reflecting the quality improvements that emerge when humans remain actively engaged in the work.

Yet even the Cyborg model carried risks. Despite their sophisticated prompting strategies, Cyborgs were sometimes persuaded by AI-generated justifications for incorrect conclusions. This highlights a fundamental asymmetry in human–AI collaboration: the fluency of AI can easily be mistaken for competence.

Perhaps the most important difference among the three modes, however, concerns skill development. Cyborgs developed new capabilities in working with AI, learning how to prompt effectively, challenge outputs, and guide the system toward better results. We describe this process as "newskilling." Centaurs strengthened their traditional expertise by using AI to accelerate research and refine analytical frameworks, a form of "upskilling" that deepens existing domain knowledge. Self-Automators, by contrast, developed neither. Because they delegated the entire cognitive process to AI, they missed opportunities to build both domain expertise and AI fluency. We describe this outcome as "no-skilling."

This pattern presents a significant long-term risk. In our experiment, more than a quarter of highly trained professionals, who knew their performance was being evaluated, defaulted to Self-Automation behavior. At organizational scale, such patterns could gradually erode the expertise that companies rely on for complex decision-making and innovation. The question, then, is not whether organizations should use generative AI, but how they can use it without inadvertently dismantling the human capabilities that make it valuable in the first place.

From Findings to Practice: A Decision Framework

Our research points to a set of practical questions that leaders and teams can use to structure human–AI collaboration. They emerge directly from the experimental evidence and apply across industries, functions, and models. Every AI deployment involves a series of design choices, many of which organizations currently make by default rather than deliberately. Making those choices explicit is the first step toward managing them well.

The first question is whether a task falls inside or outside the AI's capability frontier. In our experiment, this was the most important factor in determining whether AI improved performance or reduced it.

For structured tasks with clear quality benchmarks, such as drafting, summarization, or first-pass analysis, AI is likely operating inside its frontier. For tasks that require integrating ambiguous signals, exercising judgment across incomplete data, or making decisions with significant downstream consequences, it may not be. The discipline is to test, not assume. Before standardizing a workflow, organizations should compare AI-assisted and unassisted performance across major task categories.

The second question is which collaboration mode fits the task. Our data suggests a clear matching logic. When accuracy matters most, such as in high-stakes analysis, financial recommendations, or regulatory assessments, Centaur behavior produces the best outcomes. The human controls the reasoning architecture; AI provides targeted support. When speed and iterative exploration matter most, such as in creative development, scenario generation, or early-stage research, Cyborg behavior is more effective. The human and AI work together in a continuous loop. Self-Automation should be reserved for genuinely routine work, such as formatting, transcription, or standardized reporting, where the cost of an undetected error is low and skill development is not at stake.

The third question is where human judgment is most at risk. Our findings identify a specific danger zone: tasks that appear to fall inside the frontier but actually sit at its edge. In these cases, AI produces outputs that are fluent and confident enough to suppress critical evaluation, yet the underlying reasoning may be flawed. Leaders should treat persuasive AI output on complex or ambiguous tasks as a hypothesis to be tested, not a conclusion to be accepted. Teams working on high-stakes problems should be explicitly instructed to challenge AI recommendations rather than validate them.

The fourth question is what capabilities must be preserved. Every decision to delegate a task to AI is also a decision about which skills the organization continues to develop. If junior professionals never learn to structure a problem because AI does it for them, the organization loses not just a skill but the judgment that comes from practicing it. Leaders should identify the core analytical and creative competencies that define their competitive position and ensure that these are exercised by humans, not outsourced to machines, even when AI could perform the task faster.

These four questions have direct implications for how teams and workflows are organized. In practice, they mean that the same team may need to operate in different modes on different days: Centaur-style oversight for a client-facing financial analysis, Cyborg collaboration for a strategy brainstorm, and delegated automation for the preparation of a routine status report. This requires managers who understand the framework and can make mode-switching a conscious part of how work is assigned and reviewed. It also means rethinking performance evaluation. Traditional metrics that reward speed and volume will inadvertently encourage Self-Automation. Organizations serious about preserving expertise must find ways to measure and reward the quality of human engagement in AI-assisted work, not just its output.

Conclusion: What Is at Stake

Generative AI is not simply another productivity tool. It reshapes the relationship between human expertise and machine capability. Our research shows that this transformation can unfold in very different ways: some collaboration modes amplify human judgment and foster new skills, while others quietly replace human reasoning and weaken the expertise on which organizations depend.

Organizations that misunderstand these dynamics face a specific risk: not that AI will fail outright, but that it will succeed just enough to mask a gradual decline in institutional judgment. A company whose professionals routinely default to Self-Automation will produce work that looks efficient in the near term. But over time, it will find itself unable to evaluate the AI's outputs, unable to detect its errors, and unable to exercise the independent reasoning that complex business decisions require. The erosion will be invisible until it is consequential.

The organizations that distinguish themselves will treat human–AI collaboration not as a deployment question but as a design discipline. They will map the capability frontier continuously, not once. They will match collaboration modes to task requirements deliberately, not by default. They will monitor for automation complacency with the same rigor they apply to financial risk. And they will invest in professionals who are both fluent in AI and deep in their domain, because the most productive and resilient forms of collaboration require both.

Workforce transformation in the age of generative AI is, at its core, a leadership challenge. The technology is powerful, widely available, and improving rapidly. What remains scarce, and decisive, is the organizational judgment to use it well.

Co-investigators on the underlying research include Katherine C. Kellogg (MIT Sloan), Hila Lifshitz-Assaf (Warwick Business School / Harvard LISH), Steven Randazzo (Warwick Business School / Harvard LISH), Fabrizio Dell'Acqua (Harvard Business School), Edward McFowland III (Harvard Business School), Ethan Mollick (Wharton), Karim Lakhani (Harvard Business School), Lisa Krayer (BCG), and Saran Rajendran (BCG).