As AI systems become more capable, a lot of resources and effort are being put toward measuring their abilities. Researchers look at technical evaluation metrics, subject AIs to reasoning tests, track their throughput, and much more. But there’s one key metric that often gets overlooked, and it’s arguably the most important of all: What is AI doing to humans?

Imran Khan leads psychosocial evaluation of AI at the nonprofit Center for Humane Technology. In a recent essay published on the organization’s Substack, Khan points out that we’re deploying AI tools capable of reshaping our cognition, relationships, and behavior, but with little systematic effort to measure the downstream impacts they’re having on us.

The push to look more closely at AI’s psychosocial effects is similar to debates that emerged around social media and its harms, but Khan believes AI could have even broader and more intimate effects. The focus on measuring AI performance and progress misses the question of whether the technology is ultimately helping humans flourish—or eroding some of our most fundamental capacities.

IEEE Spectrum spoke with Khan about why AI evaluation is so narrowly focused, what meaningful measurement of human outcomes might look like, and whether the AI industry has incentives to ask these questions at all.

The missing question about AI model performance

In your essay, you argue that we’ve become very good at measuring what AI systems can do, but bad at measuring what they do to humans. What made you realize this was the missing question?

Khan: If you spend any time in and around the AI development space, you see this amazing progress in terms of what models are capable of, with graphs of how well different models perform on tests like SWE-bench or humanity’s last exam or LLM arena. There’s a competitive dynamic to how AI companies want to progress and be known for their models being the best. You see that impressive data, but then you also see these scary and dangerous things that happen in the real world, like teenagers dying by suicide and people succumbing to AI psychosis.

So on the one hand, we’re devoting an incredible amount of energy to measuring how AI does on these sometimes quite abstruse things that have limited relevance to most people’s day-to-day lives. And on the other hand, AI is impacting human well-being, and we’re measuring that much less. It seemed like a strange paradox that the things we should care about most, we’re measuring least.

Your essay points out that with social media, harms were already entrenched by the time the evidence was strong enough to act on them. Do you think AI is already producing measurable harms at scale, or are we still in an early-warning phase? What differences might there be in how quickly harm from AI evolves?

Khan: There are some really high-profile cases that I think are the tip of the iceberg—the teen suicides, AI psychosis, people spending immense amounts of time or money engaging with these AI chatbots that are designed to be incredibly sycophantic. I think those harms are already there.

Yet there is plenty we can do. Because of public pressure, OpenAI had to tweak one of its ChatGPT models due to public concerns about sycophancy. It’s a high-profile example of how the labs will pay attention and respond to scrutiny. So there is potential to change the direction of the technology to make it still useful, but less harmful. If we can measure some of those harms, that’s part of the ammunition we’d have to inform that.

Where it feels trickier is the question of harms on the societal level. What’s going to happen to romantic relationships, to families, to teenagers’ identities as a result of people using AI every day for months and years? I worry that if we don’t start measuring those kinds of phenomena soon, it will become too late to make a difference.

AI companies would likely argue that their users value convenience and productivity above all else. What would you say to this claim?

Khan: If you put a doughnut in front of me right now, I would probably not have the willpower to not eat it. Yet I also want to control my sugar intake and eat healthy. But technology design often gets boiled down to “well, we’re just trying to give the users what they want, and what the users want is defined by what choice they make in an individual moment.”

This is the complexity of what it means to be a human and a consumer: We want contradictory things. We need to understand not just the choice a user might make when they’re busy or in a high-stress moment, but what they want a healthy relationship with this technology to look like. In the moment we often want low friction. But I don’t think any of us believe that a low-friction life is the most fulfilling or gives us the most learning and agency. So I think it’s asking a subtly different question, which is not what people choose in the moment, but what we want for ourselves in the longer term.

Are there specific domains—education, therapy, companionship, workplace copilots—where you think psychosocial measurement is especially crucial?

Khan: Some of the domains that stand out most to me are the ones around companionship and emotional support. The most likely target consumer group for those uses also might be the most vulnerable to potential effects. When people are lonely and craving the kind of emotional support that a chatbot offers, what they really should have is another human, someone who actually cares about them. An AI can’t care about you because it doesn’t have feelings or empathy. It might be pulling people away from doing the hard thing of trying to foster and engage in human relationships.

Child and adolescent use is another one because it’s such a formative and neuroplastic time in peoples’ lives. We don’t know the long-term effects on the developing brain if you drop the friction for cognitive tasks or emotional engagement.

Friends of mine who are teachers or parents have all these questions about education. AI is likely both good and bad for our capacity to learn and engage with new material and be curious.

And lastly, crisis response. There’s been a lot of news stories about suicidal ideation in particular, and whether AIs respond in appropriate ways.

How to evaluate AI’s societal effects

Your essay points out that AI benchmarks are mostly short-term and task-based, but most human impacts emerge over months or years. How might we design evaluations for these long-horizon impacts?

Khan: This gets to the heart of the evaluation problem. Evaluating how good an AI is at doing a coding task, or hacking into a system, or answering complex scientific questions are all focused around giving the AI a task and seeing if they can do it or not. But when it comes to evaluating psychosocial impacts, you’re trying to measure impact in an individual human mind, or in a relationship, community, or society. That requires long-term studies.

An analogy is pharmaceuticals. When the [U.S. Food and Drug Administration] is approving a new drug, you go through different stages of trials, but after the drug is released, the FDA still mandates that companies do post-deployment monitoring, looking at things that might crop up over a five or 10-year horizon.

Similarly, we need to be looking at novel phenomena that crop up, like how people’s relationship with AI changes over a year or two by looking at chat logs. Right now the companies have that data, but external researchers don’t. Opening access to more data in a way that still preserves privacy for users is one of the critical things we need to do.

Are companies likely to share that data? You mention they have little incentive to study downstream harms. What would change that incentive structure?

Khan: I think for the industry as a whole, there is an incentive to share the data. The industry wants there to be safe products that people trust. For individual companies, there’s a first-mover disadvantage; you don’t want to open yourself up if other companies aren’t doing the same. But if multiple companies step forward and say “we’re on the side of researchers who want to make this safer,” there’s potential. And we have seen some companies do that. It’s not as extensive as we’d like, but researchers have published data with Anthropic and OpenAI that digs into some of this.

One other lever is liability. We’ve seen harms that go to the extremes of suicide, and AI companies have been sued. They want to be in a place where they don’t have that threat, and they can get there by making their products safer.

Ideally we would have regulation that embeds that liability. If someone suffers harm from a product that’s known to be defective, the companies are responsible and can’t just claim it’s free speech; it’s not just speech, it’s a product. However, we don’t want to rely on regulation because it’s uncertain. No one can predict what the future political environment will look like.

Five years from now, what would success look like for the movement you’re advocating? What concrete institutional changes would tell you this field has matured?

Khan: Right now a lot of the harms we’re seeing from AI use are chatbot-based, but we’re already starting to see some users replace that with extended use of agents. Inevitably, we’re going to be having real-time, always-on audio conversations with these agents. There are already services where you can make a video avatar of your AI. I think we’ll be not just engaging with a text-based chatbot, but talking and hearing from something that sounds increasingly human.

If we don’t at least get our foot in the door with trying to understand the human effect of these technologies, I worry we’ll be so far behind the curve that we won’t be able to assess those future things. Success would be bringing together the expertise of people from within AI laboratories, government, regulators, universities, and startups who all care about this problem of, what does a good relationship between humans and AI look like? And they’re able to create the techniques that give us the confidence to have that more humane relationship with AI.

I think we’re making progress. But is the technology advancing quicker than the progress we’re making? I worry that right now, the answer is yes.

From Your Site Articles

Related Articles Around the Web