Inside the “Virtual Lab” where AIs and humans collaborate
Can’t find an expert for your research project? There’s an AI for that.
By Kristin Houser
A “Virtual Lab” developed at Stanford University promises to revolutionize the role of AI in science—and potentially help us find solutions to the world’s biggest problems in the process.
Team effort
In 2024, Google DeepMind’s Demis Hassabis (an AI expert) and John Jumper (a chemist) earned the Nobel Prize in Chemistry (well, half the Nobel Prize) for leading the development of AlphaFold 2, an AI that accurately predicts protein structures, solving one of the biggest problems in biology.
Some experts believe interdisciplinary research like this, where scientists from different fields collaborate on a single project, will be necessary to overcome many major problems facing society today—such as climate change, food insecurity, and the spread of infectious diseases—that are just too complex for scientists within a single discipline to solve.
For that to happen, though, we must first overcome the barriers to interdisciplinary research, including the challenge of simply connecting experts from different disciplines. If you’re Google DeepMind, you can hire experts in AI, biology, bioinformatics, and more to work on your research project (as it did for AlphaFold 2), but if you’re, say, a professor at a small university, finding collaborators from a range of disciplines can be much harder, especially if funding is scarce.
However, just like it solved biology’s protein-folding problem, AI could be the solution to this one, too.
“The Virtual Lab represents a shift from AI as a tool to AI as a partner for science research.”
Swanson et al. (2024)
In 2022, OpenAI unveiled ChatGPT, an AI chatbot built on a large language model (LLM), a type of AI capable of processing and generating “natural” language, the kind humans use to communicate with one another. LLMs gain this capability by training on huge troves of text—the one supporting ChatGPT was essentially fed the entire internet. From this, it learned how to accurately predict what word was most likely to come next in a sentence based on the words that preceded it, and it uses that ability to generate responses to user prompts.
Though not the first LLM, the model supporting ChatGPT was the most powerful to date, and it gave researchers at Stanford University and the Chan Zuckerberg Biohub the idea to create a Virtual Lab where powerful LLMs playing the roles of different experts, such as “chemist” or “biologist,” could collaborate on research projects under the direction of an actual scientist.
“Previous work applying AI to science has generally treated AI methods as tools used by human researchers.… [T]he Virtual Lab represents a shift from AI as a tool to AI as a partner for science research,” the researchers write in a paper shared on the preprint server bioRxiv.
Inside the Virtual Lab
For the paper, which has yet to be peer reviewed, the Stanford team demonstrated how its Virtual Lab could be used to solve a real-world problem: the need for new COVID-19 treatments.
The first step in using the Virtual Lab is for a human scientist to define the AI agent that will serve as the project’s Principal Investigator (PI). Specifically, the human tells the agent its title, area of expertise, goal, and role in the project. For the COVID-19 demonstration, for example, the team prompted OpenAI’s GPT-4o LLM with the following agent definition:
You are a Principal Investigator. Your expertise is in applying artificial intelligence to biomedical research. Your goal is to perform research in your area of expertise that maximizes the scientific impact of the work. Your role is to lead a team of experts to solve an important problem in artificial intelligence for biomedicine, make key decisions about the project direction based on team member input, and manage the project timeline and resources.
The scientist then issues a new prompt to the PI agent that reveals the goal for the project: “You are working on a research project to use machine learning to develop antibodies or nanobodies for the newest variant of the SARS-CoV-2 spike protein that also, ideally, have activity against other circulating minor variants and past variants.”
This prompt also directs the PI agent to select a team of three “expert” agents to help with the project. The PI agent should define these agents the same way it was defined (with a title, goal, area of expertise, and role in the project). For the COVID-19 demonstration, the PI opted to collaborate with an immunologist, a machine learning specialist, and a computational biologist.
Once the team is defined, the scientist sets up meetings for the AI agents, allowing them to discuss the project with one another. For each of these, they must specify an agenda and a number of rounds of discussion between the agents, but the prompt can include other data, too, such as specific questions the agents must answer by the end of the meeting or the text of scientific papers that might provide useful context.
Meetings can include a combination of agents (a team meeting) or just one agent (an individual meeting). They can also include an agent playing the role of Scientific Critic (SC) to help counteract any “hallucinations” or inaccuracies in the agents’ responses. For the COVID-19 demonstration, the researchers created this agent using GPT-4o and the following prompt:
You are a Scientific Critic. Your expertise is in providing critical feedback for scientific research. Your goal is to ensure that proposed research projects and implementations are rigorous, detailed, feasible, and scientifically sound. Your role is to provide critical feedback to identify and correct all errors and demand that scientific answers that are maximally complete and detailed but simple and not overly complex.
At the end of each round of discussion in a meeting, the SC generates a critique of the expert agents’ responses. The PI then asks follow-up questions for the next round based on the discussion so far. The process repeats for however many rounds the scientist set in the agenda—the Stanford team usually opted for three rounds, leading to meetings lasting about 5-10 minutes.
The PI or the single agent (depending on the type of meeting) then generates a final summary of the meeting, including a recommendation based on the agenda. To improve the final output of these meetings, a scientist could opt to run the exact same meeting, with the same agenda and agents, multiple times to generate different meeting summaries. They can then have the PI merge the summaries into a single comprehensive answer.
Study co-author James Zou, an associate professor of biomedical data science at Stanford University, told Freethink that having multiple agents in the meetings, each representing the perspective of a different kind of expert, produces better results than if the team had simply opted to have one AI or multiple general “scientist” AIs work on a project.
“In the Virtual Lab group meetings, all the agents contribute to the discussion and sometimes disagree with each other,” he explains. “These iterative discussions and debates are very helpful for the Virtual Lab to refine its reasoning and approach. The critic agent is especially helpful in this regard for providing constructive feedback.”
After the first meeting between the agents, the PI recommended that the team focus on modifying existing nanobodies, rather than developing new ones or working on antibodies: “Modifying existing nanobodies allows us to leverage established data, providing a quicker and more reliable path to developing broad-spectrum candidates.” It also identified four specific nanobodies the team wanted to modify and provided justification for their choices.
In a follow-up meeting, the agents were tasked with selecting appropriate computational tools to use to modify the nanobodies. Meetings were then held to write code that would allow the appropriate AI agents to use the tools. The Computational Biologist agent, for example, wrote a script for Rosetta so that it could use the software to predict how well a nanobody would bind to the coronavirus’ spike protein (strong binding is essential for a nanobody treatment to work).
Using these tools and additional meetings, the AIs determined the various ways their nanobodies could mutate, selected promising mutations, and tested their binding abilities computationally. Once the team had identified 92 particularly promising designs (23 for each of the four starting nanobodies), the Stanford team actually created the mutant nanobodies in the lab in order to validate the AIs’ findings.
“We are particularly excited that two of the new nanobodies designed by the Virtual Lab show promising binding to the recent JN.1 variant of SARS-CoV-2, while retaining binding to the original Wuhan variant of the virus,” Zou told Freethink. “It’s quite rare to see good binders across such diverse variants.”
The big picture
The Virtual Lab’s agents might be more accessible than human experts, but they have their own limitations.
One is that LLMs only “know” whatever is included in their training data, and if that data doesn’t include the latest information—the stuff a true expert in a field would know—the AI won’t be able to contribute it to the project. In the COVID-19 demonstration, for example, the agents recommended using the tool AlphaFold-Multimer rather than AlphaFold 3 because the cut-off date for GPT-o4’s training data was prior to the release of the newer tool.
Another is that the AI agents can be prone to giving vague answers, so the human scientist might have to tweak their prompts multiple times in order to get useful responses. In the demo, the agents resisted choosing between nanobodies and antibodies, suggesting the team try both, until the meeting prompt was revised to insist they choose only one.
While the Virtual Lab can lead to valuable discoveries, research often requires real-world experimentation. In the case of the COVID-19 demo, validation was just the first step—Zou told Freethink that the researchers will need to perform more experimental analyses of the nanobodies, which is the “more time consuming and expensive” part of the development process, to determine their potential use in combatting the coronavirus.
If a scientist isn’t capable of doing that real-world experimentation on his or her own—or enticing others to join a project at that point—the research could hit a brick wall.
Despite these limitations, the Virtual Lab is already making waves in the scientific community—Eric Topol, founder and director of the Scripps Research Translational Institute, called it “creative and mind blowing.”
“We’re just figuring out use of 1 #AI agent,” he tweeted. “This team took it to the next level. Imagine the frequent autonomous brainstorming lab meetings between the 5 agents!”
As LLMs and other computational tools advance, the Virtual Lab will become even more capable. This could potentially limit the need for such painstaking prompt engineering and even allow for more experimentation with computers.
Because the framework is so versatile, Zou believes scientists could use the Virtual Lab for a wide range of interdisciplinary research projects—creating whatever collaborators they need to help them find solutions to some of the world’s biggest problems.
“Previously,” he says, “it might be difficult to find the relevant experts to join the team; now, with the Virtual Lab, some of the missing expertise can be provided by AI scientists.”
We’d love to hear from you! If you have a comment about this article or if you have a tip for a future Freethink story, please email us at tips@freethink.com.