We Need Better Benchmarks to Build Better AI: Here’s How We Get There

Muhsinah Morris
Apr 15
3 min read

Updated: Apr 22

By Dr. Muhsinah Lateefah Morris (Dr. M.O.M.)

In my work at the intersection of education, extended reality (XR), and artificial intelligence (AI), I’ve learned that if we don’t design intentionally for equity from the start, we inevitably build systems that replicate harm. That’s why I was encouraged—and energized—by a recent MIT Technology Review article that spotlighted a set of new AI evaluation benchmarks aimed at addressing bias head-on.

These new tools—like the BEATS framework and AILuminate—don’t just measure performance or accuracy. They measure harm. They challenge us to interrogate how models behave across different cultures, demographics, and high-risk scenarios like hate speech, misinformation, and even mental health prompts.

For those of us shaping the future of AI in education, especially for historically excluded communities, these benchmarks offer more than academic metrics. They offer a path forward. A way to ensure our models not only work—but work for everyone.

Let me break down why this matters:

1. Benchmarks Like BEATS Help Us See the Whole Picture

BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs) evaluates AI models on 29 different metrics. These aren’t just checkboxes—they include nuanced measures of how AI responds to prompts tied to race, gender, and moral reasoning.

As someone who trains educators and builds digital learning ecosystems for neurodiverse learners, I see this as vital. If our AI teaching assistants are coded without ethical guardrails, we risk perpetuating the same exclusions our systems were meant to repair.

2. AILuminate Shines Light on Risk Areas—Before It’s Too Late

AILuminate is a standardized benchmark created by the MLCommons AI Risk and Reliability Working Group. It evaluates hazardous content generated by general-purpose AI chatbot systems in single-turn, text-only interactions, offering a shared foundation for international safety and ethical standards in AI.

AILuminate evaluates how models handle harmful content, including self-harm and hate speech. Think about the classroom implications. AI may be helping students draft essays or search for mental health resources. If it fails to recognize dangerous prompts—or worse, responds harmfully—it can cost lives.

This benchmark puts proactive guardrails in place so we’re not just reacting to harm after the fact. We’re preventing it.

3. From Facial Recognition to LLMs, the Bias Problem Is Old—but Our Solutions Don’t Have to Be

We’ve been here before. Dr. Joy Buolamwini’s groundbreaking research on facial recognition bias was a wake-up call. Her work taught us that data—when it lacks representation—can’t be neutral. It becomes discriminatory.

Now, with large language models (LLMs) powering everything from search engines to AI TAs, we’re facing a new frontier. But this time, we have a chance to embed justice into the system. These benchmarks are our diagnostic tools to do just that.

4. What I’m Doing—and What You Can Do Too

In my own work, I’m building AI-powered learning systems—custom AI TAs, immersive VR classrooms, and online schools for educators and neurodiverse students. But tools alone don’t guarantee equity. They must be audited, evaluated, and improved continuously.

Here’s how I recommend we respond as educators, developers, and policy shapers:

Integrate bias benchmarks into teacher prep and AI literacy courses (I’m already doing this on Skool!) www.skool.com/onemetaversity
Require all AI tools in classrooms to be tested with these new benchmark frameworks
Push EdTech companies and funders to make transparency and fairness part of their design principles
Elevate diverse voices in prompt engineering, training data collection, and model testing

Let’s Build Ethical AI—Not Just Efficient AI

We can’t afford to wait until another generation of learners is harmed by “neutral” tools that don’t see them. Bias isn’t just a glitch. It’s a design flaw. And we have the power—and the responsibility—to fix it.

To my fellow innovators, parents, teachers, techies, and truth-tellers: let’s commit to making AI worthy of our communities.

Let’s build benchmarks into the soul of AI systems so that excellence and equity walk hand in hand.

21st Century Skills Needed at the Intersection of AI and XR---especially in Education.

Learn more about my work at unitethemetaverse.com and join the movement to make tech that transforms—not traumatizes—learning.

#BiasInAI #AIForEducation #EdTechEquity #AITransparency #MetaversityMade #DigitalJustice #DrMOMSpeaks

References:

Abhishek, A., Erickson, L., & Bandopadhyay, T. (2025). BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2503.24310

Ghosh, S., Frase, H., Williams, A., Luger, S., Röttger, P., Barez, F., McGregor, S., Fricklas, K., Kumar, M., Feuillade–Montixi, Q., Bollacker, K., Friedrich, F., Tsang, R., Vidgen, B., Parrish, A., Knotz, C., Presani, E., Bennion, J., Boston, M. F., … Vanschoren, J. (2025). AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons. arXiv. https://arxiv.org/abs/2503.05731

Mulligan, S. (2025, March 11). These new AI benchmarks could help make models less biased. MIT Technology Review. https://www.technologyreview.com/2025/03/11/1113000/these-new-ai-benchmarks-could-help-make-models-less-biased/