top of page
MU Logo, H, Color Light_3x.png
Search

We Need Better Benchmarks to Build Better AI: Here’s How We Get There

Updated: Apr 23

By Dr. Muhsinah Lateefah Morris (Dr. M.O.M.)


In my work at the intersection of education, extended reality (XR), and artificial intelligence (AI), I’ve learned that if we don’t design intentionally for equity from the start, we inevitably build systems that replicate harm. That’s why I was encouraged—and energized—by a recent MIT Technology Review article that spotlighted a set of new AI evaluation benchmarks aimed at addressing bias head-on.


These new tools—like the BEATS framework and AILuminate—don’t just measure performance or accuracy. They measure harm. They challenge us to interrogate how models behave across different cultures, demographics, and high-risk scenarios like hate speech, misinformation, and even mental health prompts.


For those of us shaping the future of AI in education, especially for historically excluded communities, these benchmarks offer more than academic metrics. They offer a path forward. A way to ensure our models not only work—but work for everyone.


Let me break down why this matters:


1. Benchmarks Like BEATS Help Us See the Whole Picture


BEATS, a novel framework for evaluating Bias, Ethics, Fairness, and Factuality in Large Language Models (LLMs) evaluates AI models on 29 different metrics. These aren’t just checkboxes—they include nuanced measures of how AI responds to prompts tied to race, gender, and moral reasoning.


As someone who trains educators and builds digital learning ecosystems for neurodiverse learners, I see this as vital. If our AI teaching assistants are coded without ethical guardrails, we risk perpetuating the same exclusions our systems were meant to repair.



2. AILuminate Shines Light on Risk Areas—Before It’s Too Late


AILuminate is a standardized benchmark created by the MLCommons AI Risk and Reliability Working Group. It evaluates hazardous content generated by general-purpose AI chatbot systems in single-turn, text-only interactions, offering a shared foundation for international safety and ethical standards in AI.


AILuminate evaluates how models handle harmful content, including self-harm and hate speech. Think about the classroom implications. AI may be helping students draft essays or search for mental health resources. If it fails to recognize dangerous prompts—or worse, responds harmfully—it can cost lives.


This benchmark puts proactive guardrails in place so we’re not just reacting to harm after the fact. We’re preventing it.



3. From Facial Recognition to LLMs, the Bias Problem Is Old—but Our Solutions Don’t Have to Be


We’ve been here before. Dr. Joy Buolamwini’s groundbreaking research on facial recognition bias was a wake-up call. Her work taught us that data—when it lacks representation—can’t be neutral. It becomes discriminatory.


Now, with large language models (LLMs) powering everything from search engines to AI TAs, we’re facing a new frontier. But this time, we have a chance to embed justice into the system. These benchmarks are our diagnostic tools to do just that.


4. What I’m Doing—and What You Can Do Too


In my own work, I’m building AI-powered learning systems—custom AI TAs, immersive VR classrooms, and online schools for educators and neurodiverse students. But tools alone don’t guarantee equity. They must be audited, evaluated, and improved continuously.


Here’s how I recommend we respond as educators, developers, and policy shapers:


  • Integrate bias benchmarks into teacher prep and AI literacy courses (I’m already doing this on Skool!) www.skool.com/onemetaversity

  • Require all AI tools in classrooms to be tested with these new benchmark frameworks

  • Push EdTech companies and funders to make transparency and fairness part of their design principles

  • Elevate diverse voices in prompt engineering, training data collection, and model testing


Let’s Build Ethical AI—Not Just Efficient AI


We can’t afford to wait until another generation of learners is harmed by “neutral” tools that don’t see them. Bias isn’t just a glitch. It’s a design flaw. And we have the power—and the responsibility—to fix it.


To my fellow innovators, parents, teachers, techies, and truth-tellers: let’s commit to making AI worthy of our communities.


Let’s build benchmarks into the soul of AI systems so that excellence and equity walk hand in hand.




21st Century Skills Needed at the Intersection of AI and XR---especially in Education.
21st Century Skills Needed at the Intersection of AI and XR---especially in Education.


Learn more about my work at unitethemetaverse.com and join the movement to make tech that transforms—not traumatizes—learning.



References:

Abhishek, A., Erickson, L., & Bandopadhyay, T. (2025). BEATS: Bias Evaluation and Assessment Test Suite for Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2503.24310


Ghosh, S., Frase, H., Williams, A., Luger, S., Röttger, P., Barez, F., McGregor, S., Fricklas, K., Kumar, M., Feuillade–Montixi, Q., Bollacker, K., Friedrich, F., Tsang, R., Vidgen, B., Parrish, A., Knotz, C., Presani, E., Bennion, J., Boston, M. F., … Vanschoren, J. (2025). AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons. arXiv. https://arxiv.org/abs/2503.05731 


Mulligan, S. (2025, March 11). These new AI benchmarks could help make models less biased. MIT Technology Review. https://www.technologyreview.com/2025/03/11/1113000/these-new-ai-benchmarks-could-help-make-models-less-biased/


 
 
 

Σχόλια


Creative Logo, White_3x.png

Website designed by Creative Design

© 2022 by Metaverse United, LLC.

MU Logo, H, Color Dark_3x.png
bottom of page