AI Meeting Intelligence System: Automated Transcription, Speaker Diarization, and Retrieval-Augmented Summarization over Meeting Audio
DOI:
Keywords:
Automatic Speech Recognition, Speaker Diariza tion, Retrieval-Augmented Generation, Large Language Models, Meeting Summarization, Knowledge Graph, FastAPI, Next.js, ChromaDB, MongoDB, Whisper, pyannote.audio
Abstract
Meetings drive most organizational decision
making, yet the knowledge they produce rarely survives in
usable form. Notes are scattered, action items go untracked,
and recordings sit unwatched. This paper presents an AI
Meeting Intelligence System that addresses this problem end
to end. The system accepts a recorded audio or video file,
runs transcription through a locally deployed Whisper model,
applies WhisperX and pyannote.audio for speaker attribution,
and passes the labeled transcript to a large language model—
either Google Gemini 2.5 Flash or Groq-hosted Llama 3.3—to
produce structured summaries, action item lists, and sentiment
assessments. A Retrieval-Augmented Generation (RAG) module
stores chunk-level embeddings in ChromaDB so users can query
any past meeting in plain English. Each meeting also yields an
entity-relationship graph, extracted by the LLM and rendered
interactively in the browser via Cytoscape.js. The backend is
FastAPI with asynchronous MongoDB persistence; the frontend
is Next.js 14 in TypeScript. Evaluation on real recordings shows
that a 30-minute meeting is fully processed in roughly four
minutes on a standard laptop, with Word Error Rates competitive
with published Whisper benchmarks.
Downloads
Published
Issue
Section
License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.


