AI Meeting Intelligence System: Automated  Transcription, Speaker Diarization, and  Retrieval-Augmented Summarization over Meeting  Audio

Pranjal Bhangare; Anushka Gargelwar; Krishna Garg; Amit Bhande

Authors

Pranjal Bhangare Student, Vishwakarma Institute of Information Technology
Author
Anushka Gargelwar Student, Vishwakarma Institute of Information Technology
Author
Krishna Garg Student , Vishwakarma Institute of Information Technology
Author
Amit Bhande Student, Vishwakarma Institute of Information Technology
Author

DOI:

Keywords:

Automatic Speech Recognition, Speaker Diariza tion, Retrieval-Augmented Generation, Large Language Models, Meeting Summarization, Knowledge Graph, FastAPI, Next.js, ChromaDB, MongoDB, Whisper, pyannote.audio

Abstract

Meetings drive most organizational decision
making, yet the knowledge they produce rarely survives in
usable form. Notes are scattered, action items go untracked,
and recordings sit unwatched. This paper presents an AI
Meeting Intelligence System that addresses this problem end
to end. The system accepts a recorded audio or video file,
runs transcription through a locally deployed Whisper model,
applies WhisperX and pyannote.audio for speaker attribution,
and passes the labeled transcript to a large language model—
either Google Gemini 2.5 Flash or Groq-hosted Llama 3.3—to
produce structured summaries, action item lists, and sentiment
assessments. A Retrieval-Augmented Generation (RAG) module
stores chunk-level embeddings in ChromaDB so users can query
any past meeting in plain English. Each meeting also yields an
entity-relationship graph, extracted by the LLM and rendered
interactively in the browser via Cytoscape.js. The backend is
FastAPI with asynchronous MongoDB persistence; the frontend
is Next.js 14 in TypeScript. Evaluation on real recordings shows
that a 30-minute meeting is fully processed in roughly four
minutes on a standard laptop, with Word Error Rates competitive
with published Whisper benchmarks.