Overview
Goal of the Proof-of-Concept
Build a real-time conversational experience with a MetaHuman in Unreal Engine, where a user can talk to the character through a microphone. The MetaHuman:
Listens to the user (voice input).
Thinks / responds through an LLM chatbot (API).
Talks back with real-time lip-sync and basic animations.
Can perform simple actions in the 3D world (“Pick up the ball → walks to ball → picks up ball”).
The PoC does not need to be production-ready, but must demonstrate that the entire loop works technically.
Functional Scope
- Conversation Loop (voice → AI → voice)
The user presses a push-to-talk button (or holds a key).
Microphone audio is recorded and converted to text (Speech-to-Text).
The text is sent via an API to a chatbot:
Must-have: OpenAI chatbot (e.g., GPT-4.1 or GPT-4o).
Nice-to-have: Mistral model, hosted on Scaleway (private deployment).
The chatbot returns a text-based response.
The response is converted to audio (Text-to-Speech).
The MetaHuman plays this audio with real-time lip-sync.
During speech, the MetaHuman performs an appropriate idle/gesture animation (hands, head movement, etc.).
- Action Commands in the World (“pick up the ball”)
The chatbot output is interpreted not only as text but also analyzed for intents:
Examples:
“Pick up the ball”
“Go to the table”
When an intent is recognized, the system triggers an Unreal event:
The MetaHuman navigates to a target object (e.g., ball).
Starts a simple “pickup” animation.
For the PoC, a single interaction is sufficient:
A ball lies in the scene.
On command, the MetaHuman walks to it and “picks” it up (visually believable; physical accuracy not required).
Technical Architecture (Overview)
Unreal Engine + MetaHuman
One MetaHuman character in a small demo scene.
Audio output + animation (lip-sync, gestures).
Blueprint/C++ code for:
Starting/stopping microphone recording.
API calls to backend/LLM service.
Triggering animations & world actions.
Voice Pipeline
Input: Microphone → Unreal → (optional) local/remote STT service.
STT (Speech-to-Text):
e.g., Whisper (local or via API) or OpenAI Realtime as an integrated pipeline.
Output (TTS):
Text-to-Speech engine returning audio (wav/ogg).
Audio is played back in Unreal and linked to MetaHuman lip-sync (Audio-to-Curve or standard MetaHuman lip-sync tools).
Chatbot / LLM Layer
API endpoint receiving text + context and returning an answer.
Variant A (fastest): Direct OpenAI API (chat completion or Realtime API).
Variant B (nice-to-have): Self-hosted Mistral model on Scaleway with a simple REST endpoint.
Prompting:
MetaHuman role (personality, tone-of-voice).
Instructions to mark intents, e.g., JSON:
{
"reply": "Of course, I'll pick up the ball for you.",
"action": "PICK_UP_BALL"
}
Intent / Action Interpreter
Small logic component (backend or Unreal) that:
Parses the chatbot response (e.g., JSON).
Checks if an action/command is present.
For a known action, triggers the appropriate Unreal event:
PICK_UP_BALL → Navmesh pathfinding → pickup animation.
Step-by-Step Implementation Plan
Phase 1 — Basic Unreal + MetaHuman Setup
Create a new Unreal project.
Import MetaHuman and make it a controllable character.
Create a simple demo scene with:
One MetaHuman.
One object: ball (clearly labeled / Blueprint class).
Phase 2 — Audio In & Out in Unreal
Integrate microphone input:
Implement push-to-talk (keybind + UI indicator).
Prepare audio buffer for transmission.
Audio playback of external TTS audio:
Play audio files (via HTTP response) in Unreal.
Verify lip-sync integration (phoneme or amplitude-based).
Phase 3 — STT + Chatbot Integration
Choose and integrate an STT service:
Send Unreal audio to backend (HTTP/WebSocket).
Receive text result.
Chatbot integration:
Send STT output + context to LLM API.
Receive answer.
Display in debug UI (speech bubbles) before adding TTS.
Phase 4 — TTS + Lip-Sync
Integrate TTS:
Chatbot text → TTS → audio file/stream.
Send audio back to Unreal.
MetaHuman lip-sync:
Connect audio to lip-sync system (standard or custom anim blueprint).
Test with short and long sentences.
Phase 5 — Action Commands (“Pick up the ball”)
Expand prompting for structured output (text + action).
Build intent parser (backend or Unreal).
Build Unreal logic:
AI controller or simple MoveTo.
Animation sequence for picking up a ball.
Test case:
User says: “Pick up the ball.”
STT → LLM → output with action: PICK_UP_BALL.
Unreal moves MetaHuman to ball and performs pickup animation.
Phase 6 — Polish & Demo Quality
Add basic idle/gesture animations during speech.
Simple UI:
“Listening…” / “Thinking…” / “Speaking…” indicators.
Optional subtitles.
Logging & fallback:
Log all user utterances + AI responses.
Simple error handling (no connection, timeouts, etc.).
Acceptance Criteria for the PoC
The PoC is successful if a user can:
Use a button/key to speak to the MetaHuman via the microphone.
Receive a coherent, intelligible spoken response.
The MetaHuman must:
Show visible and believable lip-sync.
Display simple idle/gesture animations while speaking.
On the command “Pick up the ball” (or similar):
Walk to the ball.
Perform a clear pickup action.
The complete loop (voice → AI → voice + action) must run stably for several minutes.
Nice-to-Haves (Outside First Iteration)
Hosting the LLM on Scaleway (Mistral) via private deployment.
Multi-language support (NL/EN) with automatic language detection.
More complex world interactions (placing objects, pointing, opening UI).
Emotion-driven animations (happy/angry/quizzical depending on AI response).
Members
Manager: Joshiya Mitsunaga, Malek chrifi alaoui, Max Zoutman, Ritwik Sinha
Developer: Joshiya Mitsunaga, Malek chrifi alaoui, Max Zoutman, Ritwik Sinha
Reporter: Joshiya Mitsunaga, Malek chrifi alaoui, Max Zoutman, Ritwik Sinha
Tester: Joshiya Mitsunaga, Malek chrifi alaoui, Max Zoutman, Ritwik Sinha