Project

General

Profile

Overview

Goal of the Proof-of-Concept

Build a real-time conversational experience with a MetaHuman in Unreal Engine, where a user can talk to the character through a microphone. The MetaHuman:

Listens to the user (voice input).

Thinks / responds through an LLM chatbot (API).

Talks back with real-time lip-sync and basic animations.

Can perform simple actions in the 3D world (“Pick up the ball → walks to ball → picks up ball”).

The PoC does not need to be production-ready, but must demonstrate that the entire loop works technically.

Functional Scope

  1. Conversation Loop (voice → AI → voice)

The user presses a push-to-talk button (or holds a key).

Microphone audio is recorded and converted to text (Speech-to-Text).

The text is sent via an API to a chatbot:

Must-have: OpenAI chatbot (e.g., GPT-4.1 or GPT-4o).

Nice-to-have: Mistral model, hosted on Scaleway (private deployment).

The chatbot returns a text-based response.

The response is converted to audio (Text-to-Speech).

The MetaHuman plays this audio with real-time lip-sync.

During speech, the MetaHuman performs an appropriate idle/gesture animation (hands, head movement, etc.).

  1. Action Commands in the World (“pick up the ball”)

The chatbot output is interpreted not only as text but also analyzed for intents:

Examples:

“Pick up the ball”

“Go to the table”

When an intent is recognized, the system triggers an Unreal event:

The MetaHuman navigates to a target object (e.g., ball).

Starts a simple “pickup” animation.

For the PoC, a single interaction is sufficient:

A ball lies in the scene.

On command, the MetaHuman walks to it and “picks” it up (visually believable; physical accuracy not required).

Technical Architecture (Overview)
Unreal Engine + MetaHuman

One MetaHuman character in a small demo scene.

Audio output + animation (lip-sync, gestures).

Blueprint/C++ code for:

Starting/stopping microphone recording.

API calls to backend/LLM service.

Triggering animations & world actions.

Voice Pipeline

Input: Microphone → Unreal → (optional) local/remote STT service.

STT (Speech-to-Text):

e.g., Whisper (local or via API) or OpenAI Realtime as an integrated pipeline.

Output (TTS):

Text-to-Speech engine returning audio (wav/ogg).

Audio is played back in Unreal and linked to MetaHuman lip-sync (Audio-to-Curve or standard MetaHuman lip-sync tools).

Chatbot / LLM Layer

API endpoint receiving text + context and returning an answer.

Variant A (fastest): Direct OpenAI API (chat completion or Realtime API).

Variant B (nice-to-have): Self-hosted Mistral model on Scaleway with a simple REST endpoint.

Prompting:

MetaHuman role (personality, tone-of-voice).

Instructions to mark intents, e.g., JSON:

{
"reply": "Of course, I'll pick up the ball for you.",
"action": "PICK_UP_BALL"
}

Intent / Action Interpreter

Small logic component (backend or Unreal) that:

Parses the chatbot response (e.g., JSON).

Checks if an action/command is present.

For a known action, triggers the appropriate Unreal event:

PICK_UP_BALL → Navmesh pathfinding → pickup animation.

Step-by-Step Implementation Plan
Phase 1 — Basic Unreal + MetaHuman Setup

Create a new Unreal project.

Import MetaHuman and make it a controllable character.

Create a simple demo scene with:

One MetaHuman.

One object: ball (clearly labeled / Blueprint class).

Phase 2 — Audio In & Out in Unreal

Integrate microphone input:

Implement push-to-talk (keybind + UI indicator).

Prepare audio buffer for transmission.

Audio playback of external TTS audio:

Play audio files (via HTTP response) in Unreal.

Verify lip-sync integration (phoneme or amplitude-based).

Phase 3 — STT + Chatbot Integration

Choose and integrate an STT service:

Send Unreal audio to backend (HTTP/WebSocket).

Receive text result.

Chatbot integration:

Send STT output + context to LLM API.

Receive answer.

Display in debug UI (speech bubbles) before adding TTS.

Phase 4 — TTS + Lip-Sync

Integrate TTS:

Chatbot text → TTS → audio file/stream.

Send audio back to Unreal.

MetaHuman lip-sync:

Connect audio to lip-sync system (standard or custom anim blueprint).

Test with short and long sentences.

Phase 5 — Action Commands (“Pick up the ball”)

Expand prompting for structured output (text + action).

Build intent parser (backend or Unreal).

Build Unreal logic:

AI controller or simple MoveTo.

Animation sequence for picking up a ball.

Test case:

User says: “Pick up the ball.”

STT → LLM → output with action: PICK_UP_BALL.

Unreal moves MetaHuman to ball and performs pickup animation.

Phase 6 — Polish & Demo Quality

Add basic idle/gesture animations during speech.

Simple UI:

“Listening…” / “Thinking…” / “Speaking…” indicators.

Optional subtitles.

Logging & fallback:

Log all user utterances + AI responses.

Simple error handling (no connection, timeouts, etc.).

Acceptance Criteria for the PoC

The PoC is successful if a user can:

Use a button/key to speak to the MetaHuman via the microphone.

Receive a coherent, intelligible spoken response.

The MetaHuman must:

Show visible and believable lip-sync.

Display simple idle/gesture animations while speaking.

On the command “Pick up the ball” (or similar):

Walk to the ball.

Perform a clear pickup action.

The complete loop (voice → AI → voice + action) must run stably for several minutes.

Nice-to-Haves (Outside First Iteration)

Hosting the LLM on Scaleway (Mistral) via private deployment.

Multi-language support (NL/EN) with automatic language detection.

More complex world interactions (placing objects, pointing, opening UI).

Emotion-driven animations (happy/angry/quizzical depending on AI response).

Issue tracking  Details

open closed Total
Bug 0 0 0
Feature 12 6 18
Support 0 0 0

View all issues | Summary | Calendar | Gantt

Time tracking

  • Estimated time: 38:00 hours
  • Spent time: 0:00 hour

Details | Report