System Requirements / Experiments

I built the aforementioned system and tested each of the components to some extent, but really, my testing was primarily anecdotal and not very scientific. For that reason I now seek to assess the ability of this system a little bit more scientifically; with metrics.

System: User-NPC-Interaction

Knowledge Assessment

We should be able to ask the NPC questions about various things in the world and get correct answers to the question. The type of questions of interest might be information about a city that an NPC lives in or the history of the world etc. First, it is necessary to evaluate the recall ability of the knowledge-base RAG system. Is the information that we expect to be retrieved from the knowledge base being correctly pulled out of the knowledge base and inserted into our prompt chain when we converse with the NPC? Second, is that information then making it into the response that the NPC is giving?
We should be able to ask NPCs about entities that have been affected by a mission and confirm that the world state (e.g. entities represented by knowledge tags) are being correctly retrieved the outcome of a mission and understand th after said mission and see the results of said mission via conversation with said NPC(s). What is the

Long Term Memory Assessment

As a reminder, long term memories include both NPC observations and reflections. We need to assess the ability to store and retrieve relevant memories. We need These observations and reflections can be both about pre-game-initialization events, These are observations (entity did such-and-such) about their

We should be able to talk to an NPC about their past. This means that an NPC should be well aware of their backstory (as wirten in the game designer suite) and be able to talk about it. Here we assess the ability of the agent to retrieve relevant memories as well as to generate a response that utilizes those memories.
NPCs should recall their past conversations with the user
Mission outcomes (narratives) are converted into observations and reflections for each companion on the mission. We should be able to ask the companion about what happened on the mission as well as get them to share how they felt about the results of the mission. Did the companion grow? Did they learn something? Can they recount the events? Does the companion’s recount of the mission fall fully or only partially within the actual narrative outcome of the mission?

NPC Objectives Assessment

Based off the current game state, various NPC objectives may be available. As it pertains to npc objectives, it is necessary that npc objectives are both classified as completed correctly and incorporated into an NPC’s response when expected.

Are NPC objectives completed as should be expected? Over the course of a game play through we track all dialog exchanges as well as npc_objective completion for each dialogue exchange. Here, the confusion matrix, recall, precision, and F1-score are calculated.
It’s important to understand that NPCs talk about what we expect them to talk about when communicating with them. When just chatting with an NPC do their responses make sense? Do NPCs function like chatbots when expected? Do they attempt to further their own objectives or fall into “AI-assistant-mode”? Do the NPCs converse in regards to available NPC objectives as would be expected or do they ignore them or railroad conversation about the objectives? For this experiment we will simply be conversing with various NPCs who have been given (1) random-ish knowledge they can converse about, (2) personal, non-story-based motivations, and (3) whom have available npc_objectives with prompt injections. These NPCs will then be conversed with. NPC responses will be assess with passing the input prompt chain as well as NPC response to GPT-4.

System: Missions

The second major system that we need to assess is the mission pipeline.

Knowledge Assessment

First, we want to assess whether all relevant knowledge is being extracted from the knowledge that the various companion(s) that are sent on the mission have access to.

Long Term Memory Assessment

Are relevant long term memories being retrieved from the companion(s) long term memories?

Mission Outcome Assessment

Naive Mission Outcome - Is the mission outcome that is being selected the one that is to be expected? The naive mission outcome is the outcome that would be expected if no relevant npc objectives have been completed and the npc has no long term memories outside of the long term memories that they are initialized with.
Informed Mission Outcome - Is the resultant mission outcome the informed mission outcome? The informed mission outcome is the mission outcome to be expected for a given companion given that all relevant npc objectives have been completed.

Mission Narrative

Content - does the result and content of the mission narrative make sense? Is there anything in the narrative that explicitly just contradicts the state of the game?
Content - do relevant companion memories and/or knowledge make their way into the mission narrative, or does the LLM purely utilize them for determining the mission outcome?

Writing Ability

What is the general writing ability of gpt-3.5-turbo?

Knowledge Extraction

Is all pertinent knowledge being extracted from the narrative and assigned to the correct entity/knowledge_tag?

PreviousAvailability Logic

Last updated 1 year ago