RAG testing: Development

Cost per question and expected answer pair: 0.03 $($ 17.10 → $17.12)

This cost includes the following metrics for Deep-eval:

==Answer Relevancy== (how relevant is the answer to the question)
Faithfulness (Checks if generated answer aligns with retrieved context)
Contextual Relevancy (If context is relevant to the question)
==Contextual Recall== (If retrieved context is relevant to the expected answer or ground truth)
Contextual Precision (Checks if more relevant context chunks are ranked higher than lesser relevant chunks)

Cost for correctness metric: ~0.01 $($ 17.12 → $17.13)

Retriever evaluation is probably something we need to optimise the most, hence let’s focus on the context metrics.

==IDEAL situation is where we have high context metrics (relevancy, recall and precision) (precision is irrelevant when relevancy and recall is high)

The following are some the possibilities:

Notes ✏️