RAG testing: Development
Evaluation (DEVELOPMENT, ground truth is available)
Cost per question and expected answer pair: 0.0317.10 → $17.12)
This cost includes the following metrics for Deep-eval:
- ==Answer Relevancy== (how relevant is the answer to the question)
- Faithfulness (Checks if generated answer aligns with retrieved context)
- Contextual Relevancy (If context is relevant to the question)
- ==Contextual Recall== (If retrieved context is relevant to the expected answer or ground truth)
- Contextual Precision (Checks if more relevant context chunks are ranked higher than lesser relevant chunks)
Cost for correctness metric: ~0.0117.12 → $17.13)
- ==Correctness== (Compares expected answer to generated answer)
How to read metrics
Retriever evaluation is probably something we need to optimise the most, hence let’s focus on the context metrics.
==IDEAL situation is where we have high context metrics (relevancy, recall and precision) (precision is irrelevant when relevancy and recall is high)
The following are some the possibilities:
- High
Context Precision
and low-midContext Recall
andContext Relevancy
- This means that context contains useful data but not all data retrieved is valuable to answer the given question.
- Why does this happen:
- All required data to answer the question is in the context but contains additional data which is not required.
- Solution: Context compression methods
- Irrelevant context is present as well as valuable context is missing
- Solution: Improve retriever process but either improving chunking strategies, extractors, query decomposition, etc
- All required data to answer the question is in the context but contains additional data which is not required.