RAG testing: Development

Evaluation (DEVELOPMENT, ground truth is available)

Cost per question and expected answer pair: 0.0317.10 $17.12)

This cost includes the following metrics for Deep-eval:

  • ==Answer Relevancy== (how relevant is the answer to the question)
  • Faithfulness (Checks if generated answer aligns with retrieved context)
  • Contextual Relevancy (If context is relevant to the question)
  • ==Contextual Recall== (If retrieved context is relevant to the expected answer or ground truth)
  • Contextual Precision (Checks if more relevant context chunks are ranked higher than lesser relevant chunks)

Cost for correctness metric: ~0.0117.12 $17.13)

  • ==Correctness== (Compares expected answer to generated answer)

How to read metrics

Retriever evaluation is probably something we need to optimise the most, hence let’s focus on the context metrics.

==IDEAL situation is where we have high context metrics (relevancy, recall and precision) (precision is irrelevant when relevancy and recall is high)

The following are some the possibilities:

  • High Context Precision and low-mid Context Recall and Context Relevancy
    • This means that context contains useful data but not all data retrieved is valuable to answer the given question.
    • Why does this happen:
      • All required data to answer the question is in the context but contains additional data which is not required.
        • Solution: Context compression methods
      • Irrelevant context is present as well as valuable context is missing
        • Solution: Improve retriever process but either improving chunking strategies, extractors, query decomposition, etc