# Tutorial: Evaluating RAG Pipelines

- **Level**: Intermediate
- **Time to complete**: 15 minutes
- **Components Used**: `InMemoryDocumentStore`, `InMemoryEmbeddingRetriever`, `ChatPromptBuilder`, `OpenAIChatGenerator`, `DocumentMRREvaluator`, `FaithfulnessEvaluator`, `SASEvaluator`
- **Prerequisites**: You must have an API key from an active OpenAI account as this tutorial is using the gpt-4o-mini model by OpenAI: https://platform.openai.com/api-keys
- **Goal**: After completing this tutorial, you'll have learned how to evaluate your RAG pipelines both with model-based, and statistical metrics available in the Haystack evaluation offering. You'll also see which other evaluation frameworks are integrated with Haystack.

> This tutorial uses the latest version of Haystack 2.x (`haystack-ai`). For more information on Haystack 2.0, read the [Haystack 2.0 announcement](https://haystack.deepset.ai/blog/haystack-2-release) or visit the [Haystack Documentation](https://docs.haystack.deepset.ai/docs/intro).

## Overview

In this tutorial, you will learn how to evaluate Haystack pipelines, in particular, Retriaval-Augmented Generation ([RAG](https://www.deepset.ai/blog/llms-retrieval-augmentation)) pipelines.
1. You will first build a pipeline that answers medical questions based on PubMed data.
2. You will build an evaluation pipeline that makes use of some metrics like Document MRR and Answer Faithfulness.
3. You will run your RAG pipeline and evaluated the output with your evaluation pipeline.

Haystack provides a wide range of [`Evaluators`](https://docs.haystack.deepset.ai/docs/evaluators) which can perform 2 types of evaluations:
- [Model-Based evaluation](https://docs.haystack.deepset.ai/docs/model-based-evaluation)
- [Statistical evaluation](https://docs.haystack.deepset.ai/docs/statistical-evaluation)

We will use some of these evalution techniques in this tutorial to evaluate a RAG pipeline that is designed to answer questions on PubMed data.

>üßë‚Äçüç≥ As well as Haystack's own evaluation metrics, you can also integrate with a number of evaluation frameworks. See the integrations and examples below üëá
> - [Evaluate with DeepEval](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_deep_eval.ipynb)
> - [Evaluate with RAGAS](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_ragas.ipynb)
> - [Evaluate with UpTrain](https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/rag_eval_uptrain.ipynb)

### Evaluating RAG Pipelines
RAG pipelines ultimately consist of at least 2 steps:
- Retrieval
- Generation

To evaluate a full RAG pipeline, we have to evaluate each of these steps in isolation, as well as a full unit. While retrieval can in some cases be evaluated with some statistical metrics that require labels, it's not a straight-forward task to do the same for the generation step. Instead, we often rely on model-based metrics to evaluate the generation step, where an LLM is used as the 'evaluator'.

![Steps or RAG](https://raw.githubusercontent.com/deepset-ai/haystack-tutorials/main/tutorials/img/tutorial35_rag.png)

#### üì∫ Code Along

<iframe width="560" height="315" src="https://www.youtube.com/embed/5PrzXaZ0-qk?si=lgBSfHatbV2i59J-" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>


## Preparing the Colab Environment

- [Enable GPU Runtime in Colab](https://docs.haystack.deepset.ai/docs/enabling-gpu-acceleration)
- [Set logging level to INFO](https://docs.haystack.deepset.ai/docs/setting-the-log-level)

## Installing Haystack

Install Haystack and [datasets](https://pypi.org/project/datasets/) with `pip`:

In [None]:
%%bash

pip install haystack-ai
pip install "datasets>=2.6.1"
pip install "sentence-transformers>=3.0.0"

### Enabling Telemetry

Knowing you're using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See [Telemetry](https://docs.haystack.deepset.ai/docs/enabling-telemetry) for more details.

In [2]:
from haystack.telemetry import tutorial_running

tutorial_running(35)

  from .autonotebook import tqdm as notebook_tqdm


## Create the RAG Pipeline to Evaluate

To evaluate a RAG pipeline, we need a RAG pipeline to start with. So, we will start by creating a question answering pipeline.

> üí° For a complete tutorial on creating Retrieval-Augmmented Generation pipelines check out the [Creating Your First QA Pipeline with Retrieval-Augmentation Tutorial](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)

For this tutorial, we will be using [a labeled PubMed dataset](https://huggingface.co/datasets/vblagoje/PubMedQA_instruction/viewer/default/train?row=0) with questions, contexts and answers. This way, we can use the contexts as Documents, and we also have the required labeled data that we need for some of the evaluation metrics we will be using.

First, let's fetch the prepared dataset and extract `all_documents`, `all_questions` and `all_ground_truth_answers`:

> ‚ÑπÔ∏è The dataset is quite large, we're using the first 1000 rows in this example, but you can increase this if you want to


In [3]:
from datasets import load_dataset
from haystack import Document

dataset = load_dataset("vblagoje/PubMedQA_instruction", split="train")
dataset = dataset.select(range(1000))
all_documents = [Document(content=doc["context"]) for doc in dataset]
all_questions = [doc["instruction"] for doc in dataset]
all_ground_truth_answers = [doc["response"] for doc in dataset]

Generating train split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 272458/272458 [00:00<00:00, 546159.29 examples/s]
Generating test split: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1000/1000 [00:00<00:00, 287025.53 examples/s]


Next, let's build a simple indexing pipeline and write the `documents` into a DocumentStore. Here, we're using the `InMemoryDocumentStore`.

> `InMemoryDocumentStore` is the simplest DocumentStore to get started with. It requires no external dependencies and it's a good option for smaller projects and debugging. But it doesn't scale up so well to larger Document collections, so it's not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see [DocumentStore Integrations](https://haystack.deepset.ai/integrations?type=Document+Store).

In [4]:
from typing import List
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()

document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

indexing = Pipeline()
indexing.add_component(instance=document_embedder, name="document_embedder")
indexing.add_component(instance=document_writer, name="document_writer")

indexing.connect("document_embedder.documents", "document_writer.documents")

indexing.run({"document_embedder": {"documents": all_documents}})

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 32/32 [00:08<00:00,  3.99it/s]


{'document_writer': {'documents_written': 1000}}

Now that we have our data ready, we can create a simple RAG pipeline.

In this example, we'll be using:
- [`InMemoryEmbeddingRetriever`](https://docs.haystack.deepset.ai/docs/inmemoryembeddingretriever) which will get the relevant documents to the query.
- [`OpenAIChatGenerator`](https://docs.haystack.deepset.ai/docs/OpenAIChatGenerator) to generate answers to queries. You can replace `OpenAIChatGenerator` in your pipeline with another `ChatGenerator`. Check out the full list of generators [here](https://docs.haystack.deepset.ai/docs/generators).

In [8]:
import os
from getpass import getpass
from haystack.components.builders import AnswerBuilder, ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

template = [
    ChatMessage.from_user(
        """
        You have to answer the following question based on the given context information only.

        Context:
        {% for document in documents %}
            {{ document.content }}
        {% endfor %}

        Question: {{question}}
        Answer:
        """
    )
]

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    "query_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
)
rag_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=3))
rag_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template))
rag_pipeline.add_component("generator", OpenAIChatGenerator(model="gpt-4o-mini"))
rag_pipeline.add_component("answer_builder", AnswerBuilder())

rag_pipeline.connect("query_embedder", "retriever.query_embedding")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "generator.messages")
rag_pipeline.connect("generator.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x3951745e0>
üöÖ Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: ChatPromptBuilder
  - generator: OpenAIChatGenerator
  - answer_builder: AnswerBuilder
üõ§Ô∏è Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.messages (List[ChatMessage])
  - generator.replies -> answer_builder.replies (List[ChatMessage])

### Asking a Question

When asking a question, use the `run()` method of the pipeline. Make sure to provide the question to all components that require it as input. In this case these are the `query_embedder`, the `prompt_builder` and the `answer_builder`.

In [9]:
question = "Do high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcome?"

response = rag_pipeline.run(
    {
        "query_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)
print(response["answer_builder"]["answers"][0].data)

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.85s/it]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Yes, high levels of procalcitonin in the early phase after pediatric liver transplantation indicate poor postoperative outcomes. Patients with elevated PCT levels on postoperative day 2 had higher International Normalized Ratio values on postoperative day 5, suffered more often from primary graft non-function, and had longer stays in the pediatric intensive care unit and on mechanical ventilation.


## Evaluate the Pipeline

For this tutorial, let's evaluate the pipeline with the following metrics:

- [Document Mean Reciprocal Rank](https://docs.haystack.deepset.ai/docs/documentmrrevaluator): Evaluates retrieved documents using ground truth labels. It checks at what rank ground truth documents appear in the list of retrieved documents.
- [Semantic Answer Similarity](https://docs.haystack.deepset.ai/docs/sasevaluator): Evaluates predicted answers using ground truth labels. It checks the semantic similarity of a predicted answer and the ground truth answer using a fine-tuned language model.
- [Faithfulness](https://docs.haystack.deepset.ai/docs/faithfulnessevaluator): Uses an LLM to evaluate whether a generated answer can be inferred from the provided contexts. Does not require ground truth labels.


Firt, let's actually run our RAG pipeline with a set of questions, and make sure we have the ground truth labels (both answers and documents) for these questions. Let's start with 25 random questions and labels üëá

> üìù **Some Notes:**
> 1. For a full list of available metrics, check out the [Haystack Evaluators](https://docs.haystack.deepset.ai/docs/evaluators).
> 2. In our dataset, for each example question, we have 1 ground truth document as labels. However, in some scenarios more than 1 ground truth document may be provided as labels. You will notice that this is why we provide a list of `ground_truth_documents` for each question.

In [10]:
import random

questions, ground_truth_answers, ground_truth_docs = zip(
    *random.sample(list(zip(all_questions, all_ground_truth_answers, all_documents)), 25)
)

Next, let's run our pipeline and make sure to track what our pipeline returns as answers, and which documents it retrieves:

In [11]:
rag_answers = []
retrieved_docs = []

for question in list(questions):
    response = rag_pipeline.run(
        {
            "query_embedder": {"text": question},
            "prompt_builder": {"question": question},
            "answer_builder": {"query": question},
        }
    )
    print(f"Question: {question}")
    print("Answer from pipeline:")
    print(response["answer_builder"]["answers"][0].data)
    print("\n-----------------------------------\n")

    rag_answers.append(response["answer_builder"]["answers"][0].data)
    retrieved_docs.append(response["answer_builder"]["answers"][0].documents)

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  2.89it/s]


Question: Is higher fibrinogen level independently linked with the presence and severity of new-onset coronary atherosclerosis among Han Chinese population?
Answer from pipeline:
Yes, higher fibrinogen level is independently linked with the presence and severity of new-onset coronary atherosclerosis among the Han Chinese population. The study demonstrated that plasma fibrinogen level was independently associated with high Gensini score (GS) and the presence of coronary atherosclerosis after adjusting for potential confounders.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  5.88it/s]


Question: Are successional changes in the chicken cecal microbiome during 42 days of growth independent of organic acid feed additives?
Answer from pipeline:
Yes, successional changes in the chicken cecal microbiome during the 42 days of growth are independent of organic acid feed additives. The study found that treatment effects on specific pathogens and the cecal microbiome as a whole were generally non-significant, while temporal changes in the cecal microbiome were dramatic, highly significant, and consistent across treatments, indicating that these changes occur regardless of the presence of the organic acids.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  8.13it/s]


Question: Does [ ITF increase the transcriptional activity of ITF promoter via the JAK-STAT3 signal transduction pathway ]?
Answer from pipeline:
Yes, ITF increases the transcriptional activity of the ITF promoter via the JAK-STAT3 signal transduction pathway. The study showed that the activity of the ITF promoter was significantly increased in the presence of ITF. Furthermore, blockage of the JAK-STAT3 pathway with a specific inhibitor reduced the ITF promoter activity, indicating that the pathway is involved in mediating the effect of ITF on the promoter's transcriptional activity.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 10.41it/s]


Question: Is peer-instructed seminar attendance associated with improved preparation , deeper learning and higher exam scores : a survey study?
Answer from pipeline:
Yes, peer-instructed seminar attendance is associated with improved preparation, deeper learning, and higher exam scores. The study indicated that perceived preparation of peers was positively associated with the perceived quality of seminars, and seminar attendance was positively associated with exam scores. Additionally, students expressed that discussing with peers and applying knowledge in pathophysiology cases enhanced their learning experience.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 14.79it/s]


Question: Are serum TARC levels strongly correlated with blood eosinophil count in patients with drug eruptions?
Answer from pipeline:
Yes, serum TARC levels are strongly correlated with blood eosinophil count in patients with drug eruptions, with a correlation coefficient of 0.53.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 14.29it/s]


Question: Does dexmedetomidine reduce atrial fibrillation after lung cancer surgery?
Answer from pipeline:
No, the study found that the incidence of postoperative atrial fibrillation (POAF) was comparable between patients treated with dexmedetomidine (DEX) and those who were not, indicating that DEX does not reduce the incidence of POAF after lung cancer surgery.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 14.11it/s]


Question: Does lidocaine potentiate the deleterious effects of triamcinolone acetonide on tenocytes?
Answer from pipeline:
Yes, lidocaine synergistically increases the deleterious effects of triamcinolone acetonide (TA) on tenocytes.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  8.96it/s]


Question: Does multicenter immunohistochemical ALK-testing of non-small-cell lung cancer show high concordance after harmonization of techniques and interpretation criteria?
Answer from pipeline:
Yes, multicenter immunohistochemical ALK-testing of non-small-cell lung cancer shows high concordance after harmonization of techniques and interpretation criteria. All 16 participants scored the two ALK positive-"borderline" samples as unequivocally positive according to their protein expression, and concordant IHC interpretation was observed in four of six unequivocal ALK break positive cases.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  5.05it/s]


Question: Does generalizability of trial result to elderly Medicare patients with advanced solid tumors ( Alliance 70802 )?
Answer from pipeline:
The generalizability of trial results to elderly Medicare patients with advanced solid tumors is limited. The study demonstrated that patients enrolled in chemotherapy trials are often younger, more functional, and have less comorbidity compared to the general population of elderly Medicare patients. Specifically, trial patients were on average 9.5 years younger and had better survival rates than Medicare patients who were 75 years or older. This suggests that the results from clinical trials may not fully reflect the outcomes that might be expected in an unselected elderly Medicare population with advanced solid tumors.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  8.53it/s]


Question: Does vascular reconstruction play an important role in the treatment of pancreatic adenocarcinoma?
Answer from pipeline:
Yes, vascular reconstruction plays an important role in the treatment of pancreatic adenocarcinoma, particularly during surgical procedures such as pancreaticoduodenectomy (Whipple operation) or total pancreatectomy. The study highlighted that vascular reconstructions were performed in a significant percentage of operations (32.8%), which included both venous and arterial reconstructions. The survival rates for patients undergoing these reconstructions were comparable to those without vascular reconstruction, indicating that such procedures can be integral in managing complex cases of pancreatic cancer, especially when vascular invasion is present. Additionally, the detailed examination of outcomes and complications underscores the importance of these reconstructive methods in improving patient survival following surgery.

----------------------------------

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 25.15it/s]


Question: Is transient receptor potential ankyrin 1 ( TRPA1 ) functionally expressed in primary human osteoarthritic chondrocytes?
Answer from pipeline:
Yes, transient receptor potential ankyrin 1 (TRPA1) is functionally expressed in primary human osteoarthritic chondrocytes, as it was shown to be expressed and functional with increased Ca(2+) influx upon stimulation with the TRPA1 agonist AITC.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 55.38it/s]


Question: Is incidence of Type 1 Diabetes Increasing in a Population-Based Cohort in Olmsted County , Minnesota , USA?
Answer from pipeline:
No, there was no significant increase in the incidence of Type 1 Diabetes (T1D) over time in the population-based cohort in Olmsted County, Minnesota (P=.45). Although there was an initial increasing trend, it was followed by a plateau, indicating overall stability in the annual incidence.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 12.21it/s]


Question: Does transjugular intrahepatic portosystemic shunt placement increase feasibility of colorectal surgery in cirrhotic patients with severe portal hypertension?
Answer from pipeline:
Yes, transjugular intrahepatic portosystemic shunt (TIPS) placement appears to increase the feasibility of colorectal surgery in cirrhotic patients with severe portal hypertension. In the study, TIPS placement was successful in all patients and significantly decreased the mean hepatic venous pressure gradient, which is associated with a reduction in the risks of surgery. Despite a notable postoperative morbidity rate, the successful TIPS placement suggests that it can help manage portal hypertension, potentially improving surgical outcomes.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 12.28it/s]


Question: Does plasmid pPCP1-derived sRNA HmsA promote biofilm formation of Yersinia pestis?
Answer from pipeline:
Yes, plasmid pPCP1-derived sRNA HmsA promotes biofilm formation of Yersinia pestis.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 13.40it/s]


Question: Does trichostatin A inhibit Retinal Pigmented Epithelium Activation in an In Vitro Model of Proliferative Vitreoretinopathy?
Answer from pipeline:
Yes, trichostatin A (TSA) inhibits Retinal Pigmented Epithelium (RPE) activation in an in vitro model of Proliferative Vitreoretinopathy (PVR). The study found that TSA (0.1 ŒºM) reduced the TGFŒ≤2-mediated RPE cell contraction and migration, indicating that TSA inhibits RPE activation.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  8.66it/s]


Question: Is endo first appropriate in some patients with critical limb ischemia because `` bridges are burned ''?
Answer from pipeline:
The phrase "bridges are burned" typically implies that once a certain action is taken, the option to return to a previous state is lost. In the context of critical limb ischemia (CLI) treatment, it refers to the scenario where patients may have undergone failed endovascular treatment (EV) and therefore may not have the same options available for future interventions, such as open surgical (OS) bypass. 

Based on the context provided, it appears that patients who had prior failed EV had poorer outcomes in subsequent OS bypass procedures compared to those who underwent primary OS without prior EV. Specifically, secondary patency and limb salvage rates were significantly better in the group that did not undergo prior EV treatment. This suggests that for some patients with CLI, starting with endovascular treatment may lead to complications or reduced effe

Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  6.74it/s]


Question: Do the effects of ifenprodil on the activity of antidepressant drugs in the forced swim test in mice?
Answer from pipeline:
Yes, ifenprodil at a non-active dose (10mg/kg) potentiated the antidepressant-like effect of imipramine (15mg/kg) and fluoxetine (5mg/kg) in the forced swim test in mice, but it did not reduce the immobility time of animals receiving reboxetine (2.5mg/kg) or tianeptine (15mg/kg).

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  2.98it/s]


Question: Does improving disease incidence estimate in primary care surveillance systems?
Answer from pipeline:
Yes, improving disease incidence estimates in primary care surveillance systems can be achieved through the use of consultation-weighted estimates. The application of these weighted incidence estimators led to bias reduction in the estimates of disease incidence. Specifically, this method resulted in observed relative changes in national-level incidence estimates for various diseases, highlighting that it can lead to more accurate and reliable data for public health monitoring. Additionally, using bias-reduced weights decreased variation in incidence between regions and increased spatial autocorrelation, further indicating improvements in the quality of disease incidence estimates.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  4.95it/s]


Question: Does unit support protect against sexual harassment and assault among national guard soldiers?
Answer from pipeline:
Yes, greater unit support is associated with decreased odds of sexual harassment and assault among Ohio Army National Guard service members during deployment.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 62.62it/s]


Question: Does treatment with anti-C5a antibody improve the outcome of H7N9 virus infection in African green monkeys?
Answer from pipeline:
Yes, treatment with the anti-C5a antibody (IFX-1) improves the outcome of H7N9 virus infection in African green monkeys. It substantially attenuated acute lung injury (ALI), reduced lung histopathological injury, decreased infiltration of macrophages and neutrophils in the lungs, and decreased the intensity of systemic inflammatory response syndrome (SIRS). Additionally, it significantly decreased the virus titers in the infected lungs.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  6.33it/s]


Question: Is impaired renal function associated with recurrence after cryoballoon catheter ablation for paroxysmal atrial fibrillation : A potential effect of non-pulmonary vein foci?
Answer from pipeline:
Yes, impaired renal function is associated with recurrence after cryoballoon catheter ablation for paroxysmal atrial fibrillation. The study found that estimated glomerular filtration rate (eGFR) was an independent predictor of recurrence, with a hazard ratio indicating that lower eGFR is linked to a higher risk of recurrence. Additionally, non-pulmonary vein ectopic beats were also identified as significant predictors of recurrence, suggesting that both impaired renal function and the presence of non-pulmonary vein foci may contribute to the likelihood of AF recurrence post-ablation.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 11.44it/s]


Question: Does melatonin prevent radiation-induced oxidative stress and periodontal tissue breakdown in irradiated rats with experimental periodontitis?
Answer from pipeline:
Yes, melatonin prevents radiation-induced oxidative stress and periodontal tissue breakdown in irradiated rats with experimental periodontitis. The study found that the oxidative stress index and levels of specific markers were significantly higher in the group with experimental periodontitis and radiation exposure (Ped-Rt) compared to the group with only experimental periodontitis (Ped). In contrast, the group that received both radiation therapy and protective melatonin administration (Ped-Rt-Mel) exhibited lower levels of oxidative stress and reduced alveolar bone destruction compared to the Ped-Rt group. This indicates that melatonin has a protective effect against the adverse impacts of radiation on periodontal tissues.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 11.18it/s]


Question: Is the ADAMTS13-von Willebrand factor axis involved in the pathophysiology of kidney ischemia-reperfusion injury?
Answer from pipeline:
Yes, the ADAMTS13-von Willebrand factor axis is involved in the pathophysiology of kidney ischemia-reperfusion injury, as indicated by the significant changes in ADAMTS13 and vWF levels and the associated kidney damage observed in the study.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 40.44it/s]


Question: Is repeat endoscopic ultrasound fine needle aspiration after a first negative procedure useful in pancreatic lesions?
Answer from pipeline:
Yes, repeat endoscopic ultrasound fine needle aspiration (EUS-FNA) after a first negative procedure can be useful in pancreatic lesions. In the study, the diagnostic yield of the second EUS-FNA was 58.8%, which contributed to an overall diagnostic yield increase to 86.3% for all patients. The likelihood of a positive diagnosis with the second EUS-FNA was particularly higher in patients who had an "atypical" histological result from the first EUS-FNA compared to those who had a "normal" result. This suggests that repeating the procedure may help to obtain a definitive diagnosis in cases that were inconclusive initially.

-----------------------------------



Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 12.62it/s]


Question: Does tRAIL receptor deletion in mice suppress the inflammation of nutrient excess?
Answer from pipeline:
Yes, TRAIL receptor deletion in mice suppresses the inflammation associated with nutrient excess. In the study, TR knockout (TR(-/-)) mice showed reduced weight gain, adiposity, and insulin resistance when fed a high-fat, cholesterol, and fructose (FFC) diet. Additionally, these mice exhibited suppressed steatohepatitis and diminished accumulation and activation of inflammatory macrophages in the liver and adipose tissue, indicating a significant reduction in inflammation related to nutrient excess.

-----------------------------------



While each evaluator is a component that can be run individually in Haystack, they can also be added into a pipeline. This way, we can construct an `eval_pipeline` that includes all evaluators for the metrics we want to evaluate our pipeline on.

In [12]:
from haystack.components.evaluators.document_mrr import DocumentMRREvaluator
from haystack.components.evaluators.faithfulness import FaithfulnessEvaluator
from haystack.components.evaluators.sas_evaluator import SASEvaluator

eval_pipeline = Pipeline()
eval_pipeline.add_component("doc_mrr_evaluator", DocumentMRREvaluator())
eval_pipeline.add_component("faithfulness", FaithfulnessEvaluator())
eval_pipeline.add_component("sas_evaluator", SASEvaluator(model="sentence-transformers/all-MiniLM-L6-v2"))

results = eval_pipeline.run(
    {
        "doc_mrr_evaluator": {
            "ground_truth_documents": list([d] for d in ground_truth_docs),
            "retrieved_documents": retrieved_docs,
        },
        "faithfulness": {
            "questions": list(questions),
            "contexts": list([d.content] for d in ground_truth_docs),
            "predicted_answers": rag_answers,
        },
        "sas_evaluator": {"predicted_answers": rag_answers, "ground_truth_answers": list(ground_truth_answers)},
    }
)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 25/25 [00:46<00:00,  1.87s/it]


### Constructing an Evaluation Report

Once we've run our evaluation pipeline, we can also create a full evaluation report. Haystac provides an `EvaluationRunResult` which we can use to display a `score_report` üëá

In [13]:
from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": list(questions),
    "contexts": list([d.content] for d in ground_truth_docs),
    "answer": list(ground_truth_answers),
    "predicted_answer": rag_answers,
}

evaluation_result = EvaluationRunResult(run_name="pubmed_rag_pipeline", inputs=inputs, results=results)
evaluation_result.score_report()

Unnamed: 0,metrics,score
0,doc_mrr_evaluator,1.0
1,faithfulness,0.985333
2,sas_evaluator,0.711842


#### Extra: Convert the Report into a Pandas DataFrame

In addition, you can display your evaluation results as a pandas dataframe üëá

In [14]:
results_df = evaluation_result.to_pandas()
results_df

Unnamed: 0,question,contexts,answer,predicted_answer,doc_mrr_evaluator,faithfulness,sas_evaluator
0,Is higher fibrinogen level independently linke...,[Fibrinogen is a coagulation/inflammatory biom...,Higher fibrinogen level is independently linke...,"Yes, higher fibrinogen level is independently ...",1.0,1.0,0.892444
1,Are successional changes in the chicken cecal ...,[Poultry remains a major source of foodborne b...,"Over the 42 d experiment, the cecal bacterial ...","Yes, successional changes in the chicken cecal...",1.0,1.0,0.635755
2,Does [ ITF increase the transcriptional activi...,[To investigate the eff ect of intestinal tref...,ITF increases the transcriptional activity of ...,"Yes, ITF increases the transcriptional activit...",1.0,1.0,0.898495
3,Is peer-instructed seminar attendance associat...,[Active engagement in education improves learn...,Discussion with well-prepared peers during sem...,"Yes, peer-instructed seminar attendance is ass...",1.0,1.0,0.881314
4,Are serum TARC levels strongly correlated with...,[This study aims to evaluate the relationship ...,Serum TARC levels are well correlated with blo...,"Yes, serum TARC levels are strongly correlated...",1.0,1.0,0.823395
5,Does dexmedetomidine reduce atrial fibrillatio...,[To evaluate whether the use of intraoperative...,These results were similar to those published ...,"No, the study found that the incidence of post...",1.0,1.0,0.772307
6,Does lidocaine potentiate the deleterious effe...,[Local anesthetics are commonly used for the t...,Our data provide evidence of the detrimental e...,"Yes, lidocaine synergistically increases the d...",1.0,1.0,0.579498
7,Does multicenter immunohistochemical ALK-testi...,[Detection of anaplastic lymphoma kinase (ALK)...,"This so-called ""ALK-Harmonization-Study"" shows...","Yes, multicenter immunohistochemical ALK-testi...",1.0,1.0,0.587502
8,Does generalizability of trial result to elder...,"[In the United States, patients who enroll in ...",Results of clinical trials for advanced pancre...,The generalizability of trial results to elder...,1.0,1.0,0.736484
9,Does vascular reconstruction play an important...,[Previous studies have proved the feasibility ...,An aggressive approach for stage II pancreatic...,"Yes, vascular reconstruction plays an importan...",1.0,1.0,0.647735


Having our evaluation results as a dataframe can be quite useful. For example, below we can use the pandas dataframe to filter the results to the top 3 best scores for semantic answer similarity (`sas_evaluator`) as well as the bottom 3 üëá


In [15]:
import pandas as pd

top_3 = results_df.nlargest(3, "sas_evaluator")
bottom_3 = results_df.nsmallest(3, "sas_evaluator")
pd.concat([top_3, bottom_3])

Unnamed: 0,question,contexts,answer,predicted_answer,doc_mrr_evaluator,faithfulness,sas_evaluator
21,Does melatonin prevent radiation-induced oxida...,[The aim of this study was to analyze the bioc...,It was found that radiotherapy increased oxida...,"Yes, melatonin prevents radiation-induced oxid...",1.0,1.0,0.944898
2,Does [ ITF increase the transcriptional activi...,[To investigate the eff ect of intestinal tref...,ITF increases the transcriptional activity of ...,"Yes, ITF increases the transcriptional activit...",1.0,1.0,0.898495
0,Is higher fibrinogen level independently linke...,[Fibrinogen is a coagulation/inflammatory biom...,Higher fibrinogen level is independently linke...,"Yes, higher fibrinogen level is independently ...",1.0,1.0,0.892444
14,Does trichostatin A inhibit Retinal Pigmented ...,[Proliferative vitreoretinopathy (PVR) is a bl...,Our findings indicate a role of acetylation in...,"Yes, trichostatin A (TSA) inhibits Retinal Pig...",1.0,1.0,0.492036
16,Do the effects of ifenprodil on the activity o...,"[According to reports in the literature, more ...",The concomitant administration of certain comm...,"Yes, ifenprodil at a non-active dose (10mg/kg)...",1.0,1.0,0.532008
15,Is endo first appropriate in some patients wit...,[The aims of this study were to determine the ...,Previous failed EV should be predictive of poo...,"The phrase ""bridges are burned"" typically impl...",1.0,0.8,0.552325


## What's next

üéâ Congratulations! You've learned how to evaluate a RAG pipeline with model-based evaluation frameworks and without any labeling efforts.

If you liked this tutorial, you may also enjoy:
- [Serializing Haystack Pipelines](https://haystack.deepset.ai/tutorials/29_serializing_pipelines)
-  [Creating Your First QA Pipeline with Retrieval-Augmentation](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline)

To stay up to date on the latest Haystack developments, you can [sign up for our newsletter](https://landing.deepset.ai/haystack-community-updates). Thanks for reading!