We provide two new benchmarks for the evaluation of KE methods: MQuAKE-2002 and MQuAKE-hard๐ฏ. Both are built based on the MQuAKE (Zhong et al., 2023), a recent KE dataset that is designed to evaluate the KE methods on helping LLMs to answer multi-hop questions given new knowledge. Zhong et al. (2023) suggests using a randomly sampled 3,000 instance subset of MQuAKE to do the evaluation, which reduces the experimental costs. This benchmark is termed as MQuAKE-3k.
Annotation Mistakes in One-Third Instances of MQuAKE-3k ๐
There are two issues of using MQuAKE-3k to evaluate KE methods. The first issue is that the new knowledge from different instances MQuAKE-3k can cause conflicts and mistakes to the ground-truth answers. In other words, the ground-truth answer from instance A can be altered by the new knowledge from another instance B. We show an example of knowledge conflicts in MQuAKE-3k in the below Figure. These knowledge conflicts will make the ground-truth answers incorrect given the new knowledge that are conflicted with the answers because the inference on every instance would retrieve the new knowledge from all instances. We observe that 998 instances' ground-truth answers are broken by the new knowledge from other instancesโ
New Benchmark MQuAKE-2002 for More Precise Evaluation.๐ฏ
To address the issue of annotation mistakes in MQuAKE-3k, we provide a new subset of MQuAKE-3k, which does not have any knowledge conflict across instances. This subset includes 2,002 instances, so we term it as MQuAKE-2002. We filter out the instances of which the ground-truth answers are broken by the new knowledge from other instances to produce MQuAKE-2002. Compared with MQuAKE-3k, our MQuAKE-2002 provides a more precise evaluation for knowledge editing methods, since it removes the annotation mistakes due to knowledge conflicts across instances. The data statistics of MQuAKE-2002 are provided in the below Table.
New Benchmark MQuAKE-hard for More Challenging Evaluation.๐ฏ
The second issue of MQuAKE-3k is that more than 66% instances in MQuAKE-3k only contain at most two edited facts that influence the answers, which are not challenging enough to evaluate the knowledge editing methods on handling multiple edited facts that can alter the ground-truth reasons. We construct a more challenging subset of MQuAKE by selecting the instances that contain the highest number of edited facts per instance. We term this challenging set as MQuAKE-hard, which includes 429 instances and every instance contains four edited facts. MQuAKE-hard has no overlap with MQuAKE-3k. We term this challenging set as MQuAKE-hard, which includes 429 instances and every instance contains four edited facts. MQuAKE-hard has no overlap with MQuAKE-3k. The data statistics of MQuAKE-hard are also provided in the below Table.
Benchmark | # Instances | # Hops per Instance | # Edited Facts per Instance | # Conflicted Instances |
---|---|---|---|---|
MQuAKE-3k [Zhong et al., 2023] | 3,000 | 3.0 | 2.0 | 998 |
MQuAKE-2002 (Ours) | 2,002 | 2.7 | 2.2 | 0 |
MQuAKE-hard (Ours) | 429 | 4.0 | 4.0 | 0 |
We propose DeepEdit โ, a new knowledge editing method that can be applied to all black-box large language models (LLMs) without retraining. DeepEdit conducts knowledge editing by applying the semantic constraints to the decoding of LLMs. Two examples of the LLMs' text generation with our DeepEdit and the baseline method MeLLo are shown below.
We propose a new perspective of knowledge editing (KE) for large language models (LLMs) that treats it as a constrained decoding problem. We design decoding constraints to regulate LLMs, ensuring coherence between reasoning steps when incorporating new knowledge. To enforce these constraints, we utilize a depth-first search to adaptively substitute new knowledge for the LLMs' original reasoning steps, greedily seeking the optimal path of multi-hop reasoning with new knowledge. From this vantage, we propose DEEPEDIT: Depth-first Search-based Decoding for Knowledge Editing. DEEPEDIT improves the KE of LLMs by enhancing the conciseness, coherence, pertinence, and receptiveness of reasoning with new knowledge. DEEPEDIT is flexibly applicable to any black-box LLM without requiring access to model parameters or token-wise distributions. In addition to DEEPEDIT, we propose two new KE benchmarks: MQuAKE-2002 and MQuAKE-hard ๐ , which are designed to provide more precise and challenging assessments of KE approaches. Qualitatively, DEEPEDIT enables LLMs to produce more succinct reasoning outputs in accordance with new knowledge. Quantitatively, it yields significant improvements on multiple KE benchmarks.
The central idea is to view knowledge editing as a problem of decoding with constraints.
We propose to impose semantic constraints on the LLMs' outputs to represent the desired properties that benefit knowledge editing.
Specifically, we propose the following constraints: Conciseness 📍 prevents the redundant loops of reasoning; Coherence 🐾 guarantees the coherence of adjacent steps; Receptiveness 💡 guarantees the LLMs' awareness of new knowledge; Pertinence 🙌 improves the relevance of reasoning to the input question.
These constraints improve LLMsโ reasoning with new knowledge from different perspectives as a whole.
In this work, we change the granularity of decoding from the token level to the reasoning step level. Our decoding is designed at the step level: instead of taking the tokens predefined in a fixed vocabulary as the candidates, we take the reasoning steps as the candidates. At every iteration, we let LLMs to produce one reasoning step and retrieve a fixed number of new facts in parallel to the reasoning step as the step candidates. The new fact retrieval is based on the nearest neighbor search with the sentence-level embeddings given by BERT. The motivation of this design is that the new facts that are semantically close to the generated facts are more likely to satisfy the constraints.
Taking the single generated step and the new facts as the step candidates, we filter a valid candidate and take it as the new reasoning step. Note that there may exist more than one candidates that satisfy the constraints. We will introduce how we design the decoding constraints for KE in Section 3.2, how we verify the constraints in Section 3.3 and Section 3.4, and how we process the multiple valid step candidates to search for the optimal path of LLMs' multi-hop reasoning in Section 3.5.
Our decoding design bypasses the requirement of accessing the token-wise distributions, since all the decoding actions are defined over the step level. Our decoding strategy is free to be applied to black-box LLMs in order to produce satisfactory outputs in accordance with new knowledge.
When producing every reasoning step, DeepEdit searches through the parametric and new knowledge to find the knowledge items that satisfy our constraints, termed as valid step candidates.
Among the valid candidates, DeepEdit prioritizes more important ones as the new reasoning step for further reasoning.
We comprehensively evaluate the importance of different candidates based on their semantic relationships and the prior that the new knowledge is more informative to LLMs than the parametric knowledge.
Note that the classical depth-first search has a backtracking mechanism that enables us to traverse all the reasoning chains of valid reasoning steps.
However, such backtracking is undoubtedly time-consuming and unnecessary after an answer is found.
Therefore, we design an early-stopping mechanism to stop the depth-first search when an answer is found by LLMs.
Such an early stopping significantly improves the reasoning efficiency by reducing redundant iterations, as validated by the qualitative analysis and empirical results.
Method | MQuAKE-3k |
---|---|
text-davinci-003 w/ MeLLo | 32.2 |
text-davinci-003 w/ DEEPEDIT (Ours) | 49.1 |
gpt-3.5-turbo-instruct w/ MeLLo | 29.9 |
gpt-3.5-turbo-instruct w/ DEEPEDIT (Ours) | 47.2 |