Lexical Regression Test Prioritization

Summary

Automated testing is a vital software engineering practice to ensure that changes to a program do not break existing functionality and fulfill newly tested specifications. Test feedback is most valuable if available immediately, but long-running test suites delay feedback. We investigate natural-language-based techniques that arrange test suites so that the most relevant tests are executed first in response to a change, thereby detecting defects earlier.

Key Insights

  1. Dynamic languages like Python or JavaScript are problematic to analyze formally to establish connections between tests and code. Treating programs in dynamic languages as natural language artifacts allows us to link code and tests by what they are semantically meant to implement and verify.
  2. Identifier names and other natural-language features carry no specific meaning in formal program analysis. However, their co-occurrence in code and tests strongly predicts relatedness. Not all identifiers are equally strong predictors, but their involvement in previous test failures is an estimate of their predictive performance (Mattis Prog’20).
  3. Topic models and large language models (LLMs) can better capture semantic relatedness than identifiers alone and offer competitive prioritization performance. However, some might be too resource-intensive to justify their use in test prioritization (Mattis PAI’24).
  4. The more historical change and test data are available, the more effective simple most-recently-failed strategies become (Mattis MSR’20). However, most repositories only provide version history without associated test outcomes, which prompted us to develop a data augmentation strategy that emulates faulty changes (Mattis Prog’20, Mattis PAI’24).

Approaches

Precision-based Term Weighting

Prioritizing tests by vocabulary overlap with a change can be reduced to an information retrieval problem, where the change is a query that retrieves the most relevant tests from the test suite. Algorithms like Okapi BM25 have previously been used to prioritize tests (Saha 2015). However, classical TF-IDF weighs terms based on inverse document frequency, which is not a good match for source code.

In contrast, we compute the precision with which the identifier distinguishes failing from passing tests when it occurs in a change. We replace the IDF term in TF-IDF with a term’s precision, obtaining the TF-PREC retrieval scheme which outperforms TF-IDF but requires collection of more data beforehand.

Code-aware Topic Modeling

To address imprecision in natural language elements, especially synonymy, topic models can group semantically related words into topics. Instead of comparing identifiers directly, we prioritize tests concerned with the changed topic. Topic models like Latent Dirichlet Allocation (LDA) have been designed for natural language and only relate words based on co-occurrence in the same document.

To better address the structure of source code, we first redefine co-occurrence in terms of proximity in the abstract syntax tree instead of documents. Second, we separate abstractions (e.g., public interfaces) from implementation details (e.g., standard library usage) to better separate topics (Mattis SRC’18). These adaptations require only minor changes to the Gibbs sampling procedures typically used to infer LDA topic models but achieve higher coherence of extracted topics.

Embedding-based Test Prioritization

Vector embeddings represent code as high-dimensional vectors so that semantic similarity is expressed in terms of proximity in that vector space.

We explored the effectiveness of comparing the embedding of a change to each test’s embedding, prioritizing by vector similarity. Since embedding models quickly average out the signal when inputs get too large, we split changes into contiguous chunks and select by vector similarity with the closest chunk. While highly effective even with older models like UniXCoder, this method demands a reasonably fast GPU.

We continue to investigate fine-tuning the underlying bi-directional transformer on historical and synthetic data.

LLM-based Test Prioritization

Large language models can generate tests by repeatedly sampling tokens from their probability distribution. Instead of generating a new test, we can use this probability distribution to estimate how likely the LLM would have generated an existing test using a change as its prompt by looking up the probabilities of the existing tokens and computing their geometric mean.

We format changes by commenting out old and inserting new code, subsequently asking a code LLM (such as StableCode or CodeGemma) to generate a unit test for the given change.

While this approach results in good test orders (but is less effective than the embedding-based approach), it currently requires too many resources compared to running most tests to be considered practical. Moreover, LLMs are biased against tests that do not follow the coding standards in their training data independently of whether the test would be relevant to a change.

We continue to investigate optimization potential by fine-tuning and optimally utilizing GPUs and caching of transformer state.

Change-aware Mutation Testing

Ideally, we would base model training and evaluation on historical changes and associated test failures and passes. However, public datasets with comprehensive test results are scarce.

A complementary approach relies on mutation testing to run the test suite against synthetic defects and obtain correlations between changed code and test failures. However, in real software projects, testing effort follows changes: most parts of a program are rarely touched and thus only coarsely tested. Introducing defects uniformly over the code base heavily biases our datasets toward barely modified and tested code.

To generate more realistic data, we combine version history with mutation testing and, for each version of a program, run the test suite against that version with faulty changes applied. This way, we follow the historical change distribution and pretend that programmers made an error in one of their commits. Moreover, we can use entire changes as inputs to our test prioritization approach and evaluate it in a more realistic setting. In contrast, standard mutation testing would yield single-line changes unless artificially expanded.

We continue to extend our data augmentation strategies with more realistic defects, including LLM-generated defects where a code-generating LLM is fine-tuned on human errors it should reproduce in a code base.

Test Prioritization for Program Comprehension

The role of tests extends beyond correctness: They can serve as usage examples of the tested functionality and thus help with program comprehension if they were available immediately when inspecting a code location. Moreover, if the most relevant test was always available during development, programmers could easily try out and explore different implementations and have immediate feedback on their correctness.

We used decision-tree-based methods to select relevant tests in real-time based on programmers’ IDE context (cursor position and open tabs) (Meier PX’21).

We continue to explore the use of test prioritization in live programming contexts to obtain live examples that illustrate the program “by example” rather than only pass/fail judgments (Mattis PX’24).

Selected Publications