This is an overview of research and teaching topics I work on. This page is intended for informational purposes only. Note that all members of the Software Architecture Group contribute to these efforts.
General fields:
AI for Programming
Seminar: AI for Programming
HPI Master’s seminar, summer term 2024.
Participants will develop their own small AI solution to a software engineering problem. The seminar will focus on an in-depth and hands-on understanding of large language models, their fine-tuning, the user experience of the resulting tools.
Seminar: Future of Programming
HPI Master’s seminar, winter term 2023/24.
Supervised topics:
- LLM code quality: Measure the code quality of large language models and the influence of prompts and parameters
- Example recommendation: Use AI embeddings to locate useful code examples in other projects and recommend them depending on the programmers’ current task
Mining Repositories
- Seminar: Code Repository Mining 2020
- Seminar: Machine Learning on Code Repositories
- Seminar: Code Repository Mining 2017
- Data
- Tools and Techniques
- High-level Goals
Seminar: Code Repository Mining 2020
HPI Master’s seminar, summer term 2020.
Supervised topics:
- Language influence on code style: Measure how learning a new language influences the code style of programmers
- Vector embeddings of code: Train and evaluate vector embeddings that capture the semantics of source code
- Change prediction: Use historical data to predict future software changes and recommend items for incomplete changes
- Test-based modularity analysis: Correlate test-breaking and test-fixing changes to characterize the modularity of a software system
- Live test prioritization: Develop real-time test prioritization models trained on mutation testing data
- Technical debt at scale: Find large scale patterns in common technical debt metrics across GitHub
- Issue complexity prediction: Train and evaluate a model that can distinguish easy from hard GitHub issues
Seminar: Machine Learning on Code Repositories
HPI Master’s seminar, summer term 2018.
Supervised topics:
- Evolutionary modularity: Measure modularity and identify architectural problems by comparing distance of code passages to their co-change frequency
- Package recommender: Implement and evaluate a recommender system for Python package dependencies based on collaborative filtering and matrix factorization
- Code completion: Design parser/generator abstractions for efficiently auto-completing repetitive structures in source code
- Expertise tracking: Investigate how individual commits improve or deteriorate (with respect to software quality metrics) when programmers gain more experience over several years of contributions
- Paradigm transfer: Measure how programming style shifts when programmers pick up a different programming language
Seminar: Code Repository Mining 2017
HPI Master’s seminar, winter term 2017/18.
Supervised topics:
- Language influence on code style: Measure which types of errors Python programmers tend to make when they switch from Java or C++ to Python and how teaching/training can be improved to better avoid them
- Effects of high-profile incidents on code: Investigate how programmers react to severe vulnerabilities (CVEs), which information channels the knowledge over CVEs and CWEs propagates through, and how programmer awareness of vulnerabilities caused by programming errors can be improved
- Classifying repository language: Design a classification method to identify a project’s primary language which is more robust than only determining the most prevalent language
- Package recommender: Implement and evaluate a recommender system for Python package dependencies based on graph search
- Cross-language syntax trees: Design data structures that can represent programs of multiple languages using the same abstractions, and allow to write code analyses that run unmodified on any language.
Data
The following data is provided to the students:
- Enriched GHTorrent data containing meta-data on most GitHub projects, users, and commits
- More than 10 billion file changes with meta-data and full patches
- 250.000+ cloned repositories, usually extended by students themselves
Tools and Techniques
- Postgres + SQL
- Jupyter Notebooks
- Python data analysis (NumPy/SciPy, scikit-learn)
- Source code analysis tools and techniques (parsers - e.g. srcML, linters - e.g. Pylint, metrics - e.g. OOP metrics)
- Lively4 and Squeak/Smalltalk for extending programming environments
High-level Goals
We focus on teaching and giving regular (weekly) feedback and advice regarding the following practices during the seminar:
- Research
- Learn and practice reproducible research
- Start from practical goals (hypothetical user having a problem) and evaluate results with respect to the original goal, reflect on limitations of the chosen methods and new questions discovered on the way
- Acquire literature related to the task at hand
- Present research results and insights in talks
- Document and structure source code and data artifacts for possible future use in teaching, research, or open source contributions
- Working with Data
- Experience and manage differences in scale, e.g., what it means to process megabytes, gigabytes, or terabytes of data.
- Understand limitations of underlying hardware, algorithm complexity, SQL queries, etc.
- Practice data cleansing, exploration, sampling, and visualization with real-world development data
- Working with code and repositories
- Gather insights on how development on GitHub works, which artifacts it produces, and what they tell about programmers, processes, and programs
- Programmatically work with version control, such as Git, and gather data from repositories
- Learn how to write code analyzers and measure code metrics
Software Testing
These topics are part of the annual undergraduate lecture Software Engineering I:
- Test automation: xUnit frameworks and how to use them
- Testing patterns: Mock objects/test doubles, regression testing, triangulation, learning tests, fixtures, exception testing, …
- Test-driven development (TDD): General process (red, green, refactor), TDD patterns, examples, preconditions, limitations, and benefits
- Behavior-driven development (BDD): Difference to TDD, ubiquitous language
- Acceptance testing in contrast to unit tests, FIT, Cucumber/Lettuce, Selenium
- Property-based testing: Quickcheck, Hypothesis
- Test quality: Coverage, test smells, test refactorings, test performance, mutation testing
- Non-functional testing: Dependability (reliability, availability, resilience, stress testing, …), performance (throughput, response time, …), compliance (e.g. coding conventions), compatibility, …
- Scopes of testing: Unit/integration/system tests, black box/white box testing, user/acceptance vs. unit testing, …
- Testing and continuous integration