This is an overview of research and teaching topics I work on. This page is intended for informational purposes only. Note that all members of the Software Architecture Group contribute to these efforts.
- Seminar: Machine Learning on Code Repositories
- Seminar: Code Repository Mining
- Tools and Techniques
- High-level Goals
Seminar: Machine Learning on Code Repositories
HPI Master's seminar, summer term 2018.
- Evolutionary modularity: Measure modularity and identify architectural problems by comparing distance of code passages to their co-change frequency
- Package recommender: Implement and evaluate a recommender system for Python package dependencies based on collaborative filtering and matrix factorization
- Code completion: Design parser/generator abstractions for efficiently auto-completing repetitive structures in source code
- Expertise tracking: Investigate how individual commits improve or deteriorate (with respect to software quality metrics) when programmers gain more experience over several years of contributions
- Paradigm transfer: Measure how programming style shifts when programmers pick up a different programming language
Seminar: Code Repository Mining
HPI Master's seminar, winter term 2017/18.
- Language influence on code style: Measure which types of errors Python programmers tend to make when they switch from Java or C++ to Python and how teaching/training can be improved to better avoid them
- Effects of high-profile incidents on code: Investigate how programmers react to severe vulnerabilities (CVEs), which information channels the knowledge over CVEs and CWEs propagates through, and how programmer awareness of vulnerabilities caused by programming errors can be improved
- Classifying repository language: Design a classification method to identify a project's primary language which is more robust than only determining the most prevalent language
- Package recommender: Implement and evaluate a recommender system for Python package dependencies based on graph search
- Cross-language syntax trees: Design data structures that can represent programs of multiple languages using the same abstractions, and allow to write code analyses that run unmodified on any language.
The following data is provided to the students:
- Enriched GHTorrent data containing meta-data on most GitHub projects, users, and commits
- More than 10 billion file changes with meta-data and full patches
- 250.000+ cloned repositories, usually extended by students themselves
Tools and Techniques
- Postgres + SQL
- Jupyter Notebooks
- Python data analysis (NumPy/SciPy, scikit-learn)
- Source code analysis tools and techniques (parsers - e.g. srcML, linters - e.g. Pylint, metrics - e.g. OOP metrics)
- Lively4 and Squeak/Smalltalk for extending programming environments
We focus on teaching and giving regular (weekly) feedback and advice regarding the following practices during the seminar:
- Learn and practice reproducible research
- Start from practical goals (hypothetical user having a problem) and evaluate results with respect to the original goal, reflect on limitations of the chosen methods and new questions discovered on the way
- Acquire literature related to the task at hand
- Present research results and insights in talks
- Document and structure source code and data artifacts for possible future use in teaching, research, or open source contributions
- Working with Data
- Experience and manage differences in scale, e.g., what it means to process megabytes, gigabytes, or terabytes of data.
- Understand limitations of underlying hardware, algorithm complexity, SQL queries, etc.
- Practice data cleansing, exploration, sampling, and visualization with real-world development data
- Working with code and repositories
- Gather insights on how development on GitHub works, which artifacts it produces, and what they tell about programmers, processes, and programs
- Programmatically work with version control, such as Git, and gather data from repositories
- Learn how to write code analyzers and measure code metrics
These topics are part of the annual undergraduate lecture Software Engineering I:
- Test automation: xUnit frameworks and how to use them
- Testing patterns: Mock objects/test doubles, regression testing, triangulation, learning tests, fixtures, exception testing, ...
- Test-driven development (TDD): General process (red, green, refactor), TDD patterns, examples, preconditions, limitations, and benefits
- Behavior-driven development (BDD): Difference to TDD, ubiquitous language
- Acceptance testing in contrast to unit tests, FIT, Cucumber/Lettuce, Selenium
- Property-based testing: Quickcheck, Hypothesis
- Test quality: Coverage, test smells, test refactorings, test performance, mutation testing
- Non-functional testing: Dependability (reliability, availability, resilience, stress testing, ...), performance (throughput, response time, ...), compliance (e.g. coding conventions), compatibility, ...
- Scopes of testing: Unit/integration/system tests, black box/white box testing, user/acceptance vs. unit testing, ...
- Testing and continuous integration