Research Interests

Explainable Time-travelling Debugging

Debugging, or fault localization, is considered as one of the most time-consuming in software development. When a bug happens, programmers need to not only pinpoint the root cause, but also have a deep understanding so that they can fix it. Typically, automated debugging approaches usually suffer from the fundamental problem of specification missing, i.e., the specification of the code implementation is missing. To tackle this problem, our ICSE'17 paper describes a trace-travelling debugging approach, which asks for programmers’ feedback as partial code specification and interactively recommends suspicious steps on the buggy execution trace. Based on the trace collection technique in this work, we build a regression fault localization technique, which can automatically compare a correct trace and a buggy trace to generate explanation for a regression bug. The source code is available at https://github.com/llmhyy/microbat and https://github.com/llmhyy/tregression (TSE'19). On tackling the debugging problem, we observe the limit of dynamic slicing, a traditional approach to locating relevant program statements based on a given statement. That is, dynamic slicing comes into a dead end when a software bug is caused by missing some code or missing the execution of some code. To this end, our ASE'18 paper proposed a data-driven approach to enhance dynamic slicing by building a neural network to predict the location where code or the execution of code is missing.

Causal Visualization for AI Models

Deep learning models are widely used for classification tasks. Once the model’s performance cannot meet the expectation, it takes data scientists and programmers great efforts to understand the root cause. Existing explainable AI techniques are proposed to answer why a sample will be predicted as a certain class, but very few techniques can answer why the prediction are formed during the training?. We propose a technique, DeepVisualInsight, to visualize how the training samples and their formed classification boundary/landscape are evolved during the training process (AAAI'22)). DeepVisualInsight is a time-travelling (or record-and-replay) visualization for deep classifiers. Moreover, we propose the spatial and temporal properties which needs to be satisfied by all the future time-travelling visualization. Technically, we design approaches to (inverse-)project between high-dimensional and low-dimensional space, while preserving a set of required mathematical properties.

AI as an Explanation (in cybersecurity system)

We propose an AI-powered explainable approach for phishing detection system, which generates explanation to justify its decision (USENIX Sec’21). Technically, we transform phishing detection problem as a computer vision problem of object detection and pattern recognition. The object detection technique can locate the logo information on the screenshot, while the pattern recognition technique (here, we use a Siamese neural network) can recognize which logo is used on the screenshot. By this means, we use AI techniques to extract the brand intention from the screenshot of a webpage. By comparing the identified brand and the domain of a webpage, we can decide why an URL/webpage is phishing, along with its explanation. Figure 6 shows our generated explanation for the phishing detection application. The explanation allows us to detect phishing webpages even more precisely. We deployed Phishpedia with a crawler and can discover about 50 new zero-day phishing websites every day.

Search-based Software Testing

Search-based software testing (SBST) considers software testing as an optimization problem, to generate test cases to maximize code coverage of a program. Existing test generator (e.g., EvoSuite and Randoop) and fuzzer (e.g., AFL) have proved their success to cover program branches and discovery software vulnerabilities in practice. We improve SBST in two folds, which technically improves the branch coverage of stateof-the-art tool EvoSuite by 5-7%. First, we propose a gradient recovery technique to reshape the noncontinuous and flat search space into a continuous and monotonous one (ISSTA'20). Specifically, when we find the search landscape is flat (e.g., flag problem in SBST), we use interprocedural program analysis technique to recover the search gradients. Second, we propose a test seed synthesis technique to generate a “shortcut” solution much closer to the global optimal solution (ESEC/FSE'21). Starting the search with a good initial seed can largely improve the search efficiency of a lot of meta-heuristic search algorithms. Specifically, we transform the dataflow of a target program branch into a template of the object construction process. By this means, our approach can construct more legitimate object as inputs and achieve a much higher program branch coverage.