zbMATH — the first resource for mathematics

Harnessing static analysis to help learn pseudo-inverses of string manipulating procedures for automatic test generation. (English) Zbl 07228507
Beyer, Dirk (ed.) et al., Verification, model checking, and abstract interpretation. 21st international conference, VMCAI 2020, New Orleans, LA, USA, January 16–21, 2020. Proceedings. Cham: Springer (ISBN 978-3-030-39321-2/pbk; 978-3-030-39322-9/ebook). Lecture Notes in Computer Science 11990, 180-201 (2020).
Summary: We present a novel approach based on supervised machine-learning for inverting String Manipulating Procedures (SMPs), i.e., given an SMP \(p:\bar{\Sigma}\rightarrow\bar{\Sigma}\), we compute a partial pseudo-inverse function \(p^{-1}\) such that given a target string \(t\in\overline{\Sigma}\), if \(p^{-1}(t)\ne \bot\) then \(p(p^{-1}(t))=t\). The motivation for addressing this problem is the difficulties faced by modern symbolic execution tools, e.g., KLEE, to find ways to execute loops inside SMPs in a way which produces specific outputs required to enter a specific branch. Thus, we find ourselves in a pleasant situation where program analysis assists machine learning to help program analysis.
Our basic attack on the problem is to train a machine learning algorithm using (output, input) pairs generated by executing \(p\) on random inputs. Unfortunately, naively applying this technique is extremely expensive due to the size of the alphabet. To remedy this situation, we present a specialized static analysis algorithm that can drastically reduce the size of the alphabet \(\Sigma\) from which examples are drawn without sacrificing the ability to cover all the behaviors of the analyzed procedure. Our key observation is that often a procedure treats many characters in a particular uniform way: it only copies them from the input to the output in an order-preserving fashion. Our static analysis finds these good characters so that our learning algorithm may consider examples coming from a reduced alphabet containing a single representative good character, thus allowing to produce smaller models while using fewer examples than had the full alphabet been used. We then utilize the learned pseudo-inverse function to invert specific desired outputs by translating a given query to and from the reduced alphabet.
We implemented our approach using two machine learning algorithms and show that indeed our string inverters can find inputs that can drive a selection of procedures taken from real-life software to produce desired outputs, whereas KLEE, a state-of-the-art symbolic execution engine, fails to find such inputs.
For the entire collection see [Zbl 1429.68006].
68Q60 Specification and verification (program logics, model checking, etc.)
Full Text: DOI