id: 05949688 dt: a an: 05949688 au: Wartell, Richard; Zhou, Yan; Hamlen, Kevin W.; Kantarcioglu, Murat; Thuraisingham, Bhavani ti: Differentiating code from data in x86 binaries. so: Gunopulos, Dimitrios (ed.) et al., Machine learning and knowledge discovery in databases. European conference, ECML PKDD 2011, Athens, Greece, September 5‒9, 2011. Proceedings, Part III. Berlin: Springer (ISBN 978-3-642-23807-9/pbk). Lecture Notes in Computer Science 6913. Lecture Notes in Artificial Intelligence, 522-536 (2011). py: 2011 pu: Berlin: Springer la: EN cc: ut: statistical data compression; segmentation; classification; x86 binary disassembly ci: li: doi:10.1007/978-3-642-23808-6_34 ab: Summary: Robust, static disassembly is an important part of achieving high coverage for many binary code analyses, such as reverse engineering, malware analysis, reference monitor in-lining, and software fault isolation. However, one of the major difficulties current disassemblers face is differentiating code from data when they are interleaved. This paper presents a machine learning-based disassembly algorithm that segments an x86 binary into subsequences of bytes and then classifies each subsequence as code or data. The algorithm builds a language model from a set of pre-tagged binaries using a statistical data compression technique. It sequentially scans a new binary executable and sets a breaking point at each potential code-to-code and code-to-data/data-to-code transition. The classification of each segment as code or data is based on the minimum cross-entropy. Experimental results are presented to demonstrate the effectiveness of the algorithm. rv: