How can one determine whether two binary files (for example, executable images) are somehow similar? I started writing a program to perform this task. Such a program could be useful for determing whether a vendor had included GNU Public License (GPL) code in a propriatary product, violating the GPL license. After writing about 20 lines, I realized that I needed an accurate definition of similarity than the vague "the two files contain a number of identical subsequences" I had in mind.
The accurate definition turns out to be based on information theory. Consider two files containing data with entropy Ha and Hb respectively. If the two files are completely different, then the entropy of their concatenation Hab would be (Ha + Hb) / 2. We can derive an approximation of a file's entropy value by compressing the file. If the file were perfectly compressed, the information in the file Itotal would be equal to the compressed file's length. Then the file's entropy could be calculated as the ratio of the file's information Itotal over the original file's length L:#!/usr/bin/perl
#
# (C) Copyright 2004 Diomidis Spinellis
#
# Permission to use, copy, and distribute this software and its
# documentation for any purpose and without fee is hereby granted,
# provided that the above copyright notice appear in all copies and that
# both that copyright notice and this permission notice appear in
# supporting documentation.
#
# THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
# WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
# MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
# Return the entropy of the file passed as the argument
sub
entropy
{
my($file) = @_;
my($i, $l);
# File length
$l = `wc -c $file`;
# File information (approximation)
$i = `bzip2 -c $file | wc -c`;
print STDERR "$0: warning file size exceeds bzip2 block size\n" if ($l > 900 * 1024);
return ($i / $l);
}
# Return the entropy of the two files passed as arguments
sub
entropy2
{
my($file1, $file2) = @_;
my($oldrs) = ($/);
my($tmp) = ("/tmp/entropy.$$");
undef($/);
open(IN, $file1) || die "read from $file1: $!\n";
binmode(IN);
open(OUT, ">$tmp") || die "write to $tmp: $!\n";
print OUT <IN>;
open(IN, $file2) || die "read from $file2: $!\n";
binmode(IN);
print OUT <IN>;
close(IN);
close(OUT);
my($e) = (entropy($tmp));
unlink($tmp);
return ($e);
}
$#ARGV == 1 || die "Usage $0: file1 file2\n";
printf("%.3g - Entropy of $ARGV[0]\n", $e0 = entropy($ARGV[0]));
printf("%.3g - Entropy of $ARGV[1]\n", $e1 = entropy($ARGV[1]));
printf("%.3g - Combined predicted entropy\n", ($e0 + $e1) / 2);
printf("%.3g - Combined actual entropy\n", entropy2($ARGV[0], $ARGV[1]));
cl *.c |
awk '/: unresolved/{print $8}' |
/usr/bin/sort -u |
sed '
s/^_/void /
s/$/(){}/
' >libstub.c
0.428 - Entropy of \windows\system32\ftp.exe 0.48 - Entropy of \dds\src\port\ftp\ftp.exe 0.454 - Combined predicted entropy 0.466 - Combined actual entropy 0.391 - Entropy of \windows\system32\rcp.exe 0.523 - Entropy of \dds\src\port\rcp\rcp.exe 0.457 - Combined predicted entropy 0.48 - Combined actual entropy 0.421 - Entropy of \windows\system32\finger.exe 0.483 - Entropy of \dds\src\port\finger\finger.exe 0.452 - Combined predicted entropy 0.471 - Combined actual entropyI would expect the combined actual entropy to be a lot lower than the predicted one. I can think of two reasons for the failure:
0.428 - Entropy of \windows\system32\ftp.exe 0.391 - Entropy of \WINDOWS\system32\rcp.exe 0.409 - Combined predicted entropy 0.377 - Combined actual entropy 0.415 - Entropy of \windows\system32\rsh.exe 0.391 - Entropy of \WINDOWS\system32\rcp.exe 0.403 - Combined predicted entropy 0.328 - Combined actual entropy 0.428 - Entropy of \windows\system32\ftp.exe 0.479 - Entropy of \WINDOWS\system32\osuninst.exe 0.453 - Combined predicted entropy 0.446 - Combined actual entropyThe program also detects a large similarity between the executables of version 1.1 and version 1.19 of my outwit winclip program.
0.428 - Entropy of \dds\src\sysutil\outwit\winclip\winclip-1.1.exe 0.452 - Entropy of \dds\src\sysutil\outwit\winclip\winclip-1.19.exe 0.44 - Combined predicted entropy 0.33 - Combined actual entropyThis is impressive, if one considers that the source code differences between the two files are longer that the current version of the source code. Comments Toot! Tweet
Last modified: Friday, March 19, 2004 2:15 pm
Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.