As a speaker of Greek, I’ve been fixing issues in the handling of non-ASCII characters for over 40 years, using techniques ranging from simple lookup tables to dynamic patching of in-memory images. Here’s how I debugged and fixed the handling of UTF-8 characters in the git grep
command, which was broken for almost a decade.
It all started with a failing continuous integration task. The CI task used the following git grep
to identify whole-line comments longer than 80 characters. It looks for characters used at the beginning of such comments, followed by more than 78 characters of any type.
git grep -nE '^(( \* )|(-- )|(// )|(# .)).{78,}'
This task worked fine on Linux, but failed on macOS on lines close to the limit that included UTF-8 characters. I verified this as a standalone command. Strangely the standalone Unix egrep command worked fine. Based on this, I created a minimal, reproducible example and armed with it, I tapped on StackOverflow’s collective wisdom. As I didn’t hear anything back, I decided to investigate it on my own. How hard could it be?
First I downloaded and compiled Git on macOS from source, to ensure this wasn’t a problem in Apple’s port. For such a large system, this proved to be surprisingly easy and fast (once I used make -j
to run a parallel build). Indeed, the problem continued to manifest itself.
My first thought was a bug or a lack of UTF-8 support in Apple’s regex(3) library. The documentation was not especially clear, especially given the behavior I was witnessing. However, thanks to the reg.c
very clear and small regexp driver I found on StackOverflow, I was able to try a few examples, and verify that Apple’s library was indeed correctly supporting UTF-8.
I then tried to look at how the library routines were called. For this I compiled git-grep
with debugging support enabled (make DEBUG=1
) and used lldb(1) to insert breakpoints, and examine the calls to regcomp(3) and regexec(3). I hadn’t used lldb(1) before, but the transition from gdb(1) proved quite easy, thanks to its built-in help
command. The checks didn’t reveal any unexpected arguments or flags, and yet the routines were failing.
At that point I had at hand a very powerful tool: the ability to examine the difference between the working regexp driver program and the failing git grep
minimal example. I compiled the driver with debugging support and looked at the routine calls. A dump of the regex_t
argument revealed something very strange. The internal representation of the argument was completely different between the two programs. A search for the names of some fields used in the git grep
case (using git grep
of course), revealed that git grep
was not using Apple’s native library, but the GNU regexp library distributed with Git under the compat/regex
directory. The mystery deepened, because this library handles UTF-8 characters just fine under Linux. Why and how was this library used, and why wasn’t it supporting UTF-8?
To answer this question I built git grep
again passing V=1
for verbose output to the make
command. This showed the arguments passed to the compiler, and revealed that Git was getting compiled with the flag -DNO_MBSUPPORT
. A quick search for this name (using git grep
again) showed that this was defined in the Makefile
under the NO_REGEX
condition.
ifdef NO_REGEX
compat/regex/regex.sp compat/regex/regex.o: EXTRA_CPPFLAGS = \
-DGAWK -DNO_MBSUPPORT
endif
A further search for NO_REGEX
in the Makefile led me to the place where this was defined for Darwin (macOS) compilations.
ifeq ($(uname_S),Darwin)
[…]NO_REGEX = YesPlease
PTHREAD_LIBS =
endif
Now I needed to know why Git wasn’t using the native Apple library.
For this I run git blame
on the Makefile. Thankfully, Git, in common with most top-tier open source projects, faithfully records and preserves its history. Thus git blame
pointed out to a 2013 commit associated with the NO_REGEX
definition.
$ git blame Makefile | grep 'NO_REGEX = YesPlease'
29de20504e9 (David Aguilar 2013-05-11 01:22:26 -0700 1430) NO_REGEX = YesPlease
I could then run git show
on the commit to see why this change had been introduced.
$ git show 29de20504e9
commit 29de20504e9790785fe1698300755323f74972aa
Author: David Aguilar <davvid@gmail.com>
Date: Sat May 11 01:22:26 2013 -0700
Makefile: fix default regex settings on Darwin
t0070-fundamental.sh fails on Mac OS X 10.8:
$ uname -a
Darwin lustrous 12.2.0 Darwin Kernel Version 12.2.0:
Sat Aug 25 00:48:52 PDT 2012;
root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64
$ ./t0070-fundamental.sh -v
fatal: regex bug confirmed: re-build git with NO_REGEX=1
Fix it by using Git's regex library.
Reviewed-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: David Aguilar <davvid@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
diff --git a/Makefile b/Makefile
index 0f931a2030..f698c1a59e 100644
--- a/Makefile
+++ b/Makefile
@@ -1054,6 +1054,7 @@ ifeq ($(uname_S),Darwin)
BASIC_LDFLAGS += -L/opt/local/lib
endif
endif
+ NO_REGEX = YesPlease
PTHREAD_LIBS =
endif
Knowing that the one-line change was associated with a once failing test case, I could easily reintroduce the use of Apple’s native libraries by removing it, recompile, and run the tests again to see whether they would still fail today, or whether the associated problem in the library had been fixed in the meantime. Indeed, the newly-compiled program passed all test cases (on macOS). Frustratingly however, it still refused to handle UTF-8 characters correctly.
Knowing that the correct operation of regular expressions on multibyte characters required a call to setlocale(3), I started searching and reading the code to verify that this was indeed getting called. It was again a mystery why git grep
was working correctly on Linux, but still not on macOS. A little later I discovered (you guessed it, using git grep
) that the call to setlocale(3) was taking place as a side-effect of the gettext(3) initialization. As gettext(3) isn’t configured under macOS, this initialization wasn’t taking place. I fixed this by moving the call to setlocale(3) into Git’s main
routine in common-main.c
.
In preparation for submitting a patch, I created a test case (which required its own round of code reading to understand the test system’s mechanics) and registered to GitGitGadget in order to submit the patch through a GitHub pull request. A nice feature of GitGitGadget is that runs 40 integration tests on diverse platforms. Unfortunately, a few of these runs revealed that the new test case was failing on some platforms. The reason for the failure was that the new test case exposed an already-exiting fault that the corresponding platforms didn’t address.
Given that this fault existed for years due to issues in the platforms’ native libraries, I addressed the failures by configuring these tests to be skipped if run on platforms using the internal regexp() library compiled without multibyte regex support, or using native libraries again lacking such support. This required some more debugging to find and add a missing configuration variable in the Makefile
, and ensure that regex
test helper program was also operating under the correct setlocale(3).
One final hurdle to overcome was the patch submission. I started by using gitgitgadget, but after it sent a followup patch with an embarrassingly duplicated commit message, I switched to git-send-email, which offers tighter control and better visibility of the process. The process was made easy thanks to the excellent git-send-email tutorial. After a few additional tweaks to the code prompted by Git’s maintainer, Junio Hamano, and several weeks of patience the fix was merged in Git’s master
branch to be made available in Git 2.39.
To conclude, the experience of debugging and fixing git grip
UTF-8 support on macOS, demonstrated to me, yet again, the importance of several of the 66 recommendations I make in the Effective Debugging book.
reg.c
vs git grep
)make V=1
)make -j
)git blame
)make DEBUG=1
)regex_t
values)Last modified: Sunday, October 9, 2022 7:19 pm
Unless otherwise expressly stated, all original material on this page created by Diomidis Spinellis is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.