On Tuesday March 17th 2020 my free online massive open online course (MOOC)
on the use of Unix command line tools
for data, software, and production engineering
goes live on the edX platform.
Already more than one thousand participants from around the world
have registered for it;
you should still be able to enroll through
this link.
In response to the course’s announcement
seasoned researchers from around the world have commented that this is an
indispensable course
and that it is
very hard to beat the ROI of acquiring this skillset, both for academia and industry.
In an age of shiny IDEs and cool GUI tools, what are the reasons for
the enduring utility and popularity of the Unix command line tools?
Here’s my take.
Continue reading "Seven reasons to add Unix command line expertise to your tool chest"Last modified: Monday, March 16, 2020 0:34 am
The Unix sort command can efficiently handle files of arbitrary size
(think of terabytes).
It does this
by loading into main memory all the data that can fit into it (say 16GB),
sorting that data efficiently using an O(N log N) algorithm,
and then merge-sorting the chunks with a linear complexity O(N) cost.
If the number of sorted chunks is higher than the number of file descriptors
that the merge operation can simultaneously keep open
(typically more than 1000),
then sort will recursively merge-sort intermediate merged files.
Once you have at hand sorted files with unique elements,
you can efficiently perform set operations with them through linear
complexity O(N) operations.
Here is how to do it.
Continue reading "How to Perform Set Operations on Terabyte Files"Last modified: Tuesday, April 3, 2018 8:44 pm
Monitor Process Progress on Unix
I often run file-processing commands that take many hours to
finish, and I therefore need a way to monitor their progress.
The Perkin-Elmer/Concurrent OS32 system I worked-on for a couple
of years back in 1993 (don't ask)
had a facility that displayed for any executing
command the percentage of work that was completed.
When I first saw this facility working on the programs I maintained,
I couldn't believe my eyes, because I was sure that those rusty
Cobol programs didn't contain any functionality to monitor their progress.
Continue reading "Monitor Process Progress on Unix"Last modified: Monday, October 27, 2008 1:34 pm
Open and Closed Source Kernels Go Head to Head
Earlier today I presented at the
30th International Conference on Software Engineering a
research paper comparing the
code quality of Linux, Windows (its
research kernel distribution),
OpenSolaris, and
FreeBSD.
For the comparison I parsed multiple configurations of these systems (more than ten million lines), and stored the results in four databases, where I could run SQL queries on them. This amounted to 8GB of data, 160 million records.
(I’ve made the databases and the SQL queries available
online.)
The areas I examined were file organization, code structure, code style, preprocessing, and data organization.
To my surprise there was no clear winner or looser, but there were interesting differences in specific areas.
Continue reading "Open and Closed Source Kernels Go Head to Head"Last modified: Friday, May 16, 2008 1:44 am
The Treacherous Power of Extended Regular Expressions
I wanted to filter out lines containing the word "line" or a double quote
from a 1GB file.
This can be easily specified as an extended regular expression,
but it turns out that I got more than I bargained for.
Continue reading "The Treacherous Power of Extended Regular Expressions"Last modified: Tuesday, August 28, 2007 10:37 am
What Can System Administrators Learn from Programmers?
Although we often hear about program bugs and techniques to get
rid of them, we seldom see a similar focus in the field of system
administration.
This is unfortunate, because increasingly the reliability of an IT system
depends as much on the software comprising the system as on the support
infrastructure hosting it.
Continue reading "What Can System Administrators Learn from Programmers?"Last modified: Sunday, July 23, 2006 0:10 am
Code Reading Example: the Linux Kernel Load Calculation
A colleague's Linux machine was exhibiting a very high load value,
for no obvious reason.
I wanted to make him point the kernel debugger on the routine calculating
the load.
It has been more than 7 years since the last time I worked on a Linux
kernel,
so I had to find my way around from first principles.
This is an annotated and slightly edited version of what I did.
Continue reading "Code Reading Example: the Linux Kernel Load Calculation"Last modified: Thursday, November 25, 2004 9:40 am