How it works...

The preceding code will find the copies of the same file in a directory and remove all except one copy of the file. Let's go through the code and see how it works.

ls -lS lists the details of the files in the current folder sorted by file size. The --time-style=long-iso option tells ls to print dates in the ISO format. awk reads the output of ls -lS and performs comparisons on columns and rows of the input text to find duplicate files.

The logic behind the code is as follows:

We list the files sorted by size, so files of the same size will be adjacent. The first step in finding identical files is to find ones with the same size. Next, we calculate the checksum of the files. If the checksums match, the files are duplicates and one set of the duplicates are removed.
The BEGIN{} block of awk is executed before the main processing. It reads the "total" lines and initializes the variables. The bulk of the processing takes place in the {} block, when awk reads and processes the rest of the ls output. The END{} block statements are executed after all input has been read. The output of ls -lS is as follows:

        total 16
        -rw-r--r-- 1 slynux slynux 5 2010-06-29 11:50 other
        -rw-r--r-- 1 slynux slynux 6 2010-06-29 11:50 test
        -rw-r--r-- 1 slynux slynux 6 2010-06-29 11:50 test_copy1
        -rw-r--r-- 1 slynux slynux 6 2010-06-29 11:50 test_copy2

The output of the first line tells us the total number of files, which in this case is not useful. We use getline to read the first line and then dump it. We need to compare each of the lines and the following line for size. In the BEGIN block, we read the first line and store the name and size (which are the eighth and fifth columns). When awk enters the {} block, the rest of the lines are read, one by one. This block compares the size obtained from the current line and the previously stored size in the size variable. If they are equal, it means that the two files are duplicates by size and must be further checked by md5sum.

We have played some tricks on the way to the solution.

The external command output can be read inside awk as follows:

      "cmd"| getline

Once the line is read, the entire line is in $0 and each column is available in $1, $2, ..., $n. Here, we read the md5sum checksum of files into the csum1 and csum2 variables. The name1 and name2 variables store the consecutive filenames. If the checksums of two files are the same, they are confirmed to be duplicates and are printed.

We need to find a file from each group of duplicates so we can remove all other duplicates. We calculate the md5sum value of the duplicates and print one file from each group of duplicates by finding unique lines, comparing md5sum from each line using -w 32 (the first 32 characters in the md5sum output; usually, the md5sum output consists of a 32-character hash followed by the filename). One sample from each group of duplicates is written to unique_files.

Now, we need to remove the files listed in duplicate_files, excluding the files listed in unique_files. The comm command prints files in duplicate_files but not in unique_files.

For that, we use a set difference operation (refer to the recipes on intersection, difference, and set difference).

comm only processes sorted input. Therefore, sort -u is used to filter duplicate_files and unique_files.

The tee command is used to pass filenames to the rm command as well as print. The tee command sends its input to both stdout and a file. We can also print text to the terminal by redirecting to stderr. /dev/stderr is the device corresponding to stderr (standard error). By redirecting to a stderr device file, text sent to stdin will be printed in the terminal as standard error.