Christian Filippini

Apr 23, 20215 min read

“Find the Difference” in Python

Difflib — A hidden gem in Python built-in libraries Christopher Tao Mar 21·8 min read

Recently, I have been focusing on reviewing and trying out Python built-in libraries. This self-assigned task has brought me a lot of fun. There are many interesting features in Python that provide out-of-the-box implementations and solutions to different kinds of problems.

One of the examples is the built-in library I’m going to introduce in this article — Difflib. Because it is built-in to Python3, so we don’t need to download or install anything, simply import it as follows.

import difflib as dl

Now we should begin :)

1. Find “Changed” Elements

Photo by geralt on Pixabay

I believe most of my audiences would know how to use git. If you are one of them, you may have seen the raw code file with conflicts as follows.

<<<<<<< HEAD:file.txt
Hello world
=======
Goodbye
>>>>>>> 77976da35a11db4580b80ae27e8d65caf5208086:file.txt

In most of the GUI tools for Git, such as Atlassian Sourcetree, you must have seen the representation like this.

image courtesy: https://blog.sourcetreeapp.com/files/2013/03/hero-small.png

The minus sign indicates that the line of code has been removed in the new version, while the plus sign indicates that the line of code is added in the new version of code which doesn’t exist in the old version.

In Python, we can easily implement something very similar using the Difflib with a single line of code. The first function I’m going to show off is context_diff().

Let’s make up two lists with some string elements.

s1 = ['Python', 'Java', 'C++', 'PHP']
s2 = ['Python', 'JavaScript', 'C', 'PHP']

Then, the magic comes. We can generate a comparing “report” as follows.

dl.context_diff(s1, s2)

The context_diff() function will return a Python generator. So, we can loop it to print everything in one go.

for diff in dl.context_diff(s1, s2):
    print(diff)

The output clearly shows that the 2nd and the 3rd elements are different using an exclamation mark !, indicating that “this element is different”!

In this case, the 1st and the 4th elements are the same so that it will only show the difference. Most interestingly, if we make the scenario more complex, it will show something more useful. Let’s change the two lists as follows.

s1 = ['Python', 'Java', 'C++', 'PHP']
s2 = ['Python', 'Java', 'PHP', 'Swift']

The above output suggests that “C++” has been removed from the first list (original list) and “Swift” has been added to the second list (new list).

What if we want to show something exactly similar to the screenshot we got from Sourcetree? Yes, Python allows us to do that. Just use another function called unified_diff().

dl.unified_diff(s1, s2)

The unified_diff() function will “unify” the two lists together can generate the outputs as above-shown, which is more readable in my opinion.

2. Indicates the Difference Accurately

Photo by lucasgeorgewendt on Pixabay

In section 1, we have focused on identifying the differences in row-level. Can it be more specific? For example, what if we want to compare character by character and shows what is the difference?

The answer is yes absolutely. We can use the ndiff() function.

Suppose we are comparing words in two lists as follows.

['tree', 'house', 'landing']
['tree', 'horse', 'lending']

The three words look very similar but only the 2nd and the 3rd have a single letter changed. Let’s see what the function will return to us.

It shows not only what has been changed (with minus and plus sign) and not changed (“tree” is not changed, so there is no indicator at all), but also which letter has been changed. The ^ indicators have been correctly added to the letter that has been identified as differences.

3. Get Close Matches

Photo by Pezibear on Pixabay

Have you ever typed “teh” and was automatically corrected into “the”? I bet you did, this happened to me all the time :)

Now, with the Difflib, you potentially can implement this feature in your Python application very easily. The key is to use the get_close_matches() function.

Suppose we have a list of candidates and an “input”, this function can help us to pick up the one(s) that close to the “input”. Let’s have a look at the example below.

dl.get_close_matches('thme', ['them', 'that', 'this'])

It successfully picked up “them” when we input “thme” (a typo) rather than “this” or “that” because “them” is the closest one. However, the function is not guaranteed to return something. That is, when there is nothing close enough to the input, an empty list will be returned.

This makes perfect sense because we are not making typos all the time, right? :)

In fact, we can control how the “similarity” is measured by passing an argument to the parameter cutoff. It expects a float number between 0 and 1. The larger number means more strict, vice versa.

A cutoff = 0.1 is small enough to let the function matches all the candidates with the input. What if we want to limit the number of returned matches? we can use the other parameter n which will get back to us the “top n” closest matched terms.

4. How to Modify a String A to B?

Photo by Foundry on Pixabay

If you have some knowledge of Information Retrieval, you may have already realised that the above functions make use of Levenshtein Distance. It tries to find the difference between two textual terms and measure the “distance” between them.

Typically, the distance is defined by how many times substitution, insertion and deletion will need minimum to modify term A to B. Sometimes, the different modifications will be assigned with different weight. In this article, I will skip the algorithm part. You may go to the Wiki page of Levenshtein Distance for details if you are interested.

Levenshtein distance In information theory, linguistics, and computer science, the Levenshtein distance is a string metric for measuring the… en.wikipedia.org

In fact, this article has not finished yet. Using the Difflib, we can even implement the steps of how Levenshtein Distance is applied to two string. This could be done using the SequenceMatcher class in the Difflib.

Suppose we have two string abcde and fabdc, and we would like to know how the former can be modified into the latter. The first step is to instantiate the class.

s1 = 'abcde'
s2 = 'fabdc'seq_matcher = dl.SequenceMatcher(None, s1, s2)

Then, we can use the class method get_opcodes() to get a list of tuples that indicates:

The tag of the modifying action (insert, equal or delete)
The start and end positions of the source string
The start and the end position of the destination string

We can “translate” the above information into something more readable.

for tag, i1, i2, j1, j2 in seq_matcher.get_opcodes():
    print(f'{tag:7}   s1[{i1}:{i2}] --> s2[{j1}:{j2}] {s1[i1:i2]!r:>6} --> {s2[j1:j2]!r}')

Super cool :)

You may also be noticed that the first argument we passed into the SequenceMatcher class is None. This parameter is used to indicate the algorithm that some characters might be “ignored”. This ignorance doesn’t mean remove it from the source string but will ignore it in the algorithm processing. A bit difficult to understand, I know. Let me show you an example.

seq_matcher = dl.SequenceMatcher(lambda c: c in 'abc', s1, s2)

The letter “abc” will not be run through the algorithm but will be treated as a whole.

Last but not the least, the example below shows how we may use this function in practice.

Summary

Photo by JillWellington on Pixabay

In this article, I have introduced another Python built-in library called Difflib. It can generate reports that indicate the differences between two lists or two strings. Also, it can help us to find the closest matches strings between an input and a list of candidate strings. Ultimately, we can use a class of this module to achieve some more complex and advanced functions.

If you are interested in other Python built-in libraries, please check the posts below.