Intro to Python Sets and Using them for Deduplication

The aim of this page📝is to cover Python sets. As beautifully explained in Set Theory: the Method To Database Madness by Vaidehi Joshi, Medium, sets are essential concepts for working with data(bases). It is a primitive data structure in Python with both mutable (Set) and immutable (Frozenset) type and I am using it mostly for deduplication - for example, I have hundreds of data processing jobs with environments within their suffix (foo-prod1, bar-prod1, acme-dev1, xxx-qa1) and I quickly need to get unique values of the environments (I get a set of 5 environments from the list of 200 jobs). Also, I am moved to share these notes because of the following claim made on Leet Code

If I had to choose three built in functions/methods that I wasn’t comfortable with at the start and have found them super helpful, I’d probably say enumerate, zip and set

Sum MegaPost — Python3 Solution with a detailed explanation

  • collection
  • unordered
  • elements are unique
  • mutable (there is also a frozen set which is immutable)
  • each element is immutable (like keys of a dictionary)
  • the literal form similar to dicts
>>> set = {333,555,77,32,124}
>>> set
{32, 555, 77, 333, 124}
  • NOTE: {} is already reserved for the creation of a dictionary — you, therefore, need to use the set() constructor
  • out of 4 main collection types (list, dict, set, tuple), set does not have a literal constructor
>>> f = {}
>>> f
{}
>>> type(f)
<class 'dict'>
>>> g = []
>>> g
[]
>>> type(g)
<class 'list'>
>>> b = ()
>>> b
()
>>> type(b)
<class 'tuple'>
>>> e = set()
>>> e
set()
>>> type(e)
<class 'set'>
  • you can create a set from any iterable series
  • any duplicates thereof are discarded
>>> j = set([1,2,2,2,3,4,5,6,11,6,])
>>> j
{1, 2, 3, 4, 5, 6, 11}
  • this is a fundamental use — note that items of a set cannot be retrieved by their position/index
  • tested with in and not in operators
>>> j
{1, 2, 3, 4, 5, 6, 11}


>>> j[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'set' object is not subscriptable

'set' object is not subscriptable

>>> 11 in j
True
  • set constructor is commonly used to efficiently remove duplicate items from a series of objects
>>> l = [1,1,2,4,6,7,1,44,108,108,108]
>>> dedup = set(l)
>>> dedup
{1, 2, 4, 6, 7, 44, 108}

--

--

Infrastructure Support Engineer/Technical Writer (snowplow.io) with a passion for Python/writing documentation. More about me: https://pavol.kutaj.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Pavol Kutaj

Infrastructure Support Engineer/Technical Writer (snowplow.io) with a passion for Python/writing documentation. More about me: https://pavol.kutaj.com