Intro to Python Sets and Using them for Deduplication

Pavol Kutaj
2 min readJan 6, 2022

--

The aim of this pageđź“ťis to cover Python sets. As beautifully explained in Set Theory: the Method To Database Madness by Vaidehi Joshi, Medium, sets are essential concepts for working with data(bases). It is a primitive data structure in Python with both mutable (Set) and immutable (Frozenset) type and I am using it mostly for deduplication - for example, I have hundreds of data processing jobs with environments within their suffix (foo-prod1, bar-prod1, acme-dev1, xxx-qa1) and I quickly need to get unique values of the environments (I get a set of 5 environments from the list of 200 jobs). Also, I am moved to share these notes because of the following claim made on Leet Code

If I had to choose three built in functions/methods that I wasn’t comfortable with at the start and have found them super helpful, I’d probably say enumerate, zip and set

— Sum MegaPost — Python3 Solution with a detailed explanation

1. attributes

  • collection
  • unordered
  • elements are unique
  • mutable (there is also a frozen set which is immutable)
  • each element is immutable (like keys of a dictionary)

2. syntax

  • the literal form similar to dicts
>>> set = {333,555,77,32,124}
>>> set
{32, 555, 77, 333, 124}

3. constructor

  • NOTE: {} is already reserved for the creation of a dictionary — you, therefore, need to use the set() constructor
  • out of 4 main collection types (list, dict, set, tuple), set does not have a literal constructor
>>> f = {}
>>> f
{}
>>> type(f)
<class 'dict'>
>>> g = []
>>> g
[]
>>> type(g)
<class 'list'>
>>> b = ()
>>> b
()
>>> type(b)
<class 'tuple'>
>>> e = set()
>>> e
set()
>>> type(e)
<class 'set'>
  • you can create a set from any iterable series
  • any duplicates thereof are discarded
>>> j = set([1,2,2,2,3,4,5,6,11,6,])
>>> j
{1, 2, 3, 4, 5, 6, 11}

4. membership

  • this is a fundamental use — note that items of a set cannot be retrieved by their position/index
  • tested with in and not in operators
>>> j
{1, 2, 3, 4, 5, 6, 11}


>>> j[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'set' object is not subscriptable

'set' object is not subscriptable

>>> 11 in j
True

5. deduplication

  • set constructor is commonly used to efficiently remove duplicate items from a series of objects
>>> l = [1,1,2,4,6,7,1,44,108,108,108]
>>> dedup = set(l)
>>> dedup
{1, 2, 4, 6, 7, 44, 108}

6. sources

--

--

No responses yet