Intro to Python Sets and Using them for Deduplication
The aim of this pageđź“ťis to cover Python sets. As beautifully explained in Set Theory: the Method To Database Madness by Vaidehi Joshi, Medium, sets are essential concepts for working with data(bases). It is a primitive data structure in Python with both mutable (Set
) and immutable (Frozenset
) type and I am using it mostly for deduplication - for example, I have hundreds of data processing jobs with environments within their suffix (foo-prod1, bar-prod1, acme-dev1, xxx-qa1) and I quickly need to get unique values of the environments (I get a set of 5 environments from the list of 200 jobs). Also, I am moved to share these notes because of the following claim made on Leet Code
If I had to choose three built in functions/methods that I wasn’t comfortable with at the start and have found them super helpful, I’d probably say enumerate, zip and set
— Sum MegaPost — Python3 Solution with a detailed explanation
1. attributes
- collection
- unordered
- elements are unique
- mutable (there is also a frozen set which is immutable)
- each element is immutable (like keys of a dictionary)
2. syntax
- the literal form similar to dicts
>>> set = {333,555,77,32,124}
>>> set
{32, 555, 77, 333, 124}
3. constructor
- NOTE:
{}
is already reserved for the creation of a dictionary — you, therefore, need to use theset()
constructor - out of 4 main collection types (list, dict, set, tuple),
set
does not have a literal constructor
>>> f = {}
>>> f
{}
>>> type(f)
<class 'dict'>
>>> g = []
>>> g
[]
>>> type(g)
<class 'list'>
>>> b = ()
>>> b
()
>>> type(b)
<class 'tuple'>
>>> e = set()
>>> e
set()
>>> type(e)
<class 'set'>
- you can create a set from any iterable series
- any duplicates thereof are discarded
>>> j = set([1,2,2,2,3,4,5,6,11,6,])
>>> j
{1, 2, 3, 4, 5, 6, 11}
4. membership
- this is a fundamental use — note that items of a set cannot be retrieved by their position/index
- tested with
in
andnot in
operators
>>> j
{1, 2, 3, 4, 5, 6, 11}
>>> j[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'set' object is not subscriptable
'set' object is not subscriptable
>>> 11 in j
True
5. deduplication
- set constructor is commonly used to efficiently remove duplicate items from a series of objects
>>> l = [1,1,2,4,6,7,1,44,108,108,108]
>>> dedup = set(l)
>>> dedup
{1, 2, 4, 6, 7, 44, 108}