How to Extract All Regex Matches from a File With Python
The aim of this playbook🏁 is to list steps for extracting all regex matches with Python’s re
module from a file into a list using the re.findall() method
1 min readSep 13, 2022
--
- NOTE: this is not about extracting matches from capture groups.
1. RE.FINDALL()
- import
re
module - initialize a variable bound to a regex compile object with
p = re.compile(<pattern>)
- open the file with
with open(...) as <alias>:
statement - assign the content of the file with
f = <alias>.read()
- assign the list of matches with
m = re.findall(f, p)
re.findall()
will return a list of matching tuples, not a list of matching strings
2. EXAMPLE
import re
def extract_images(filename):
p = re.compile("../assets/(.*?jpg|.*?png)")
with open(filename, mode="rt", encoding="utf-8") as docFile:
doc = docFile.read()
images = re.findall(p, doc)
return ["./assets/" + img for img in images]
# later used in e.g. [os.remove(img) for img in extract_images(filename)]
# above deletes all images located in ./assets/<filename>.jpg|png
- I have a file containing
|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|
|  |  |  |  |
|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|
- Running the function returns
extract_images("Test-Ignore.md"))
['./assets/81.03_2.png', './assets/81.03_3.png', './assets/81.03_4.png', './assets/81.03_5.png', './assets/81.03_1.png']
- These are paths and I can e.g. commit and push those files if they are part of an automated git workflow