How to Extract All Regex Matches from a File With Python
The aim of this playbook🏁 is to list steps for extracting all regex matches with Python’s re
module from a file into a list using the re.findall() method
1 min readSep 13, 2022
- NOTE: this is not about extracting matches from capture groups.
1. RE.FINDALL()
- import
re
module - initialize a variable bound to a regex compile object with
p = re.compile(<pattern>)
- open the file with
with open(...) as <alias>:
statement - assign the content of the file with
f = <alias>.read()
- assign the list of matches with
m = re.findall(f, p)
re.findall()
will return a list of matching tuples, not a list of matching strings
2. EXAMPLE
import re
def extract_images(filename):
p = re.compile("../assets/(.*?jpg|.*?png)")
with open(filename, mode="rt", encoding="utf-8") as docFile:
doc = docFile.read()
images = re.findall(p, doc)
return ["./assets/" + img for img in images]
# later used in e.g. [os.remove(img) for img in extract_images(filename)]
# above deletes all images located in ./assets/<filename>.jpg|png
- I have a file containing
|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|
| data:image/s3,"s3://crabby-images/15775/15775095e223fb9f832d402d1677fbd7ad0d9d61" alt="example_description" | data:image/s3,"s3://crabby-images/fdaa1/fdaa128964ce363e5150ee6d79c81e6dd7cd8e1e" alt="example_description" | data:image/s3,"s3://crabby-images/d7e92/d7e92a16cc1e5b7cd745426d743d5e4ea835ece8" alt="example_description" | data:image/s3,"s3://crabby-images/686b1/686b1944437c58da9f791ddf2075353cb2012736" alt="example_description" |
|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|-----------------------------------------------|
- Running the function returns
extract_images("Test-Ignore.md"))
['./assets/81.03_2.png', './assets/81.03_3.png', './assets/81.03_4.png', './assets/81.03_5.png', './assets/81.03_1.png']
- These are paths and I can e.g. commit and push those files if they are part of an automated git workflow