manipulating string content value in html file using beautifulsoup -
folks
i new python , beautifulsoup - please bear me. trying html parsing.
i remove newlines , compact whitespace selected attributes (based on string search within html file.
for example, following html, search tags string attribute "xy" , remove newlines , multiple spaces string (replace single space.
<html> <head></head> <body> <h1>xy z</h1> <p>xy z</p> <div align="center" style="margin-left: 0%; "> <b> <font style="font-family: 'times new roman', times"> ab c </font> <font style="font-family: 'times new roman', times"> xy z </font> </b> </div> </body> </html>
the resulting html should like:
<html> <head></head> <body> <h1>xy z</h1> <p>xy z</p> <div align="center" style="margin-left: 0%; "> <b> <font style="font-family: 'times new roman', times"> ab c </font> <font style="font-family: 'times new roman', times"> xy z </font> </b> </div> </body> </html>
ok - found way it...you use findall , use replacewith() method shown below.
......... soup = beautifulsoup(contents) s = soup.findall(text=re.compile("xy"))
s1 in s:
s1.replacewith(re.sub('\s+', ' ', str(s1)))
...........
Comments
Post a Comment