c++ - Search HTML lines and remove lines that don't start with </form></td><td><a -
i have html file bad formatted code website, want extract small pieces of information.
i interested in lines start this:
</form></td><td><a href="http://www.mysite.com/users/user897" class="username"> <b>user897</b></a></td></tr><tr><td>housea</td><td>2</td><td class="entriestablerow-gamename">housea type12 <span class="entriestablerow-moredetails"></span></td><td>1 of 2</td><td>user123</td><td>10</td><td>
and want extract 3 fields:
a:housea b:housea type12 c:user123 d:10
i know i've seen people recommend html agility pack , lib2xml don't think need that. app in c/c++.
i using getline start reading lines, not sure what's best way proceed. thanks!
std::ifstream data("home.html"); std::string line; while(std::getline(data,line)) { linenum++; std::stringstream linestream(line); std::string user; if (strncmp(line.c_str(), "</form></td><td>",strlen("</form></td><td>")) == 0) { printf("found wanted line in line:%d\n", linenum); } }
in general case, xml/html parser best way here, robust against differing input. (whatever do, don't use regexps!)
update
however, if you're targetting specific input, seems you're doing, can use sscanf (as suggest) or cin.read() or regexp scan manually.
just beware code can break @ moment html changes (even whitespace).
therefore, my/our recommendation use proper tool job. xml/html not raw text, , should not treated such.
how writing python script instead? :)
Comments
Post a Comment