regex - How do HTML parsers work? -


i've seen humorous threads , read warnings, , know you don't parse html regex. don't worry... i'm not planning on trying it.

but... leads me ask: how html parsers coded (including built-in functions of programming languages, dom parsers , php's strip_tags)? mechanism employ parse (sometimes malformed) markup?

i found source of one coded in javascript, , uses regex job:

// regular expressions parsing tags , attributes var starttag = /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/,     endtag = /^<\/(\w+)[^>]*>/,     attr = /(\w+)(?:\s*=\s*(?:(?:"((?:\\.|[^"])*)")|(?:'((?:\\.|[^'])*)')|([^>\s]+)))?/g;   

do this? there conventional, standard way code html parser?

i not know that style “normal” way things. better i’ve seen, it’s still close refer “naïve” approach in this answer. 1 thing, isn’t accounting html comments getting in way of things. there legal matters of entities isn’t dealing with. it’s html comments such approaches fall down.

a more natural way use lexer peel off tokens, more like shown in answer’s script, assemble meaningfully. lexer able know html comments enough.

you could approach full grammar, such 1 shown here parsing rfc 5322 mail address. sort of approach take in second, “wizardly” solution in this answer. complete grammar well-formed html, , i’m interested in few different sort of tags. define fully, don’t define valid fields tags i’m unconcerned with.


Comments

Popular posts from this blog

apache - Add omitted ? to URLs -

redirect - bbPress Forum - rewrite to wwww.mysite prohibits login -

php - How can I stop spam on my custom forum/blog? -