regex - How do HTML parsers work? -


i've seen humorous threads , read warnings, , know you don't parse html regex. don't worry... i'm not planning on trying it.

but... leads me ask: how html parsers coded (including built-in functions of programming languages, dom parsers , php's strip_tags)? mechanism employ parse (sometimes malformed) markup?

i found source of one coded in javascript, , uses regex job:

// regular expressions parsing tags , attributes var starttag = /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/,     endtag = /^<\/(\w+)[^>]*>/,     attr = /(\w+)(?:\s*=\s*(?:(?:"((?:\\.|[^"])*)")|(?:'((?:\\.|[^'])*)')|([^>\s]+)))?/g;   

do this? there conventional, standard way code html parser?

i not know that style “normal” way things. better i’ve seen, it’s still close refer “naïve” approach in this answer. 1 thing, isn’t accounting html comments getting in way of things. there legal matters of entities isn’t dealing with. it’s html comments such approaches fall down.

a more natural way use lexer peel off tokens, more like shown in answer’s script, assemble meaningfully. lexer able know html comments enough.

you could approach full grammar, such 1 shown here parsing rfc 5322 mail address. sort of approach take in second, “wizardly” solution in this answer. complete grammar well-formed html, , i’m interested in few different sort of tags. define fully, don’t define valid fields tags i’m unconcerned with.


Comments

Popular posts from this blog

jQuery clickable div with working mailto link inside -

java - Getting corefrences with Standard corenlp package -

WPF: binding viewmodel property of type DateTime to Calendar inside ItemsControl -