regex - How do HTML parsers work? -
i've seen humorous threads , read warnings, , know you don't parse html regex. don't worry... i'm not planning on trying it.
but... leads me ask: how html parsers coded (including built-in functions of programming languages, dom parsers , php's strip_tags)? mechanism employ parse (sometimes malformed) markup?
i found source of one coded in javascript, , uses regex job:
// regular expressions parsing tags , attributes var starttag = /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/, endtag = /^<\/(\w+)[^>]*>/, attr = /(\w+)(?:\s*=\s*(?:(?:"((?:\\.|[^"])*)")|(?:'((?:\\.|[^'])*)')|([^>\s]+)))?/g;
do this? there conventional, standard way code html parser?
i not know that style “normal” way things. better i’ve seen, it’s still close refer “naïve” approach in this answer. 1 thing, isn’t accounting html comments getting in way of things. there legal matters of entities isn’t dealing with. it’s html comments such approaches fall down.
a more natural way use lexer peel off tokens, more like shown in answer’s script, assemble meaningfully. lexer able know html comments enough.
you could approach full grammar, such 1 shown here parsing rfc 5322 mail address. sort of approach take in second, “wizardly” solution in this answer. complete grammar well-formed html, , i’m interested in few different sort of tags. define fully, don’t define valid fields tags i’m unconcerned with.
Comments
Post a Comment