'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

ChubakPDP11+TakeWithGrainOfSalt@programming.dev · edit-2 1 year ago

'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

solrize@lemmy.world · 1 year ago

It wouldn’t occur to me to use a parser generator to make an html parser, or (say) a Lisp reader. But I wouldn’t use regexes either. For large HTML docs I generally use a SAX parser like expat, then maintain a stack in the application of tag nesting at the current point. For smaller docs I use a DOM parser, same idea but it builds a tree for you. In each case the bottom level is basically a hand coded state machine, scanning the HTML input and emitting tag events or text strings.

Markdown is possibly simpler, but I haven’t had to with process it so far.