Bringing a meme from last year back up just to discuss a tad bit more seriously seems like something this post will have underlying inside it. It seems almost comical, like a joke someoje would’ve made at a High School Reunion part just to sound witty.
However, when a meme carries a really serious message, there’s more than a generic probability than you’d recall it.
Those lurking in StackOverflow circa 2009 would immediately recognize the title of the post as a great example of a “meme that actually is a great life lesson” and the meme would eventually be on repeat due to the ever-increasing number of newcomers that does what the meme says to not do.
Don’t parse (X)HTML with Regular Expression (RegEx)
What started out as a normal and regular-looking answer at StackOverflow, albeit sounding a bit subjective and rant-y quickly devolved into a textual art. A performance pulled smoothly, that no one dared to edit (or even flag) it as incorrect.
I’m talking about the famous reply at StackOverflow – a reply to the question “how to parse HTML using regex?”
While the post tends toward the sarcastic and humorous side, there’s a message repeated over and over again in the post. You shouldn’t use regex to parse HTML.
I, myself, have received countless of messages regarding this particular topic. It’s safe to say it’s all white-noise nowadays, but I had to repeat the same answer as if I’m a voicemail.
So why shouldn’t you use RegEx to parse (X)HTML?
Sure, we have been there sometimes. It’s just to retrieve some information and store it somewhere else, something as simple as regex can definitely handle that. I mean, what could go wrong?
**Many. Many things could go wrong.**
The imminent danger of using regex to parse a context-free language (HTML) is as clear as day. Regex, as the name suggests, parses regular languages. HTML on the other hand is a totally different beast to handle.
It’s easy when you just want to be dirty and use regex to find a pattern as easy as an URL, for example. Mostly because you probably know the basic forms of URL, it either starts with “http://”, “https://”, “www” or it ends with a specific domain (.com, .org, etc).
However, when you try to be specific and say you want to match URLs that are not commented out (<!– Like this –>), and the URL has to be in some very specific element in with a specific classname, that’s when you should start looking for alternatives.
Regex simply won’t cover all the possible HTML structures either. Seeing regex is unable (or more accurately, shouldn’t) to parse a nested tags, you should immediately think that you shouldn’t use regex to parse HTML.
You’d spend hours figuring out and trying patch up all plausible patterns you may think of. And it is possible to parse HTML using regex, depending on your use-case, seeing that some people do know exactly what they need and thus can quickly filter out the scope of their regex pattern.
But if you don’t know what you’ll be dealing with and you’ve started typing out that “brilliant” regex pattern into your code, just heed the prohibition the meme have written clearly and follow the message left by the author of the meme at the end of the page:
Consider a premade HTML/XML parser instead
tl;dr for busy people-
* Author’s note: This is a bit of a rant just to stop my colleagues from asking if they parse HTML using regex. Thank you for bearing with my rant.
That said, stop parsing (X)HTML with regular expression, please!