Home > PHP > Regular expression searches in HTML code

Regular expression searches in HTML code

A lot of times, we have to parse an .html page and return the contents between 2 div tags.  The easiest way to do this is by using regular expressions.  This is a great way to load pages into a Content management system: using a programming language, such as PHP [what we use], you can grab the “meat and potatoes” of your content, load it in an editor (such as TinyMCE), and allow users to edit content without messing up the structure of your page.

Let’s say that we only want to grab the content between our <content01> div tag.  Here is what the PHP code looks like:

[php]$regex_pattern = "/<div class=\"content01\">(.*)<\/content01>/s";
preg_match($regex_pattern,$text,$matches);
$contentVar= $matches[0];[/php]

The first line of code creates a regular expression pattern to search for all content contained between the content01 opening and closing tags.  The second line  uses PHP’s preg_match function to search for anything that matches the pattern, with the $text variable, and save it to an array entitled $matches.  We then set a new variable, called $contentVar,  to match our result.  Pretty simple, huh?

One caveat: reg expressions can be greedy; they will try to grab as much of the matching content as possible.  That being said, the above method works best when using unique content.  For example: if you are wanting to just return  all of the links within a web page, using the above code will not work, as it will return everything associated with the links.  A new expression will better serve our purposes for this scenario:

[php]$regex_pattern = "/<a href=\"[^\"]+">[^<]+</a>/s";[/php]

Now, we can limit the scope of our search to finding at least one of any character except a double-quote, then finding the first instance of a double quote, and then a >.

Piece of cake, right? If you are hungry for more info, check out this excellent book:

Thantos PHP

May 15th, 2009
  1. No comments yet.
  1. No trackbacks yet.