Jan 27 2008
Regular Expressions in Notepad++
Like a lot of people, before I got to know regular expressions they seemed like strange and untameable beasts. Most of the people I’ve spoken to agree (granted, they are undergraduates) but it doesn’t take long while getting to know regular expressions to realise that they are insanely useful.
Notepad++ comes with the capability to search as well as replace using regular expressions. I was knocking together an XML file which was part of some coursework I was working on when I found out just how useful this capability can be. The file had already grown fairly large when I realised that I wanted to change the structure of it. There were some attributes that I wanted to convert to child elements, but changing each one manually would be a long and tedious job. Of course, tracking down the way that Notepad++ handled regular expressions, including back referencing, probably took as long, but it was a whole lot more fun.
How to do it
First, here’s an example piece of XML to work from.
<?xml version="1.0"?> <library> <book title="The Light Fantastic" author="Terry Pratchett"/> <book title="The Subtle Knife" author="Phillip Pullman"/> <book title="Dune" author="Frank Herbert"/> </library>
Here is what I want the end result to be.
<?xml version="1.0"?> <library> <book> <title>The Light Fantastic</title> <author>Terry Pratchett</author> </book> <book> <title>The Subtle Knife</title> <author>Phillip Pullman</author> </book> <book> <title>Dune</title> <author>Frank Herbert</author> </book> </library>
Of course, this is just for illustration - the real thing might be hundreds or thousands of lines long.
The first thing to do is to bring up the Replace dialogue - Ctrl + H in Notepad++ and make sure that the “Regular expression” box is checked. The regular expression to actually find each element is relatively simple. The following can be typed into the “Find” field.
<book title="([^"]*)” author=”([^"]*)”/>
([^"]*) just means a sequence of any characters that are not quote characters - so basically all the text that is the value of the attributes title and author up to the respective end quotes. We will want to get these two pieces of text later when we make the string that constructs the child elements in the “Replace” field. The parentheses that can be seen in the above code are what actually allow this.
<book> <title>{title}</title> <author>{author}</author> </book>
The string required is only slightly more complex than what is seen above. Notepad++ allows \n and \t escape characters to denote newlines and tab characters respectively in the “Replace” field, so you can even specify the code to be correctly formatted. The variables {title} and {author} are specified in the replace string with what are known in regular expression land as back references. Back referencing is very simple in Notepad++ - \1 refers to the first parenthesised piece of text, \2 refers to the second and so on.
<book>\n\t\t<title>\1</title>\n\t\t<author>\2</author>\n\t</book>
Using regular expressions to search and replace - or even just search - is immensely useful in many situations. Reformatting XML is not the application of this, although it’s certainly a very useful one.
Useful links
- Mastering Regular Expressions, 3rd Ed (Friedl, Jeffrey E.F.) - invaluable book on regular expressions
- Notepad++ - free, simple text editor with a nice set of features
- Regular Expressions - User guide - a decent online overview
