Web development, usability and more
Writing CSS is good fun, but analyzing an html document to find how the page is structured us, at the very least, tiring. If you have ever had to write CSS for a blog or CMS template, you already know how time-consuming is to find every ID and CLASS in a large document. We’re going to write a simple script that takes an HTML file as input and gives us two things:
We need to use any scripting language to read the HTML and act accordingly. This time we’re going to use perl, since it’s a simple and powerful language for text managing. It’s available in most operating systems and for windows users, it’s downloadable from http://activestate.com/ .
It should be possible to write in ruby, python, java or any other programming language. Take it as a guideline or, if you just want to use it, grab the code
First of all, we need an HTML document to be parsed. As I’ll recall later, I used HTML code from csszengarden to try the code.
First thing to do is read the html from a file. The script we’re going starts the following way:
This first part opens a file, reads it and assign the contents to a variable. Now, we need to find all the id’s present in the file. To do that we could use a full-fledged HTML parser, but since many documents in the real world are not correct and our necessities are quite easy, there’s a simpler solution. So, regular expressions to the rescue.
Regular expressions find patterns in a file. We can tell perl to find things like /\w+/ and it will match alphanumeric characters. Probably to find named elements it should be enough to write:
Where “id=” means exactly that, \”? means a possible quotation mark, (\w+) is a group of letters or numbers, another possible quote and the last “i” tells us to ignore the case of the text. In this particular case, we’re going to use a slightly more complicated expression, to take into acount possible missing quotes and so on:
We’re going to write an empty CSS file. I like to use a previously written template that includes links to my CSS-reset file and some more files I always use. Then, for every named element, we need a line:
and every time we find a class, since they are not unique, we need to filter it, creating a hash (associative array) with its name. And then, the same thing we’ve done with the IDs. Some more code:
With all that code, we already have a CSS template. You could start right now editing it to suit your needs. But we don’t know yet what the structure is. Why don’t we create a new HTML file that shows us how the document is laid? OK. Let’s write some more code
Here is where the fun begins. Probably, I’ll need to wite another post about this piece of code, since there are a couple of sleights of hand, but, briefly, the following piece of code removes all the stylesheets and the inline CSS rules, then takes all the elements in the dom and finds the ones that have an ID, and adds a P element with the ID name to each one.
As you can see, there are only two functions. The first one, removeSheets(), disables every stylesheet included from the HTML, and then calls the other one, findNamedElements(), This function gets all the elements in the DOM (getElementsByTagName(‘*’)) and iterates over them, doing a couple of things:
Then, we create a new stylesheet only for that class. We have to code a different way for Explorer and Firefox/Opera/Others because of compatibility issues.
Last, the function is called with removesheets(). You can even save the code as a bookmarklet and apply it to any website. It’s truly portable.
When you run the full script, another HTML file will be generated, that shows how the document is laid.
The best example I can think of is CSSzengarden, a site where a default layout is given to designers to skin it the better they can. I downloaded the HTML code (CC license, by-nc-sa) and called he script over it:
$> perl stylizator.pl csszengarden.html
And I got a csszengarden CSS file, with all the named tags, and a csszengarden.structure.html file with the visual structure of the site.
Download this file, unzip it in the same folder and call it via >perl stylizator.pl yourhtmlfile.html
It’ll create a file named yourhtmlfile.css and another one called yourhtmlfile.structure.html
There’s so much to be done. It could be possible to create a firefox extension that gives us the same visual result at the press of a button, or a system that keeps only structural CSS, discarding type and colors. The regular expressions could be better, too, to allow not-so-well-defined documents. If you have any further questions, drop me a line to firstname.lastname@example.org.