If you edit your HTML files by hand, you can simply avoid using the kinds of HTML that you don't want. The solution is not so simple if you use web authoring software or word processing programs to create HTML files. Such tools may inject unwanted HTML that you then have to remove by hand.
Likewise, if you publish HTML files contributed by others, you may have to edit them to remove unwanted tags before publishing them. Policing your HTML manually is tedious, time-consuming, and error-prone.
Why not let your computer do the grunt work?
This website offers a free tool, html_scrub, in the form of C source code. By supplying a small configuration file, you can tell html_scrub which tags and attributes to keep, which to discard, and which to warn you about.
html_scrub [ -f config_file ] [ input_file ]
The -f option specifies the name of a configuration file. If the
command line does not specify a configuration file, html_scrub
looks for a configuration file in several default locations, as
described below.
After specifying the configuration file, if any, the command line may specify the name of the input file. If no input file is specified, html_scrub reads standard input.
The output HTML is written to standard output. Error messages are written to standard error.
<HEAD> : keep
<BODY> : keep
<APPLET> : warn
# FONT tags may contain browser-specific barbarisms
<FONT>
{
attribute FACE : drop # Don't require browser to have
# specific fonts available
default : keep
}
<META>
{
attribute HTTP-EQUIV : warn # be wary of Refresh
}
<EM> : drop
<SCRIPT> : drop all # Don't want any scripts
default : drop
comment : warn
Given this configuration file, html_scrub will do the following:
The "warn" action also retains the tag or attribute unchanged, but it also issues a warning message to standard error, so that you can review the HTML manually.
The "drop" action, when applied to a tag, eliminates the HTML tag and the corresponding end tag.  It does not affect anything between the start and end tags.
The "drop" action, when applied to an attribute, eliminates both the attribute and the associated value.
The "drop all" action applies only to tags. It eliminates the specified start tag, the corresponding end tag, and everything in between.
Under Linux or UNIX, or any other system that provides a make utility, put all the files into the current working directory and enter the make command. You may have to tinker with the Makefile a bit to change the name of the compiler, or the compiler options. You will almost certainly need root privileges to install the resulting executable into /usr/bin, /usr/local/bin, or some other directory in your path.
If your system doesn't provide a make utility, then you'll have
to do whatever it takes to compile the C files and link them.
No additional libraries are needed beyond the Standard C libraries.
This combination of characters is unlikely to occur in practice.
If you are don't want to take that chance, then code
"<SCRIPT> : warn" in your configuration file so that you can
edit any scripts manually. If you encounter this problem and
cannot fix it by a trivial change to the script, then enclose the
script within an HTML comment, where html_scrub will do a better job
of interpreting the syntax. It is good practice to enclose
scripts within HTML comments anyway.
A true, proper, and correct fix will not be simple, because it will
require some parsing of the scripting language, and the rules may be
different for different languages.
For Linux or UNIX:
Bugs
If the input HTML includes a scripting language such as JavaScript,
and the script is outside of an HTML comment ("<--...-->"),
and the script contains "</" (probably within
a comment or string literal), html_scrub will interpret these
characters as the beginning of an end tag, and get terribly confused.
Download
For Windows:
html_scrub_1_3.zip
Extract the files with WinZip or a similar utility.
html_scrub_1_3.tar.gz
Extract the files with the following commands:
gunzip html_scrub_1_3.tar.gz
tar -xvf html_scrub_1_3.tar
These two archives contain identical source code, except that
the Windows version uses carriage return/line feeds to terminate
the lines, while the Linux/UNIX version uses line feeds.
hscrub.exe
This is an MS-DOS executable, compiled with an ancient creaking
copy of Borland 3.0. The name has been shortened to fit within
the old MS-DOS naming limits. If you can't compile html_scrub
for yourself, or you don't want to bother, and you trust me not to
plague you with viruses, then download it and run it from the DOS
prompt or a .bat file.
Subscribe
To be notified of updates, subscribe to html_scrub at
Freshmeat.
Scott McKellar