Last update: [15-May-1999]
Writing HTML files without a validating parser is like trying to write computer programs without a compiler: don't do it! Fortunately, help is readily available on the Internet.
James Clark <jjc@jclark.com> is developing a new implementation of a suite of SGML parser tools, called SP. These include:
nsgmls
sgmls
-compatible validating SGML
parser.
spam
sgmlnorm
spent
sx
Besides being a complete redesign of the earlier successful
smgls
implementation, the new programs are
designed for the future: they support extended character
sets, such as Unicode, and various multi-byte encodings used
in oriental languages.
For a textbook treatment of the use of the SP system, see
@String{pub-PH = "Pren{\-}tice-Hall"} @String{pub-PH:adr = "Englewood Cliffs, NJ 07632, USA"} @Book{McGrath:1997:SS, author = "Sean McGrath", title = "ParseMe.1st: {SGML} for Software Developers", publisher = pub-PH, address = pub-PH:adr, pages = "xxiii + 341", month = jan, year = "1998", ISBN = "0-13-488967-3", LCCN = "QA76.76.H94M388 1998", bibdate = "Sat Jan 4 12:20:44 MST 1997", price = "US\$33.75", acknowledgement = ack-nhfb, keywords = "SGML (document markup language)", xxprice = "US\$55.00", }
The new code is written almost entirely in C++ (almost 78K lines at version 1.3, or almost 4 times the size of Don Knuth's TeX or Metafont), and requires template support, a relatively new feature of C++ which is not yet widely available. [An ANSI/ISO Standard for C++, ISO/IEC 14882:1998 Programming languages -- C++, was finally adopted in 1998, and by mid-1999, a few UNIX vendors claimed conformanance to that Standard.]
WARNING: To build these programs, you will need about 50MB of disk space, unless you remove the default -g compiler option. Doing so reduces the executable sizes from almost 10MB each to about 1.5MB (on a Sun SPARC Solaris 2.3 system). Alternatively, you can build them, then run the UNIX strip command on the executables to remove debug symbols.
The SP
programs can be compiled and built using
recent releases of
GNU g++
and libg++
(2.7.1 or later), or better, the newer, and more-easily
buildable, Cygnus egcs development releases at
ftp://egcs.cygnus.com/pub/egcs/releases/.
g++
itself is built as part
of the GNU gcc
compiler installation;
although that installation takes a few hours, and requires
about 120MB of disk space to be able to run the validation
tests before installation, it is straightforward, and
should be problem free on most current UNIX systems. The
GNU compiler suite has also been built on IBM PC MS DOS
and DEC OpenVMS systems, although those versions usually
lag behind.
The
SP
distribution site
has binaries for SP version 1.3 for IBM
PC DOS, and Windows 95 and Windows NT.
Binaries for older versions are available for Intel 386 Linux, Sun Solaris 2.5, and DEC Alpha OSF/1 3.2.
Just as with
sgmls
,
lengthy command lines are needed to run these programs
successfully. To facilitate their use, I've prepared simple
UNIX shell scripts
html-ncheck
and
html-spam
to hide the complexity, so that only the HTML files need to
be provided on the script command lines.
If you have installed the html-check
distribution, and you want to use html-spam
,
you need to add to end of the HTML catalog file,
/usr/local/lib/html-check/lib/catalog.
these lines:
-- Added at the suggestion of James Clark <jjc@jclark.com> -- -- so that spam -p doesn't output the contents of html.decl -- SGMLDECL html.decl
Without this change, the contents of html.decl
are copied to the output if the -p is included
in the spam
invocation in html-spam
; omitting -p
and including
html.decl doesn't help, because the <!DOCTYPE ...
> line is then lost.
I have successfully built sp-1.3
with
g++
(gcc
2.8.1 [2-Mar-1998] or
gcc
version egcs-2.91.66
(egcs-1.1.2
source release) on these systems:
using the command
make && make check && make install
On a few of these, minor problems cropped up and were solved; they are discussed further below.
I also made unsuccessful attempts to build SP
with native C++ compilers on Hewlett-Packard HP-UX 10.0.1
and Silicon Graphics IRIX 5.3, with a command line like
make CXX=CC CXXFLAGS=-O DEFINES='-DANSI_CLASS_INST $(XDEFINES)'
Numerous compiler errors quickly led to my abandoning the effort.
Compilation with native Sun Solaris 2.3 CC looked initially promising, but linking failed with errors about differing sizes of particular symbols, and with many missing functions arising from template instantiation. This linking problem is just what I found with SP 0.4 on the IBM RS/6000 AIX 3.2.5 systems too.
Mail from Michael Riedmann <Michael_Riedmann@hp.com> at Hewlett-Packard GmbH in Böblingen, Germany on 12 May 1998 reported a successful build of SP version 1.3 on HP-UX 10.20 with g++ version 2.7.2.3, after installing HP patch PHKL_8693 to fix a problem with a non-ANSI extern struct declaration in /usr/include/sys/time.h.
The function set_new_handler() came up undefined when the code was compiled with egcs-2.91.66. Using make LIBS=-L/usr/lib used the system version of the C++ library, and resolved the problem.
Because GNU/Linux systems are notorious for problems from incompatible versions of shared libraries, I prepared a separate Linux-i686-2.0.35-libs.tar.gz file containing the four libraries needed by the SP executables:
ld-linux.so.2 libc.so.6 libm.so.6 libstdc++.so.2.8
If they prove necessary on your system, you can install those you need in /usr/local/lib, or any other convenient place. If the chosen directory is not already listed in /etc/ld.so.conf, add it, then, as root, run the command ldconfig to update the run-time linker's cache of directories. If you cannot do this, then you can just add the directory to the LD_RUN_PATH environment variable.
I modified the top-level SP
Makefile
to set RANLIB=ranlib. The build of
SP
then completed successfully, and make
check passed all of the validation tests.