![]() |
Introduction to CGIA Hands-on How-toSMfrom Brass Cannon ConsultingA little vague handwaving can often save hours of tedious explanation. |
Sometimes the toughest thing about learning something new and useful is figuring out what other people call it. You may have wanted to do a small project like the one I'm about to describe, and you would have been able to do it long before this, if only you had known what search terms to type into Google to find the tools.
Plain old HTML is very easy to learn; you just add some tags to plain text, and a browser presents it as a nicely formatted document with links and pictures. The next step is to interact with your pages -- and that's where CGI comes in.
CGI stands for Common Gateway Interface. It's "Common," because it doesn't care what kind of webserver or browser you're using. CGI is not a programmming language. It's just a protocol: an agreement that "this is how we're going to do this." Normally, a webserver is only interested in one piece of information from you -- "What page do you want me to show you?" CGI changes that by letting you give the server a little extra information, which can change the page as you get it from the server.
For instance, I have a small picture gallery. There is a single "thumbnail" image, made with an old commercial program by the Paint Shop Pro people (Image Commander? something like that.) I used a free program called MapThis! to make each thumbnail into a link to another file, the full size picture. Now, each of those links could have been a mere <img src="picturefile"> tag; most browsers will happily open a binary file presented that way. But it would have been ugly. I really wanted a nicer presentation: center the image, set a background of some kind, perhaps add a title or some copyright info.
I could do that by creating a separate web page, one for each image; and if that seemed like too much work, I could simplify things by generating those pages from a template. But even if I used a batch process to generate all those .html files, it would still mean a lot of clutter and bother.
Why not write a single page that uses a variable in its <img src="picturefile"> tag? Hey, that sounds like a nice, clean approach -- that's what computer people mean when they say a solution is "elegant." So that's what I did. As it turns out, what I wanted was "a CGI script to accept a parameter and generate dynamic HTML."
The thing about Unix (or Linux) programs, and by extension all web browsers: they don't care whether you feed them a file that you prepared weeks in advance, or a stream of bytes you are creating 'on the fly' by running a program -- it's still just ones and zeros. That latter trick is what "dynamic HTML" is all about. A "cgi-bin script" is a computer program that creates an HTML page as its output. (Don't be intimidated by the thought of "computer programming." As programs go, these CGI scripts are pretty small and simple.) The fact that you are creating the HTML as the output of a program (instead of reading a file that was prepared in advance) doesn't matter.
Because CGI scripts are programs that run on a web server, they have the power to "do things" there. For that reason, it's a good idea to keep them separate from your plain HTML content. CGI is an agreement: we agree that we are going to put some files into a special place, or give them a special name (both ways of identifying a CGI file are valid), and the server is going to be allowed to run those files rather than simply copying them to your waiting browser.
A script can do things that plain text can't do. That's risky. You only want your CGI processes to do things that you expect and control. The first step in keeping control is to know where your scripts live. When you set up the Apache webserver, you tell it whether you want it to run any CGI programs at all, and if so, how you expect it to find them. For example, for an Apache setup where the static html content lives in /var/www/html, the CGI content might be in /var/www/cgi-bin. Things that reside in the cgi-bin directory are treated differently than things elsewhere. You also have the option of telling your server that any file that ends with a certain file type (.cgi for instance, or .pl, or .php) is to be run as a CGI script, not just "served."
If you have not made these settings in the configuration file for your Web server, it simply will not "play the CGI game" with you. When you upload a Perl script to your website and try to access it, you'll see the text of your script instead of its output.
This is about as simple as a CGI script can get:
#!/usr/bin/perl print "Content-type: text/html\n\n"; print qq~<html> <body> Hello, world! </body> </html>~;
This script does not accept any input. You can save it (as "hello.pl") and run it from the command line on your own computer (by typing "perl < hello.pl"). It should put this output onto your screen:
Content-type: text/html <html><body>Hello, world!</body></html>
A CGI is an executable script or program whose output looks like valid HTML content.
So, why all the fuss? Why didn't we just type that silly one-line web page instead of wrapping it up in all that Perl code?
Remember, things in a cgi-bin directory are treated differently. If we had put that one-line page into our cgi-bin directory as "hello.html" instead of "hello.pl", Apache would not simply display it. It would try to execute it, and the server's operating system would find it very distasteful -- it would give you the computer equivalent of "I don't understand this at all!"
When you want to put a plain static page into your cgi-bin directory, it needs some sort of executable wrapper around it. The example above is a good start. It doesn't have to be Perl -- it could be PHP, or even a shell script (not a good idea -- it's too much like leaving your car keys in the ignition). For heavy-duty CGI, you might even build or buy a compiled C program.
The important thing is that a CGI is an executable. That means it is in a format that your server operating system recognizes, whether as a compiled program or "binary," or a script -- a text file -- that is written in a language for which you have an interpreter.
A CGI:
To satisfy that last requirement, the first line of output from the CGI will almost always be a "Content-type" line. If you provide that, then the output of a CGI can be almost any valid MIME type, even a binary file; but for now we are going to limit ourselves to text in HTML format, produced by simple Perl scripts.
Here's my very first "real" CGI program, thumb.cgi:
#!/usr/bin/perl
@params=split(/=/,$ENV{'QUERY_STRING'});
print "Content-type: text/html\n\n";
print qq~<html>
<body bgcolor=black>
<hr>
<center>
<img src=/images/$params[0].jpg><br>
<font color=gray><h6>© 2002 Kevin Martin</h6></font>
<hr>
<a href=/cgi101.html#example><img src="/whitecannon.gif" border=0
alt="Return"></a>
</center>
</body></html>~;
Okay, what is going on here?
First of all, this program is written in the programming language Perl. Perl is one of the easiest and most widely-supported ways to create a CGI program or script. A shell script is a "bad idea" because shells are very big. They tend to be "sloppy" and offer far too many opportunities to break down security barriers. Perl is much better in all these respects, and PHP is better still, in my humble opinion.
Second, we have now added the ability to accept a value as input. This is the part that makes it more than just a static page. It's the second line, where Perl accepts a value from its runtime "environment" called QUERY_STRING.
Our CGI script starts running when the Apache server processes a request that points to its URL, its Uniform Resource Locator. That URL would look something like "http://handsonhowto.com/cgi-bin/thumb.cgi" (There is a reason this is not a link, and PLEASE don't type it in until you've read the rest of this page.) Then comes a "delimiter" (a separator character), usually a question mark "?". The delimiter marks the beginning of our QUERY_STRING. So, someone could run this CGI by entering this URL:
http://handsonhowto.com/cgi-bin/thumb.cgi?circle
and the QUERY_STRING will contain the value "circle".
Now that we have a parameter, also known as a variable, the rest of the script is simply a matter of presenting a stream of HTML data to the browser. The Perl "qq" command says we will use the tilde or ~ character as a quoting character, so we can use double quotes freely without confusing the script.
There is a little trick you need to know to make a CGI work successfully. We are taking responsibility for all of the stuff that the server would normally send to the user's browser, including the MIME content header, which tells the browser that it is about to receive some HTML. So, the first thing we send is "Content-type: text/html" and a couple of line breaks. Once we've taken care of that, it's just a matter of writing out exactly the same text we would use in a static, handwritten web page. We use $params[0] in our "img src" tag, and the result should be a nicely centered picture on a black background, with horizontal rules above and below, and my copyright notice centered below it.
There's one thing that makes this CGI example not quite "ready for prime time," though. I'm not checking the incoming input well enough. You'll note that I did hard-code the directory and the file extension, so only the name portion of the filename comes in as a variable. But that assumes that only my index.html page is calling this CGI. There is nothing to keep someone unfriendly from studying my source code and then going directly to the CGI. Worse, they don't have to type in a valid file name; they could type in tricky file syntax strings like "../../../.." to "traverse" up to my server's root directory, and try to list out password files or other things I'd just as soon they not see.
One thing I can do is to ruthlessly filter the QUERY_STRING to strip out those /.. characters, and any other characters that I know don't appear in my filenames. I know that all of my filenames are purely alphanumeric, so I can reject any input that does not consist purely of letters and numbers. (This is a non-trivial thing to do, though, because browsers that have to deal with international alphabets have a lot of ways of "encoding" special strings. One of the vulnerabilities of Microsoft IIS is that it would accept "Unicode" characters as an alternative to characters that the web administrator thought he had blocked, and turn them back into dangerous values.)
So, here's a slightly more "robust" version of the script above.
#!/usr/bin/perl
@params=split(/=/,$ENV{'QUERY_STRING'});
my $p0=$params[0];
print "Content-type: text/html\n\n";
## Handle URLencodings:
$p0 =~ tr/+/ /;
$p0 =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1))/eg;
$p0 =~ s/,/ /eg;
## Eat /.. attacks:
$p0 =~ s/\/\.\.//eg;
## Kill Unix shell escape characters:
$p0 =~ s/([;<>\*\|'\$!#\(\)\[\]\{\}:'"])/\\$1/g;
## Fix up for embedded spaces in filenames:
$p0 =~ s/ /\ /;
my $DD="/opt/www/images";
my $file="$DD/$p0.gif";
if (!(-f $file)) {
print ("<html><body bgcolor=red>");
print ("<h2>WARNING</h2>");
print ("<h4>Attempts to access this site through\n\r");
print ("unauthorized means are logged.</h4>");
die "Hack attempt: $params[0]";
}
print qq~<html>
<body bgcolor=black>
<hr>
<center>
<img src=/images/$p0.gif><br>
<font color=gray><h6>© 2002 Kevin Martin</h6></font>
<hr>
<a href=cgi101.html#example><img src="/whitecannon.gif" border=0
alt="Return"></a>
</center>
</body></html>~;
The "Handle URL encodings" section covers file names that have embedded spaces or other strange punctuation -- the user's browser will automatically send these characters as three-byte codes (a space becomes %20, for example). This Perl code changes those three-byte codes back into single ASCII characters.
Then we "eat" any instances of /.. -- these are attempts to climb out of the art gallery directory into where my system files exist. We mask out or "escape" any characters that have special meaning to the Unix shell, and also handle the slim possibility that I might have a valid file that contains an embedded space in its name, by changing it to a representation that the Unix filesystem understands ("\ ").
Finally, and most important, we stop assuming that only valid filenames from my index.html page will be submitted to the CGI. Instead, we check that the edited QUERY_STRING value points to an existing file somewhere in the art gallery directory. If it doesn't, we record the failed attempt into Apache's error log and exit the script. That's why I asked you not to type it in without a valid file name -- if you did, you would get a nasty red warning page.
If the filename is valid, of course, everything runs pretty much as it did in the original version. If you copy this code, you'll have to fiddle a bit with the locations of things. One thing that might be confusing is that the running cgi code has a different view of things than the web server does. Note that while the cgi code is running, it is using a real system path to look for files (it looks in "/opt/www/images") but once we have found a valid image, we have to change to the webserver's point of view. The webserver sees this directory as /images, meaning "the images directory under my document root." Even though the images reside in "/opt/www/images", the webserver sees that as "/images".
So we have a CGI script. How do we use it? It's just like any other hypertext link. It needs to have a valid parameter value, but that's no worse than the requirement that you spell your domain name correctly.
So, we could have a plain text link that looks like this:
<a href="http://handsonhowto.com/cgi-bin/thumb.cgi?circle">Circle</a>
This is the URL of a plain (or "static") Web page:
"http://example.com/index.html" (plain html)
This example calls up a dynamic page with two parameters:
"http://example.com/cgi-bin/index.cgi?p1=apple&p2=banana"
(CGI)
The question mark is the standard CGI separator between the URL and the
parameter list. The Perl "split" command allows us to
have multiple parameters separated by a delimiter character of our own
choosing (The default separator is the ampersand character, '&').
But since you've stayed with me this far, let me reward you with a bonus project. It follows logically from our picture display CGI, but it doesn't include any CGI code.
It's pretty common now to have a block of small "thumbnail" pictures as an easy to use catalog. Have you ever wondered how that works? Let's make one of these "client side image maps" and use it to access our CGI page.
Note: If you're looking for a way to generate thumbnail pictures automatically, check out Gallery at SourceForge, or the PHP equivalent of this CGI example, which includes a helper script that uses ImageMagick.
<img src=thumb.gif border=0 usemap=#thumbnail> <MAP NAME="thumbnail"> <!-- #$-:Image Map file created by Map THIS! --> <AREA SHAPE=RECT COORDS="30,14,104,111" HREF="http://handsonhowto.com/cgi-bin/thumb.cgi?star" ALT="Star"> <AREA SHAPE=RECT COORDS="129,14,204,111" HREF="http://handsonhowto.com/cgi-bin/thumb.cgi?circle" ALT="Circle"> <AREA SHAPE=RECT COORDS="32,123,101,216" HREF="http://handsonhowto.com/cgi-bin/thumb.cgi?square" ALT="Square"> <AREA SHAPE=RECT COORDS="129,123,204,218" HREF="http://handsonhowto.com/cgi-bin/thumb.cgi?triangle" ALT="Triangle"> </MAP>
The "MAP" information is the result of using Todd Wilson's "Map THIS!" program for Windows to calculate the coordinates -- the top left and bottom right corners, in this case -- for each clickable area.
Try it out! You should get a black page with a (slightly) larger version of the image you picked. There will be a white Brass Cannon logo on that page; click it to come back here.
(In case you're wondering how that works, it's the "#example" link in our CGI code. The paragraph above this one, the one you're reading right now, has an anchor link that looks like this: <a name="example"> -- by referring to such an "internal link" you can jump into the middle of a page, rather than starting at the top.)
On our next page we build a skeleton for a download manager. These few examples should, I hope, help you see what part of the script is "CGI" (the parameter passing stuff) and what is "application code" (everything else). Once you understand that, you can start looking at the more interesting examples at other sites.
You are invited to discuss this article with the author in the Feedback section of the Brass Cannon webboard.