CGI 101 - What does CGI mean?
Posted on November 14, 2007
Filed Under Unix
Sometimes the toughest thing about learning something new and useful is figuring out what other people call it. You may have wanted to do a small project like the one I’m about to describe, and you would have been able to do it long before this — if only you had known what to type into Google.
Plain old HTML is very easy to learn; you just add some tags to plain text, and a browser will present it as a nicely formatted document with links and pictures. The next step is to interact with your pages — and that’s where CGI comes in.
CGI stands for Common Gateway Interface. It’s “Common” because it doesn’t care what kind of webserver or browser you’re using. CGI is not a programmming language. It’s just a protocol: an agreement that “this is how we’re going to do this thing.” Normally, a webserver is only interested in one piece of information from you — “What page do you want me to show you?” CGI changes that by letting you tell the server more about what you want. That means you can tell the server to send you different pages, or even to create a completely new page.
For instance, I have a small picture gallery. It started with a single large image that was made up of icons, each of which a “thumbnail” representation of a full-size picture. I used a free program called MapThis! to make each thumbnail into a link to its full size picture. Now, each of those links could have been a mere <img src=”picturefile.jpg”> tag; most browsers will understand that it’s an image file — they recognize the “content type” — and will happily open it. That worked, but it was ugly. Opening an image file directly in a browser is like having your food thrown at you instead of served on a plate. I really wanted a nicer presentation: I wanted to center the image, set a background of some kind, display a title and my copyright.
I could do that by creating a separate web page, one for each image; and if that seemed like too much work, I could simplify things by generating those pages from a template. But even if I used a script to generate all those .html files, it would still mean a lot of clutter and bother.
Why not write a single page that uses a variable in its <img src=”picturefile”> tag? As it turns out, what I wanted was “a CGI script to accept a parameter and generate dynamic HTML.”
The cgi-bin directory
The thing about Unix (or Linux) programs, and by extension all web browsers: they don’t care whether you feed them a file that you prepared weeks in advance, or the output of some program. It’s still just ones and zeros. A CGI script is a computer program that creates an HTML page as its output.
CGI is an agreement: We agree that we are going to put some files into a special place, or give them a special name (both of these ways of identifying a CGI file are valid), and the server is going to be allowed to run those files rather than simply copying them to a waiting browser.
Why do we make this distinction? Because running a script is risky. You only want your CGI processes to do things that you expect and control. The first step in keeping control is to know where your scripts live. When you set up the Apache webserver, you tell it whether you want it to run any CGI programs at all, and if so, how you expect it to find them. For example, for an Apache setup where the static html content lives in /var/www/html, the CGI content might be in /var/www/cgi-bin. Things that reside in the cgi-bin directory are treated differently than things elsewhere. You also have the option of telling your server that any file that ends with a certain file type (.cgi for instance, or .pl) is to be run as a CGI script, not just “served.”
If you have not made these settings in the configuration file for your Web server, it simply will not “play the CGI game” with you. When you upload a Perl script to your website and try to access it, you’ll see the text of your script instead of its output.
A simple sample
This is about as simple as a CGI script can get:
#!/usr/bin/perl print "Content-type: text/html\n\n"; print qq~<html> <body> Hello, world! </body> </html>~;
This script does not accept any input. You can save it (as “hello.pl”) and run it from the command line on your own computer (by typing “perl < hello.pl”). It should put this output onto your screen:
Content-type: text/html <html><body>Hello, world!</body></html>
A CGI is an executable script or program whose output looks like valid HTML content.
So, why all the fuss? Why didn’t we just type that silly one-line web page instead of wrapping it up in all that Perl code?
Remember, things in a cgi-bin directory are treated differently. If we had put that one-line page into our cgi-bin directory as “hello.html” instead of “hello.pl”, Apache would not simply display it. It would try to execute it, and the server’s operating system would find it very distasteful — it would give you the computer equivalent of “I don’t understand this at all!”
When you want to put a plain static page into your cgi-bin directory, it needs some sort of executable wrapper around it. The example above is a good start. The important thing is that it is in a format that your server’s operating system recognizes as a program that it can run. It may be a text file — a script — that is written in a language for which you have an interpreter. That’s how your computer runs a script written in the Perl language. If it is a very complex program, it might be created in the C programming language and compiled into a “binary” file that runs very efficiently, but the principle is still the same: A CGI is something your computer recognizes as a program to be run, whose output looks like HTML.
A CGI:
- is located in a place the webserver expects to find something that plays by the CGI rules;
- has protection settings that allow it to be run by the owner of the webserver process;
- gives a stream of output in a format that your webserver can serve and a browser can present.
To satisfy that last requirement, the first line of output from the CGI will almost always be a “Content-type” line. If you provide that, then the output of a CGI can be almost any valid MIME type, even a binary file such as an image; but for now we are going to limit ourselves to text in HTML format, produced by simple Perl scripts.
A real CGI script
Here’s my very first “real” CGI program, thumb.cgi:
#!/usr/bin/perl
@params=split(/=/,$ENV{'QUERY_STRING'});
print "Content-type: text/html\n\n";
print qq~<html>
<body bgcolor=black>
<hr>
<center>
<img src=/images/$params[0].jpg><br>
<font color=gray><h6>© 2002 Kevin Martin</h6></font>
<hr>
<a href=/cgi101.html#example><img src="/whitecannon.gif" border=0
alt="Return"></a>
</center>
</body></html>~;
Okay, what is going on here?
First of all, this program is written in the programming language Perl. Perl is one of the easiest and most widely-supported ways to create a CGI program or script. A shell script is a bad idea for CGI use, because shells tend to be “sloppy” and offer far too many ways to break down security barriers. Shells are also big and waste server resources. Perl is much better in all these respects.
Second, we have now added the ability to accept a value as input. This is the part that makes it more than just a static page. It’s the second line, where Perl accepts a value from its runtime “environment” called QUERY_STRING.
Our CGI script starts running when the Apache server processes a request that points to its URL, its Uniform Resource Locator. That URL would look something like “http://handsonhowto.com/cgi-bin/thumb.cgi” (PLEASE don’t try to hit that as a link!) Then comes a “delimiter” (a separator character), usually a question mark “?”. The delimiter marks the beginning of our QUERY_STRING. So, someone could run this CGI by entering this URL:
http://handsonhowto.com/cgi-bin/thumb.cgi?circle
and the QUERY_STRING will contain the value “circle”.
Now that we have a parameter, also known as a variable, the rest of the script is simply a matter of presenting a stream of HTML data to the browser. The Perl “qq” command says we will use the tilde or ~ character as a quoting character, so we can use double quotes freely without confusing the script.
There is another little trick you need to know to make a CGI script work. We are taking responsibility for all of the stuff that the server would normally send to the user’s browser, including the MIME content header, which tells the browser that it is about to receive some HTML. So, the first thing we do is send is “Content-type: text/html” and a couple of line breaks. Once we’ve taken care of that, it’s just a matter of writing out exactly the same text we would use in a static, handwritten web page. We use $params[0] in our “img src” tag, and the result should be a nicely centered picture on a black background, with horizontal rules above and below, and my copyright notice centered below it.
Securing the CGI script
There’s one thing that makes this CGI example not quite “ready for prime time,” though. I’m not checking the input. You’ll note that I did hard-code the directory and the file extension, so only the name portion of the filename comes in as a variable. But that assumes that only my index.html page is calling this CGI. There is nothing to keep someone unfriendly from studying my source code and then going directly to the CGI. Worse, they don’t have to type in a valid file name; they could type in tricky file syntax strings like “../../../..” to “walk” up to my server’s root directory, and try to list out password files or other things I’d just as soon they not see.
One thing I can do is to ruthlessly filter the QUERY_STRING to strip out those /.. characters, and any other characters that I know won’t appear in my filenames. I know that all of my filenames are purely alphanumeric, so I can reject any input that does not consist purely of letters and numbers. (This is a non-trivial thing to do, though, because browsers that have to deal with international alphabets have a lot of ways of “encoding” special strings. One of the vulnerabilities of Microsoft IIS was that it would accept “Unicode” characters as an alternative to characters that the web administrator thought he had blocked, and turn them back into dangerous values.)
So, here’s a slightly more “robust” version of the script above.
#!/usr/bin/perl
@params=split(/=/,$ENV{'QUERY_STRING'});
my $p0=$params[0];
print "Content-type: text/html\n\n";
## Handle URLencodings:
$p0 =~ tr/+/ /;
$p0 =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C",hex($1))/eg;
$p0 =~ s/,/ /eg;
## Eat /.. attacks:
$p0 =~ s/\/\.\.//eg;
## Kill Unix shell escape characters:
$p0 =~ s/([;<>\*\|'\$!#\(\)\[\]\{\}:'"])/\\$1/g;
## Fix up for embedded spaces in filenames:
$p0 =~ s/ /\ /;
my $DD="/opt/www/images";
my $file="$DD/$p0.gif";
if (!(-f $file)) {
print ("<html><body bgcolor=red>");
print ("<h2>WARNING</h2>");
print ("<h4>Attempts to access this site through\n\r");
print ("unauthorized means are logged.</h4>");
die "Hack attempt: $params[0]";
}
print qq~<html>
<body bgcolor=black>
<hr>
<center>
<img src=/images/$p0.gif><br>
<font color=gray><h6>© 2007 Kevin Martin</h6></font>
<hr>
<a href=cgi101.html#example><img src="/whitecannon.gif" border=0
alt="Return"></a>
</center>
</body></html>~;
The “Handle URL encodings” section covers file names that have embedded spaces or other strange punctuation — the user’s browser will automatically send these characters as three-byte codes (a space becomes %20, for example). This Perl code changes those three-byte codes back into single ASCII characters.
Then we “eat” any instances of /.. — these are attempts to climb out of the art gallery directory into the place where my system files are. We mask out or “escape” any characters that have special meaning to the Unix shell, and also handle the slim possibility that I might have a valid file that contains an embedded space in its name, by changing it to a representation that the Unix filesystem understands, a backslash and a blank (”\ “).
Finally, and most important, we stop assuming that only valid filenames from my index.html page will be submitted to the CGI. Instead, we check that the edited QUERY_STRING value points to an existing file somewhere in the art gallery directory. If it doesn’t, we record the failed attempt into Apache’s error log and exit the script. That’s why I asked you not to type it in without a valid file name — if you did, you would get a nasty warning page.
If the filename is valid, of course, everything runs pretty much as it did in the original version. If you copy this code, you’ll have to fiddle a bit with the locations of things. One thing that might be confusing is that the running cgi code has a different view of things than the web server does. Note that while the cgi code is running, it is using a real system path to look for files (it looks in “/opt/www/images”) but once we have found a valid image, we have to change to the webserver’s point of view. The webserver sees this same directory as /images, meaning “the images directory under my document root.” Even though the images reside in “/opt/www/images”, the webserver sees that as “/images”.
Calling the CGI script
So we have a CGI script. How do we use it? It’s just like any other hypertext link. It needs to have a valid parameter value, but that’s no worse than the requirement that you spell your domain name correctly.
So, we could have a plain text link that looks like this:
<a href=”http://handsonhowto.com/cgi-bin/thumb.cgi?circle”>Circle</a>
![]()
If you click on that thumbnail screenshot, my “thumb.cgi” will parse out the parameter “circle” and display “circle.gif” from my images directory. Simple as that.
(The white Brass Cannon logo on that page is a link that will bring you back here.)
To summarize — this is what the URL of a plain (or “static”) Web page looks like:
“http://example.com/index.html”
This “Uniform Resource Locator” is actually pointing to a file called index.html on a remote machine.
This URL calls up a dynamic page and also provides two parameters, p1 and p2:
“http://example.com/cgi-bin/index.cgi?p1=apple&p2=banana”
The question mark separates the pointer to index.cgi from the parameter list. Whoever wrote index.cgi is using the Perl “split” command to have another delimiter character, in this case the ampersand character, ‘&’ to separate multiple parameters. For clarity, the author does not assume that the parameters will be in a certain order; instead, they are given explicit names and the value is associated with the name by the equal sign. Thus, the parameter named “p1″ is associated with the value “apple” and so forth. You can do a lot with this technique!
Comments
Leave a Reply
You must be logged in to post a comment.