The task in step 6 is to parse the data passed as a character string from the server to a CGI program through STDIN. This involves two new ideas: reading data from STDIN and then parsing that data. Reading data from STDIN is required when METHOD=POST, whereas when METHOD=GET, data are passed through the QUERY_STRING environment variable, as seen in step 5. Regardless of whether the data are obtained from an environment variable or read through STDIN, they must be parsed before they can be used by a CGI program since they are compressed into a continuous string, without spaces, and some "special" characters are mapped to other values.
The purpose of parsing is to break the encoded string passed between client and server into attribute/value pairs, translate several special characters that were encoded back into their original character forms, and to store the translated attribute/value pairs in a convenient data structure, i.e., an associative array.
Parsing strings in CGI, regardless of whether they are passed through STDIN or through QUERY_STRING, requires four steps:
- TOKENIZE the string into a list of attribute/value strings
- SPLIT each attribute/value string into separate key and value strings
- DEPLUS the strings to translate plus signs(+) into spaces
- DECODE 3-character hex representations of special characters and translate them back into their original 1-character forms
The order in which these four steps are carried out matters. Consider the following portion of a query and its encoded form:
I vote ++; let's make an offer & try to hire her!
name=john&textarea=I+vote+%2B%2B%3B+let%27s+make+an+offer+%26+try+to+hire+her%21
Decode should be done last. For example, ampersands are used to separate attribute/value strings from one another; consequently, ampersands in the data are translated into 3-character hex values. If these hex values were translated back to ampersands before tokenizing, they would be confused with the ampersands used as delimiters. The order generally recommended is that given above. One exception is to deplus first, while the input data exist as a single string, before tokenizing and splitting; using the example, below, explain why this exception works.
Parsing requires some five or six new Perl concepts and operators:
- <STDIN>
- Translate and squeeze
- Split
- Substitution
- Associative array assignment
Each will be discussed in the context of the parse program, below.
<STDIN>
is actually an operator. It returns the next line of input from the file, STDIN. Consequently, it does not need to be used with another operator or verb, such as aread
. Consequently, the command
$in_string = <STDIN>
;reads the next line of input, which is the entire concatenated string of attributes and values, and places them in the scalar variable,
$in_string
.
Translate and squeeze.
The next section of code translates plus signs (+), used to indicate spaces in the original data, back into spaces; it also removes multiple spaces so that only a single space exists between any two words. The Perl command used for this istr
, for translate. It takes two patterns, delimited by slash (/) characters, and translates instances of the first into the second. For example, it can be used to translate all uppercase characters to lowercase, or vice versa. Patterns can be quite complex, and will be discussed in more detail when regular expressions are described.In the line of code shown, the plus sign is preceded by a backslash (\) to indicate that it is the character, plus, in this context and not the Perl operator for addition. The
s
at the end of the expression removes, or "squeezes out", multiple instances of the translated pattern, spaces in this case. Finally, the symbol=~
is actually an operator. It identifies the variable on the left as the one to which the operator on the right, thetr
, is applied. Thus, it works like an assignment statement, although it is not literally that. Had it not been used, the translate would have been applied to an invisible (predefined) variable, called the default variable and denoted$_
. It is a somewhat mysterious variable whose value is set as a result of the last operation; often it is the variable to which one would logically apply the next operation.
Split
does what the name implies: it takes a pattern, shown between the slash (/) delimiters, and a character string, and returns a list of the portions of the string that precede and succeed the pattern. Thus, it produces a list of the portions of the string that don't match the pattern and throws away the portions that do match.In the TOKENIZE step, the input string is split on the pattern, /&/, and the resulting list of attribute/value strings is assigned to an array,
@attr_val_strings
, indicated by the at-sign (@) prefix of the variable name.In the next block of code, three important things happen. First, each string in the
@attr_val_strings
array is split into the portions that come before and after an equal (=) sign, with the two strings assigned to a 1x2 array,@pair
. Element$pair[0]
is the part that comes before the equal sign, and$pair[1]
is the part that comes after.Second, the character strings in each of the two cells of the pair array are decoded. Special characters, such as parentheses, ampersands, and the like, that might interfere with processing the data string, are coded as an escape character,
%
, followed by a 2-character hexadecimal value. Thus, for example, a dollar sign ($) is encoded by a browser as %24. Decoding translates these encoded representation back into their original, one-character forms. It is desirable to do this translation before assigning keys and values to the associative array to allow easy processing of keys that include special characters. Hence, decoding is done within the split and assignment loop.The first substitution is performed on
$pair[0],
which will become a key element within the associative array,%attr_value
. The second is done on$pair[1]
, which will be its corresponding value.
Substitution.
The substitution operator,s
, is at the center of the decode step. This is the Matterhorn. Once we get over this peak, it's all downhill form there. As with many Perl expressions, there is a great deal of magic packed into this single line of code. That's the beauty of the language, if you like it, or its downfall, if you don't. But it is one of the main characteristics that makes Perl what it is. We'll work from the outside in.The substitution operator, like the translate operator, takes two patterns, delimited by slashes. It looks for an instance of the first pattern in the target string and substitutes an instance of the second pattern for it. The pattern that is looked for here is
%(..)
. The percent sign is a literal and is looked for, explicitly. The two periods (..
) are matched by any two characters. The parentheses around the two periods tells Perl to "remember" those two characters so that they can be referred to later, in this context through the variable,$1
. Thus, the string,%28
, which is the coded representation for a left parenthesis, would be matched by this pattern and the28
would be assigned as the value to the variable,$1
. When such a pattern is found, the operator substitutes what follows, delimited by the slashes, for the 3-character string.What is substituted here is
pack("c",hex($1))
.pack
takes two arguments, a format control string and a list of values, and creates a single string from those values. The format control string is defined to be a single character, denoted by the"c"
, and the list of values is the single value,$1
, which is the hex code for the character to be translated.Note that what is produced as a result of the substitution is a Perl operator,
pack
. The finale
tells Perl to execute that operation and substitute the results of the operation in the place where the pattern is found. Theg
at the end of the expressions says that the substitution should be made for all occurrences of the pattern. Finally, the=~
operator directs the substitution to the desired string.In the third step of this block of code, the two array values,
$pair[0]
and$pair[1]
, are assigned as associated key and value components for a row in the associative array,%attr_value
. However, since the expressions refers to individual elements of the array, each such element is referred to using its scalar prefix. Let me paraphrase the line,
$attr_value{$pair[0]} = $pair[1];
Assign the contents of $pair[1], the part of the string that came after the equal sign, as the value element in the row of the associative array,
%attr_value
, that is indexed by the key, $pair[0], which is the part of the string that came before the equal sign. Since the assignment applies to only a single element in the array, the scalar name, $attr_value, is used.
Associative array assignment
. Just did it!Echo STDIN Variables Program
#!/usr/local/bin/perl # # INPUT data $in_string = <STDIN>; # # DEPLUS $in_string $in_string =~ tr/\+/ /s; # translate and squeeze multiple spaces # # TOKENIZE attr/val strings @attr_val_strings = split (/&/, $in_string); # # SPLIT attr/val strings, DECODE special characters, # and put into assoc. array foreach $out_str (@attr_val_strings) { @pair = split (/=/, $out_str); $pair[0] =~ s/%(..)/pack("c",hex($1))/ge; $pair[1] =~ s/%(..)/pack("c",hex($1))/ge; $attr_value{$pair[0]} = $pair[1]; } # # OUTPUT section # generate header lines print "status: 200 ok\n"; print "content-type: text/html\n\n"; # GENERATE report, in HTML print "<HTML>\n"; print "<HEAD>\n"; print "<TITLE>stdin vars.</TITLE>\n"; print "<H2>Print CGI STDIN Variables</H2>\n"; print "</HEAD>\n"; print "<BODY>\n"; print "<HR>\n"; print "<H3>STDIN Variables</H3>\n"; print "<UL>\n"; foreach $key (keys %attr_value) { print "<LI>$key = $attr_value{$key}\n"; } print "</UL>\n"; print "</BODY>\n"; print "</HTML>\n";
Write and test an
Echo STDIN Variables
script. You can also execute the program, above:Echo STDIN Variables
.