CS270 Language Blog - Perl: 2020

Sunday, March 29, 2020

The Six Degrees of Kevin Bacon

What is Six Degrees of Kevin Bacon?

Six Degrees of Kevin Bacon is a parlor game based on trying to find the shortest “chain” connecting two actors and/or actresses. Each “link” in the chain is a pair of actors and/or actresses who have co-starred in a movie, and the number of links in the chain determines the length of the chain. In the seminal case of this game, one of the actors being linked is Kevin Bacon, hence the name.

To demonstrate the way in which this game works, consider the following two examples: William Shatner to Leonard Nimoy and Kevin Bacon to Natalie Portman. Shatner and Nimoy co-starred in Star Trek: The Motion Picture, so the length of the chain connecting them is 1. Bacon co-starred with Meg Gibson in Picture Perfect, and Gibson co-starred with Portman in Vox Lux, so the length of the chain connecting Bacon and Portman is 2.

The goal of this program is to find the shortest chain given the names of the two actors. Note that this program will only give the chain length as a number—it will not tell which movies and intermediate actors (if any) the chain was traced through. Throughout the development process, I will be using the Oracle of Bacon as a reference to verify my program’s results.

Loading the Raw Data into Memory

The Oracle of Bacon makes the dataset it uses to find matches publicly available for download as a compressed text file. After downloading and extracting the file, I opened it to look at how the data was formatted. Here is the first entry in the file:

{"title":"Actrius","cast":["Núria Espert","Rosa Maria Sardà","Anna Lizaran","Mercè Pons"],"directors":["Ventura Pons"],"producers":["Ventura Pons"],"companies":["Canal+ España","Els Films de la Rambla S.A.","Generalitat de Catalunya - Departament de Cultura","Televisión Española","Buena Vista International"],"year":1997}

Now knowing the data format, I was able to then write regular expressions to extract each film’s cast list. I then used the built-in split function to extract the individual actors’ and actresses’ names from the cast list and store them in an array. This portion of the code is shown below:

my @casts = ();

open my $input, "<", "data.txt" or die $!;
while (<$input>)
{
    my $castList = $1 if (/"cast":\[(.+?)\]/);
    my @castMembers = split /,/, $castList;
    my @cast = ();
    for (@castMembers)
    {
        push @cast, $1 if (/"([\p{L}.,' ]+?)"/);
    }
    push @casts, \@cast;
}

Note that since the string I want to match my regex against is stored in the special variable $_, I do not need to use the =~ operator to tell Perl to check for a match—I simply declare my regex, and Perl checks $_ for a match by default. Also note that I am storing array references into the @casts array; recall that this is necessary to prevent the sub-arrays from losing their structure when stored in the larger array. Also note the \p{L} character class in the regular expression. This is a special character class that matches anything considered a letter according to the Unicode standard. The reason I use this instead of A-Za-z is to ensure that I am correctly capturing actors and actresses such as Ricardo Montalbán whose names contain accented characters.

Having loaded into memory the cast lists of every movie in the datset, I can now construct my adjacency list—a hash that maps each actor or actress to the actors and/or actresses with whom they have co-starred. Since I can do this simply by iterating over the arrays I created earlier, I first allow my file variable to go out of scope so that Perl will close the connection to it. The code for constructing the adjacency list is shown below:

our %LEVEL_ONE_LINKS = ();
while (@casts)
{
    $cast = shift @casts;
    for my $a (@$cast)
    {
 for my $b (@$cast)
 {
     unless ($a eq $b)
     {
  if (exists $LEVEL_ONE_LINKS{$a})
  {
      push @{$LEVEL_ONE_LINKS{$a}}, $b unless (grep(/^$b$/, @{$LEVEL_ONE_LINKS{$a}}));
         }
         else
         {
      $LEVEL_ONE_LINKS{$a} = [$b];
         }
         if (exists $LEVEL_ONE_LINKS{$b})
  {
      push @{$LEVEL_ONE_LINKS{$b}}, $a unless (grep(/^$a$/, @{$LEVEL_ONE_LINKS{$b}}));
         }
         else
         {
      $LEVEL_ONE_LINKS{$b} = [$a];
         }
     }
 }
    }
}

The grep function returns a list of all elements in an array that match a given regular expression. Recall that an empty list is treated as a false value in boolean context; hence, the grep function as it is being used here essentially acts to prevent duplicate mappings being added to the adjacency list. Also note that, in the else clause, I use square brackets rather than parentheses around the sole element of my newly created array. This is because using square brackets creates an array reference, saving me the step of having to declare the array and then obtain a reference to it using a backslash as I did with the @cast array in creating the first hash.

For those of you following along at home, be forewarned that the dataset is a very large file (about 43 MB); it will take several minutes to construct the adjacency list.

Finding the Shortest Chain

This essentially boils down to a breadth-first search algorithm: given a starting node (actor/actress), check all of the adjacent nodes (actors/actresses) to see whether they match the target node (actor/actress), then check all the nodes adjacent to the adjacent nodes, and so on and so forth until the target node is found, as seen below:

sub findShortestChain($$)
{
    my ($first, $second) = @_;
    my $chainLength = 0;
    my $currentIndex = 0;
    my $chainLengthBoundary = 1;
    my @queue = ($first);
    while ($currentIndex < @queue)
    {
 return $chainLength if ($queue[$currentIndex] eq $second);
 for my $link (@{$LEVEL_ONE_LINKS{$queue[$currentIndex]}})
 {
     push @queue, $link unless (grep(/^$link$/, @queue));
 }
 if (++$currentIndex == $chainLengthBoundary)
 {
     $chainLength++;
     $chainLengthBoundary = @queue;
        }
    }
    return undef;
}

Note again the use of grep to ensure that no node (actor/actress) is visited more than once; this is also the reason for keeping all of the names in the queue and simply advancing a pointer ($currentIndex) instead of removing nodes from the queue as they are visited: if the nodes were removed as they were visited, it would create the possibility of nodes being visited more than once. Also recall that string equality uses the eq operator rather than the == operator used for numerical equality. Finally, recall that an array variable used in a scalar context evaluates to the number of elements it contains (as seen in the while loop condition).

The Complete Program Code

my @casts = ();
our %LEVEL_ONE_LINKS = ();

{
    open my $input, "<", "data.txt" or die $!;
    while (<$input>)
    {
        my $castList = $1 if (/"cast":\[(.+?)\]/);
        my @castMembers = split /,/, $castList;
        my @cast = ();
        for (@castMembers)
        {
            push @cast, $1 if (/"([\p{L}.,' ]+?)"/);
        }
        push @casts, \@cast;
    }
}

while (@casts)
{
    $cast = shift @casts;
    for my $a (@$cast)
    {
 for my $b (@$cast)
 {
     unless ($a eq $b)
     {
  if (exists $LEVEL_ONE_LINKS{$a})
  {
      push @{$LEVEL_ONE_LINKS{$a}}, $b unless (grep(/^$b$/, @{$LEVEL_ONE_LINKS{$a}}));
         }
         else
         {
      $LEVEL_ONE_LINKS{$a} = [$b];
         }
         if (exists $LEVEL_ONE_LINKS{$b})
  {
      push @{$LEVEL_ONE_LINKS{$b}}, $a unless (grep(/^$a$/, @{$LEVEL_ONE_LINKS{$b}}));
         }
         else
         {
      $LEVEL_ONE_LINKS{$b} = [$a];
         }
     }
 }
    }
}

# get rid of the @casts variable once it's no longer needed to free up memory
undef @casts;

sub findShortestChain($$)
{
    my ($first, $second) = @_;
    my $chainLength = 0;
    my $currentIndex = 0;
    my $chainLengthBoundary = 1;
    my @queue = ($first);
    while ($currentIndex < @queue)
    {
 return $chainLength if ($queue[$currentIndex] eq $second);
 for my $link (@{$LEVEL_ONE_LINKS{$queue[$currentIndex]}})
 {
     push @queue, $link unless (grep(/^$link$/, @queue));
 }
 if (++$currentIndex == $chainLengthBoundary)
 {
     $chainLength++;
     $chainLengthBoundary = @queue;
        }
    }
    return undef;
}

sub askYesOrNo
{
    while (1)
    {
 print "Would you like to find the shortest chain between another pair of actors (Y/N)? ";
 my $answer = <STDIN>;
 $answer = uc(substr $answer, 0, 1);
 if ($answer eq 'Y')
 {
     return 1;
 }
 elsif ($answer eq 'N')
 {
     return 0;
        }
        else
        {
     print "Please enter Y or N.\n";
        }
    }
}

do {
    my ($first, $second);
    until (exists $LEVEL_ONE_LINKS{$first})
    {
 print "I don't have any data for $first.\n" if (defined $first);
 print "Enter an actor: ";
 $first = <STDIN>;
 chomp $first;
    }
    until (exists $LEVEL_ONE_LINKS{$second})
    {
 print "I don't have any data for $second.\n" if (defined $second);
 print "Enter another actor: ";
 $second = <STDIN>;
 chomp $second;
    }
    $result = findShortestChain($first, $second);
    print (defined $result)? "The shortest chain from $first to $second is $result.\n" :
                    "There is no chain connecting $first and $second.\n";
} while (askYesOrNo());

Sunday, March 22, 2020

File I/O

Sometimes we want to read input from a file instead of having the user type it at the keyboard. Likewise, sometimes we want our output to be saved to a file instead of printed to the console. The way we do this is by reading from or writing to a file.

The `open` function

In order to read from or write to a file, the program must open that file. This is done by calling the open function, which uses the syntax open $var, $mode, $path. The first argument, $var, represents the variable that will be used to access the file moving forward. Often this is a newly declared variable, so make sure to prefix its name with the my keyword if so. The second argument represents the mode in which the file is to be opened and can take one of three values:

Value	Meaning
`<`	Read-only mode
`>`	Write-only mode
`>>`	Append mode

IMPORTANT NOTE: If a file that already exists is opened in write-only mode, the previous contents of that file will be overwritten (i.e. irretrievably gone). Append mode also allows information to be written to a file, but in append mode, the data written is tacked onto the end of whatever data was already in the file prior to its being opened.

The third argument contains the path to the file you want to open, which is relative by default unless you specify an absolute path (i.e. one that starts with a drive letter).

As long as you store the file to a local variable (i.e. one declared using the keyword my), Perl will automatically close the file once the variable goes out of scope. This makes it especially important that you double-check to make sure you have used the keyword my in declaring your file variable—Perl will treat the file variable as a global if you forget to do so, meaning that it will not be automatically closed. Suffice to say, that would not be good.

The usual practice when invoking the open function is to use the formulation open $var, $mode, $path or die $!. The or die $! part tells Perl to exit if the file cannot be opened for whatever reason. The special variable $! is used by Perl to store the error message generated by open if the file cannot be opened. For example, suppose the following program is saved as FileIO.plx. It attempts to open a text file called DNE.txt, which, as its name might suggest, does not exist:

open my $input, "<", "DNE.txt" or die $!;

When this program is run, it produces the output No such file or directory at FileIO.plx line 1. This tells us that Perl was unable to locate a file called DNE.txt in the current directory.

Reading from a File

Perl reads files one line at a time by enclosing the file variable in angle brackets < >. Note this is very similar to how we get user input from the keyboard using <STDIN>; STDIN is actually a special type of “file variable” that interfaces with the operating system to get input from the user’s keyboard. Suppose we have stored in the same directory as our program a text file called sample.txt with the following contents:

The quick brown fox jumped over the lazy dogs.
She sells seashells by the seashore.
Peter Piper picked a peck of pickled peppers.

The following program reads the file, one line at a time, and prints its contents to the console:

open my $input, "<", "sample.txt" or die $!;
print $_ while (<$input>);

Writing to a File

Perl writes to files using a variation on the print statement: print $fileVariable $stuffToPrint. Note that there is no comma between $fileVariable and $stuffToPrint. This is important: if there is a comma, Perl will try to concatenate the two and print the result to the console. The following program will take our sample.txt from the previous example and copy its contents into a new file, copy.txt:

my @lines = ();
{
    open my $input, "<", "sample.txt" or die $!;
    push @lines, $_ while (<$input>);
}
{
    open my $output, ">", "copy.txt" or die $!;
    print $output $_ for (@lines);
}

Notice how I have put the read and write portions of the code in separate code blocks. It is good programming practice to only keep a file open for the minimum amount of time it absolutely has to be. As soon as I’m done using it, I let it the file variable go out of scope so that Perl will automatically close the file. Likewise, it is generally considered poor practice to have multiple files open at once unless this is absolutely necessary for some reason, hence why I use the @lines array as an intermediary rather than writing directly into copy.txt from sample.txt.

As you might have suspected based on the similarity between reading from files and reading from STDIN, printing to the console also uses a special type of “file variable,” called STDOUT. Writing to a file can also be accomplished by changing the default print location to another file variable using select $fileVariable;. If you choose to go this route, make sure to change the default print location back to the console when you’re done by calling select STDOUT;, as in the below example:

my @lines = ();
{
    open my $input, "<", "sample.txt" or die $!;
    push @lines, $_ while (<$input>);
}
{
    open my $output, ">", "copy.txt" or die $!;
    select $output;
    print $_ for (@lines);
    select STDOUT;
}

Wednesday, March 11, 2020

Regular Expressions

What is a Regular Expression?

A regular expression is a construct in Perl that can be used to determine whether a string of text contains a substring that has a particular form (or pattern, as they’re called more formally). You’ve actually already seen a sneak preview of regular expressions (or regexes for short). Recall this line from the getNumberInput function in the post on subroutines, which we used to determine whether the user input was actually a number:

return $usrInput if ($usrInput =~ /^-?\d+(\.\d+)?$/);

In this post, we’ll look more closely at what that means and how it works, as well as other examples of regexes. I will refer back to this regex throughout the post as the “number example.”

The Basics

Regexes are delimited by forward slashes / /. The operator =~ is used to test whether a string contains a substring that matches the specified regex, and the operator !~ tests whether a string does not contain a substring that matches the specified regex.

The simplest regexes are literal text. For example, $string =~ /foo/ tests whether $string contains foo as a substring. Placing a lowercase i after the closing forward slash makes the search case-insensitive. So, for example, $string =~ /foo/i will be true if $string contains any of the following as substrings:

foo
Foo
fOo
foO
FOo
FoO
fOO
FOO

Like double-quoted strings, regexes are interpolated. This means that variable names appearing in a regex will be replaced with the variable’s contents. So, for example,

my $regex = "fooBar";
my $string = "fooBarBaz";
print ($string =~ /$regex/)? "t" : "f";

would output t, since fooBar is a substring of fooBarBaz.

Note that the following fourteen characters have special meaning within a regex. In order to use any of these as literal characters, they must be preceded by a backslash \:

.
*
?
+
(
)
[
]
{
}
^
$
|
\

Anchors

What if we want to look for something to be at the very start or very end of the string? We use an anchor. The carat (^) means “match the beginning of the string,” and the dollar sign ($) means “match the end of the string.” So $string =~ /^foo/ matches only if the first three characters of the string are foo.

Anchoring can also be seen in the number example. Note that the first character inside the opening forward slash is ^ and the last character before the closing forward slash is $. By using both the beginning-of-string and end-of-string anchors, I tell Perl that I want to check for an exact match with the entire string, rather than checking for a substring that matches my regex.

Character Classes

This is where the true power of regexes starts to make itself known. Instead of matching one specific character, I can specify that I want to match one of several possible characters by defining a character class. A character class is defined by placing the desired characters in square brackets [ ]. For example, $string =~ /[bcr]at/ will match if $string contains bat, cat, or rat as a substring. Character classes can also specify ranges of characters; for example, [A-Z] represents any uppercase letter. A character class can also be negated by placing a carat ^ immediately inside the opening square bracket. For example, [^A-Z] represents any character other than an uppercase letter.

Perl also provides the following predefined character classes that can be accessed via shorthand notation:

Shorthand	Represents
`\d`	A digit (0-9)
`\D`	Anything other than a digit
`\w`	A “word character”—any character that can be validly used in a Perl identifier (letter, number, or underscore `_`)
`\W`	Anything other than a word character
`\s`	A whitespace character
`\S`	Anything other than a whitespace character
`.`	Any character (literally anything at all)

Note the use of the \d character class in the number example above. This allows me to match any digit 0-9, which is good because I want to accept any validly formatted real number regardless of what digits it uses. Also note, that, although a . appears in the number example, I am not using the “any character” class; by preceding the . with a backslash, I’ve told Perl to interpret it as a literal character. The regex will thus be looking for the actual . character in $usrInput.

Quantifiers

Quantifiers are used to specify the number of times a particular element may appear in the matching substring. There are seven different quantifiers, which are shown in the table below applied to the letter a. Note that x and y in the below examples represent positive integers.

Quantifier	Meaning
`a?`	Zero or one occurrences of `a`
`a*`	Zero or more occurrences of `a`
`a+`	One or more occurrences of `a`
`a{x}`	Exactly `x` occurrences of `a`
`a{x,}`	At least `x` occurrences of `a`
`a{,x}`	At most `x` occurrences of `a`
`a{x,y}`	At least `x` but no more than `y` occurrences of `a`

Note that the quantifiers bind only to the immediately preceding character or character class by default. For example, ab+ will match ab, abb, abbb, and so on. To apply a quantifier to more than one character or character class, we must define the characters we want the quantifier to apply to as a group by enclosing them in parentheses ( ). So, for example, (ab)+ will match ab, abab, ababab, and so on.

Also note that the quantifiers are greedy by default—they will attempt to match as large a substring as they possibly can while still allowing the entire pattern to produce a match. This is of particular concern with the constructs .* and .+, which will match the largest consecutive sequence of “anything at all” that they can get their grubby paws on. Any of the quantifiers can be made reluctant by following them with a question mark ?. This will cause them to match the shortest sequence they can that will still allow the entire pattern to produce a match. For example, consider my $string = "abracadabra";. Matching it against the regex /a\w*a/, using the greedy quantifier, the initial a in the regex will match the first a in the string, the \w* will consume the bracadabr, and the last a in the regex will match the last a in the string, so the regex matches the entire string, abracadabra. Matching it against the regex /a\w*?a/, using the reluctant quantifier, the initial a in the regex will again match the first a in the string. The reluctantly quantified \w*?, however, will match only the br, stopping as soon as it gets to the second a, which gets matched to the last a in the regex. Thus, the version of the regex using the reluctant quantifier finds only abra as its match.

We see several quantifiers in the number example. Firstly, a ? quantifier is bound to the - character immediately following the start-of-string anchor. This makes the - character optional. Both occurrences of the \d character class have a + quantifier bound to them—this means there can be one or more digits, which is good since we don&rdsquo;t know how big of a number the user might give us. Finally, we have another ? quantifier, this time bound to the group (\.\d+). This makes the entire group optional; however, we cannot have only part of the group present. The group must either be able to match in its entirety or be completely absent.

We now know enough to describe the number example in full. The pattern matched by the number example is: the beginning of the string, optionally followed by a negative sign, followed by one or more digits, optionally followed by a decimal point and one or more additional digits, followed by the end of the string. By checking whether the user input matches this regex, we are assuring ourselves that the user has input a validly formatted real number.

Backreferences

Groups have another purpose besides just having quantifiers bound to them. They can also be used to extract a portion of the matched string to be looked at later. The extracted groups are stored in special backreference variables, which begin with $1 for the first extracted group, $2 for the second extracted group, and so on. By using groups to extract portions of our matched string into the backreference variables, we can write a program that more clearly demonstrates the difference between the greedy and reluctant quantifiers:

my $string = "abracadabra";
if ($string =~ /(a(\w*)a)/)
{
	print "The greedy quantifier consumed $2, and the entire match was $1\n";
}
if ($string =~ /(a(\w*?)a)/)
{
	print "The reluctant quantifier consumed $2, and the entire match was $1\n";
}

When run, this program produces the output

The greedy quantifier consumed bracadabr, and the entire match was abracadabra
The reluctant quantifier consumed br, and the entire match was abra

Note that the group numbering has no bearing on what groups were actually matched, only on the groups as they are specified in the regex. So, for example, in the regex /([A-Za-z]{3})?(\d+)/, the substring matched by the group (\d+) will always be in backreference variable $2, even if the optional group preceding it was not matched. If we only intend to extract some of the defined groups, and the other sets of parentheses are being used only to define a group for a modifier to bind to, we can place the sequence ?: immediately inside the opening parenthesis of a group to tell Perl not to extract that group into a backreference variable. So, for example, using the regex /(?:[A-Za-z]{3})?(\d+)/, the optional first group will not be extracted into a backreference variable. The group we actually care about extracting, (\d+), is thus placed in backreference variable $1, since the preceding group is no longer being extracted.

Thursday, February 27, 2020

Binary Search

A binary search is an algorithm for searching for an item in a sorted list. It can be implemented in Perl for numeric data as follows:

sub binarySearch
{
	# $data will be an array reference
	my $data = shift;
	my $target = shift;
	my $start = shift || 0;
	my $end = shift || $#$data;
	# if start and end pointers cross over, $data does not contain $target
	return -1 if ($end < $start);
	# Perl always does floating point division,
	# so we have to tell it to convert to an integer
	my $mid = $start + int(($end - $start) / 2);
	if ($data->[$mid] == $target)
	{
		return $mid;
	}
	elsif ($data->[$mid] > $target)
	{
		return binarySearch($data, $target, $start, $mid - 1);
	}
	else
	{
		return binarySearch($data, $target, $mid + 1, $end);
	}
}

This subroutine handles the retrieval of arguments in a slightly different way than the subroutines we saw in the previous post. The shift function is used to remove and return the element at index 0 of a specified array or, if no array is specified, of @_. If shift is called on an empty array, it returns undef. This behavior is taken advantage of to make the start and end arguments optional—the program will first shift a value from @_ and evaluate it. If it is a true value, the short-circuit behavior of the logical or operator causes the value that was returned from shift to be returned by the logical or operator and thus assigned to the variable. If the shift returned undef, which is a false value, the logical or operator will then evaluate and return the second value, thus providing default values for $start and $end that will be used if the caller does not supply such values.

Also note that we can access the value at a particular index of an array to which we have a reference by using the arrow operator -> followed by the index we want to access. This is similar to how we called subroutines from a reference in the previous post.

Two additional items to note pertain to the use of this subroutine by the caller. Firstly, the input array to which $data is a reference must be sorted—this is a requirement of the binary search algorithm. Secondly, because we have not specified the number and types of the arguments we are expecting (this is because we make some of the arguments optional, which cannot be done if the number and types of arguments are specified), Perl will not automatically convert an array into an array reference before passing it as an argument to the subroutine—it is the responsibility of the caller to do so. An example of a complete program that shows how this subroutine would be called is shown below:

sub binarySearch
{
	# $data will be an array reference
	my $data = shift;
	my $target = shift;
	my $start = shift || 0;
	my $end = shift || $#$data;
	# if start and end pointers cross over, $data does not contain $target
	return -1 if ($end < $start);
	# Perl always does floating point division,
	# so we have to tell it to convert to an integer
	my $mid = $start + int(($end - $start) / 2);
	if ($data->[$mid] == $target)
	{
		return $mid;
	}
	elsif ($data->[$mid] > $target)
	{
		return binarySearch($data, $target, $start, $mid - 1);
	}
	else
	{
		return binarySearch($data, $target, $mid + 1, $end);
	}
}

my @rawData = ();
push @rawData, int(rand(100)) for (1..500);
# input data to a binary search must be sorted
my @sortedData = sort { $a <=> $b } @rawData;
my $target;
until (defined $target)
{
    print "Enter a number 0-99: ";
    my $usrInput = <STDIN>;
    chomp $usrInput;
    unless ($usrInput =~ /^\d{1,2}$/)
    {
        print "$usrInput is not between 0-99. Please enter a number between 0-99.\n";
        next;
    }
    $target = $usrInput;
}
my $result = binarySearch(\@sortedData, $target);
if ($result == -1)
{
	print "$target was not found in the data.\n";
}
else
{
	print "$target was found at index $result of the data.\n";
}

Notice how the @rawData array is sorted to create the @sortedData array, and it is this array that is passed to binarySearch. The { $a <=> $b } in the invocation of the sort function instructs Perl to perform a numeric sort (the default sort treats the array’s elements as strings and sorts them lexicographically). Also notice how, when binarySearch is called, it is not @sortedData itself that is passed as an argument but rather a reference to it (created by prefixing a backslash). Finally, notice how only two arguments—the data to be searched and the value to search for—are passed to binarySearch when it is called by the main program. Since no values for them have been specified by the caller, the $start and $end variables will take on the default values specified in the binarySearch subroutine—in this case, the first and last indices of the data.

Saturday, February 22, 2020

Subroutines

A subroutine is a block of code with a defined name that, instead of being executed immediately when it is encountered in the program, is stored for later use. The block of code can then be run (potentially multiple times) by using the name defined for it later in the program.

A Note About Terminology

Many languages refer to what Perl calls subroutines as functions. In Perl, a function is something that is built into Perl, such as chomp that we’ve used in earlier posts to strip the trailing newline from user input, whereas a subroutine is written by the user.

A Simple Subroutine with No Parameters or Return Value

Subroutines in Perl are defined using the keyword sub, followed by the name of the subroutine, followed by the block of code to be executed when the subroutine is called. For example, the following subroutine, when called, will print the digits 0-9 to the screen, each on their own line:

sub printDigits
{
    print "$_\n" for (0..9);
}

The Argument Array

Okay, technically I lied when I said that the printDigits method above has no parameters. Because I haven’t explicitly told Perl what types of parameters I’m looking for, Perl will allow me to pass in as many arguments as I want of whatever types I want when I go to call this subroutine later in the program. The arguments are stored in the variable @_. For example, this subroutine, when called, will print out any arguments passed to it as a comma-separated list:

sub commaSeparated
{
    my @args = @_;
    print "$args[$_], " for (0..($#args - 1));
    print $args[$#args];
}

Notice how the subroutine starts by unpacking @_ into a local variable, @args. As with $_ in the context of loops, @_ is just an alias to the actual arguments: copying the arguments into a local variable and accessing them using that local variable instead of @_ ensures that, if we modify one of the arguments, those modifications are not reflected outside of the subroutine.

Returning a Value

Subroutines can also return a value back to the caller using the keyword return followed by the value to be returned. For example, the following subroutine adds up the arguments passed to it and returns their sum:

sub add
{
    my @args = @_;
    my $result = 0;
    $result += $_ for (@args);
    return $result;
}

Specifying Parameters

Perl allows you to specify the number and type of parameters to be passed to a subroutine. This is done by placing sigils in parentheses after the subroutine’s name. For example, the following division function takes two scalar values as its arguments:

sub divide($$)
{
    my ($dividend, $divisor) = @_;
    return $dividend / $divisor;
}

Unfortunately, it is not possible to specify the type any more specifically than by its sigil (so you can’t require that the arguments be numeric values). Also note that, even when the number and type of arguments is specified, the arguments are still stored in @_: they cannot be named in the subroutine declaration, only by unpacking them into local variables as is done on the first line of the divide function shown above.

Passing Arrays and Hashes to a Subroutine Using References

Arrays in Perl are automatically flattened: if an array “contains” another array, the contents of the inner array are unpacked into the outer array, meaning each element of the former inner array is treated as a single element of the outer array, and it is impossible to determine just by looking at the outer array where the start and end of the inner array were. Likewise, if an array “contains” a hash, the hash is flattened into the array, destroying the key-value associations in the process. Because the arguments of a subroutine are passed to it as an array, this means that special measures must be taken in order to pass an array or hash as an argument to a subroutine without losing its structure.

To specify an array or hash in the parameter list, prepend a backslash \ to the appropriate sigil. When the subroutine is called, a reference to the array or hash passed as an argument is placed in @_. It can then be dereferenced into a local variable by using the appropriate sigil for an array or hash, followed by the reference as accessed from @_ enclosed in curly braces { }. For example, the following subroutine takes an array and a hash and returns an array containing the elements of the array that are keys in the hash:

sub findKeys(\@\%)
{
    my @searchingFor = @{$_[0]};
    my %hashToSearch = %{$_[1]};
    my @found = ();
    for (@searchingFor)
    {
        push @found, $_ if (exists $hashToSearch{$_});
    }
    return @found;
}

It is also possible to store the references themselves to local variables, which would carry the scalar sigil $, and work with the references directly, dereferencing them each time they are used. For example,

sub findKeys(\@\%)
{
    my ($searchingFor, $hashToSearch) = @_;
    my @found = ();
    for (@$searchingFor)
    {
        push @found, $_ if (exists $$hashToSearch{$_});
    }
    return @found;
}

Note that when the reference is stored in a scalar variable instead of at an index in an array, it is not necessary to surround it with curly braces when dereferencing it (although the curly braces can still be used if one so desires). Also note that when the references are used directly instead of being dereferenced into a local variable, any changes made to the contents of the array or hash being referenced will continue to be visible after the function has returned. So, for example, the following subroutine takes an array and a hash and removes from the hash all key-value pairs for which the key is contained in the array:

sub removeKeys(\@\%)
{
    my ($keysToRemove, $hashToPrune) = @_;
    delete $$hashToPrune{$_} for (@$keysToRemove);
}

A simple program using this subroutine is shown below:

sub removeKeys(\@\%)
{
    my ($keysToRemove, $hashToPrune) = @_;
    delete $$hashToPrune{$_} for (@$keysToRemove);
}

sub printHash(\%)
{
    my %hash = %{$_[0]};
    print "$_ => $hash{$_}\n" for (keys %hash);
}

my @searchTerms = ("foo", "bar", "baz");
my %searchingIn = (
    foo => 3,
    bar => 7,
    qux => 9
);

printHash(%searchingIn);
removeKeys(@searchTerms, %searchingIn);
print "After calling removeKeys:\n"
printHash(%searchingIn);

This program produces the following output:

bar => 7
qux => 9
foo => 3
After calling removeKeys:
qux => 9

This output also demonstrates an important fact to note about the behavior of hashes: they are unordered. When you iterate over a hash, the only thing you are guaranteed is that every key-value pair will be generated exactly once—Perl makes no guarantees about the order in which they are generated.

Subroutine References

Recall that in the post on conditional statements, we used a hash to substitute for an extended if-elsif-else chain to determine which of several strings to print. What if we have an extended if-elsif-else chain where the operations to be performed are more complicated than just printing some string? Can we still use a hash instead of an if-elsif-else chain? The answer is yes—we do it by storing references to subroutines as the hash&rsuqo;s values.

A reference to a subroutine is created by prepending \& to the name of the subroutine. A subroutine reference is never followed by a list of arguments—the arguments will be supplied when we dereference the subroutine and call it. Calling a subroutine from a reference is done by using the reference, followed by the arrow operator ->, followed by the argument list. Consider the following program:

sub getNumberInput
{
    while (1)
    {
        print "Enter a number: ";
        my $usrInput = <STDIN>;
        chomp $usrInput;
        return $usrInput if ($usrInput ~= /^-?\d+(\.\d+)?$/);
        # if user input was valid, subroutine will have returned on the previous line
        # and so this line will not be executed
        print "$usrInput is not a valid number. Please enter a number.\n";
    }
}

sub add($$)
{
    my ($a, $b) = @_;
    return $a + $b;
}

sub subtract($$)
{
    my ($a, $b) = @_;
    return $a - $b;
}

sub multiply($$)
{
    my ($a, $b) = @_;
    return $a * $b;
}

sub divide($$)
{
    my ($a, $b) = @_;
    return $a / $b;
}

sub mod($$)
{
    my ($a, $b) = @_;
    return $a % $b;
}

sub exp($$)
{
    my ($a, $b) = @_;
    return $a ** $b;
}

my %options = (
    addition => \&add,
    subtraction => \&subtract,
    multiplication => \&multiply,
    division => \&divide,
    modulo => \&mod,
    exponentiation => \&exp
);

my $first = getNumberInput();
my $second = getNumberInput();
my $operation;
until (exists $options{$operation})
{
    print "That operation is not supported.\n" if (defined $operation);
    print "Enter an operation: ";
    $operation = <STDIN>;
    chomp $operation;
}
my $result = $options{$operation}->($first, $second);
print "The result is $result.\n";

The second-to-last line is the one that is of interest to us. We retrieve a subroutine reference from the %options hash and use the arrow operator to simultaneously dereference it and call it with $first and $second as its arguments.

Saturday, February 15, 2020

Do It Again – Loops

Loops are the way in which Perl allows the same code to be executed multiple times. There are three main types of loops: for, while, and do-while. In order to understand the first type of loop in Perl—the for loop—we must first discuss a new data type—arrays.

Arrays

An array is a list of values. Array variables are declared using the sigil @. The listing of the array’s contents is bounded by parentheses, and the values are separated from one another by commas. For example, we could declare my @friends = ("Joe Smith", "Frank Jones", "Linda Brown");. When accessing the values in the array, the name of the array variable is prefixed by the scalar sigil $ (since the value to be retrieved is eventually a scalar) and followed by the desired value’s index (the position in the array at which the desired value appears) enclosed in square brackets [ ]. Note that array indices in Perl begin with 0. So, for example, to get "Joe Smith" out of the array declared earlier, we would use $friends[0].

It is also possible to access a slice (sub-array) of the array by placing a range inside the square brackets. A range is declared by giving the first and last numbers desired to be included in the range separated by two dots, for example 1..5. In this case, since the value that will eventually be retrieved is also an array, we would prefix the array variable with the array sigil @. So, for example,

my @months = ("January", "February", "March", "April", "May", "June",
              "July", "August", "September", "October", "November", "December");
my @secondQuarter = @months[3..5];
print "The months in the second quarter are @secondQuarter";

produces the output The months in the second quarter are April May June. Note that ranges are themselves arrays, so it is perfectly valid to say, for example, my @foo = 4..8;.

IMPORTANT: Ranges are inclusive at both ends. The declaration above is equivalent to my @foo = (4, 5, 6, 7, 8);, not my @foo = (4, 5, 6, 7); as a programmer coming from a language such as Java might expect.

An array’s length (the number of elements it contains) can be accessed by prefixing the scalar sigil $ to the name of the array variable. For example, to get the length of the months array from earlier, we would use $months, which would evaluate to 12. Prefixing $# to the name of an array variable gives the value of the last index of that array variable (i.e. one less than the length). So, for example, $#months would evaluate to 11.

The `for` Loop

The for loop is used to iterate over the contents of some array (which could be a range). By default, the current element in the loop is aliased to the variable name $_. So, for example, we can print out the numbers 0-9 inclusive, each on their own line, with the following code:

for (0..9)
{
    print "$_\n";
}

We can also specify our own alias for the current loop element by placing it after the for and before the opening parenthesis. For example,

for my $i (0..9)
{
    print "$i\n";
}

produces exactly the same output as the version using $_.

We can also use an array variable to iterate over in a for loop. For example, we can print out the months of the year, each on their own line, using the following code:

my @months = ("January", "February", "March", "April", "May", "June",
              "July", "August", "September", "October", "November", "December");
for (@months)
{
    print "$_\n";
}

WARNING: Whether you define your own name for the loop variable or use $_, it is just an alias for the current position in the array being iterated over. You can do anything with the alias that you could do with $someArray[$whateverIndex], including modify the contents of the array. For example,

my @nums = 0..4;
for (@nums)
{
    $_++;
}
print "My nums are @nums";

produces the output My nums are 1 2 3 4 5. The modification to the loop variable does persist outside the loop, unlike in languages such as Java where the loop variable is merely a shared reference to the contents of the current position in the array being iterated over.

As with the simple if and unless statements, when the code to be executed by a for loop is a single line, the for can be placed after that line, for example print "$_\n" for (0..9). Unlike with the “normal” for loop, you cannot define your own alias for the loop variable when using the postfix syntax; you must use $_.

The `while` and `do`-`while` Loops

Instead of taking an array and iterating over it, while and do-while loops repeat the code they contain indefinitely until a supplied condition is no longer true. For example, the following code accepts a number as input from the user and decrements that number for as long as it remains positive:

print "Enter a positive number: ";
my $selected = <STDIN>;
chomp $selected;
while ($selected > 0)
{
    print $selected--, "\n";
}

Running this program with 8 as the user input produces the following:

Enter a positive number: 8
8
7
6
5
4
3
2
1

Note that the loop condition is checked at the beginning of the loop. This means it is possible (if the user enters, say, -6 as input) that the code inside the loop will not execute at all. If we want to guarantee that the code in the loop executes at least once, we use a do-while loop, which checks the loop condition at the end of the loop. For example, rewriting the above program to use a do-while loop gives the following:

print "Enter a positive number: ";
my $selected = <STDIN>;
chomp $selected;
do
{
    print $selected--, "\n";
} while ($selected > 0);

Now, even if we were to provide as input a non-positive number such as -6, the loop would still execute at least once, even though its loop condition is initially false:

Enter a positive number: -6
-6

Just as unless (condition) is equivalent to if (not condition), until (condition) is equivalent to while (not condition). So the first version of the above program could be written as

print "Enter a positive number: ";
my $selected = <STDIN>;
chomp $selected;
until ($selected <= 0)
{
    print $selected--, "\n";
}

and the second version as

print "Enter a positive number: ";
my $selected = <STDIN>;
chomp $selected;
do
{
    print $selected--, "\n";
} until ($selected <= 0);

Exiting Early and Skipping Iterations

Suppose we have a long array of numbers, and we want to find the index of the first occurrence of some particular number in our array. As soon as we find it, we don’t need to keep looking at the rest of the array. To exit a for loop before we reach the end of the array, we use the keyword last. The last keyword can also be used to exit a while loop even while the loop condition is still true. In a similar fashion, if we want to skip a value in the array being iterated over by a for loop, or if we want to skip the rest of a while loop’s body and reevaluate its condition, we use the keyword next. An example of the usage of last and next is shown below:

my @bigData = ();
push @bigData, int(rand(100)) for (1..500);
my $loc = -1;
my $target;
until (defined $target)
{
    print "Enter a number 0-99: "
    my $usrInput = <STDIN>;
    chomp $usrInput;
    if ($usrInput =~ /\D/ or $usrInput < 0 or $usrInput > 99)
    {
        print "$usrInput is not between 0-99. Please enter a number between 0-99.\n";
        next;
    }
    $target = $usrInput;
}
for my $i (0..$#bigData)
{
    if ($bigData[$i] == $target)
    {
        $loc = $i;
        last;
    }
}
if ($loc == -1)
{
    print "The data does not contain the value $target\n";
}
else
{
    print "The first occurrence of $target is at index $loc\n";
}

This program introduces a few new elements. The push function adds a value to the end of an array. The rand function generates a random number that is at least 0 and less than its argument, and the int function rounds its argument down to the next smaller integer. So the combination of the two, int(rand($someNumber)), produces a random integer in the range 0..($someNumber - 1). The expression $usrInput =~ /\D/ checks to see whether $usrInput contains any non-numeric characters. It does this using what’s called a regular expression, which we’ll discuss in more detail in a future post. The reason this is necessary is because of the way in which Perl implicitly converts between strings and numeric data types. When a string that begins with a non-numeric character is used in a numeric context (such as being compared to a number using the < and > operators), it is implicitly converted to 0, which is within our range of valid inputs and thus will be accepted by the program as if the user had actually typed 0. To avoid this false positive, we have to explicitly check whether the input is non-numeric.

Now let’s look at how the next and last statements are being used. In our until loop, we ask the user for a number and check to see if it is valid. If it is not, we print an error message and invoke next. This causes the last line of code in the loop, which assigns the user input to the $target variable, to be skipped over—we don’t want to store an invalid user input as the value we’re going to try to search for. Thus the loop exit condition, that $target has a defined value, remains false, and so we prompt the user for another input.

Once we have a valid input, we start searching the @bigData array for the target value provided by the user, which we do using a for loop (remember that $#bigData represents the last index in @bigData). As soon as we find our target value, we store the index we found it at and then invoke last. This causes the loop to immediately exit, and the program continues execution with the next line of code outside the loop (in this case, the line if ($loc == -1)). We do this because, once we’ve found the first occurrence of the target value, we have the information we need—continuing to search through the rest of the array would only waste time and resources. An example run of what the user would see is shown below:

Enter a number 0-99: foo
foo is not between 0-99. Please enter a number between 0-99.
Enter a number 0-99: -6
-6 is not between 0-99. Please enter a number between 0-99.
Enter a number 0-99: 104
104 is not between 0-99. Please enter a number between 0-99.
Enter a number 0-99: 47
The first occurrence of 47 is at index 274

Saturday, February 8, 2020

Conditional Statements

Comparison Operators

Since most conditional switching is based on comparing the values of two variables, we first need to know how to do those comparisons. Perl provides two sets of comparison operators—one set for comparing numeric values, another set for comparing string values.

Comparison	Numeric Operator	String Operator
Equals	`==`	`eq`
Does Not Equal	`!=`	`ne`
Is Greater Than	`>`	`gt`
Is Greater Than or Equal To	`>=`	`ge`
Is Less Than	`<`	`lt`
Is Less Than or Equal To	`<=`	`le`

Perl does not define a boolean (true/false) data type. Instead, the comparison operators return 1 if the comparison is true and undef if false. undef, short for “undefined”, is a special value that is treated as 0 when used in a numeric context and "" (the empty string) when used in a string context.

Logical Operators

The logical operators, and and or are used to evaluate multiple conditions simultaneously. $a and $b evaluates to true if $a and $b each individually evaluate to true, and evaluates to false otherwise. $a or $b evaluates to true if at least one of $a and $b individually evaluates to true, and evaluates to false only if both individually evaluate to false. and and or can also be represented symbolically as && and ||, respectively.

The logical operator not is used to reverse the value of the condition it precedes. So not $a is false if $a is true, and true if $a is false. not can also be represented symbolically as !.

A Word on Truth and Falsity

As mentioned in the discussion of comparison operators, Perl does not define explicit true and false values. For the purposes of logical operations, the following values are treated as false:

0
0.0
00 # 0 in octal
0b0 # 0 in binary
0x0 # 0 in hexadecimal
"" # the empty string
'0' # the string containing 0
() # the empty list
undef # the undefined value

All other values are treated as true.

Note that in the above, # introduces a single-line comment: everything after the # until the end of the line is ignored by the Perl interpreter. Also note that '0' evaluates to false. This is because of the implicit conversion from string to numeric values mentioned in the previous post.

Conditional Statements

Conditional Statements Generally

Conditional statements test the truth or falsity of a condition and, if the condition evaluates to the desired truth or falsity, execute the code in the following block. Code blocks in Perl begin with { and end with }. The conditional statement must always be followed by a code block, even if there is only one line of code to be executed if the condition evaluates to the desired truth value. So

if ($foo)
    print "blah blah blah";

is a syntax error: unlike in languages such as Java and C++, this code must be written as

if ($foo)
{
    print "blah blah blah";
}

even though there is only one line of code that is dependent on the conditional.

`if` and `unless`

The two simplest conditional statements are if, which executes the block of code that follows it if the given condition is true, and unless, which executes the block of code that follows it if the given condition is false (so unless ($foo) is the same as if (not $foo)). So, for example,

my $foo = 7;
if ($foo > 5)
{
    print "foo is big\n";
}
unless ($foo % 2 == 0)
{
    print "foo is odd\n";
}

produces the output

foo is big
foo is odd

In this simplest case of an if or unless that is used to execute a single statement, and does not have an attached else or elsif clause (we’ll discuss those in a moment), we can avoid having to create a code block by placing the if or unless after the statement we want executed. So the following program is the exact same as the one above:

my $foo = 7;
print "foo is big\n" if ($foo > 5);
print "foo is odd\n" unless ($foo % 2 == 0);

`else` and `elsif`

An else clause can be placed after an if or unless statement, and the code in the else clause is executed if the code in the if or unless is not. So

my $foo = 7;
if ($foo < 5)
{
    print "foo is small\n";
}
else
{
    print "foo is big\n";
}

unless ($foo % 2 != 0)
{
    print "foo is even\n";
}
else
{
    print "foo is odd\n";
}

once again produces the output

foo is big
foo is odd

To chain two or more conditions together in this way (if the first condition isn’t fulfilled, check the second condition and execute its code if it is fulfilled, otherwise check the next condition, and so on and so forth until some code is executed if none of the conditions are fulfilled), the conditions after the first are stated using the keyword elsif (there is no elsunless— to get that behavior, you would have to nest an unless clause inside an else block). So, for example,

my $foo = 7;
if ($foo < 5)
{
    print "foo is small";
}
elsif ($foo < 10)
{
    print "foo is medium";
}
else
{
    print "foo is big";
}

produces the output foo is medium. Note that a final else clause is not required; if it is absent, the program will simply do nothing if none of the conditions are fulfilled. So, for example,

my $foo = 13;
if ($foo < 5)
{
    print "foo is small";
}
elsif ($foo < 10)
{
    print "foo is medium";
}

produces no output because neither of the conditions were fulfilled. Also note that the fulfilling of one condition meets that the following conditions are not checked. For example, if the value of $foo in the above program were 3, the program would produce the output foo is small. It would not also print foo is medium, even though $foo < 10 is true, because as soon as one of the conditions is fulfilled, the rest of the if-elsif-else chain is bypassed.

Digression: Hashes

A hash is a built-in data type in Perl that associates keys with values. For example, a hash might be used like a contacts list to associate names with email addresses. Hash variables are declared using the sigil %. The listing of the hash’s contents is bounded by parentheses. Keys are separated from values using the so-called “fat comma” operator, =>, and key-value pairs are separated from each other by commas. For example,

my %contacts = (
	"Joe Smith" => 'jsmith@aol.com',
	"Frank Jones" => 'fjones@yahoo.com',
	"Linda Brown" => 'lbrown@hotmail.com'
);

When accessing the values in a hash, the name of the hash variable is prefixed by the sigil $ for a scalar (since the value that is eventually retrieved is a scalar), followed by the name of the key enclosed in curly braces { }. For example, using the hash declared above, print $contacts{"Joe Smith"}; produces the output jsmith@aol.com.

The builtin function exists can be used to check whether a hash contains a value for a particular key. Still using the hash declared above, exists $contacts{"Paul Williams"} would evaluate to false, since %contacts does not contain a value for the key "Paul Williams".

Using Hashes as an Alternative to Extended `if`-`elsif`-`else` Chains

Suppose we want to write a program that takes as input from the user a number between 1 and 10, inclusive, and prints out that number as a word. We could use an if-elsif-else chain: first check if the user input 1, then check 2, then 3, and so and so forth until an error message is printed if the user’s input isn’t a number 1-10. But this long of a chain can get cumbersome very quickly. Is there any way to shorten the code? Yes—we use a hash to associate the numbers with their corresponding words. The code for this program looks like this:

my %numbersSpelledOut = (
	1 => "one",
	2 => "two",
	3 => "three",
	4 => "four",
	5 => "five",
	6 => "six",
	7 => "seven",
	8 => "eight",
	9 => "nine",
	10 => "ten"
);
print "Enter a number 1-10: ";
my $userInput = <STDIN>;
chomp $userInput;
unless (exists $numbersSpelledOut{$userInput})
{
	print "I don't have the name for that number.\n";
}
else
{
	print "$numbersSpelledOut{$userInput}\n";
}

We see two new elements in this code. First, <STDIN> is an instruction to get input from the keyboard. The user types their input into the console and presses Enter to submit. Unfortunately for the programmer, the user pressing Enter to submit causes a newline character to be appended to the input string that is stored to $userInput. This is where the second new element comes in. The chomp function strips the trailing newline and stores the result back to the same variable. So, for example, if the user inputs 7, the program runs as follows:

Enter a number 1-10: 7
seven

The key-value pairs stored in the hash take care of what would have been the if and elsif blocks in the chain. What would have been the else clause is handled by the unless exists check. For example, if the user enters 13, the program runs as follows:

Enter a number 1-10: 13
I don't have the name for that number.

The program checks to see whether %numbersSpelledOut contains a value for the key 13 and, finding that it does not, prints the error message.

The Ternary Conditional Operator

One of the more common uses of conditionals is to set variables. For example, the following code sets $max to the larger of $a and $b:

my $a = 4;
my $b = 7;
my $max;
if ($a > $b)
{
    $max = $a;
}
else
{
    $max = $b;
}

The ternary conditional operator can be used to shorten this if-else construct to a single statement. It is written as $testCondition? $valueIfTrue : $valueIfFalse. So the above example could be rewritten as

my $a = 4;
my $b = 7;
my $max = ($a > $b)? $a : $b;

An unusual feature of Perl is that it allows the ternary conditional to be used on the left side of an assignment operator to determine which variable a value is to be assigned to. For example, this program assigns the larger of $a and $b to $max and the smaller of the two values to $min:

my $a = 4;
my $b = 7;
my $min;
my $max;
($a > $b)? $max : $min = $a;
(defined $max)? $max : $min = $b;

The defined function used in the last line of this code snippet checks whether a value has been assigned to the specified variable.

Saturday, February 1, 2020

Constants and Variables and Operators, Oh My!

Declaring a Variable

Sigils

Every variable name in Perl is prefixed by what is known as a sigil, which denotes in very broad terms the category of data being stored by the variable. Note that I use category rather than type. The most commonly encountered sigil, $, is used for any of the scalar data types, i.e. data types that store only a single value, such as strings, integers, floating-point numbers, etc. The sigil @ is used for arrays, and the sigil % for hashes, but we won’t be working with those just yet.

Identifiers

The identifier is the part of the variable that comes after the sigil—the variable’s actual name. The identifier can consist of any combination of uppercase letters, lowercase letters, digits, and the underscore (_) character, with the sole restriction that the first character of the identifier cannot be a digit.

Scope

The first time a variable name is used, its scope must be specified. This is done by placing one of two keywords in front of the sigil and identifier. Local variables (those that persist only in the current code block) are declared using the keyword my. For example, to declare a local variable named foo and assign it the value 7, we would write my $foo = 7;. On future uses of this variable within the same code block, we would simply use the sigil and identifier by themselves, for example $foo = 5;. As soon as we exit the current code block, $foo disappears.

Global variables do not disappear at the end of the current code block, but instead persist throughout the entire program. Global variables are declared using the keyword our. So to declare a global variable named foo and assign it the value 7, we would write our $foo = 7;. As with local variables, we would use only the sigil and identifier on future uses of this variable. Because they persist throughout the program, global variables can create potential problems and should therefore be used sparingly.

Constants

Perl has no built-in construct for declaring constants—that is, variables whose values cannot be changed once they have been declared. However, variables whose values are intended not to change once they have been declared are conventionally given identifiers consisting of uppercase letters and underscores if necessary to separate words. For example, the number of centimeters in one inch might be declared as our $CM_PER_INCH = 2.54;. While Perl won’t prevent us from changing this value later, the fact that its identifier is in all caps tells us we probably shouldn’t.

As an aside, declaring global constants is not nearly as problematic as declaring global non-constant variables, since the value associated with a constant is expected to remain the same throughout the entire program.

Data Types

Numeric Data Types

Perl treats integers and floating-point numbers equally and will freely convert between them. So, from Perl’s point of view, there is no difference between 7 (integer) and 7.0 (floating-point). Perl allows the use of the underscore as a thousands separator in numeric literals, for example my $foo = 1_000.0; is the same as my $foo = 1000.0. Additionally, integers can be declared in hexadecimal by prefixing 0x, in octal by prefixing 0, and in binary by prefixing 0b. So, for example, 0x20, 040, and 0b100000 all evaluate to 32.

Strings

Strings are how Perl stores text. Strings can be delimited either by single quotes (') or double quotes ("), and the two types of strings work slightly differently.

Single-Quoted Strings

Strings delimited by single quotes are treated as raw text. Escape sequences (such as \n for newline) within a single-quoted string are not processed. For example, print 'Hello, World!\n'; produces the output Hello, World!\n. The \n is printed to the screen as-is rather than being converted to a newline character. The sole exception to this is the escape sequence \', which is used to allow the single-quote character to appear within a string delimited by single quotes.

Double-Quoted Strings

Escape sequences appearing in strings delimited by double quotes are converted into the special characters they represent. So print "Hello, World!\nMore text on the next line"; produces the output

Hello, World!
More text on the next line

Additionally, double-quoted strings allow what is known as interpolation. When a variable name (complete with sigil) appears inside a string, the value of the variable is substituted into the string. So the code

my $foo = 7;
print "The value of foo is $foo";

produces the output The value of foo is 7. Note that the substitution occurs immediately when the string is evaluated and will not reflect subsequent changes to the value of the variable. So the code

my $foo = 7;
my $output = "The value of foo is $foo";
$foo = 12;
print $output;

also produces the output The value of foo is 7, because $foo was 7 at the time the string was evaluated. The subsequent change to the value of $foo does not cause the string to be updated to reflect the new value of $foo.

Operators

Arithmetic Operators

As one might expect, +, -, *, and / are used to add, subtract, multiply, and divide numbers. Because Perl treats integers and floating-point numbers equally, the / operator always performs floating-point division, unlike in many languages where the result of dividing two integers is always an integer. Perl also includes the exponentiation operator, **, which raises the first number to the power of the second, and the modulo operator, %, which gives the remainder when the first number is divided by the second. So the code

print 24 + 7, "\n";
print 24 - 7, "\n";
print 24 * 7, "\n";
print 24 / 7, "\n";
print 24 ** 7, "\n";
print 24 % 7, "\n";

produces the output

31
17
168
3.42857142857143
4586471424
3

All of the arithmetic operators work equally well with variables as they do with numeric literals. So, for example, the code

my $foo = 24;
my $bar = 7;
print $foo + $bar;

produces the output 31.

Compound Assignment Operators

Any of the six arithmetic operators listed above can be prefixed to the assignment operator, =, to produce what is known as a compound assignment operator. A compound assignment operator always takes a variable as its first argument. Its second argument can be a literal or another variable. The effect of a compound assignment operator is to apply the specified arithmetic operation and then store the result back into the variable provided as the first argument. So the code

my $foo = 7;
$foo **= 2;
print $foo;

produces the output 49.

Increment and Decrement Operators

One of the more common cases of modifying a variable’s value and storing the result back to the same variable is adding or subtracting 1 from the variable’s value. Perl provides special operators for doing this, known as the increment (++) and decrement (--) operators. Unlike the operators discussed above, the increment and decrement operators are unary—they work on only a single value. So, for example, the code

my $foo = 24;
my $bar = 7;
$foo++;
$bar--;
print "$foo\n$bar\n";

produces the output

25
6

The increment and decrement operators can also be placed before the variable (including sigil), i.e. ++$foo;. In the most common case, where the increment or decrement occurs on a line by itself, there is no difference between the two usages.

String Operators

Perl also provides two operators whose use is specific to strings. The string concatenation operator, which puts two strings together one after the other, is . Note that this is different from many languages, which use + for this purpose as well as for addition. The reason for this is that, in Perl, a string whose contents can be interpreted as a numeric literal can be implicitly converted to a number. So, for example, the code

my $foo = "6.5";
my $bar = "27";
print $foo + $bar, "\n";
print $foo . $bar, "\n";

produces the output

33.5
6.527

Notice how, when the + operator was used, the two strings were implicitly converted to numbers and treated as if they were numbers.

The second string operator is the repetition operator, x. It takes a string as its first argument and a number as its second argument, and it repeats the given string the specified number of times (if the number given is not an integer, it is rounded down; if the number given is negative, it is treated as 0). So, for example, the code print "foo"x3; produces the output foofoofoo.

Saturday, January 18, 2020

Installation and "Hello, World!" Program

Installation

Begin by going to the website for the Padre IDE (padre.perlide.org/). Click on the “Download” link in the top right corner of the page.

Now click on the appropriate download for your operating system. I am using Windows, so the steps that follow will be for installation on Windows. Click on the first link in the “Windows” section of the download page.

This will take you to a page where you can select which version to download. Select the third link file on the page (the first two are for Linux operating systems, and the last one is an older version).

Wait for the download to complete, then run the executable. Follow the steps in the installation wizard to install the Padre IDE. Once the install is complete, open the IDE by clicking on its icon in the Start menu.

The “Hello, World!” Program

Once you have installed Perl (which is included in the Padre IDE’s installation package), it is a good idea to create and run a simple program to confirm that the execution environment has been installed successfully and is working properly. The program traditionally used for this purpose is known as the “Hello, World!” program after the text it prints to the screen.

Padre should have automatically opened a new blank file when you started it. If not, you can open a new file by pressing Ctrl-N. On the first line of the file, type print "Hello, world!\n"; (the \n represents a newline character). Save the file as HelloWorld.plx, and then run it by pressing the F5 key. You should see a black window pop up that contains the text

Hello, world!
Press any key to continue . . .

The contents of HelloWorld.plx together with the output window are shown below:

Background Information and Resources

Creation and History

Perl was the brainchild of linguist and NASA sysadmin Larry Wall, who developed it in the late 1980s as a language designed to make report processing easier. Since that time, it has grown beyond this original role to take on such tasks as automated system administration and connecting different computer systems together. It has also become one of the most popular languages for programming Common Gateway Interfaces (CGIs) on the Internet.

Resources

For the purposes of this blog, I will be using Beginning Perl (available free online at perl.org/books/beginning-perl/) as my primary resource for learning the syntax and conventions of the Perl language. I will be creating and executing my Perl scripts through the Padre IDE (padre.perlide.org/).

Sunday, March 29, 2020

What is Six Degrees of Kevin Bacon?

Loading the Raw Data into Memory

Finding the Shortest Chain

The Complete Program Code

Sunday, March 22, 2020

The open function

Reading from a File

Writing to a File

Wednesday, March 11, 2020

What is a Regular Expression?

The Basics

Anchors

Character Classes

Quantifiers

Backreferences

Thursday, February 27, 2020

Saturday, February 22, 2020

A Note About Terminology

A Simple Subroutine with No Parameters or Return Value

The Argument Array

Returning a Value

Specifying Parameters

Passing Arrays and Hashes to a Subroutine Using References

Subroutine References

Saturday, February 15, 2020

Arrays

The for Loop

The while and do-while Loops

Exiting Early and Skipping Iterations

Saturday, February 8, 2020

Comparison Operators

Logical Operators

A Word on Truth and Falsity

Conditional Statements

Conditional Statements Generally

if and unless

else and elsif

Digression: Hashes

Using Hashes as an Alternative to Extended if-elsif-else Chains

The Ternary Conditional Operator

Saturday, February 1, 2020

Declaring a Variable

Sigils

Identifiers

Scope

Constants

Data Types

Numeric Data Types

Strings

Single-Quoted Strings

Double-Quoted Strings

Operators

Arithmetic Operators

Compound Assignment Operators

Increment and Decrement Operators

String Operators

Saturday, January 18, 2020

Installation

The “Hello, World!” Program

Creation and History

Resources

The `open` function

The `for` Loop

The `while` and `do`-`while` Loops

`if` and `unless`

`else` and `elsif`

Using Hashes as an Alternative to Extended `if`-`elsif`-`else` Chains