CS270 Language Blog - Perl: March 2020

Sunday, March 29, 2020

The Six Degrees of Kevin Bacon

What is Six Degrees of Kevin Bacon?

Six Degrees of Kevin Bacon is a parlor game based on trying to find the shortest “chain” connecting two actors and/or actresses. Each “link” in the chain is a pair of actors and/or actresses who have co-starred in a movie, and the number of links in the chain determines the length of the chain. In the seminal case of this game, one of the actors being linked is Kevin Bacon, hence the name.

To demonstrate the way in which this game works, consider the following two examples: William Shatner to Leonard Nimoy and Kevin Bacon to Natalie Portman. Shatner and Nimoy co-starred in Star Trek: The Motion Picture, so the length of the chain connecting them is 1. Bacon co-starred with Meg Gibson in Picture Perfect, and Gibson co-starred with Portman in Vox Lux, so the length of the chain connecting Bacon and Portman is 2.

The goal of this program is to find the shortest chain given the names of the two actors. Note that this program will only give the chain length as a number—it will not tell which movies and intermediate actors (if any) the chain was traced through. Throughout the development process, I will be using the Oracle of Bacon as a reference to verify my program’s results.

Loading the Raw Data into Memory

The Oracle of Bacon makes the dataset it uses to find matches publicly available for download as a compressed text file. After downloading and extracting the file, I opened it to look at how the data was formatted. Here is the first entry in the file:

{"title":"Actrius","cast":["Núria Espert","Rosa Maria Sardà","Anna Lizaran","Mercè Pons"],"directors":["Ventura Pons"],"producers":["Ventura Pons"],"companies":["Canal+ España","Els Films de la Rambla S.A.","Generalitat de Catalunya - Departament de Cultura","Televisión Española","Buena Vista International"],"year":1997}

Now knowing the data format, I was able to then write regular expressions to extract each film’s cast list. I then used the built-in split function to extract the individual actors’ and actresses’ names from the cast list and store them in an array. This portion of the code is shown below:

my @casts = ();

open my $input, "<", "data.txt" or die $!;
while (<$input>)
{
    my $castList = $1 if (/"cast":\[(.+?)\]/);
    my @castMembers = split /,/, $castList;
    my @cast = ();
    for (@castMembers)
    {
        push @cast, $1 if (/"([\p{L}.,' ]+?)"/);
    }
    push @casts, \@cast;
}

Note that since the string I want to match my regex against is stored in the special variable $_, I do not need to use the =~ operator to tell Perl to check for a match—I simply declare my regex, and Perl checks $_ for a match by default. Also note that I am storing array references into the @casts array; recall that this is necessary to prevent the sub-arrays from losing their structure when stored in the larger array. Also note the \p{L} character class in the regular expression. This is a special character class that matches anything considered a letter according to the Unicode standard. The reason I use this instead of A-Za-z is to ensure that I am correctly capturing actors and actresses such as Ricardo Montalbán whose names contain accented characters.

Having loaded into memory the cast lists of every movie in the datset, I can now construct my adjacency list—a hash that maps each actor or actress to the actors and/or actresses with whom they have co-starred. Since I can do this simply by iterating over the arrays I created earlier, I first allow my file variable to go out of scope so that Perl will close the connection to it. The code for constructing the adjacency list is shown below:

our %LEVEL_ONE_LINKS = ();
while (@casts)
{
    $cast = shift @casts;
    for my $a (@$cast)
    {
 for my $b (@$cast)
 {
     unless ($a eq $b)
     {
  if (exists $LEVEL_ONE_LINKS{$a})
  {
      push @{$LEVEL_ONE_LINKS{$a}}, $b unless (grep(/^$b$/, @{$LEVEL_ONE_LINKS{$a}}));
         }
         else
         {
      $LEVEL_ONE_LINKS{$a} = [$b];
         }
         if (exists $LEVEL_ONE_LINKS{$b})
  {
      push @{$LEVEL_ONE_LINKS{$b}}, $a unless (grep(/^$a$/, @{$LEVEL_ONE_LINKS{$b}}));
         }
         else
         {
      $LEVEL_ONE_LINKS{$b} = [$a];
         }
     }
 }
    }
}

The grep function returns a list of all elements in an array that match a given regular expression. Recall that an empty list is treated as a false value in boolean context; hence, the grep function as it is being used here essentially acts to prevent duplicate mappings being added to the adjacency list. Also note that, in the else clause, I use square brackets rather than parentheses around the sole element of my newly created array. This is because using square brackets creates an array reference, saving me the step of having to declare the array and then obtain a reference to it using a backslash as I did with the @cast array in creating the first hash.

For those of you following along at home, be forewarned that the dataset is a very large file (about 43 MB); it will take several minutes to construct the adjacency list.

Finding the Shortest Chain

This essentially boils down to a breadth-first search algorithm: given a starting node (actor/actress), check all of the adjacent nodes (actors/actresses) to see whether they match the target node (actor/actress), then check all the nodes adjacent to the adjacent nodes, and so on and so forth until the target node is found, as seen below:

sub findShortestChain($$)
{
    my ($first, $second) = @_;
    my $chainLength = 0;
    my $currentIndex = 0;
    my $chainLengthBoundary = 1;
    my @queue = ($first);
    while ($currentIndex < @queue)
    {
 return $chainLength if ($queue[$currentIndex] eq $second);
 for my $link (@{$LEVEL_ONE_LINKS{$queue[$currentIndex]}})
 {
     push @queue, $link unless (grep(/^$link$/, @queue));
 }
 if (++$currentIndex == $chainLengthBoundary)
 {
     $chainLength++;
     $chainLengthBoundary = @queue;
        }
    }
    return undef;
}

Note again the use of grep to ensure that no node (actor/actress) is visited more than once; this is also the reason for keeping all of the names in the queue and simply advancing a pointer ($currentIndex) instead of removing nodes from the queue as they are visited: if the nodes were removed as they were visited, it would create the possibility of nodes being visited more than once. Also recall that string equality uses the eq operator rather than the == operator used for numerical equality. Finally, recall that an array variable used in a scalar context evaluates to the number of elements it contains (as seen in the while loop condition).

The Complete Program Code

my @casts = ();
our %LEVEL_ONE_LINKS = ();

{
    open my $input, "<", "data.txt" or die $!;
    while (<$input>)
    {
        my $castList = $1 if (/"cast":\[(.+?)\]/);
        my @castMembers = split /,/, $castList;
        my @cast = ();
        for (@castMembers)
        {
            push @cast, $1 if (/"([\p{L}.,' ]+?)"/);
        }
        push @casts, \@cast;
    }
}

while (@casts)
{
    $cast = shift @casts;
    for my $a (@$cast)
    {
 for my $b (@$cast)
 {
     unless ($a eq $b)
     {
  if (exists $LEVEL_ONE_LINKS{$a})
  {
      push @{$LEVEL_ONE_LINKS{$a}}, $b unless (grep(/^$b$/, @{$LEVEL_ONE_LINKS{$a}}));
         }
         else
         {
      $LEVEL_ONE_LINKS{$a} = [$b];
         }
         if (exists $LEVEL_ONE_LINKS{$b})
  {
      push @{$LEVEL_ONE_LINKS{$b}}, $a unless (grep(/^$a$/, @{$LEVEL_ONE_LINKS{$b}}));
         }
         else
         {
      $LEVEL_ONE_LINKS{$b} = [$a];
         }
     }
 }
    }
}

# get rid of the @casts variable once it's no longer needed to free up memory
undef @casts;

sub findShortestChain($$)
{
    my ($first, $second) = @_;
    my $chainLength = 0;
    my $currentIndex = 0;
    my $chainLengthBoundary = 1;
    my @queue = ($first);
    while ($currentIndex < @queue)
    {
 return $chainLength if ($queue[$currentIndex] eq $second);
 for my $link (@{$LEVEL_ONE_LINKS{$queue[$currentIndex]}})
 {
     push @queue, $link unless (grep(/^$link$/, @queue));
 }
 if (++$currentIndex == $chainLengthBoundary)
 {
     $chainLength++;
     $chainLengthBoundary = @queue;
        }
    }
    return undef;
}

sub askYesOrNo
{
    while (1)
    {
 print "Would you like to find the shortest chain between another pair of actors (Y/N)? ";
 my $answer = <STDIN>;
 $answer = uc(substr $answer, 0, 1);
 if ($answer eq 'Y')
 {
     return 1;
 }
 elsif ($answer eq 'N')
 {
     return 0;
        }
        else
        {
     print "Please enter Y or N.\n";
        }
    }
}

do {
    my ($first, $second);
    until (exists $LEVEL_ONE_LINKS{$first})
    {
 print "I don't have any data for $first.\n" if (defined $first);
 print "Enter an actor: ";
 $first = <STDIN>;
 chomp $first;
    }
    until (exists $LEVEL_ONE_LINKS{$second})
    {
 print "I don't have any data for $second.\n" if (defined $second);
 print "Enter another actor: ";
 $second = <STDIN>;
 chomp $second;
    }
    $result = findShortestChain($first, $second);
    print (defined $result)? "The shortest chain from $first to $second is $result.\n" :
                    "There is no chain connecting $first and $second.\n";
} while (askYesOrNo());

Sunday, March 22, 2020

File I/O

Sometimes we want to read input from a file instead of having the user type it at the keyboard. Likewise, sometimes we want our output to be saved to a file instead of printed to the console. The way we do this is by reading from or writing to a file.

The `open` function

In order to read from or write to a file, the program must open that file. This is done by calling the open function, which uses the syntax open $var, $mode, $path. The first argument, $var, represents the variable that will be used to access the file moving forward. Often this is a newly declared variable, so make sure to prefix its name with the my keyword if so. The second argument represents the mode in which the file is to be opened and can take one of three values:

Value	Meaning
`<`	Read-only mode
`>`	Write-only mode
`>>`	Append mode

IMPORTANT NOTE: If a file that already exists is opened in write-only mode, the previous contents of that file will be overwritten (i.e. irretrievably gone). Append mode also allows information to be written to a file, but in append mode, the data written is tacked onto the end of whatever data was already in the file prior to its being opened.

The third argument contains the path to the file you want to open, which is relative by default unless you specify an absolute path (i.e. one that starts with a drive letter).

As long as you store the file to a local variable (i.e. one declared using the keyword my), Perl will automatically close the file once the variable goes out of scope. This makes it especially important that you double-check to make sure you have used the keyword my in declaring your file variable—Perl will treat the file variable as a global if you forget to do so, meaning that it will not be automatically closed. Suffice to say, that would not be good.

The usual practice when invoking the open function is to use the formulation open $var, $mode, $path or die $!. The or die $! part tells Perl to exit if the file cannot be opened for whatever reason. The special variable $! is used by Perl to store the error message generated by open if the file cannot be opened. For example, suppose the following program is saved as FileIO.plx. It attempts to open a text file called DNE.txt, which, as its name might suggest, does not exist:

open my $input, "<", "DNE.txt" or die $!;

When this program is run, it produces the output No such file or directory at FileIO.plx line 1. This tells us that Perl was unable to locate a file called DNE.txt in the current directory.

Reading from a File

Perl reads files one line at a time by enclosing the file variable in angle brackets < >. Note this is very similar to how we get user input from the keyboard using <STDIN>; STDIN is actually a special type of “file variable” that interfaces with the operating system to get input from the user’s keyboard. Suppose we have stored in the same directory as our program a text file called sample.txt with the following contents:

The quick brown fox jumped over the lazy dogs.
She sells seashells by the seashore.
Peter Piper picked a peck of pickled peppers.

The following program reads the file, one line at a time, and prints its contents to the console:

open my $input, "<", "sample.txt" or die $!;
print $_ while (<$input>);

Writing to a File

Perl writes to files using a variation on the print statement: print $fileVariable $stuffToPrint. Note that there is no comma between $fileVariable and $stuffToPrint. This is important: if there is a comma, Perl will try to concatenate the two and print the result to the console. The following program will take our sample.txt from the previous example and copy its contents into a new file, copy.txt:

my @lines = ();
{
    open my $input, "<", "sample.txt" or die $!;
    push @lines, $_ while (<$input>);
}
{
    open my $output, ">", "copy.txt" or die $!;
    print $output $_ for (@lines);
}

Notice how I have put the read and write portions of the code in separate code blocks. It is good programming practice to only keep a file open for the minimum amount of time it absolutely has to be. As soon as I’m done using it, I let it the file variable go out of scope so that Perl will automatically close the file. Likewise, it is generally considered poor practice to have multiple files open at once unless this is absolutely necessary for some reason, hence why I use the @lines array as an intermediary rather than writing directly into copy.txt from sample.txt.

As you might have suspected based on the similarity between reading from files and reading from STDIN, printing to the console also uses a special type of “file variable,” called STDOUT. Writing to a file can also be accomplished by changing the default print location to another file variable using select $fileVariable;. If you choose to go this route, make sure to change the default print location back to the console when you’re done by calling select STDOUT;, as in the below example:

my @lines = ();
{
    open my $input, "<", "sample.txt" or die $!;
    push @lines, $_ while (<$input>);
}
{
    open my $output, ">", "copy.txt" or die $!;
    select $output;
    print $_ for (@lines);
    select STDOUT;
}

Wednesday, March 11, 2020

Regular Expressions

What is a Regular Expression?

A regular expression is a construct in Perl that can be used to determine whether a string of text contains a substring that has a particular form (or pattern, as they’re called more formally). You’ve actually already seen a sneak preview of regular expressions (or regexes for short). Recall this line from the getNumberInput function in the post on subroutines, which we used to determine whether the user input was actually a number:

return $usrInput if ($usrInput =~ /^-?\d+(\.\d+)?$/);

In this post, we’ll look more closely at what that means and how it works, as well as other examples of regexes. I will refer back to this regex throughout the post as the “number example.”

The Basics

Regexes are delimited by forward slashes / /. The operator =~ is used to test whether a string contains a substring that matches the specified regex, and the operator !~ tests whether a string does not contain a substring that matches the specified regex.

The simplest regexes are literal text. For example, $string =~ /foo/ tests whether $string contains foo as a substring. Placing a lowercase i after the closing forward slash makes the search case-insensitive. So, for example, $string =~ /foo/i will be true if $string contains any of the following as substrings:

foo
Foo
fOo
foO
FOo
FoO
fOO
FOO

Like double-quoted strings, regexes are interpolated. This means that variable names appearing in a regex will be replaced with the variable’s contents. So, for example,

my $regex = "fooBar";
my $string = "fooBarBaz";
print ($string =~ /$regex/)? "t" : "f";

would output t, since fooBar is a substring of fooBarBaz.

Note that the following fourteen characters have special meaning within a regex. In order to use any of these as literal characters, they must be preceded by a backslash \:

.
*
?
+
(
)
[
]
{
}
^
$
|
\

Anchors

What if we want to look for something to be at the very start or very end of the string? We use an anchor. The carat (^) means “match the beginning of the string,” and the dollar sign ($) means “match the end of the string.” So $string =~ /^foo/ matches only if the first three characters of the string are foo.

Anchoring can also be seen in the number example. Note that the first character inside the opening forward slash is ^ and the last character before the closing forward slash is $. By using both the beginning-of-string and end-of-string anchors, I tell Perl that I want to check for an exact match with the entire string, rather than checking for a substring that matches my regex.

Character Classes

This is where the true power of regexes starts to make itself known. Instead of matching one specific character, I can specify that I want to match one of several possible characters by defining a character class. A character class is defined by placing the desired characters in square brackets [ ]. For example, $string =~ /[bcr]at/ will match if $string contains bat, cat, or rat as a substring. Character classes can also specify ranges of characters; for example, [A-Z] represents any uppercase letter. A character class can also be negated by placing a carat ^ immediately inside the opening square bracket. For example, [^A-Z] represents any character other than an uppercase letter.

Perl also provides the following predefined character classes that can be accessed via shorthand notation:

Shorthand	Represents
`\d`	A digit (0-9)
`\D`	Anything other than a digit
`\w`	A “word character”—any character that can be validly used in a Perl identifier (letter, number, or underscore `_`)
`\W`	Anything other than a word character
`\s`	A whitespace character
`\S`	Anything other than a whitespace character
`.`	Any character (literally anything at all)

Note the use of the \d character class in the number example above. This allows me to match any digit 0-9, which is good because I want to accept any validly formatted real number regardless of what digits it uses. Also note, that, although a . appears in the number example, I am not using the “any character” class; by preceding the . with a backslash, I’ve told Perl to interpret it as a literal character. The regex will thus be looking for the actual . character in $usrInput.

Quantifiers

Quantifiers are used to specify the number of times a particular element may appear in the matching substring. There are seven different quantifiers, which are shown in the table below applied to the letter a. Note that x and y in the below examples represent positive integers.

Quantifier	Meaning
`a?`	Zero or one occurrences of `a`
`a*`	Zero or more occurrences of `a`
`a+`	One or more occurrences of `a`
`a{x}`	Exactly `x` occurrences of `a`
`a{x,}`	At least `x` occurrences of `a`
`a{,x}`	At most `x` occurrences of `a`
`a{x,y}`	At least `x` but no more than `y` occurrences of `a`

Note that the quantifiers bind only to the immediately preceding character or character class by default. For example, ab+ will match ab, abb, abbb, and so on. To apply a quantifier to more than one character or character class, we must define the characters we want the quantifier to apply to as a group by enclosing them in parentheses ( ). So, for example, (ab)+ will match ab, abab, ababab, and so on.

Also note that the quantifiers are greedy by default—they will attempt to match as large a substring as they possibly can while still allowing the entire pattern to produce a match. This is of particular concern with the constructs .* and .+, which will match the largest consecutive sequence of “anything at all” that they can get their grubby paws on. Any of the quantifiers can be made reluctant by following them with a question mark ?. This will cause them to match the shortest sequence they can that will still allow the entire pattern to produce a match. For example, consider my $string = "abracadabra";. Matching it against the regex /a\w*a/, using the greedy quantifier, the initial a in the regex will match the first a in the string, the \w* will consume the bracadabr, and the last a in the regex will match the last a in the string, so the regex matches the entire string, abracadabra. Matching it against the regex /a\w*?a/, using the reluctant quantifier, the initial a in the regex will again match the first a in the string. The reluctantly quantified \w*?, however, will match only the br, stopping as soon as it gets to the second a, which gets matched to the last a in the regex. Thus, the version of the regex using the reluctant quantifier finds only abra as its match.

We see several quantifiers in the number example. Firstly, a ? quantifier is bound to the - character immediately following the start-of-string anchor. This makes the - character optional. Both occurrences of the \d character class have a + quantifier bound to them—this means there can be one or more digits, which is good since we don&rdsquo;t know how big of a number the user might give us. Finally, we have another ? quantifier, this time bound to the group (\.\d+). This makes the entire group optional; however, we cannot have only part of the group present. The group must either be able to match in its entirety or be completely absent.

We now know enough to describe the number example in full. The pattern matched by the number example is: the beginning of the string, optionally followed by a negative sign, followed by one or more digits, optionally followed by a decimal point and one or more additional digits, followed by the end of the string. By checking whether the user input matches this regex, we are assuring ourselves that the user has input a validly formatted real number.

Backreferences

Groups have another purpose besides just having quantifiers bound to them. They can also be used to extract a portion of the matched string to be looked at later. The extracted groups are stored in special backreference variables, which begin with $1 for the first extracted group, $2 for the second extracted group, and so on. By using groups to extract portions of our matched string into the backreference variables, we can write a program that more clearly demonstrates the difference between the greedy and reluctant quantifiers:

my $string = "abracadabra";
if ($string =~ /(a(\w*)a)/)
{
	print "The greedy quantifier consumed $2, and the entire match was $1\n";
}
if ($string =~ /(a(\w*?)a)/)
{
	print "The reluctant quantifier consumed $2, and the entire match was $1\n";
}

When run, this program produces the output

The greedy quantifier consumed bracadabr, and the entire match was abracadabra
The reluctant quantifier consumed br, and the entire match was abra

Note that the group numbering has no bearing on what groups were actually matched, only on the groups as they are specified in the regex. So, for example, in the regex /([A-Za-z]{3})?(\d+)/, the substring matched by the group (\d+) will always be in backreference variable $2, even if the optional group preceding it was not matched. If we only intend to extract some of the defined groups, and the other sets of parentheses are being used only to define a group for a modifier to bind to, we can place the sequence ?: immediately inside the opening parenthesis of a group to tell Perl not to extract that group into a backreference variable. So, for example, using the regex /(?:[A-Za-z]{3})?(\d+)/, the optional first group will not be extracted into a backreference variable. The group we actually care about extracting, (\d+), is thus placed in backreference variable $1, since the preceding group is no longer being extracted.