CS 124
Fall 2023

Problem Set 13

Preliminaries

In your work on this assignment, make sure to abide by the collaboration policies of the course.

Don’t forget to use docstrings and to take whatever other steps are needed to create readable code.

If you have questions while working on this assignment, please come to TA help hours or post them on Piazza.

Make sure to submit your work on Gradescope, following the procedures found at the end.


All problems are due by 10 p.m. EDT on Friday October 20, 2023.

Suggested self-deadline of Wednesday October 18, 2023.

Important note regarding test cases and Gradescope:

  • You must test each function after you write it. Here are two ways to do so:

    • Run your file after you finish a given function in the interactive shell where you can call the function using different inputs and check to see that you obtain the correct outputs.
    • Add test calls to the bottom of your file, inside the if __name__ == '__main__' control structure. For example:

      if __name__ == '__main__':
      
          print("mystery(6,7) returned", mystery(6,7))
      

      These tests will be called every time that you run the file, which will save you from having to enter the tests yourself. We have given you an example of one such test in the starter file.

  • You must not leave any print statements in the global scope. This will cause an error with the Gradescope autograder. Make sure all of your print statements are inside a function scope or inside the if __name__ == '__main__' control structure.


Problem 1: Another Date client: using the dict object

40 points; pair-optional or group-of-3-optional

In this problem, you will continue to create client code using Date objects.

Getting started
Begin by downloading the file ps13pr1.py and opening it in VS Code. At the top of the file but after your header comment, add an import statement that imports the Date class from your ps12pr2.py file.

Put your ps12pr2.py file in the same directory as the ps13pr1.py file, and you should be able to access your Date class.

Important

If you add test code to your ps12pr2.py file, please put it in if __name__ == '__main__' section, which will only execute when you run your file (but not when Gradescope imports it). However, you should not have any test code in the global scope (i.e., outside of a function).

Your tasks

  1. Write a function nye_counts(start, end) that counts how many times New Year’s Eve (December 31st) falls on each day of the week between the years start and end, inclusive of both endpoints, and that returns a dictionary containing those counts.

    For example, in the three years from 2014 to 2016, New Year’s Eve falls once on Wednesday, once on Thursday, and once on Saturday, but does not fall on any other day of the week. Thus, your function would return a dictionary with the following contents:

    {'Saturday': 1, 'Thursday': 1, 'Monday': 0, 'Sunday': 0, 'Tuesday': 0, 'Wednesday': 1, 'Friday': 0}
    

    Note: The order of the key-value pairs in a dictionary is not guaranteed, so your dictionary may show the days of the week in a different order. That is not a problem!

    In ps13pr1.py, we have given you code at the start of the function that initializes the dictionary for you. You need to complete the rest of the funtion.

    You should use a for loop to iterate over the years in the range given by the parameters start and end. Your loop header should look something like this:

    for year in _____________:
    

    where you fill in the blank with an appropriate expression.

    The body of the loop should:

    • Create a Date object to represent New Year’s Eve for the current value of year.
    • Call the day_name method for that Date object to determine the day of the week on which New Year’s Eve falls in that year.
    • Update the appropriate element of the dictionary.

    Finally, once all of the years in the range have been considered, the dictionary should be returned. Make sure that you return it and do not print it.

    Here’s another example you can use for testing:

    >>> counts = nye_counts(2014, 2113)
    >>> counts
    {'Monday': 14, 'Tuesday': 13, 'Friday': 14, 'Wednesday': 15, 'Thursday': 15, 'Sunday': 14, 'Saturday': 15}
    >>> counts['Wednesday']
    15
    

Problem 2: Markov text generation

60 points; pair-optional or group-of-three-optional

This is the only problem of the assignment that you may complete with partners. See the rules for working with a partner on pair-optional problems for details about how this type of collaboration must be structured.

In this problem, you will write a program that is capable of generating meaningful text all by itself! You will accomplish this by implementing what is known as a Markov text-generation algorithm.

Background

English is a language with a lot of structure. Words have a tendency (indeed, an obligation) to appear only in certain sequences. Grammatical rules specify legal combinations of different parts of speech. For example, the phrase “The cat climbs the stairs” obeys a legal word sequence. “Stairs the the climbs cat” does not. Additionally, semantics (the meaning of a word or sentence) further limits possible word combinations. “The stairs climbs the cat” is a perfectly legal sentence, but it doesn’t make much sense and you are very unlikely to encounter this word ordering in practice.

Even without knowing the formal rules of English or the meaning of English words, we can get an idea of which word combinations are legal simply by looking at well-formed English text and noting the combinations of words that tend to occur in practice. Then, based on our observations, we can generate new sentences by randomly selecting words according to commonly occurring sequences of these words. For example, consider the following text:

I love roses and carnations. I hope I get roses for my birthday.

If we start by selecting the word “I”, we notice that “I” may be followed by “love,” “hope,” and “get” with equal probability in this text. We randomly select one of these words to add to our sentence:“I get.” We can repeat this process with the word “get,” necessarily selecting the word “roses” as the next word. Continuing this process could yield the phrase:

I get roses and carnations.

Note that this is a valid English sentence, but not one that we have seen before. Other novel sentences we might have generated include “I love roses for my birthday.” and “I get roses for my birthday.”

More formally, the process used to generate these sentences is called a first-order Markov process. A first-order Markov process is a process in which the state at time t + 1 (i.e., the next word) depends only on the state at time t (i.e., the previous word). In a second-order Markov process, the next word would depend on the two previous words, and so on. Our example above was a first-order process because the choice of the next word depended only on the current word.

Your tasks

Implementing a first-order Markov text generator will involve writing two functions: one to process a file and create a dictionary of legal word transitions, and another to actually generate the new text.

We will consider words to be different even if they only differ by capitalization or punctuation. For example, 'spam', 'Spam', and 'spam!' will all be considered distinct words.

To start, open a new file in VS Code and save it as ps13pr2.py. Put all of the code that you write for this problem in that file. Don’t forget to include appropriate comments at the top of the file, and a docstring for each function.

Important

If you add test code to your ps13pr2.py file, please put it in if __name__ == '__main__' section, which will only execute when you run your file (but not when Gradescope imports it). However, you should not have any test code in the global scope (i.e., outside of a function).

  1. Write a function create_dictionary(filename) that takes a string representing the name of a text file, and that returns a dictionary of key-value pairs in which:

    • each key is a word encountered in the text file

    • the corresponding value is a list of words that follow the key word in the text file.

    For example, the dictionary produced for the text “I love roses and carnations. I hope I get roses for my birthday.” would include the following key-value pairs, among others:

    'I': ['love', 'hope', 'get']
    'love': ['roses']
    'roses': ['and', 'for']
    'my': ['birthday.']
    # as well as others!
    

    Guidelines:

    • You should not try to remove the punctuation from the words of the text file.

    • The keys of the dictionary should include every word in the file except the sentence-ending words. A sentence-ending word is defined to be any word whose last character is a period ('.'), a question mark ('?'), or an exclamation point ('!'). A sentence-ending word should be included in the lists associated with the words that it follows (i.e., in the value parts of the appropriate key-value pairs), but it not appear as its own key.

    • If a word w1 is followed by another word w2 multiple times in the text file, then w2 should appear multiple times in the list of words associated with w1. This will allow you to capture the frequency with which word combinations appear.

    • In addition to the words in the file, the dictionary should include the string $ as a special key referred to as the sentence-start symbol. This symbol will be used when choosing the first word in a sentence. In the dictionary, the list of words associated with the key '$' should include:

      • the first word in the file
      • every word in the file that follows a sentence-ending word.

      Doing this will ensure that the list of words associated with '$' includes all of the words that start a sentence. For example, the dictionary for the text “I scream. You scream. We all scream for ice cream.” would include the following entry for the sentence-start symbol:

      '$': ['I', 'You', 'We']
      
    • You may find it helpful to consult the word_frequencies function from lecture. We will also discuss some additional strategies for create_dictionary in lecture.

    Examples:

    To test your code, download the sample.txt file into the same directory that contains ps13pr2.py. This sample text file contains the following contents:

    A B A. A B C. B A C. C C C.
    

    Once this file is in place, run your ps13pr2.py in VS Code and test your function from the Shell:

    >>> word_dict = create_dictionary('sample.txt')
    >>> word_dict
    {'A': ['B', 'B', 'C.'], 'C': ['C', 'C.'],
     'B': ['A.', 'C.', 'A'], '$': ['A', 'A', 'B', 'C']}
    

    The order of the keys–or of the elements within a given key’s list of values–may not be the same as what you see above, but the elements of the lists should appear in the quantities shown above for each of the four keys 'A', 'B', 'C', and '$'.

    Here are some additional files you can use for testing:

    Here again, the ordering that you obtain for the keys and list elements in the dictionaries may be different. In addition, we have edited the formatting of the dictionaries to make them easier to read.

  2. Write a function generate_text(word_dict, num_words) that takes as parameters a dictionary of word transitions (generated by the create_dictionary function) named word_dict and a positive integer named num_words. The function should use word_dict to generate and print num_words words. The function should print the words that it generates. It should not return a value.

    Guidelines:

    • The first word should be chosen randomly from the words associated with the sentence-start symbol, '$'. The second word should be chosen randomly from the list of words associated with the first word, etc. When the current word ends in a period ('.'), question mark ('?'), or exclamation point ('!'), the function should detect this and start a new sentence by again choosing a random word from among those associated with '$'.

    • Do not include '$' in the output text. It should only be used as an internal marker for your function.

    • You can use the random.choice function to choose from a list of possible words. Don’t forget to include import random at the top of your file. Then, if wordlist is the list of possible words at a given point in the generated text, you can do something like the following:

      next_word = random.choice(wordlist)
      
    • Here again, you shouldn’t try to remove or change the punctuation associated with the words, and you don’t need to worry if the generated text doesn’t end with appropriate punctuation. The generated text won’t be perfect, but most of the time it will at least be meaningful!

    • If your function encounters a word that doesn’t have any words associated with it in the dictionary, the function should start a new sentence. This situation can occur if the last word in the file used to create the dictionary was unique and did not end with punctuation.

    • Here again, we will discuss some strategies for generate_words in lecture.

    Example:

    Here are two examples using the same text file as above. Your output may differ because of the randomness involved in the generation.

    >>> word_dict = create_dictionary('sample.txt')
    >>> generate_text(word_dict, 20)
    B C. C C C. C C C C C C C C C C C. C C C. A
    
    >>> generate_text(word_dict, 20)
    A B A. C C C. B A B C. A C. B A. C C C C C C.
    

    Try some other examples using longer documents containing English words, such as the works of William Shakespeare. In particular, we are providing a text file containing the first act of Romeo and Juliet, along with the files that we provided above in the examples for create_dictionary.

Submitting Your Work

You should use Gradesope to submit the following files:

Warnings

  • Make sure to use these exact file names, or Gradescope will not accept your files. If Gradescope reports that a file does not have the correct name, you should rename the file using the name listed in the assignment or on the Gradescope upload page.

  • If you make any last-minute changes to one of your Python files (e.g., adding additional comments), you should run the file in VS Code after you make the changes to ensure that it still runs correctly. Even seemingly minor changes can cause your code to become unrunnable.

  • If you submit an unrunnable file, Gradescope will accept your file, but it will not be able to auto-grade it. If time permits, you are strongly encouraged to fix your file and resubmit. Otherwise, your code will fail most if not all of our tests.