Computer Science 111, Assignment 3
Tutorial on characters, strings, and debugging


  1. Preliminaries
  2. The primitive data type char
  3. Classifying characters
  4. The get function of cin
  5. Character literals vs. string literals and numeric literals
  6. Comparison of strings
  7. Runtime debugging


  1. Preliminaries.   First, see the preliminaries in the other tutorial.

    Hopefully you won't need to get rid of your old homework source code files as well as your old a.out files. Bur if you do need to remove them you should first back them up (via FTP) onto more than one diskette.


  2. The primitive data type char.   So far, we have dealt primarily with primitive data types representing numbers (integers and floating-point). As we have seen, there is also another primitive data type, char, used to represent text characters. We have also dealt briefly with text in the form of strings of characters, either C-strings or objects of the C++ string class. In either kind of string, the individual characters within the string are of type char.

    Like anything else in a computer's memory, a character is stored as a sequence of bits (binary digits, i.e. 0's and 1's). On both Windows/DOS and Unix systems, the most commonly-used characters are each represented by a unique sequence of 7 bits known as the character's ASCII code. They are traditionally stored as bytes (8 bits), i.e. the 7-bit ASCII code with an extra zero at the beginning.

    In your web browser, go to the course website and click on "Links to other relevant sites" in the index on the main page, then click on "ASCII, Unicode, and other character encodings," and look briefly at a few of the listed sites with ASCII charts. Some of the sites list decimal (base 10) representations of the ASCII values; others list hexadecimal (base 16) representations; others list both.

    Why base 16? Because it is very easy to convert back and forth between hexadecimal and binary, as follows:

       Hex    Binary
    
        0      0000
        1      0001
        2      0010
        3      0011
        4      0100
        5      0101
        6      0110
        7      0111
        8      1000
        9      1001
        A      1010
        B      1011
        C      1100
        D      1101
        E      1110
        F      1111
    

    Note that each hexadecimal digit corresponds to 4 bits (binary digits); hence 4 hexadecimal digits correspond to 16 bits.

    The ASCII characters include not only letters, digits, and punctuation marks, but also some things we don't ordinarily think of as characters:   the control characters, such as the backspace, the tab key, the carriage return, and the Escape (ESC) key. The control characters are also known as nonprintable characters. The characters whose ASCII codes have numeric values in the range 0 to 31 (base 10) are all control characters. Although you're probably not accustomed to thinking of the control characters as characters, the computer stores them the same way it stores printable characters, and thus they are values of type char

    In the ASCII charts, observe that the capital letters and the lower-case letters are distinct characters. Observe also that the capital letters are grouped together in alphabetical order. Thus, consecutive capital letters have consecutive ASCII codes. The lower-case letters are similarly grouped together in alphabetical order. Likewise, the digit characters are grouped together in numerical order. Note that the ASCII value of a digit character is not the same as its intended numeric value. For example, the digit '0' has an ASCII value of 48, not 0.


  3. Classifying characters.   In your hw03 folder, compile classifyChar1.cpp and run it. Enter a character, as prompted. Your character may be a letter, digit, or punctuation mark. Also, try entering some non-ASCII characters. (To enter a non-ASCII character, hold down the [Alt] key while typing a number between 128 and 255 on the numeric keypad -- NOT the row of digit keys at the top of the main alphanumeric keypad.)

    The program will tell you the numeric value of the binary code that is used to represent your character in the computer's memory. The program will then categorize the character, stating that it is an ASCII letter, an ASCII digit, an ASCII punctuation mark, a non-ASCII character, etc.

    Look now at the source code. The characters are classified based on the numeric values of their binary codes. First, the program checks whether a character is within the ASCII range.

    Recall that an ASCII code is 7 bits. The values that can fit within 7 bits are 0 to 127. An ASCII code is stored in a byte (8 bits) with a leading bit of 0. Thus, if the first bit in a byte is a 1, rather than a 0, it contains the numeric code of a non-ASCII character.

    When the byte is interpreted as a numeric value rather than as the represented character, a leading 1-bit indicates a negative number. So, we can detect non-ASCII characters, i.e. characters with a leading 1-bit, by testing whether the numeric value is negative, i.e. less than zero:

       if ( entry < 0 )
          cout << "a non-ASCII character." << endl;
    

    The program then classifies those characters that are within the ASCII range. First, it checks to see whether a character is a space:

       else if ( entry == ' ' )
          cout << "an ASCII space." << endl;
    

    The next category is ASCII digits:

       else if ( entry >= '0' && entry <= '9' )
          cout << "an ASCII digit." << endl;
    

    The double ampersand (&&) means "AND." Recall that the ASCII values of the digit characters are consecutive and in numerical order. A character is an ASCII digit if its numeric value of its binary code is greater than the ASCII value of the digit '0' and less than the ASCII value of the digit '9'.

    The next category is ASCII letters:

       else if ( (entry >= 'A' && entry <= 'Z')
                || (entry >= 'a' && entry <= 'z') )
          cout << "an ASCII letter." << endl;
    

    A character is a letter if it is either a capital letter or a lower-case letter. A character is a capital letter if its ASCII value is between the ASCII value of 'A' and the ASCII value of 'Z'. Likewise, a character is a lower-case letter if its ASCII value is between the ASCII value of 'a' and the ASCII value of 'z'.

    The above lines of code use both the double ampersand (&&), meaning "AND," and the double vertical bar (||), meaning "OR." The && and || are logical (boolean) operators. Just as the arithmetic operators (+, -, *, /, %) perform operations on values of numeric data types such as int and float, so too the logical operators perform operations on expressions of type bool.

    In our program, the next category is ASCII control characters. Recall that control characters are not ordinarily considered "characters" by non-programmers. They are things like the tab, the back space, the carriage return, and the newline. But they too are of type char, with ASCII values. The ASCII control characters have ASCII values in the range 0 to 31, plus one more control character with ASCII value 127.

       else if ( entry < 32 || entry == 127 )
          cout << "an ASCII control character." << endl;
    

    Finally, the last category, ASCII punctuation marks, consist of all the characters left over after all the other categories have been covered. Verify this by looking at one of the ASCII charts linked to on the course website (under "Links to other relevant sites").

       else
          cout << "an ASCII punctuation mark." << endl;
    


  4. The get function of cin.   As we have seen, in classifyChar1.cpp, the first branch of the nested if/else is supposed to be able to test whether the entered character is a space (' '). Try entering a space character by pressing the space bar and then pressing the Enter key. Likewise, try entering a tab character, one of the control caracters, by pressing the Tab key and then pressing Enter. Observe that the program does not respond at all. If you then enter a letter, digit, or punctuation mark, the program finally does respond.

    Now compile classifyChar2.cpp and run it. Try entering spaces and tabs. This program does respond to a space or a tab just like any other character. Also, try pressing the Enter key without typing another character first. The program will then accept, as input, the newling character that was generated by pressing the Enter key itself.

    The two programs classifyChar1.cpp and classifyChar2.cpp are identical except for the way they input the character. In classifyChar1.cpp, we use cin's extraction operator (">>") to input the character:

       cin >> entry;
    

    On the other hand, in classifyChar2.cpp, we use the get function of cin:

       cin.get(entry);
    

    Recall that cin is an object of class istream, and recall that what distinguishes classes from other structured data types is that an object (a variable whose data type is a class) does not only store data, but also has behaviors defined for it. These behaviors can be defined either as operators (symbols such as ">>") or as functions.

    When an object of class istream does input using the extraction operator (">>"), it ignores all whitespace characters:   the space, the tab, the carriage return, and the newline character. (You'll learn more about the carriate return and the newline character later in this tutorial.) On the other hand, the get function can be used to input any character, not just a whitespace character.

    Try inputting other control characters using both programs. The first 26 control characters can be generated by pressing [Ctrl]-A, [Ctrl]-B, [Ctrl]-C, etc. ([Ctrl]-A press the A key while holding down the Ctrl key.)


  5. Character literals vs. string literals and numeric literals.   In various branches of the nested if/else classifyChar1.cpp and classifyChar2.cpp, observe the use of single quote marks (') to indicate specific characters. For example:

       else if ( (entry >= 'A' && entry <= 'Z')
                || (entry >= 'a' && entry <= 'z') )
          cout << "an ASCII letter." << endl;
    

    Single quote marks (') denote the beginning and end of a character literal, whereas double quote marks (") denote the beginning and end of a string literal. Recall that a string stores a sequence of characters, whereas a value of type char is just a single character.

    As data types, characters and strings are NOT interchangeable, even for single-character strings. For example, the string literal "A" is NOT equivalent to the character literal 'A'. The string literal "A" represents a string which contains the character 'A', whereas the character literal 'A' denotes just the character 'A' itself.

    A one-character string takes up more room in memory than just the character itself. In the case of C-strings, the string "A" actually contains not only the character 'A' but also a null character immediately after the 'A' in memory. All C-strings use a null character (ASCII value zero) to indicate the end of the string, because a C-string does not otherwise know its own length.

    Observe also that there is a distinction between digit characters and the intended numeric values of the digits. In the code segment below, the character literal '0' does NOT mean the same thing as the numeric literal 0, nor does the '9' mean the same thing as the numeric literal 9.

       else if ( entry >= '0' && entry <= '9' )
          cout << "an ASCII digit." << endl;
    

    The numeric literals 0 and 9 are of type int and represent values that are stored in memory as straightforward binary representations of the numbers 0 and 9, respectively. On the other hand, the character literals '0' and '9' represent digit characters that are displayed as 0 and 9. Note that the ASCII codes for the digit characters are NOT equal to the intended numeric values of those characters. Look again at one of the online ASCII charts and observe, for example, that the ASCII value of the digit character '0' is 48, not 0, and that the ASCII value of the digit character '9' is 57, not 9.


  6. Comparison of strings.   Compile stringCompareDemo.cpp and run it. Try entering various pairs of words. This program compares two words and tells you which word precedes the other in a lexicographical ordering.

    A lexicographical ordering is similar to an alphabetical ordering, except that the ordering is based on values of the numeric codes of the characters. A lexicographical ordering is the same as an alphabetical ordering if both words are either all-uppercase or all lowercase. So, for example, "ant" precedes "zebra", and "ANT" precedes "ZEBRA". However, the ASCII values of all the uppercase letters precede the ASCII values of all the lowercase letters. Thus "ZEBRA" precedes "ant". Similarly, all the digits precede all the letters.

    A lexicographical comparison first compares the leftmost characters of the two strings, i.e. the characters at position 0. If those cnaracters are different, the strings are ordered based on the ASCII values of the characters at position 0. If the characters at position 0 are different, then the characters at position 1 are compared. If the characters at position 0 are the same but the characters at position 1 are different, the strings are ordered based on the ASCII values of the characters at position 1. If both the characters at position 0 are alike and the characters at position 1 are alike, then the characters at position 2 are compared. And so on.

    Let's now consider a lexicographical comparison of strings representing non-negative integers. Does a lexicographical ordering of two such strings correspond to a numerical ordering of the two represented numbers? Yes, IF the two numbers have the same number of digits. Otherwise, not necessarily. >For example, the number 432 precedes the number 1234 in a numerical ordering, but the string "1234" precedes the string "432" in a lexicographical ordering. To understand why, do an actual character-by-character lexicographical comparison of the two strings yourself, as described above.

    Comparison of strings for lexicographical ordering can be done using the same operators that are used for comparison of numeric values:

                           Meaning with               Meaning with
         Operator         numeric values                strings
    
            <              is less than                 precedes
            >             is greater than               succeeds
            <=         is less than or equals      precedes or equals
            >=       is greater than or equals     succeeds or equals
            ==                equals                    equals
            !=            does not equal             does not equal
    

    In the C++ language itself, the above comparison operators are defined only for primitive data type values, i.e. the built-in simple types which hold only a single number or character. The comparison operators for strings are defined not as part of the C++ language itself, but as part of the definition of the string class in the C++ library.

    As we have seen, an object (a variable whose data type is a class) not only stores data, but also has behaviors defined for it, in the form of functions and oparators, as part of the definition of the object's class. In Assignment 1, we saw 1 that string class objects have a length function for them. The length function tells us the number of characters in a string. And, as we have just now seen, string class objects also have operators defined for them, such as the comparison operators.

    As we also saw in Assignment 1, there are two kinds of strings in C++: (1) string class objects and (2) C-strings, which are arrays of characters, i.e. they just store a sequence of characters and do NOT have functions or operators defined for them as part of a data type definition.

    Because the comparison operators for strings are defined as part of the definition of the string class, they can be used only if the string on at least one side of the operator is a string class object. The comparison operators can be used to compare two string class objects, or they can be used to compare a string class object to a C-string, or they can be used to compare a C-string to a string class object, but they CANNOT be used to compare two C-strings. If you try to use a comparision operator to compare two C-strings, the compiler won't complain, but the results will be unpredictable. It may seem to work for some pairs of strings, but will not work reliably.


  7. Runtime debugging.   There are two major kinds of programming errors: (1) syntax errors, which prevent your program from compiling, and (2) runtime errors, in which your successfully-compiled machine code file does something other than what you wanted it to do. We will now discuss some techniques to help you debug runtime errors.

    Look again at one of the Assignment 3 example files, compare3.cpp, which had a runtime error. In the Assignment 3 tutorial, we explained how to find the error using a handwritten code trace.

    Another debugging technique is to insert output statements into your program to output the values of the variables at various different points in the program. Look now at compare3debug.cpp, a version of compare3.cpp with debugging output statements added. Whereas the original program compare3.cpp might have the following output, for user inputs of 5, 5, and 3:

    This program will find the largest of 3 floating-point numbers.
    Enter 3 numbers separated by spaces:>5 5 3
    The largest of 5, 5, and 3 is 3.
    

    compare3debug.cpp has the following output for the same inputs:

    This program will find the largest of 3 floating-point numbers.
    Enter 3 numbers separated by spaces:>5 5 3
    1. number1=5  number2=5  number3=3.
    2. number1=5  number2=5  number3=3  largest=0.
    3. number1=5  number2=5  number3=3  largest=3.
    The largest of 5, 5, and 3 is 3.
    

    Below is a portion of compare3debug.cpp with the debugging output statements highlighted in boldface type:

       cout << "Enter " << NUMBER_OF_NUMBERS
                        << " numbers separated by spaces:>";
       float number1;
       float number2;
       float number3;
       cin >> number1 >> number2 >> number3;
    
       cout << "1. number1=" << number1         // ***** 1 ******
               << "  number2=" << number2
               << "  number3=" << number3 << endl;
    
       // Find the largest of the first two numbers:
       float largest;
       if ( number1 > number2 )
          largest = number1;
       if ( number1 < number2 )
          largest = number2;
    
       cout << "2. number1=" << number1         // ***** 2 ******
               << "  number2=" << number2
               << "  number3=" << number3
               << "  largest=" << largest << endl;
    
       // Find the largest of all three numbers:
       if ( number3 > largest )
          largest = number3;
    
       cout << "3. number1=" << number1         // ***** 3 ******
               << "  number2=" << number2
               << "  number3=" << number3
               << "  largest=" << largest << endl;
    

    The comments with the numbers and asterisks serve two purposes: (1) to make it easy to match, at a glance, the debugging output statements with the lines they generated in the actual output, and (2) to make the debugging output statements stand out so that they can be easily removed later, after the errors have been found and fixed.

    To debug the program, compare the actual outputs of each of the debugging statements with what you think the outputs should be. If the output of a debuggins statement differs from what you think it should be, you know that there is an error before that point. If the output of one debugging statement is the same as what you think it should be, but then the next debugging statement has an unexpected output, then you can conclude that your error is probably somewhere in between those two points in the program. The debugging output statements will not tell you WHAT your error is, but they will help you find WHERE your error is, a vital first step in tracking it down.

    When debugging a runtime error, if the error is not immediately obvious to you, then you should focus first on determining WHERE in your program the error is, before you even begin to try to figure out WHAT the error is. Your first goal should be to determine, exactly, which statement in your program contains the error. You'll find the error much faster if, at first, you focus exclusively on determining WHERE the error is, and ignore the question of WHAT the error is until you have determined exactly WHERE it is.

    How do you zero in on one particular faulty statement? When debugging a program longer than compare3.cpp, don't put lots and lots of debugging output statements in the program all at once. Instead, begin by putting in just a few debugging output statements in strategic places. Then look for a pair of consecutive debugging statements such that one of them has the output you expect and the next one doesn't. You now know that the error is somewhere between those two statements. Delete all other output statements besides those two, and then put more debugging statements between those two, to find a smaller portion of the program containing the error. Again, delete the no-longer-relevant output statements and insert more output statements within the program segment enclosed by the currently relevant pair. Keep doing this for smaller and smaller segments of the program until you've isolated the exact statement in which the error occurs.

Back to: