Computer Science 111, Assignment 2
Tutorial on characters and strings - brief intro
- Preliminaries.
Some of the examples below, and the homework problem, will require more than the usual amount of disk space. So, please be sure to get rid of all a.out, l.out, and core files before you begin. To find where they are, type the following at the "[username@venus ~]$" prompt, in your home directory:find . | grep a.out find . | grep l.out find . | grep coreThen go to the indicated directories and remove the files.
- Characters and strings.
So far, we've seen the use of variables (memory locations) to store numbers. A computer's memory can also be used to store other kinds of data, such as text.As mentioned in an earlier tutorial, everything in the computer's memory is stored in the form of sequences of 1's and 0's, also known as bits (binary digits). Text is stored in the form of characters, where each character is represented, in memory, by a binary numeric code. On most computers, English-language characters are represented using a system of numeric encoding known as ASCII (American Standard Code for Information Interchange).
A character can be stored using a variable of type char. For an example using type char, Compile demoChar.cpp and run it. This program prompts you to enter a character, and then outputs both the character itself and the numeric value of the ASCII code which represents the character in memory. The numeric value is obtained by casting the char value to type int.
However, we normally don't want to store just one character. Text normally consists of sequences of characters. A sequence of characters is known as a string. We have used plenty of strings before. So far, we have not used string variables, but we have used string literals. For example, in the statement:
cout << "Enter a character:>";the expression which begins and ends with the quote marks is a string literal. In C++, there are two different data types that can be used for strings. One of these types is called a C-string, because it was used in the older programming language C. The other type is the C++ string class, a newer data type defined in the C++ library.
As mentioned earlier, data types in C++ can be classified as (1) simple data types, variables of which can store just one single data item, such as a single number or character, and (2) structured data types, variables of which can store more than one number or character. Types int, float, and char are all simple data types. (A variable of type char stores a single character.) Both C-strings and the string class are structured data types, because they can store more than one character.
There are different kinds of structured data types. A record, also known as a struct, can store multiple data items of different types, whereas an array stores multiple data items that are all of the same type. Recall that a string is a sequence of characters. A C-string is an array which stores multiple data items of type char.
The fanciest structured data types are classes, such as the C++ string class. A variable whose data type is a class is known as an object. What makes objects different from records and arrays is that objects can do more than just store data. We can also tell them to do things, by using functions and operators that are defined for the object's class. On the other hand, arrays and records do nothing but store data.
Because objects can be told to do things via operators and functions, they have more capabilities than an array or record has. Thus, objects of the C++ string class have more capabilities than C-strings.
A variable whose data type is either of the two string types (C-string or C++ string class) refers to an area in the computer's memory where a sequence of characters is stored. The sequence of characters is said to be a "value" for the string variable, although it is not a number. (Actually, the characters in the string ARE stored as numbers. As explained earlier, absolutely all values in the computer, including both numeric values and characters, are stored in the computer's memory as binary numbers.)
For examples of the use of C-strings (arrays of characters) and string class objects, compile and run the programs demoCString.cpp and demoString.cpp. Warning: Compiling demoString.cpp will result in a huge executable file. Make sure you have enough room for it by deleting any previous a.out. Otherwise, attempting to compile demoString.cpp may result in a core dump. If you get a core dump, use the ls command to verify the presence of a file named core, and then remove that file by typing:
rm coreYou may also need to remove a file l.out, by typing:
rm l.outThe first of the above two programs, demoCString.cpp, uses a C-string, declared as an array of characters:
char word[11]; // to be input from userHere, the array is declared to hold up to 11 data items of type char. Thus, it can store a string with up to 11 characters, including the characters of the string itself plus a special character to indicate the end of the string. Thus, it can store a word which is up to 10 characters long, not counting the special end-of-string character.
Let's see what happens if you enter a word which is more than 10 characters long. The result is unpredictable. The program might appear to work fine, or you might ge a core dump.
Unlike a C-string, an object of the C++ string class is smart enough to adjust its length. Compile and run demoString.cpp, and observe that it does NOT require any maximum length. Furthermore, an object of the C++ string class knows its own length. We can find out the number of characters in a string class object by calling its length function, as follows:
cout << "Your word is " << word.length() << " characters long." << endl;where word is a variable of type string.
With both C-strings and string class objects, you can access individual characters in the string. For example, both our example programs print out the first three characters in word, as follows:
cout << "The first three characters in your word are: " << word[0] << word[1] << word[2] << endl;The characters in a string are in numbered positions, where the first position is numbered 0, the second is numbered 1, and so on, up to the last character, which is at a position whose number is one less than the length of the string.
In order to use the C++ string class, you must include the header file in which class string is declared:
#include <string>When you compile a program which includes the header file <string>, the preprocessor inserts the contents of <string> at the top of your program before it is compiled. Then, after your program is compiled, the g++ linker finds the library files which define the functions, etc. which were declared in the header files, and puts them all together into one big program. In the case of <string>, the corresponding library file containing the definitions is huge, resulting in a huge executable file.
Besides string class objects, other objects you've seen so far include cin and cout, which are variables of types istream and ostream. We haven't had to declare cin and cout because they are already declared for us in the header file whose contents the preprocessor will insert at the top of your program, in response to this directive:
#include <iostream>Classes istream and ostream define input and output behaviors, associating them with the operators >> and <<, respectively. Thus we can tell cin and cout to input and output particular items using those operators.
- Escape sequences.
We have seen that quote marks denote the beginning and end of a string literal. For example, the following output::Hello, world!can be generated by the following program statement:
cout << "Hello, world!" << endl;Here, "Hello, world!" is a string literal whose beginning and end are denoted by quote marks which, as we can see by looking at the output, are NOT part of the string itself.
So then, how could we write a program statement to output the following string?
Alice says "hello" to Bob.We CANNOT write it as follows:
cout << "Alice says "hello" to Bob." << endl;If we do, the compiler will complain. On venus, g++ would give us an error message like this:
program.cpp: In function `int main()': program.cpp:12: parse error before string constantThe problem is that the compiler can't tell the difference between quote marks which are intended to be part of the string itself, i.e. the quote marks surrounding "hello", and the quote marks which denote the beginning and end of the entire string and are thus are NOT part of the string itself. The compiler sees the following two separate string literals, which it recognizes as string constants (strings that won't be changed):
"Alice says "and" to Bob."The compiler sees the word "hello" NOT as a string literal or as part of string literal, but rather as a mysterious extra piece of source code between the above two string literals. The compiler then gives up trying to figure out what "hello" is supposed to be, and simply complains "parse error," which is the compiler's most general way of saying that it doesn't understand what you're trying to do.
So, if we want to put a quote mark inside a string literal, we must somehow distinguish between a quote mark which is intended to be part of the string literal and the quote marks which indicate the beginning and end of the string literal. To make this distinction, we can put a backslash ("\") in front of a quote mark to indicate that it is intended to be part of our string literal:
cout << "Alice says \"hello\" to Bob." << endl;Besides string literals, there are also character literals, which indicate a specific individual character. For example, in the following statement, 'A' is a character literal:
char x = 'A';Note that character literals are enclosed by single quote marks (apostrophes), whereas string literals are enclosed by double quote marks. What if we want a character literal representing the apostrophe itself? For that, we use an escape sequence oonsisting of a backslash followed by the apostrophe:
char y = '\'';In general, when a backslash appears inside a string literal or character literal, the compiler interprets it as meaning that the character or number immediately following the backslash is to be interpreted in some special way. The compiler does not interpret the backslash itself as part of the string represented by a string literal or as the character represented by a character literal. A backslash and the character or number following it is called an escape sequence.
For example, as we have seen, the escape sequence '\"' represents a literal quote mark, indicating that the quote mark is to be interpreted as a quote mark, not as the beginning or end of a string literal.
Another escape sequence, '\\', represents the backslash character itself. It is what we use when we want to put a backslash character in a string literal, rather than just using the backslash as the beginning of some other escape sequence. The double backslash translates to a single backslash.
The ASCII characters include not only letters, digits, and punctuation marks, but also some things we don't ordinarily think of as characters: the control characters, such as the backspace, the tab key, the carriage return, and the Escape (ESC) key. The control characters are also known as nonprintable characters. Although you're probably not accustomed to thinking of the control characters as characters, the computer stores them the same way it stores printable characters, and thus they are values of type char
There are escape sequences representing various control characters. For example, '\t' represents a tab, and '\b' represents a backspace.
Another escape sequence, '\7', represents the control character whose ASCII value is 7. It is supposed to make the computer beep. When I tried it, it does not work via remote login to venus. However, it would probably work if you were running the program on a local machine.
Compile escapeSequenceDemo.cpp and run it. It displays a table of some of the more commonly-used escape sequences and their meanings, followed by examples of strings using some of them.
Escape sequence meaning \" " \' ' \\ \ \n newline \r carriage return \t tab \b backspace \7 beep \0 null (end of C-string) Some examples: Alice says "hello" to Bob. 1. 2. 3. 4. beep!Most of the listed escape sequences represent ASCII control characters.
Look now at the source code of escapeSequenceDemo.cpp and compare all the string literals in the source code with the program's actual output.
- End-of-line markers.
The escape sequence '\n' represents a newline character (ASCII value 10). On a Unix system, the newline character ends a line, so that the next character will be printed at the beginning of the next line. On a Unix system, putting the newline escape sequence in a string literal has the same effect as using the endl manipulator outside a string literal, which is how we've ended lines in all our program examples so far.However, the newline character, by itself, is NOT how lines are ended on a Windows/DOS system. On a Windows/DOS system, the end-of-line marker consists of two control characters, not just one: a carriage return (ASCII value 13) followed by a newline (ASCII value 10). Therefore, if you want to write a program that can run on both a Unix system and a Windows/DOS system with as few changes to your program as possible, it is better to use endl than either '\n' or "\r\n". The endl manipulator works by inserting either "\n" or "\r\n" into the output, depending on the kind of machine on which the program was compiled.
To end a line of output on a Unix system such as venus, you should use either '\n' or endl. You should NOT simply press the Enter key inside a string literal. In a program, a string literal should NOT span more than one line, as in the following:
cout << "1. 2. 3. 4." << endl;Instead, you should use either newlines or endl, preferable endl. The code segments below are all acceptable to the compiler and are all equivalent in what they output:
cout << "1.\n2.\n3.\n4." << endl;or
cout << "1.\n2.\n3.\n4.\n";or
cout << "1." << endl << "2." << endl << "3." << endl << "4." << endl;or
cout << "1." << endl; cout << "2." << endl; cout << "3." << endl; cout << "4." << endl;The above four code segments all have the following output:
1. 2. 3. 4.
Back to: