Strings, Characters and Regular Expressions

  • In this chapter you will learn:

    • To create and manipulate immutable character string objects of class string.
    • To create and manipulate mutable character string objects of class StringBuilder.
    • To manipulate character objects of struct Char.
    • To use regular expressions in conjunction with classes Regex and Match
  • Characters are the fundamental building blocks of C# source code. Every program is composed of characters that, when grouped together meaningfully, create a sequence that the compiler interprets as instructions describing how to accomplish a task. In addition to normal characters, a program also can contain character constants. A character constant is a character that is represented as an integer value, called a character code. For example, the integer value 122 corresponds to the character constant ‘z’. The integer value 10 corresponds to the newline character ‘\n’. Character constants are established according to the Unicode character set, an international character set that contains many more symbols and letters than does the ASCII (American Standard Code for Information Interchange) character set
  • On occasion, a string will contain multiple backslash characters (this often occurs in the name of a file). To avoid excessive backslash characters, it is possible to exclude escape sequences and interpret all the characters in a string literally, using the @ character. Backslashes within the double quotation marks following the @ character are not considered escape sequences, but rather regular backslash characters. Often this simplifies programming and makes the code easier to read
  • This approach also has the advantage of allowing strings to span multiple lines by preserving all newlines, spaces and tabs.
  • Class string provides eight constructors for initializing strings in various ways
  • In most cases, it is not necessary to make a copy of an existing string. All strings are immutabletheir character contents cannot be changed after they are created. Also, if there are one or more references to a string (or any object for that matter), the object cannot be reclaimed by the garbage collector
  • Attempting to access a character that is outside a string’s bounds (i.e., an index less than 0 or an index greater than or equal to the string’s length) results in an IndexOutOfRangeException
  • Computers can order characters alphabetically because the characters are represented internally as Unicode numeric codes. When comparing two strings, C# simply compares the numeric codes of the characters in the strings
  • Method Equals uses a lexicographical comparison the integer Unicode values that represent each character in each string is compared. A comparison of the string "hello" with the string "HELLO" would return false, because the numeric representations of lowercase letters are different from the numeric representations of corresponding uppercase letters
  • Method StartsWith determines whether a string instance starts with the string text passed to it as an argument. Method EndsWith determines whether a string instance ends with the string text passed to it as an argument
  • There are three versions of LastIndexOf. 1st version of the method LastIndexOf that takes as an argument the character for which to search. Second version of the method LastIndexOf takes two arguments the character for which to search and the highest index from which to begin searching backward for the character. Third version of the method LastIndexOf takes three arguments the character for which to search, the starting index from which to start searching backward and the number of characters (the portion of the string) to search
  • Class string provides two Substring methods, which are used to create a new string by copying part of an existing string. Each method returns a new string
  • The 1sy version of Substring method that takes one int argument. The argument specifies the starting index from which the method copies characters in the original string. The substring returned contains a copy of the characters from the starting index to the end of the string. If the index specified in the argument is outside the bounds of the string, the program throws an ArgumentOutOfRangeException
  • The second version of method Substring takes two int arguments. The first argument specifies the starting index from which the method copies characters from the original string. The second argument specifies the length of the substring to be copied. The substring returned contains a copy of the specified characters from the original string
  • String concatenation: Concat
  • Class string provides several methods that return modified copies of strings like Replace, ToLower, ToUpper and trim
  • Replace: can differ between capital letters and small letters
  • Method Replace takes two arguments a string for which to search and another string with which to replace all matching occurrences of the first argument. The original string remains unchanged. If there are no occurrences of the first argument in the string, the method returns the original string
  • Method trim removes all whitespace characters that appear at the beginning and end of a string. Without otherwise altering the original string, the method returns a new string that contains the string, but omits leading or trailing whitespace characters. Another version of method trim takes a character array and returns a string that does not contain the characters in the array argument
  • The string class provides many capabilities for processing strings. However a string’s contents can never change. Operations that seem to concatenate strings are in fact assigning string references to newly created strings (e.g., the += operator creates a new string and assigns the initial string reference to the newly created string).
  • The features of class StringBuilder (namespace System.Text), used to create and manipulate dynamic string information i.e., mutable strings. Every StringBuilder can store a certain number of characters that is specified by its capacity. Exceeding the capacity of a StringBuilder causes the capacity to expand to accommodate the additional characters. As we will see, members of class StringBuilder, such as methods Append and AppendFormat, can be used for concatenation like the operators + and += for class string
  • Objects of class string are immutable (i.e., constant strings), whereas object of class StringBuilder are mutable. C# can perform certain optimizations involving strings (such as the sharing of one string among multiple references), because it knows these objects will not change
  • The no-parameter StringBuilder constructor to create a StringBuilder that contains no characters and has a default initial capacity of 16 characters
  • lass StringBuilder provides the Length and Capacity properties to return the number of characters currently in a StringBuilder and the number of characters that a StringBuilder can store without allocating more memory, respectively. These properties also can increase or decrease the length or the capacity of the StringBuilder
  • Method EnsureCapacity allows you to reduce the number of times that a StringBuilder’s capacity must be increased. The method doubles the StringBuilder instance’s current capacity. If this doubled value is greater than the value that the programmer wishes to ensure, that value becomes the new capacity. Otherwise, EnsureCapacity alters the capacity to make it equal to the requested number
  • Assigning null to a string reference can lead to logic errors if you attempt to compare null to an empty string. The keyword null is a value that represents a null reference (i.e., a reference that does not refer to an object), not an empty string (which is a string object that is of length 0 and contains no characters)
  • Class StringBuilder provides 19 overloaded Append methods that allow various types of values to be added to the end of a StringBuilder. The FCL provides versions for each of the simple types and for character arrays, strings and objects. (Remember that method ToString produces a string representation of any object.) Each of the methods takes an argument, converts it to a string and appends it to the StringBuilder
  • Class StringBuilder also provides method AppendFormat, which converts a string to a specified format, then appends it to the StringBuilder
  • The information enclosed in braces specifies how to format a specific piece of data. Formats have the form {X[,Y][:FormatString]}, where X is the number of the argument to be formatted, counting from zero. Y is an optional argument, which can be positive or negative, indicating how many characters should be in the result. If the resulting string is less than the number Y, the string will be padded with spaces to make up for the difference. A positive integer aligns the string to the right; a negative integer aligns it to the left. The optional FormatString applies a particular format to the argument currency, decimal or scientific, among others. In this case, "{0}" means the first argument will be printed out. "{1:C}" specifies that the second argument will be formatted as a currency value
  • The format "{0:d3}", specifies that the first argument will be formatted as a three-digit decimal, meaning any number that has fewer than three digits will have leading zeros placed in front to make up the difference
  • Class StringBuilder provides 18 overloaded Insert methods to allow various types of data to be inserted at any position in a StringBuilder. The class provides versions for each of the simple types and for character arrays, strings and objects. Each method takes its second argument, converts it to a string and inserts the string into the StringBuilder in front of the character in the position specified by the first argument. The index specified by the first argument must be greater than or equal to 0 and less than the length of the StringBuilder; otherwise, the program throws an ArgumentOutOfRangeException
  • Class StringBuilder also provides method Remove for deleting any portion of a StringBuilder. Method Remove takes two argumentsthe index at which to begin deletion and the number of characters to delete. The sum of the starting index and the number of characters to be deleted must always be less than the length of the StringBuilder; otherwise, the program throws an ArgumentOutOfRangeException
  • Another useful method included with StringBuilder is Replace. Replace searches for a specified string or character and substitutes another string or character in its place
  • An overload of Replace that takes four parameters, the first two of which are characters and the second two of which are ints. The method replaces all instances of the first character with the second character, beginning at the index specified by the first int and continuing for a count specified by the second int.
  • C# provides a type called a struct (short for structure) that is similar to a class. Although structs and classes are comparable in many ways, structs represent value types. Like classes, structs can have methods and properties, and can use the access modifiers public and private. Also, struct members are accessed via the member access operator (.)
  • The simple types are actually aliases for struct types. For instance, an int is defined by struct System.Int32, a long by System.Int64 and so on. All struct types derive from class ValueType, which in turn derives from object. Also, all struct types are implicitly sealed, so they do not support virtual or abstract methods, and their members cannot be declared protected or protected internal
  • Regular Expressions and Class Regex:

    • Regular expressions are specially formatted strings used to find patterns in text. They can be useful during information validation, to ensure that data is in a particular format. For example, a ZIP code must consist of five digits, and a last name must start with a capital letter. Compilers use regular expressions to validate the syntax of programs. If the program code does not match the regular expression, the compiler indicates that there is a syntax error
    • The .NET Framework provides several classes to help developers recognize and manipulate regular expressions. Class Regex (of the System.Text.RegularExpressions namespace) represents an immutable regular expression. Regex method Match returns an object of class Match that represents a single regular expression match. Regex also provides method Matches, which finds all matches of a regular expression in an arbitrary string and returns an object of the class MatchCollection object containing all the Matches. A collection is a data structure, similar to an array and can be used with a foreach statement to iterate through the collection’s elements.
  • Regular Expression Character Classes:

    • The following table specifies some character classes that can be used with regular expressions. Please do not confuse a character class with a C# class declaration. A character class is simply an escape sequence that represents a group of characters that might appear in a string

Character Class

Matches

\d

Any digit

\w

Any word character

\s

Any white space

\D

Any non-digit

\W

Any non-word character

\S

Any non-white space

.

Any thing

  • A word character is any alphanumeric character or underscore. A whitespace character is a space, a tab, a carriage return, a newline or a form feed. A digit is any numeric character. Regular expressions are not limited to the character classes previous table. As you will see in our first example "RegexMatches, regular expressions can use other notations to search for complex patterns in strings
  • See Article Project: RegexMatches Project
  • Talking about the project:

    • We precede the string with @. Recall that backslashes within the double quotation marks following the @ character are regular backslash characters, not the beginning of escape sequences. To define the regular expression without prefixing @ to the string, you would need to escape every backslash character, as in
    • "J.*\\d[0-35-9]-\\d\\d-\\d\\d"
    • Which makes the regular expression more difficult to read?
    • The first character in the regular expression, "J", is a literal character. Any string matching this regular expression is required to start with "J". In a regular expression, the dot character "." matches any single character except a newline character. When the dot character is followed by an asterisk, as in ".*", the regular expression matches any number of unspecified characters except newlines. In general, when the operator "*" is applied to a pattern, the pattern will match zero or more occurrences. By contrast, applying the operator "+" to a pattern causes the pattern to match one or more occurrences. For example, both "A*" and "A+" will match "A", but only "A*" will match an empty string
    • "\d" matches any numeric digit. To specify sets of characters other than those that belong to a predefined character class, characters can be listed in square brackets, []. For example, the pattern "[aeiou]" matches any vowel. Ranges of characters are represented by placing a dash (-) between two characters. In the example, "[0-35-9]" matches only digits in the ranges specified by the pattern i.e., any digit between 0 and 3 or between 5 and 9; therefore, it matches any digit except 4. You can also specify that a pattern should match anything other than the characters in the brackets. To do so, place ^ as the first character in the brackets. It is important to note that "[^4]" is not the same as "[0-35-9]"; "[^4]" matches any non-digit and digits other than 4.
    • Although the "" character indicates a range when it is enclosed in square brackets, instances of the "-" character outside grouping expressions are treated as literal characters. Thus, the regular expression in line 12 searches for a string that starts with the letter "J", followed by any number of characters, followed by a two-digit number (of which the second digit cannot be 4), followed by a dash, another two-digit number, a dash and another two-digit number
  • Quantifiers:

    • The asterisk (*) is more formally called a quantifier. The following table lists various quantifiers that you can place after a pattern in a regular expression and the purpose of each quantifier
    • All of the quantifiers are greedy they will match as many occurrences of the pattern as possible until the pattern fails to make a match. If a quantifier is followed by a question mark (?), the quantifier becomes lazy and will match as few occurrences as possible as long as there is a successful match

Quantifier

Matches

*

Matches zero or more occurrences of the preceding pattern

+

Matches one or more occurrences of the preceding pattern

?

Matches zero or one occurrences of the preceding pattern

{n}

Matches exactly n occurrences of the preceding pattern

{n,}

Matches at least n occurrences of the preceding pattern

{n,m}

Matches between n and m (inclusive) occurrences of the preceding pattern

  • The Windows application "Validating user information using regular expressions" presents a more involved example that uses regular expressions to validate name, address and telephone number information input by a user.
  • See Article Projects: Validating user information using regular expressions:

    • In the Zip Code Regular Expression: Note that without the "^" and "$" characters, the regular expression would match any five consecutive digits in the string. By including the "^" and "$" characters, we ensure that only five-digit zip codes are allowed
  • In a regular expression that begins with a "^" character and ends with a "$" character, the characters "^" and "$" represent the beginning and end of a string, respectively. These characters force a regular expression to return a match only if the entire string being processed matches the regular expression

The character "|" matches the expression to its left or the expression to its right. For example, Hi (John|Jane) matches both Hi John and Hi Jane. In line 55, we use the character "|" to indicate that the address can contain a word of one or more characters or a word of one or more characters followed by a space and another word of one or more characters. Note the use of parentheses to group parts of the regular expression. Quantifiers may be applied to patterns enclosed in parentheses to create more complex regular expressions

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s