Collections in Rust - Strings

In the last blog, we learned about collections in rust and Vectors collection and its usage and how it helps us in saving data of variable length unlike arrays which need to know the data length during compile time. Today, we will learn about Strings and how they help us in saving data collections.

String

First, let's clarify what we mean when we refer to a string. The string slice str, which is typically seen in its borrowed form &str, is the only string type available in the core language of Rust. String slices, which are references to some UTF-8 encoded string data saved elsewhere. For instance, string literals are actually string slices because they are saved in the program's binary output.

The String type is a growable, mutable, owned, UTF-8 encoded string type that is available in the Rust standard library rather than being incorporated into the language's core. Rustaceans typically refer to the String and the string slice &str types, not just one of those types, when they talk about "strings" in Rust. Although the focus of this section is on String, the standard library for Rust makes extensive use of both types, and both String and string slices are UTF-8 encoded.

Several other string types, including OsString, OsStr, CString, and CStr, are also included in the standard library of Rust. There are even more possibilities for storing string data with library crates. They frequently offer an owned and borrowed variant, just like String/&str, similar to the *String/*Str naming. These string types, for instance, can hold text in various encodings or be represented in memory in various ways. These other string types are not covered in this chapter; for more information on how to use them and when each is appropriate, see their API documentation.

Creating a New String

Many of the same operations available with Vec<T> are available with String as well, starting with the new function to create a string:

let mut s = String::new();

This line generates an entirely new, empty string called s, into which we may later load data. Frequently, we wish to begin the string with some beginning data. String literals implement the Display trait, so we can use the to_string method on any type to achieve this.

let data = "Hello World";
let s = data.to_string();
// the method also works on a literal directly:
let s = "Hello World".to_string();

This programme creates a string with empty contents.

In order to build a String from a string literal, we may also utilize the String::from function.

let s = String::from("Hello World");

Since strings are used for so many various purposes, we have a wide variety of generic APIs to choose from. Although some of them might seem unnecessary, they are all necessary. The decision between String::from and to string in this situation is purely a matter of taste.

Keep in mind that since strings are UTF-8 encoded, any properly encoded data can be included in them.

Updating a String

By inserting more data into it, a String can expand in size and have its contents alter, exactly like a Vec<T>. Additionally, we can easily concatenate String values by using the + operator or the format! macro.

Using push_str and push to add to a String

By appending a string slice to a String using the push str method, we can increase its size.

let mut s = String::from("Test");
s.push_str("ing");

Testing will be present in s following these two lines. Because we don't necessarily want to take ownership of the parameter, the push str method accepts a string slice. For instance, the code illustrates how awful it would be if we couldn't use s2 after attaching its contents to s1:

let mut s1 = String::from("test");
let s2 = "ing";
s1.push_str(&s2);
println!("s2 is {}", s2);

We wouldn't be able to print out the value of s2 on the final line if the push str method assumed control of it. But this code performs as expected!

The push method adds a single character to the String by accepting it as an argument. Code for adding the character e to a String using the push method is shown below:

let mut s = String::from("cod");
s.push('e');

This code will cause s to contain code.

Concatenation using the format! Macro or the + Operator 

We'll frequently want to combine two already-existing strings. Using the + operator is one approach.

let s1 = String::from("Hello, ");
let s2 = String::from("world!");
let s3 = s1 + &s2; // Note that s1 has been moved here and can no longer be used

This code will result in the string s3 containing Hello, world!. The signature of the function that is called when we apply the + operator is the reason s1 is no longer valid after the addition and the reason we utilized a reference to s2. The add method, whose signature resembles this, is used by the + operator:

fn add(self, s: &str) -> String {

This signature differs from the one found in the standard library, since add there is defined using generics. When we use this method with values of type String, the concrete types in the add method's signature replace the generic ones. We have the hints we need to understand the tricky parts of the + operator from this signature.

Due to the s parameter in the add method, we can only add a &str to a String; we cannot add two String values together. First, s2 has a &, indicating that we are adding a reference of the second string to the first string. But wait—despite what the second parameter to add specifies, the type of &s2 is &String, not &str. Then why does the aforementioned code run?

The compiler's ability to convert the &String argument into a &str allows us to utilize &s2 in the call to add. Rust uses a deref coercion when we call the add method, which in this case converts &s2 into &s2[..]. The result of this operation will still leave s2 as a valid String because add does not claim ownership of the s argument.

Second, the absence of a & in self's signature indicates that add assumes ownership of self. As a result, s1 will be inserted into the add call and become invalid there. Let s3 = s1 + &s2 appears to copy both strings and construct a new one, but in reality, this statement takes ownership of s1, appends a copy of s2, and then takes ownership of the outcome. In other words, even though it appears to be producing many copies, it actually uses implementation, which is more effective than copying.

The behaviour of + becomes awkward if multiple strings need to be concatenated:

let s1 = String::from("tic");
let s2 = String::from("tac");
let s3 = String::from("toe");
let s = s1 + "-" + &s2 + "-" + &s3;

s will now be tic-tac-toe. It's challenging to understand what's happening with all the + and " characters. We can utilize the format! macro to combine strings in more intricate ways:

let s1 = String::from("tic");
let s2 = String::from("tac");
let s3 = String::from("toe");
let s = format!("{}-{}-{}", s1, s2, s3);

s is also set to tic-tac-toe by this code. Similar to println!, the format! macro produces output, but instead of publishing it to the screen, it returns a String containing the contents. The code that makes use of format! is much simpler to read and doesn't claim any of its parameters as its own.

Indexing into Strings

Accessing specific characters in a string by index is a legal and frequent operation in several other programming languages. However, using Rust's indexing syntax to retrieve specific portions of a String will result in an error.

let s1 = String::from("hello");
let h = s1[0];

This code will result in the following error:

error[E0277]: the trait bound `std::string::String: std::ops::Index<{integer}>` is not satisfied
 -->
  |
3 |     let h = s1[0];
  |             ^^^^^ the type `std::string::String` cannot be indexed by `{integer}`
  |
  = help: the trait `std::ops::Index<{integer}>` is not implemented for `std::string::String`

Rust strings do not support indexing, as indicated by the error and the note. Why not, though? We must talk about how strings are stored in memory in Rust in order to respond to that query.

Internal Representation

A Vec<u8> is covered by a String. Check out some of our UTF-8 example strings that have been correctly encoded. Here's the first:

let len = String::from("Hola").len();

Len will be four in this instance, indicating that the Vec containing the string "Hola" is four bytes long. These letters are encoded in UTF-8, which requires one byte per letter. What about the line after that, though?

let len = String::from("Здравствуйте").len();

Notably, rather than the Arabic number 3, this string starts with the capital Cyrillic letter Ze. You might respond, "Twelve," when asked how long the string is. However, Rust's response is 24 since each Unicode scalar value requires two bytes of storage, making that the number of bytes required to encode "дравствутe" in UTF-8. As a result, not every index within the string's bytes will correspond to a usable Unicode scalar value. Consider the following invalid Rust code as an example

let hello = "Здравствуйте";
let answer = &hello[0];

What should the answer's worth be? Should the first letter be? Although the answer should be 208 because the first byte of the character is 208 and the second byte is 151 when encoded in UTF-8, 208 is not a recognized character by itself. The first letter of this string is 208, which is probably not what a user would desire, since Rust only provides that information at byte index 0. Even if the string only contains Latin letters, returning the byte value is probably not what users want because &"hello"[0] would return 104 rather than h if it were valid code. Rust doesn't even compile this code, preventing misunderstandings earlier in the development cycle and preventing bugs that might not be immediately discovered.

Grapheme clusters, scalar values, and bytes

Another thing to note about UTF-8 is that, from the viewpoint of Rust, there are actually three significant ways to view strings: as bytes, scalar values, and grapheme clusters (the closest thing to what we would call letters).

The Devanagari version of the Hindi word "नमस्ते" is finally stored as a Vec of u8 values, as seen in the example below:

[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]

This data is finally stored by computers in a format of 18 bytes. These bytes seem as follows when viewed as Unicode scalar values, which is what Rust's char type is:

['न', 'म', 'स', '्', 'त', 'े']

There are six char values here, but the fourth and sixth are not letters: they’re diacritics that don’t make sense on their own. Finally, if we look at them as grapheme clusters, we’d get what a person would call the four letters that make up the Hindi word:

["न", "म", "स्", "ते"]

No matter what human language the data is in, Rust offers various ways of interpreting the raw string data that computers store so that each application can select the interpretation it requires.

Rust prohibits us from indexing into a String to obtain a character for one final reason: indexing operations are predicted to always take constant time (O(1)). However, with a String, it is impossible to guarantee that performance because Rust would have to traverse the contents from the beginning to the index to count the number of valid characters.

Slicing Strings

Because it's not always obvious what the output type of a string indexing operation should be—a byte value, a character, a grapheme cluster, or a string slice—indexing into a string is frequently a bad idea. If you absolutely need to generate string slices using indices, Rust asks you to be more explicit. Instead of indexing with [] with a single number to be more precise and specify that you want a string slice, you can use [] with a range to create a string slice with specific bytes:

let hello = "Здравствуйте";
let s = &hello[0..4];

Here, s will be a &str that holds the string's first four bytes. As we discussed earlier, each of these characters takes up two bytes, so s will be д.

What would occur if &hello[0..1] was used? The solution: Accessing an incorrect index in a vector will cause Rust to panic at runtime in the same way that it does:

thread 'main' panicked at 'byte index 1 is not a char boundary; it is inside 'З' (bytes 0..2) of `Здравствуйте`', src/libcore/str/mod.rs:2188:4

Ranges should only be used sparingly when making string slices because they have the potential to break your program.

Methods for Iterating Over Strings

Fortunately, there are alternative ways to retrieve a string's elements.

It is recommended to utilise the chars method when we need to conduct operations on specific Unicode scalar values. In order to access each element, we can iterate over the result of using chars on "नमस्ते" which separates out and returns six items of type char:

for c in "नमस्ते".chars() {
    println!("{}", c);
}

This code will print the following:






The bytes method, which may be suitable for your domain, returns each raw byte:

for b in "नमस्ते".bytes() {
    println!("{}", b);
}

The 18 bytes that make up this String will be printed by this code, beginning with:

224
164
168
224
// ... etc

But keep in mind that acceptable Unicode scalar values may contain several bytes.

Because it is difficult to obtain grapheme clusters from strings, this capability is not offered by the standard library. If you require this functionality, crates.io has them available.

Strings Are Not So Simple

In conclusion, strings are challenging. Regarding how to convey this complexity to the programmer, different programming languages take different approaches. Programmers must take additional care when handling UTF-8 data because Rust has decided to make the proper handling of String data the default approach for all Rust applications. Despite exposing more of the complexity of strings than other programming languages, this trade-off saves you from having to deal with non-ASCII character issues later on in the development life cycle.

I sincerely hope that the majority of you find the approach covered here to be helpful. Thank you for reading, and please feel free to leave any comments or questions in the comments section below.

Post a Comment

0 Comments