We’re going to explore three levels of parsing in Rust: Beginner, Intermediate, and Advanced using the nom parsing crate.
The beginner level might be written by someone who hasn’t had a lot of Rust experience yet.
and we’ll use that as our base to refactor into the intermediate level, covering the changes in mental model that can apply to your other projects.
Finally we’ll show the “advanced” way to handle the parsing. Usually its hard for people to get to this type of code right away, even with a lot of experience.
The Parsing Problem
In software there are soft line breaks and hard line breaks.
Soft line breaks are a way of visually representing wrapping text without affecting the source.
This is often what happens on websites or if you’re writing markdown since the content doesn’t change meaning if the content is presented on multiple lines
Other formats need or prefer hard line breaks. Code in source files is often broken up by many hard breaks, or literal returns.
You can also use hard line breaks in formats like Markdown, which then goes on to ignore your line breaks when parsing.
So a single paragraph can be written in Markdown as a single line of text, and soft-wrapped in your editor for easier reading, or hard-wrapped with literal presses of the Enter key which is often preferred by people who work in older terminals.
Why hard line breaks?
But why do we care about hard line breaks?
Well, they give us the opportunity to write Rust code in a series of different levels.
Here’s an example. We have two paragraphs with hard line breaks at column 80.
Now we'll take the almighty fan brush. Let's have a happy little tree in here.
We don't have to be concerned about it. We just have to let it fall where it
will. Life is too short to be alone, too precious. Share it with a friend.
Zip. That easy. There it is. I can't think of anything more rewarding than being
able to express yourself to others through painting. A tree needs to be your
friend if you're going to paint him. Once you learn the technique, ohhh! Turn
you loose on the world; you become a tiger. Fluff that up.
You can contrast that with what the same text looks like if its soft-wrapped in an editor.
When we render out this text we want to render each paragraph as its own single cohesive unit, which means we need to remove the hard line breaks at the end of each line.
So overall we need to parse each individual paragraph as well as each line.
Beginner’s Code
A beginner’s code might look like this.
fn hard_break_paragraphs(input: &str) -> Vec<String> {
let mut paragraphs: Vec<String> = vec![];
let mut index = 0;
for line in input.trim().lines() {
if line == "" {
index += 1;
} else if let Some(s) = paragraphs.get_mut(index) {
s.push_str(" ");
s.push_str(line);
} else {
paragraphs.push(line.to_string());
}
}
paragraphs
}
This code works and wouldn’t be out of place in someone’s early Rust program.
We start by setting up mutable variables for a Vec of Strings and an index.
Then we use a for loop to iterate over each line of the input using .lines()
, which is a function on string slices. This is basically like splitting the input on newlines.
Inside of that for loop we have three potential operations.
If the line is an empty string, then we must be at the double-newline section between paragraphs, so we increment our index to keep track of where we are.
Otherwise, if there is a String that already exists at our index, we get mutable access to it, and push a space and the content of the line onto the already-existing string.
Finally, if we haven’t done anything else, we push the contents of the line onto the vec as a new String.
After all of that, our vec contains our paragraphs and we can return that from the function.
But this approach has a number of different improvements we can make to it.
Especially reliance on indices, and reliance on mutability in a larger scope than we need it.
The first change we make will be small. Instead of testing to see if the line is equal to empty string, we can use is_empty()
.
is_empty()
better conveys our intent here using human readable words, and removes the ability to make a mistake by adding any extra characters to the string we’re testing against.
fn hard_break_paragraphs(input: &str) -> Vec<String> {
let mut paragraphs: Vec<String> = vec![];
let mut index = 0;
for line in input.trim().lines() {
if line.is_empty() {
index += 1;
} else if let Some(s) = paragraphs.get_mut(index) {
s.push_str(" ");
s.push_str(line);
} else {
paragraphs.push(line.to_string());
}
}
paragraphs
}
We can also get rid of the index-based vec access entirely.
If there is any empty line, that must mean we’re starting a new paragraph, so push a String into the paragraphs vec.
Otherwise, we can use last_mut
to always access the last item in the vec. There are a number of ways to take advantage of the optional return value, but in this case we’ll keep largely the same control flow and test to see if the last string is empty or not.
If its not an empty string, then add a space before inserting the line itself.
Finally, we can push the line onto the vec if neither of the other tests pass.
fn hard_break_paragraphs(input: &str) -> Vec<String> {
let mut paragraphs: Vec<String> = vec![];
for line in input.trim().lines() {
if line.is_empty() {
paragraphs.push("".to_string());
} else if let Some(s) = paragraphs.last_mut() {
if !s.is_empty() {
s.push_str(" ");
}
s.push_str(line);
} else {
paragraphs.push(line.to_string());
}
}
paragraphs
}
Its interesting to note that we can remove the final branch entirely in favor of starting our iteration with a String already in the vec.
fn hard_break_paragraphs(input: &str) -> Vec<String> {
let mut paragraphs: Vec<String> = vec![String::from("")];
for line in input.trim().lines() {
if line.is_empty() {
paragraphs.push("".to_string());
} else if let Some(s) = paragraphs.last_mut() {
if !s.is_empty() {
s.push_str(" ");
}
s.push_str(line);
}
}
paragraphs
}
We still have some free mutation where we don’t need it, and while this is fine we’ve effectively started to write a fold anyway, so we’ll continue down that path.
Intermediate
This is when we start getting into the intermediate level code.
Our input can be trimmed and turned into lines() like before, but instead of using a for loop we can take advantage of Rust’s iterators and use fold.
You might know fold as reduce in other languages. We use .fold on our iterator, starting with the same initial value we had earlier: a vec with a String.
That initial value is used as the first argument to our closure, which accepts the accumulated data we’re building up and the next item to process.
This logic is very much like our for loop, except we’ve removed the external mutable variable entirely, in favor of the more localized accumulator variable.
We take advantage of .last and .last_mut instead of index access, and is_empty like we discussed earlier.
This fold will return our Vec of Strings, so we can use that as the return value for our function.
fn hard_break_paragraphs(input: &str) -> Vec<String> {
input
.trim()
.lines()
.fold(vec![String::from("")], |mut acc, item| {
if item.is_empty() {
acc.push("".to_string());
} else if acc.last().unwrap().is_empty() {
acc.last_mut().unwrap().push_str(item);
} else {
acc.last_mut().unwrap().push_str(" ");
acc.last_mut().unwrap().push_str(item);
}
acc
})
}
there’s still more we can do here.
the .last function returns an optional type, which we’re testing to see if its empty.
Using .unwrap() is a really common way for beginners to get around dealing with data inside of enums, but we can use a different function here.
If we go to the Option docs, we can find a function called is_some_and
, which tests to see if the Option is a Some variant and that the value inside also passes a test.
Since we already know that acc.last() succeeded, we can unwrap last_mut inside of that branch without worrying about unwrap() panic-ing.
Finally we leave the unwrap()s in the last branch because we already set up in a way that there will always be a last item. So if there isn’t a last item, then our assumptions about how our program should behave are incorrect, and panic’ing is appropriate.
fn hard_break_paragraphs(input: &str) -> Vec<String> {
input
.trim()
.lines()
.fold(vec![String::from("")], |mut acc, item| {
if item.is_empty() {
acc.push("".to_string());
} else if acc.last().is_some_and(|s| s.is_empty()) {
acc.last_mut().unwrap().push_str(item);
} else {
acc.last_mut().unwrap().push_str(" ");
acc.last_mut().unwrap().push_str(item);
}
acc
})
}
However, we do build up Strings gradually by pushing onto them in this solution. This could trigger our string to constantly expand the amount of space it takes up which may or may not matter to us.
We can use nom to build up a parser instead.
Nom will cause our function signature to change. It is a parser-combinator library, which means that we can build up complicated parsers from smaller functions. To do this, the input and the return type should be more regular.
We’ll start by using nom to handle creating our lines. Our input is effectively a separated_list separated by newlines.
This gives us the ability to continue using our previous logic to take those lines and create paragraphs from them.
fn hard_break_paragraphs(input: &str) -> IResult<&str, Vec<String>> {
let (input, lines) = separated_list0(newline, not_line_ending)(input.trim())?;
let output = lines.iter().fold(vec![String::from("")], |mut acc, item| {
if item.is_empty() {
acc.push("".to_string());
} else if acc.last().is_some_and(|s| s.is_empty()) {
acc.last_mut().unwrap().push_str(item);
} else {
acc.last_mut().unwrap().push_str(" ");
acc.last_mut().unwrap().push_str(item);
}
acc
});
Ok((input, output))
}
but fold is just the most general function for reducing many items into a single data structure. That doesn’t make it the best option.
In the itertools crate there is a function called group_by. group_by allows us to group consecutive elements in our iterator according to a test.
In this case, we have a sequence of content, followed by an empty string separating the two paragraphs, followed by more lines of content for the second paragraph.
If we test the line for is_empty()
, then we’ll get the content grouped together, separated by groups of the the empty lines.
We can then iterate over these groups, and join each of them together to create a String.
This creates the allocation for each String once when joining rather than potentially many times like we were doing before when pushing onto a String.
We can then filter out all of the empty lines, and we have our paragraphs.
fn hard_break_paragraphs(input: &str) -> IResult<&str, Vec<String>> {
let (input, lines) = separated_list0(newline, not_line_ending)(input.trim())?;
let output = lines
.into_iter()
.group_by(|line| line.is_empty())
.into_iter()
.map(|(_key, mut group)| group.join(" "))
.filter(|s| !s.is_empty())
.collect();
Ok((input, output))
}
Similarly, there are many ways to iterate to create this output. Another possible function from the itertools crate is coalesce, which determines whether or not to merge two elements.
Can you figure out why we need to .map over the strings to call to_string() on them?
fn hard_break_paragraphs(input: &str) -> IResult<&str, Vec<String>> {
let (input, lines) = separated_list0(newline, not_line_ending)(input.trim())?;
let output = lines
.into_iter()
.map(|s| s.to_string())
.coalesce(|x, y|
if x.len() != 0 && y.len() != 0 {
Ok(format!("{x} {y}"))
} else {
Err((x, y))
}
)
.filter(|s| !s.is_empty())
.collect();
Ok((input, output))
}
batching
is yet another option! In this approach we take advantage of take_while, .not(), and .then_some!
fn hard_break_paragraphs(input: &str) -> IResult<&str, Vec<String>> {
let (input, lines) = separated_list0(newline, not_line_ending)(input.trim())?;
let output = lines
.iter()
.batching(|it| {
let joined = it.take_while(|x| !x.is_empty()).join(" ");
joined.is_empty().not().then_some(joined)
})
.collect();
Ok((input, output))
}
All of these are fine, but they don’t use nom to its fullest potential.
Advanced
We said before that nom is a parser combinator library. This means that we can build up complicated parsers from smaller parsers.
The key insight for our hard_break_paragraphs
parser is that we effectively have a list of lines to define the paragraphs, and a list of paragraphs.
The paragraph lines are separated by a single newline, while the paragraphs themselves are separated by two newlines.
We’ll take special care to make sure our parser doesn’t overlap these two different separators.
Our first parser is easy enough. The set of paragraphs is a separated list of paragraphs, separated by two newlines.
The second parser is a bit tougher. We need to parse a paragraph, but also stop if we hit two newlines in a row so that we can pass control back to the hard_break_paragraphs parser.
If we wrote a separated list like this, it would work for a single paragraph, but it wouldn’t stop at the first paragraph because the newline can just keep eating when it reaches two newlines!
separated_list0(newline, not_line_ending)(input)
So instead, we need our separator to have a concept of when it should fail. Our separator is a newline, but crucially not if another newline follows it, which we can achieve by using not(newline)
, chained together with a tuple combinator.
use itertools::Itertools;
use nom::{
character::complete::{line_ending, newline, not_line_ending},
combinator::{iterator, not, opt},
multi::separated_list0,
sequence::{terminated, tuple},
IResult,
};
use std::ops::Not;
fn paragraph(input: &str) -> IResult<&str, Vec<&str>> {
separated_list0(tuple((newline, not(newline))), not_line_ending)(input)
}
fn hard_break_paragraphs(input: &str) -> IResult<&str, Vec<String>> {
let (input, paragraphs) = separated_list0(tuple((newline, newline)), paragraph)(input.trim())?;
let output = paragraphs.into_iter().map(|lines| lines.join(" ")).collect();
Ok((input, output))
}
So our parser starts at hard_break_paragraphs and tries to parse its first paragraph due to our separated list.
That first paragraph parser run tries to parse a string that is as long as possible but doesn’t include a line ending.
Then tries to parse the separator, which is strictly a single newline before parsing another line.
If the separator parser hits a double newline, the paragraph parser stops and returns what it has, at which point the hard_break_paragraphs parser parses its separator, the two newlines, before trying to parse another paragraph.
All of this gives us a Vec<Vec<&str>>, which we can then iterator over to join the paragraphs together with a space, giving us our final Vec
But its not over yet!
Advanced nom users will come to find out that you can map over parsers, which themselves implement the Parser trait.
That means that we don’t need to reach into the datastructure we create, we can instead map over the result of the paragraph parser and join the lines we get back together to fit our needs.
fn paragraph(input: &str) -> IResult<&str, Vec<&str>> {
separated_list0(tuple((newline, not(newline))), not_line_ending)(input)
}
fn hard_break_paragraphs(input: &str) -> IResult<&str, Vec<String>> {
let (input, output) = separated_list0(
tuple((newline, newline)),
paragraph.map(|lines| lines.join(" ")),
)(input.trim())?;
Ok((input, output))
}
and at long last, we get to the final nom parsers. Two separated lists with very little extra code that only allocates a String when it needs to and only does it once for each string.
fn paragraph(input: &str) -> IResult<&str, Vec<&str>> {
separated_list0(tuple((newline, not(newline))), not_line_ending)(input)
}
fn hard_break_paragraphs(input: &str) -> IResult<&str, Vec<String>> {
separated_list0(
tuple((newline, newline)),
paragraph.map(|lines| lines.join(" ")),
)(input.trim())
}
All of the code in this video works. All of it would probably be fine in your application. Sometimes the differences matter. Most of the time they don’t.
If you need a certain kind of performance, don’t forget to measure before you make changes!
Its up to you how far you want to go.
Consider how you can remove error-prone code in your own programs.