In this lesson we'll move from using StringRecords
to defining our own types. Each row in the csv will turn into a single Rust struct.
To turn csv rows into our own structs, we'll can add serde
to our package with the derive
feature.
cargo add -p upload-pokemon-data serde -F derive
Your dependencies in Cargo.toml
will now look like this:
[dependencies]
csv = "1.2.2"
serde = { version = "1.0.188", features = ["derive"] }
Serde is a library that is widely used in the Rust ecosystem for serializing and deserializing Rust data types into various formats. Some of those formats include JSON, TOML, YAML, and MessagePack. Serialization takes a Rust struct and turns it into JSON, while Deserialization takes JSON and turns it into a Rust struct.
We will be using the Deserialize
derive macro to derive
a deserializer for our PokemonCsv
type, which is why we need to enable the derive
feature.
Cargo features are powerful ways to turn different pieces of code on and off depending on what a consumer needs to use, ensuring that our projects don't include code they don't need in the compilation process.
Next we'll create a new Rust sub-module to hold the PokemonCsv
struct that represents what's in each CSV row. In src/main.rs
:
mod pokemon_csv;
use pokemon_csv::*;
Rust modules don't necessarily reflect the filesystem, but in this case it will. We'll use mod pokemon_csv
to define a new submodule that will exist in src/pokemon_csv.rs
, and we'll use pokemon_csv::*
to pull all of the public items from pokemon_csv.rs
into scope in main.rs
.
In src/pokemon_csv.rs
we'll define a new public struct PokemonCsv
. This struct will include all of the fields we care about from the csv file. We also label all of the fields as pub
so that they're accessible if we want them wherever we create a PokemonCsv
.
use serde::Deserialize;
#[derive(Debug, Deserialize)]
pub struct PokemonCsv {
pub name: String,
pub pokedex_id: u16,
pub abilities: String,
pub typing: String,
pub hp: u8,
pub attack: u8,
pub defense: u8,
pub special_attack: u8,
pub special_defense: u8,
pub speed: u8,
pub height: u16,
pub weight: u16,
pub generation: u8,
pub female_rate: Option<f32>,
pub genderless: bool,
#[serde(rename(deserialize = "legendary/mythical"))]
pub is_legendary_or_mythical: bool,
pub is_default: bool,
pub forms_switchable: bool,
pub base_experience: u16,
pub capture_rate: u8,
pub egg_groups: String,
pub base_happiness: u8,
pub evolves_from: Option<String>,
pub primary_color: String,
pub number_pokemon_with_typing: f32,
pub normal_attack_effectiveness: f32,
pub fire_attack_effectiveness: f32,
pub water_attack_effectiveness: f32,
pub electric_attack_effectiveness: f32,
pub grass_attack_effectiveness: f32,
pub ice_attack_effectiveness: f32,
pub fighting_attack_effectiveness: f32,
pub poison_attack_effectiveness: f32,
pub ground_attack_effectiveness: f32,
pub fly_attack_effectiveness: f32,
pub psychic_attack_effectiveness: f32,
pub bug_attack_effectiveness: f32,
pub rock_attack_effectiveness: f32,
pub ghost_attack_effectiveness: f32,
pub dragon_attack_effectiveness: f32,
pub dark_attack_effectiveness: f32,
pub steel_attack_effectiveness: f32,
pub fairy_attack_effectiveness: f32,
}
At the top we've included two derive macros for Debug
and Deserialize
. Debug
we've already talked about. We're deriving it for convenience if we want to use the Debug
formatter ("{:?}"
) or the dbg!
macro with a value of type PokemonCsv
.
Deserialize
on the other hand, is from serde, and does a pretty good job of automatically handling all of the fields with the types we've given. If we didn't derive Deserialize
we'd have to manually implement it for our type, and thus every field... but all of these types already have implementations so we let serde write that code for us.
That is what derive macros do: write boring repetitive code for us.
We use a few different number types for the data in our csv: u8
, u16
, and f32
.
- A
u8
is an unsigned (unsigned means: not negative) integer from 0-255, just like a color in CSS. - A
u16
is bigger than au8
and can hold values from 0-65535.
Why are there different integer types? well, because they're different sizes in memory. Storing a u8 takes 8 bits (one byte), while storing a u64 takes 64 bits (8 bytes). So if we can appropriately size the number we use, we can store more numbers in less memory.
Using the right type can also help us understand what values are valid. The number 300 is not a valid u8
, for example.
An f32
is a 32 bit float. Floats store numbers with decimal places, like 2.5
.
The other types we're using are Option
, bool
, and String
.
bool
istrue
orfalse
Option
is an enum that represents a value that can exist or not. The variants we can build areSome(value)
if there is a value orNone
if there isn't.String
is an owned string. That is, it's mostly what you think of when you think of strings in languages like JavaScript. We can add more to the string and otherwise do whatever we want with it.
We match all of these types to the types in the CSV. I've chosen to match the integer types as tightly as possible, even though I don't know if more pokemon with bigger values will be added in future generations. This is because I'm trying to match what's in the CSV now, not what could be added to the database in the future and using different types allows us to talk about those types.
The only thing left to talk about is the use of a field-level attribute macro. Serde offers us the power to rename fields when we're deserializing, so we'll take advantage of that to remove the /
out of legendary/mythical
and transform it into is_legendary_or_mythical
.
#[serde(rename(deserialize = "legendary/mythical"))]
pub is_legendary_or_mythical: bool,
In our for loop that iterates over the CSV reader, we can change the function used from records
to deserialize
. The deserialize
function needs a type parameter to tell it what type to deserialize into. We can get Rust to infer that type if we label the record
as a PokemonCsv
type, because Rust is capable of knowing that this type will propogate back up to the deserialize function and it is the only possible value for that type parameter.
for result in rdr.deserialize() {
let record: PokemonCsv = result?;
dbg!(record);
}
If you have Rust Analyzer with the type inlays on, you will see that Rust Analyzer correctly shows the type of result
as Result<PokemonCsv, csv::Error>
.
Running the program results in a DeserializeError
that specifically specifies a ParseBool
error at a specific byte on a specific line of the csv.
❯ cargo run --bin upload-pokemon-data
Error: Error(Deserialize { pos: Some(Position { byte: 781, line: 1, record: 1 }), err: DeserializeError { field: Some(14), kind: ParseBool(ParseBoolError) } })
If we look at the csv values, we can see that this is because the true/false values are capital T True
and capital F False
, which don't parse into Rust's true
and false
.
from_capital_bool
We can create a new function, just for these fields, to deserialize True
and False
into bools.
We first need to use serde's field-level attribute macro to tell it that when we deserialize, we're going to use a function called from_capital_bool
. Notice that we can also add it alongside other usage, such as the rename
.
#[serde(deserialize_with = "from_capital_bool")]
pub genderless: bool,
#[serde(
rename(deserialize = "legendary/mythical"),
deserialize_with = "from_capital_bool"
)]
pub is_legendary_or_mythical: bool,
#[serde(deserialize_with = "from_capital_bool")]
pub is_default: bool,
#[serde(deserialize_with = "from_capital_bool")]
The from_capital_bool
function signature is already defined for us by serde and is shown in the docs. We do not get the option to change it aside from the bool
value that represents the value we'll be parsing out.
Here's the whole function. You'll also need to add de
to the serde use item at the top of the file.
use serde::{de, Deserialize};
fn from_capital_bool<'de, D>(
deserializer: D,
) -> Result<bool, D::Error>
where
D: de::Deserializer<'de>,
{
let s: &str =
de::Deserialize::deserialize(deserializer)?;
match s {
"True" => Ok(true),
"False" => Ok(false),
_ => Err(de::Error::custom("not a boolean!")),
}
}
The function signature from the docs is
fn<'de, D>(D) -> Result<T, D::Error> where D: Deserializer<'de>
The function signature we use reads as:
The function from_capital_bool
, which makes use of a lifetime named 'de
and some type D
, accepts an argument named deserializer
that is of type D
, and returns a Result
where a successful deserialization ends up being a bool
type and a failure is the associated Error
type that the D
type defines.
Additionally, D
must implement the Deserializer
trait, which makes use of the same 'de
lifetime we talked about earlier.
As it happens, serde has an entire page explaining why the 'de
lifetime is like this, and what the D: de::Deserializer<'de>
trait bound is useful for.
The short version is that this new (to us) usage of lifetimes and generics is responsible for safely ensuring the ability to create zero-copy deserializations, which is some advanced Rust. We haven't done that in our PokemonCsv
struct, but we could.
The usage of the 'de
lifetime means that the input string that we're deserializing from needs to live as long as the struct that we're creating from it.
Overall, as it turns out, we're doing this so that we can take advantage of the csv crate's implementation of Deserializer
to deserialize the string "True" or "False" from the csv's values.
Then we can directly match on that string value and turn "True"
into true
and "False"
into false
. If for some reason we've annotated the wrong field with this function and we get something that isn't one of those two strings, we fail with a custom error message.
Keep in mind that the previous explanation of lifetimes and generics is something we could have avoided entirely if we wanted to. We could have mapped over the StringRecord
s and manually constructed the PokemonCsv
s ourselves, never having touched serde.
We could have also cleaned up the csv data before attempting to parse it at all, manually switching out True
for true
and False
for false
. I've chosen to present you this deserialize_with
approach specifically because it brings up new concepts and that's what this course is all about: learning more about Rust little by little.
We're left with the output being a PokemonCsv
now.
PokemonCsv {
name: "Bulbasaur",
pokedex_id: 1,
abilities: "Overgrow, Chlorophyll",
typing: "Grass, Poison",
hp: 45,
attack: 49,
defense: 49,
special_attack: 65,
special_defense: 65,
speed: 45,
height: 7,
weight: 69,
generation: 1,
female_rate: Some(0.125),
genderless: false,
is_legendary_or_mythical: false,
is_default: true,
forms_switchable: false,
base_experience: 64,
capture_rate: 45,
egg_groups: "Monster, Plant",
base_happiness: 70,
evolves_from: None,
primary_color: "green",
number_pokemon_with_typing: 15.0,
normal_attack_effectiveness: 1.0,
fire_attack_effectiveness: 2.0,
water_attack_effectiveness: 0.5,
electric_attack_effectiveness: 0.5,
grass_attack_effectiveness: 0.25,
ice_attack_effectiveness: 2.0,
fighting_attack_effectiveness: 0.5,
poison_attack_effectiveness: 1.0,
ground_attack_effectiveness: 1.0,
fly_attack_effectiveness: 2.0,
psychic_attack_effectiveness: 2.0,
bug_attack_effectiveness: 1.0,
rock_attack_effectiveness: 1.0,
ghost_attack_effectiveness: 1.0,
dragon_attack_effectiveness: 1.0,
dark_attack_effectiveness: 1.0,
steel_attack_effectiveness: 1.0,
fairy_attack_effectiveness: 0.5,
};
Dealing with Multiple values
Finally, we can see a few of these fields are actually multiple values, abilities
for example is the string "Overgrow, Chlorophyll"
, which is two abilities.
We can take the same approach we just did for the capital booleans to turn these array-strings into Vecs. Instead of returning bool
we'll return a Vec<String>
from our new from_comma_separated
function.
We can use split
to turn the string values into an Iterator over string slices (&str
), which are views into the original string. Then we can filter out any potentially empty strings using filter
and .is_empty()
and finally map over those views to turn them into owned String
s, and collect
into a Vec
.
.collect()
infers that it's type should be Vec<String>
from the function signature, so we don't need to additionally specify it.
fn from_comma_separated<'de, D>(
deserializer: D,
) -> Result<Vec<String>, D::Error>
where
D: de::Deserializer<'de>,
{
let s: &str =
de::Deserialize::deserialize(deserializer)?;
Ok(s.split(", ")
.filter(|v| !v.is_empty())
.map(|v| v.to_string())
.collect())
}
With our new from_comma_separated
function set up, we can put the deserialize_with
on any types we want to deserialize into a Vec<String>
.
#[serde(deserialize_with = "from_comma_separated")]
pub abilities: Vec<String>,
#[serde(deserialize_with = "from_comma_separated")]
pub typing: Vec<String>,
// ...
// egg_groups is further down in the struct
#[serde(deserialize_with = "from_comma_separated")]
pub egg_groups: Vec<String>,
And now we have a fully serialized PokemonCsv
struct for every element in the csv.
cargo run --bin upload-pokemon-data