We discuss parsers, how to build them in Elm, and how to try to make your error messages as nice as Elm's.
May 25, 2020

What is a parser?

  • yacc/lex
  • AST (Abstract Syntax Tree) vs. CST (Concrete Syntax Tree)
  • JSON decoding vs. parsing
  • JSON decoding is validating a data structure that has already been parsed. Assumes a valid structure.
  • elm/parser
  • Haskell parsec library - initially used for the Elm compiler, now uses custom parser

What is a parser?

  • One character at a time
  • Takes input string, turns it into structued data (or error)



Elm regex vs elm parser

Indications that you might be better off with parser

  • Lots of regex capture groups
  • Want very precise error messages

Getting source code locations


  • Loop docs in elm/parser
  • Looping allows you to track state and parse groups of expressions
  • Loop over repeated expression type, tell it termination condition with Step type (Loop and Done)

Error Messages

Getting Started with a Parser Project

There's likely a specification doc if you're parsing a language or formal syntax

Look at examples of parser projects

Look at elm/parser docs and resources


Hello, Jeroen.
Hello, Dillon.
How are you doing today?
I'm doing pretty well. How are you?
I'm good. And I am excited to chat with you about Elm Parsers.
And I understand you've done some weekend hacking using Elm Parser for the first time.
So I'm excited to hear about your experience with that.
Yeah, I tried it. Before I tried it, I was like,
I'm going to try it out. I'm going to try it out.
I'm excited to hear about your experience with that. Yeah, I tried it. Before I tried it, I ran a Twitter poll asking people,
Do you think I will have fun or do you think I will have a lot of pain?
And I was kind of surprised that only four people out of 25 replied that I will have pain.
So most people would think I was going to have fun.
And I kind of heard the opposite. So I'm pretty I was pretty surprised.
I found it to be quite easy. But then you get into pitfalls and then it's hard to figure things out.
Yeah, I guess there's the question of how hard is writing a parser in Elm compared to doing other things in Elm?
And then there's the question of how hard is writing a parser in Elm compared to writing a parser in another language?
I would say for the latter, Elm makes it really nice to write parsers.
For the question of how hard is writing a parser in Elm compared to doing other things?
I mean, I've got my take. What was your initial experience with that? Was it pretty intuitive?
Did you run into a lot of things that were confusing and surprising?
Well, I didn't get too far. I just basically tried to do an Elm code parser because that's where my interests lie.
Just for fun. I don't want to replace Elm syntax.
Right. But you depend heavily on an Elm parser project for Elm review, which you've spent a lot of time with the result of a parser.
Yeah. And you've thought a lot about the syntax tree that that gives you.
And now you're looking at the other side of how you traverse the raw source code to build up that data structure.
Yeah, I wrote a compiler back when I was a student. So I do have some experience with parsing.
I think I used YACC or LEXT. I don't know. All the things.
I think you use both. I think that like LEXT is a lexer and YACC is a parser and you have to like, yeah, find the tokens.
Yeah, exactly. And I did the exact same thing in college.
And I had a lot of fun doing that.
Me too. And that was like using C or C++ and it was still fun.
Yeah. Yeah, I had a lot of fun doing that project too. Writing parsers and writing languages were always something that I quite liked.
I still think I will write a language parser at some point just for fun, just for kicks.
It is satisfying.
It is very satisfying, I think.
I mean, when you're working with Elm and you can parse things and then having parsed into a nice data structure, you can then use that data structure in Elm and do case statements on this well defined data type.
That's really satisfying. So, OK, before we get too far into this, talking about our experiences with parsers and all of that, let's get a definition.
What is a parser and what is Elm parser, which we're talking about today?
Yeah. So the way I understand it is a parser is something that takes a raw string and then decodes it into something else.
A data structure, an abstract syntax tree, concrete syntax tree.
Yeah. And then you do whatever you want with it.
So you compile a language or you just extract some information like Richard Feldman's ISO 8601.
Yes, you got it.
I trained so hard before this podcast.
That package extracts data information from a string and Elm syntax extracts the abstract syntax tree of Elm code.
Right. The input is like a string, like your Elm.
If you have an Elm source file, an Elm module, and you feed it to Elm parser to steal 4M Elm syntax, this parser project, you give it a string and then it takes that string and it either fails to parse or it gives you nicely structured data, which represents the abstract syntax tree of Elm.
And OK, so we should probably define an abstract syntax tree.
Yeah. So an abstract syntax tree. I don't know what it is. I've never played with it.
It's a little abstract. Yeah. You haven't spent any time with abstract syntax trees, have you?
Yeah. So little. Only five years of my life, something like that.
When you've got your code, your Elm code, for instance, it is one giant string.
But you have keywords and expressions, A plus B, and they all mean something. And the meaning is represented often as an AST, an abstract syntax tree, where we have removed all the unnecessary information like spacing or the limitation of elements where that doesn't matter.
What you get is some kind of representation, often as a tree, that represents what the code means.
And then you try to do whatever you want with it.
Right. So it's still you don't have things like how much white space something had, if something was on a new line.
But if something is defined in a let or as a top level value, that's part of the syntax tree that here's a let. It has these bindings to these expressions.
That's part of the abstract syntax tree.
Exactly. And if you want the white space information, then what you're dealing with is a concrete syntax tree, which there is none with for Elm at the moment, as far as I know.
Right. And a concrete syntax tree might be useful if you're building editor tooling that needs to be able to recreate your exact source code.
Whereas an abstract syntax tree, you lose information about how it was written, but you preserve all the information about what the code means in order to execute or compile it.
So these terms are helpful and they're useful concepts if you're building a parser.
But, you know, the ultimate point is you're turning this source code or some sort of string into some structured data, much like you would do with a JSON decoder.
Except that the data is different in JSON, even if it's stringified, you get things by name, usually for when it's a record or JSON object.
But with parsing, it's always about ordering. So you get this, then you get that, then you get this. It's what I'm getting.
Right. You expect some kind of order, some kind of syntax.
Yes. And if those expectations are not met, then you have a parsing failure.
That is a really great point. And that's a great way to frame the distinction.
So, okay, so JSON decoder, you have something which has already been parsed, actually.
I mean, you could do like JSON.decode.decodeString and you could give it malformed JSON.
And Elm is going to say, I couldn't parse this. So technically it does the parsing step somewhere in there.
But it's basically checking, I mean, under the hood, I'm guessing it does JSON.parse.
Right. Okay. So it does JSON.parse. So like the browser is saying, hey, I'm going to take this string and I'm going to parse it for you into well formed JSON.
So you have this structure that's already been pieced together. So now you have this structure that has JSON data.
It has fields, it has values, and the types are well defined, matching this sort of JSON specification.
So you know things about the structure because you've parsed it successfully as JSON.
So in a JSON decoder, as you're saying, you can sort of reach in and say, I want this field.
And so parsers are very different because parsers, you're going through one character at a time and eating the symbols to define what the structure of the content is.
And with JSON, you're sort of dealing with this data type that's already been parsed into a sort of structure and you're making assertions about the shape of that data.
Yeah, you're basically already hitting a dictionary with JSON.parse, for records at least.
Yes, exactly. Yeah. So you're doing some sort of validations on it.
And in a sense, writing a parser, you're doing validations because your parser could fail or succeed. And if it succeeds, then you're going to end up with data of a certain type.
But the similarity sort of ends there. A parser is a different category because it's processing things in a way where it's stepping through each character and building up some structure.
So I think that's a pretty good introduction to the general concepts we're working with. What is a parser? What's an abstract syntax tree? The distinction between JSON decoders and parsers.
So maybe let's dive into the building blocks a little bit that you use to actually define these things and how they work in a way that's going to be familiar if you've done JSON decoders and in a way that's going to feel new if you've done JSON decoders.
Yeah. So who do you mean with Elm Parser? With specifically the Elm Parser library. Yes. Which is the official and I think default parser for Elm?
Oh, yeah. I mean, it's a really lovely library. It's a unique take on parsing. And Evan built this library. He has a lot of experience working with parsers, having spent a lot of time maintaining the Elm parser itself in Haskell.
Yeah. And I believe he took this Haskell parser library, Parsec, and initially built the Elm parser using that and then found he wanted to do things a little differently, both for performance reasons and for maintainability and kind of built his own tool on top of that.
Or maybe just from scratch and learned some lessons and applied those lessons to the Elm parser library.
Yeah, you really get the feeling that this is made to be very performance. Right. Kind of like painfully so sometimes. But when things work, things are performance right out of the box.
Unless you use some construct that was advised against, I guess. Right. And we'll get into some of those topics like backtracking.
So like, what was your initial experience? You dived in, you tried writing your first parser. Did you get something to work initially in an intuitive way or did you have to try things out for a while?
No, the first few things were very easy, very intuitive. So basically what I tried to do was A equals one with some spacing.
Basically, you already had examples doing that. So I was looking for a variable, then some spaces, a symbol, the equal sign, some spaces again, potentially you can ignore those.
And then some values. In this case, it was just an integer. So I don't remember what it was for the string, though. But the other things were very simple.
Right. OK, so you were able to pretty intuitively get that functioning and get it successfully parsing A equals one.
Yeah, I don't think I've even hit an error at the point. So yeah, quite intuitive.
Yeah. And I guess there's like a helper that lets you define like an identifier that's.
Yeah, I think that's what I used.
Which I mean, really, it's not that hard to define yourself, but there are certain rules like you can have numbers in an identifier name, but they can't be the first letter.
So you can have A123 equals something, but you can't have 123A equals something because an identifier must start with A through Z, lowercase A through Z in the case of Elm.
Yeah, you kind of list the steps of what you're expecting. So that's the order that matters. Kind of like decoding pipeline where you say decode.succeed.
You do parser.succeed and then the function that takes the extracted information.
And then you do pipelines where you say the first thing that I expect is an identifier, then expect spaces, then expect an equal sign, et cetera.
So in that sense, it really feels familiar to decoding when you're used to the decoding pipeline or using decode.map2, map3, et cetera.
Right. When all goes well, when you're on that happy path and you're finding the tokens that you expect to define sequentially, you say, I expect zero or more spaces.
I expect an equal symbol here. Then that all works as you'd expect. So let's stay on that happy path for a little bit before we veer off of it.
Yeah. OK, so when you're on that happy path. So first of all, you mentioned starting with parsing, was it like variable? Is it parser.variable? Is that the helper for that?
I can't remember. There's some sort of. Yeah. It's really not the most important detail.
But the point is that the library happens to give you a pretty small helper function that defines something for parsing variables, which you could very easily build yourself.
But so it is parser.variable. OK, great. So we have parser.variable. And then now we have these sort of I think of them as like keep and discard.
Yeah, I was thinking there's like a nicer four letter word for both. But keep and discard is good.
But there's this pipe equals like vertical bar equals operator and there's vertical bar dot. So these are operators that you change, just like you would in a JSON decoder pipeline style.
You do pipe greater than and pipe a bunch of things through. But you do pipe equals or pipe dot.
If you do pipe equals, it's going to capture the result of that parser and include it when you put the results together. And if you do pipe dot, it's going to say.
So, for example, if you're doing whitespace, you want to discard it. That's going to be a pipe dot because you say, well, I want to get past this whitespace or it's fine if there's no whitespace.
I expect some potentially, but I don't care about it. But I don't care about it.
I don't want to use that raw input for something. But if it's like a variable name, you want to get that value and you're going to put that in some data structure that says this is an assignment expression or in a I guess it's a statement in Elm, isn't it?
When it's a binding declaration, a declaration.
And you need the name of that variable so you can have that in your data structure. So you could have like a let binding where it's you have some string that's the variable name and then some expression that's that it's bound to.
Yeah, let's talk about building blocks afterwards.
So the next thing I tried was to have other kinds of expressions. So A equals one. So one is an integer, but potentially it could be string.
So I tried making a data structure that could accommodate both integers and strings or flows or Booleans, whatever.
So I extracted the integer parser, which was just parser int to a new function for parsing expressions or different parser, I guess, because it's not really a function as it is declared.
That's right. Just like a JSON decoder is a decoder. It's like a value of type decoder.
Yeah, it is probably a function under the hood, but maybe.
Right. As far as you know, it could be some magical value that just does the right thing, although it's a sort of hint that it must be a function somewhere because you can do map.
And if you can do map, it's got to store your function somewhere.
So that is a good point. Very good. But we'll never know. Just.
That's right.
I extracted that to a parser expression, expression parser, sorry, where I say parser dot one off so it could decode either the expression that I tried to decode is either an integer or something else, a float or string, whatever.
I tried a string and that worked fine.
Right. Right. And this is going to feel very familiar to people to coming from some experience with JSON decoders and other similar techniques in Elm where you just do one of and you you can combine these things together and they'll try something until it succeeds.
So at least when you're on the happy path, it's going to feel like a very familiar concept.
And by the way, what you're talking about of sort of defining something that deals with this one part of the parsing as a separate top level parser.
I spent a lot of time writing Elmarkdown because it turns out Markdown is a very, very large specification that has a lot of different cases.
So like two days worth of work or something.
Times some number. Yeah.
Huge number. Yeah, I'd say so.
And I've found that to be extremely helpful to extract to give yourself the building blocks for your specific parsing domain, because that's the thing is, in a way, in some cases, parsing feels high level.
You know, you do parser dot one of parser dot map.
Those things feel very high level, but then there are certain things which we're about to get into, which are very low level where you have to go one character at a time.
And the thing is, you can define what you need. You say, I wish I had something that could parse in this way.
Give yourself that tool. So you keep talking about staying on the happy path.
Are we going to walk off the happy path? I'm feeling so ominous.
There's a dark cloud off of the happy path. Can you can you hear the thunder?
What is the name of the unhappy path? Well, are you afraid of commitment your own?
I'll tell my girlfriend I'm not. OK, OK.
Well, your experience with parsing may change that because after having spent a lot of time writing parsers, it makes me fear commitment because because when you write a parser,
as soon as you chomp a value, as soon as you eat a character, you've committed down that path.
And so for me, at least when I first started with parsing, it took a while to get used to. And it almost felt like, wait a minute, this is like there's some state here that feels very unfamiliar because you're writing a JSON decoder and you know,
you do one of and you just throw a bunch of things at it. And as long as you have the right thing at the top of your one of.
It kind of works as you'd expect. If you put a succeed for a default case in the one of at the top, then it's not going to hit the other ones.
That's pretty intuitive, but it doesn't feel like there's this state that it's holding on to.
But as soon as you commit down a path with parsing, you've committed down that path.
So, yeah, that's what I noticed. That's where things got tricky too.
When I tried to parse a float, I was trying first trying to parse an integer. The integer failed and then tried floating, but it went fast. The initial numbers is what I'm understanding.
So you are afraid of commitment.
Yeah. So you mentioned chomping. So that's a new word for me. And from what I'm getting is I'm imagining Pacman and I don't know if that's the right mental model.
So you chomp, you eat.
Waka, waka, waka, waka, waka.
Yeah, those little balls and then you'd never see them again because you can backtrack.
You can go back, but that ball has disappeared forever.
That's right. Exactly.
Well, in Elm Parser, the backtracking, you get it back, the ball.
Right. So perhaps if the default for Elm Parser were backtracking, it would feel more intuitive.
Yeah, it feels easier.
Yeah, when you're first starting, the behavior might match what you're expecting at first a little bit better.
But for performance reasons, it's not a good idea to make everything backtrackable. So by default, you know, there are these helpers like like chomp if and chomp if just says there's chomp if and there's chomp while.
So if you say chomp if it takes a function that gives you a single character and then you return true or false.
If you're parsing a float, maybe you say chomp while it's one, two, three, four, five, six, seven, eight or nine or zero.
Perhaps you start with chomp if one through nine and you don't allow it to start with zero or who knows what your syntax rules are.
But the point is, once you've started to chomp, once you chomp something and it succeeds.
So if you say chomp if and it gets false as the first chomp statement, then it's going to go down a different path.
Yeah, because you're decoding what? Because you're decoding A for instance, which is not expected.
Right. Exactly. Exactly. So if you're trying to decode either a float or an int or a string.
Or it could be a variable and it's just A, like you said, you start trying to parse a float.
You're expecting some sort of number first. Oh, hey, it's not a number. It's the letter A or it's a double quote.
Well, I'm not going to go down that path anymore. And that was your first step on that path, your first chomp.
And so you're good. You just don't even take a single step down that path.
But as you were saying, if you're going to first see if I can parse this into a float and then if that fails,
then you're going to see if you can parse it as an integer.
Then you first have to chomp onto those integers at the beginning of the float.
So you start chomping, you chomp. If it's one, two, three, four, five, you chomp one, you chomp two, you chomp three.
And you say, OK, now chomp another numeric character or chomp a dot. And it chomps a dot and you're good.
But if it's not one, two, three, four, five, if it's just one, two, three and you're trying to chomp a float,
now you do chomp one, chomp two, chomp three. And then you reach a new line and it says, wait a minute,
I was expecting either another number or a dot. This isn't a float. And that, OK, that's OK.
It fails, but it's not going to go in your one of where you say, try doing a float.
If the float fails, try parsing into an integer. Not going to happen.
Because you've already taken a step down the float parser path. You're committed. That's committing.
Yeah. So when you use decode that one off, then it's really try this.
And if at any point it fails, it doesn't matter. We just go to the next one. And that is not going to fail on parser.
Exactly. So that's what's unintuitive because it puts it in a particular state.
As soon as any chomping occurs, it could fail or it could succeed.
But if it partially succeeds, it's not going to hit the other cases in the one of because you've committed down that path.
And then the whole decoder fails. And that's for performance reasons. So what do you do instead?
There are two ways to approach this. One is just throw backtrackable on it.
You do parser dot backtrackable with your float parser. And what's it going to do?
It's going to eat the one, eat the two, eat the three. It finds an unexpected character, a new line.
Instead of a dot or more characters or more numerical characters. And it says, oh, I failed to decode this, but it's backtrackable.
So now it unwinds its commitment. It can go back on the path that it's already taken a few steps down because it's backtrackable.
And now it can try your integer parser. Yeah, but that is less efficient.
But that's less efficient because now it's stepped through the one, the two, the three, however many characters it's needed to in order to check if it's a float.
Now it has to revisit those characters to check if it's an integer.
Yeah. OK, so it's very important for performance reasons to avoid backtracking.
So how would we write that same parser? We could solve your problem and get your parser working with backtrackable, right? That would work fine.
But it's not optimal for performance. So how do we solve that problem without using backtrackable?
I'm imagining from the examples that I saw is that you would try to parse an integer, then expect potentially a dot and then some numbers.
And if you find the dots and some numbers, then it's a float and otherwise it's an integer.
So in both cases, you try to do the integers. And then if it's at least an integer, then try to go the extra path of finding a dot.
And if that doesn't work, then it's an integer. Exactly. That's exactly it. Yeah, you got it.
So conceptually, you know, conceptually, it's not so complex to actually like write the code for that.
It takes a little practice. But just if you if you can wrap your head around what you need to do conceptually to avoid backtracking, then you're half of the way there, you know.
And the concept is exactly what you described. You you have a parser where instead of one of for your parsing, where you say, I expect this expression to be either a float or an integer or and you're doing one up for each of those cases.
You know, a one of list that contains each of those parsers. Instead of that, you're going to say try parsing it as an integer or float.
You have an integer or float parser. And what it's going to do is it's going to have a common parser that captures as many characters as it can until it finds some sort of signal,
some sort of signal that tells it it's done and it's an integer like a new line would tell it, OK, this part of the source code that's an integer is done.
You can move on to parsing the next thing and we're all done here. Here's the value.
So basically, you try to group together the things that start with the same symbols.
Exactly, exactly. And then you do a continuation.
So you branch off. So you have a single parser that starts out capturing all of the integers.
It has that input and then it's going to continue either saying I'm done. It was an integer.
We're good. Or, oh, I see a dot. Now I know that it's a float.
I have the part that comes before the dot. Now I'm going to continue parsing the part that comes after the dot.
And now you commit down the float path, but you started by going down a path that connects to another fork in the road.
So, you know, it's very much like just taking a walk in the park.
You take a walk in the park and you see a sign that says, oh, there's either a float or an integer down that path.
But then the path splits. So you go down that path. You know it takes you to both of those places.
You know you want to go to one of those places, but you don't have to decide until you get to the next fork in the road.
So you follow the sign that says integers and floats this way. Oh, great. I know I want one of those because I'm looking at a numeric character.
You follow that path. Now you hit another fork in the road. That's when you hit either a new line or a dot.
Now you have to commit to either integer or float. You've started down that path and you know it's going to be a number.
And if you, you know, if you have one, two, three, a, now there's a problem and you've committed down that path.
And that's actually the desired behavior. Now you're able to give a message that says, hmm, I was expecting this to be an integer or a float.
But because I was going down the integer or float path, but then I saw this character a which doesn't really fit here.
So that's exactly what you want in terms of giving feedback to the user that there's some syntax error in their code.
And so now you have you've done it in a performant way because you've combined those two paths into a single path.
Yeah, I feel like that's a pretty good mental model and that does make it feel pretty approachable.
When I learned of all the backtracking and performance issues, I was thinking I will probably have to do some backtracking at some point.
But with this specific way of branching things, it all feels pretty doable in my mind.
So do you use backtrackable in LMarkdown?
Yes, we do use backtrackable in a few cases and we're trying to remove those.
But there are some cases where it's a lot of work to fold things into a common path.
You know, in the case we were talking about of this is a float or an integer if it starts with a numeric character.
That's a that's a pretty straightforward one. But you can imagine as you're folding eight potential paths into one, it becomes more difficult.
And so there are a few cases where we're trying to remove those as much as possible.
And Fulkart has been doing some really awesome work to improve the performance and making a bunch of pull requests, which has been so, so nice.
And he's been been doing some benchmarking there, too, which, by the way, benchmarking.
If you're trying to build a parser project and it's like a non trivial parser, you know, it's not like parse a phone number and we use it in the UI once.
Right. Like, OK, you're not going to have any performance bottlenecks, even if you use backtrackable. If that gets the job done, it's not going to matter.
Yeah, but if you're building something that parses Elm syntax for Elm projects, you're going to notice a performance bottleneck if you're not doing these micro optimizations.
And so in cases like that, where performance is critical and where it's like a sort of community asset to have this parsing project,
I would recommend benchmark first before you make assumptions about what's going to improve performance.
After you've written it, you mean.
So, yes, that's a good that's a good question. Step one is definitely write tests, right?
Because if you are if you are doing performance tuning before you have benchmarks, that's not good.
But what's even worse is to do performance tuning before you have tests. That's a nightmare.
I can't even imagine. And in general, with parsers, you want lots of tests and parsers testing in Elm is really nice.
Like if you're just testing, I have this source code and I run it through this parser and I expect it to fail in this way here.
And I expect it to parse into this data structure here. It's really so easy. It's so much easier than manually testing it with parser projects.
There's no reason not to do lots and lots of tests because they're really fast to run. They don't have any side effects.
They're just very simple. Well, that's simple, but very straightforwardly.
Yeah, they're functional, right?
The thing about testing is testing is inherently functional. And when you're testing in, you know, languages that are more, you know, imperative, that have side effects that have environment, you know, objects that have a bunch of state,
then you have to like mock things out and stub things and try to capture the side effects that have happened.
And so you use all these things that are very messy and brittle and they make you less confident about your test because you're inherently trying to take these things which are nonfunctional.
Like side effects and like global state and environment. And you're trying to make them functional as in with this input, I get this output.
Well, with Elm, that's all you have. So if you're writing a parser, that's all it is. You give it this input, this source code, which is just a string.
You get this data type. It's so easy. It's way easier than doing it without tests.
So I cannot recommend highly enough, whether it's a very complicated parser or a very simple parser, just write lots of tests.
And certainly before you do any performance tuning, write tests. And before you do performance tuning, even if you've written your tests, benchmark it to figure out where the bottlenecks are.
Yeah. Another good thing is that you work with building blocks. So you parse statements and inside of those you use a parser for expressions and you can unit test the expression parser and you can unit test the statement parser.
And then you can unit test the whole parser, but you can do it at the level that you need to.
That's a really good point. Yeah. And one thing I like to do, we kind of demonstrated it on this live stream that I did with a couple of people who have done a bunch of great contributions to the Elm Markdown project.
We did a live stream where we implemented at least a lot of the functionality for the Markdown table parsing for the GitHub flavored Markdown spec.
And one of the techniques we use there, which I find makes it a lot easier to do this process, is to just like, I mean, this, I hope that people get sick of me saying this because if they do, then I've accomplished my goal, which is to drill it into people's heads.
Start with a hard coded success, get your tests passing as fast as possible. I mean, this is the basic sort of TDD concept of, you know, fake it till you make it where you make it dumb, then make it smarter.
Exactly. Exactly. Do the stupidest thing you could possibly think of the first thing that comes to mind. Use all the dirty tricks you can to get it green and then refactor.
But now you've got a starting point that you know works. It's all wired through. You have a test that's telling you if you have the expected result or not. Right.
So we did that in the live stream where we said, well, I expect if I had like a table like this, like what's the most basic case of a GitHub flavored Markdown table?
OK, that's our test case. And I would expect it to decode in to parse into this data structure.
So we write that test. It's not compiling. And then what do we do? We use data structure inline.
You hard code it. Exactly. We hard code it. And what what function do we use to hard code the result? Succeed? Succeed.
Yeah, succeed. Succeed is the key to success. I'm going to make that a T shirt.
I want one. Succeed in in monospace font. Succeed is the key to success. That's that's good.
I'm going to tweet that. I'm going to tweet that right after this. Give people a sneak peek of this episode.
But I really think it's a great tool because it lets you take a small step that, you know, you've got something to work towards and to iterate towards.
But you know that the types can all line up. You know that like here's the result I'm looking for and you can break it down into smaller steps.
So one thing I think we should talk about is when should you use a parser and when should you use something else like probably a RegEx?
Absolutely. So yes, I'm thinking when it gets very complicated. But that's I don't know any specific data points where you should say, oh, this is definitely parser material.
So when it gets complicated to use a parser, it's not complicated to use RegEx. So try RegEx and then if it fails, if it doesn't work, try a parser.
Yeah, this is this is a great question. And I think there are a few things that come to mind for maybe some some code smells that might point you in the direction of parser.
So I like the idea that if you can throw together a RegEx in three minutes that does what you need and it's not super complicated, then great.
But let's say that you're working with a RegEx and from Stack Overflow, you have from Stack Overflow is starting to do really complicated things.
And perhaps you're capturing a lot of pieces, you have a lot of capture groups. So first of all, the API for dealing with capture groups is, you know, you don't get these nice sort of types where the Elm compiler says, oh, because of the way that you wrote this, you're going to get these types like you do with JSON decoders and things like that.
Yeah, it's just like, OK, maybe there are some strings here or maybe not. Like maybe there's a list of things you have to check.
And then you always get it as a string, which is not always what you want.
Yeah, exactly. So you have to come back around later and check, does this match this RegEx that it's a string or do some other checks on it?
If you're doing a lot of that, that's probably a smell that you parser might be a good fit. If you want to give very precise error messages.
That's definitely a sign that Elm parser is a good fit, I think. I mean, I think it's safe to say that Elm and Evan's work have been very influential in the broader software development community and set an example of they've kind of set the bar for good error messages.
And, you know, a lot of it's in our work too.
We have ruthlessly stolen the formatting and inspiration from Elm error messages and so have we were inspired and we were inspired so much that it made us steal ideas.
It's the ultimate sign of flattery, right? Yeah.
No, I mean, I think that Evan has really been influential in what good error messages can look like.
And definitely. He built Elm parser with that in mind, right? So Elm parser gives you some tools to give some really precise, high quality error messages.
So there's something else I've been wondering, because in Elm syntax, you get to play with expressions, statements or declarations, or both.
And you get to know the location of each element. So you know where this number expression appeared in, where the type signature happens to be.
How do you get that information? Is that something complex? Is that something that you do with parser advanced or?
It's actually ridiculously easy. That's one task that, yeah, so all you do is if you're trying to capture, like, let's take your example where you're saying A equals one, two, three.
When you have your parser that's either going to try an integer or float, then what you can actually do, you can say, actually, so this is my expression parser.
So it's going to be one of integer or float, which is that one parser we defined, right?
So you reach a fork in the road and that fork is all expressions this way.
So this is your starting point. You're standing at this big fork in the road that has all these paths that branch off for all the expression types.
Because you know you need an expression here. And so you have, OK, if you go down this path, it's going to be a float or integer.
If you go down this path, it's going to be a string. If you go down this path, it's going to be a variable.
Well, you can do something called get offset. Yeah. And get offset, you can just capture the value of get offset, and that's just going to give you the line number.
I think there's also like get row and get column or maybe it's get row and get call or whatever. But they're actually equivalent.
You can derive it from that. But the point is that you just chain it on.
So we talked about these pipelines that you build where you say pipe equals and pipe dot.
If you say pipe equals, it's always going to succeed and just give you, you say pipe equals get row, pipe equals get column.
It's always going to succeed and give you the current row and column as integers.
And it's not going to jump either. Exactly. It's just getting a hard coded value based on the state of the parser.
Exactly. It will never cause the parser to fail. It doesn't change the state of the parser at all. It just includes that value there.
So what you can do is, you know, you're taking a walk in the park, you get to this expression fork where you say, OK, I need to parse some sort of expression.
And when you're standing there at that point, you say, oh, let me grab the row and column number.
And now you just have that data and you include it as part of your expression. And then when you reach the end of that, one of those paths, you do the same thing.
So what you would do is you would say, OK, I'm doing like a let binding parser or I'm doing like, you know, a top level.
What's it called? A top level value parser, right? A top level declaration.
So I parse some sort of identifier like A and then I parse white space, then I parse equals, then I parse white space, then I parse an expression.
But before you parse that expression, like you could just include it in your expression parser where your expression parser is.
Get the current row and column. Run the expression parser, then get the current row and column, and then you just include that with your data.
So now your data is start set, end set, plus whatever expression data structure you had.
Yeah, that sounds pretty simple.
Quite straightforward. So that's that's a really nice feature.
Plus, I guess that you can just write a helper function that just says get location over a parser.
OK, that sounds pretty nice.
Yeah, maybe one last building block to touch on before we move on to some other topics.
There's one more thing called parser dot loop. Did you encounter that at all in your...
I encountered it, but I haven't played with it.
Yeah, it's an important tool. I don't think we need to cover it in depth here.
I think people can look at examples and get a sense of it.
But just suffice it to say it exists.
It is a tool that you can use to solve certain problems where you where you need to keep track of context.
So like if you're doing a regular expression where you wanted to count the number of times that a certain character appears or where like a certain condition is met, regular expression isn't really going to do that for you.
It can't help you track state. It just executes, right?
But if you wanted to write a parser where you do that, then what you can do is you can use this helper called parser dot loop.
And what it does is it's effectively like a while loop, which feels weird to do in Elm, but it's an abstraction that feels similar to that.
And what you're doing is you're just calling this parser and you have a parser that either returns loop that says keep running the parser or the parser will parse into done.
If it parses into done, then the parser will stop. If it parses into loop, then it will continue and it maintains state.
So I think of it kind of like a fold expression in Elm where you can do like list dot fold L where like compared to list dot map list dot map,
you're just going over every item and you don't have a context that you carry with you.
But parser dot loop allows you to retain context as you go through that parsing.
So basically you use parser dot loop when you have things that can be duplicated, like you can have several statements in an Elm code file.
And when you parse lists, then you have a certain number of elements that is undefined at the beginning.
So you loop through those and at every step of the way, you have a parser that says stop here because I found a closing bracket or continue because I found a comma or something.
Right. Yeah, yeah, you can you can do that and you can track state as you do that.
And that's I think the sort of significant thing about looping is it allows you to maintain that state.
And as you say, you you yeah, you can tell it when to terminate running a parser repeatedly until it finds some end condition.
What kind of information would you gather, for instance, like if you're parsing a tuple, would you count the number of elements to see if they're bigger than three or something?
Yeah, yeah, I think you would. I think you would do that.
If you want tuple for elements to be a syntax error, for instance, then you could. OK.
Yes, exactly. Yeah, you could do that. And you could you can do that with parser dot problem.
But that's exactly right. So like if you yeah, if you just run a parser and you don't have this context from loop, you don't know how many times you've gone through it because you don't have any state.
So, yeah, exactly as you say, you say, I am going to parse a tuple, but I need to know how many items have I seen because if it's greater than three, then I'll fail.
So that's that's exactly how you would do that. You would do. Yes.
So you would do that, just like what you said. But if you couldn't do that, then you would not have a syntax error.
You would have a check. Yeah.
Because, hey, I have a tuple here. Is it bigger than three? OK, then I have a different problem.
Exactly. Exactly. So you'd have to do a to pass to parse it into the raw syntax and then check for syntax errors in this thing that you parsed in a second step rather than.
Yeah, exactly as you say, you you can as you're parsing that tuple, you have the context of how many elements you found in that tuple and then you can fail.
And the way you fail is, you know, much like we have Jason decode fail.
We have parser dot problem. And that allows us to just say if you went down a path that led you here, give this error right now.
Is it when the Elm compiler says, hey, I got something very confusing. Is that what it uses on the third, like parser problem?
Because I'm expecting when you do parser dot one of this, it says, oh, I was expecting a this or that or this.
Right. Probably. So I'm not sure how similar because the Elm parser is written in Haskell.
I'm not sure how similar Evans API that he built to do parsing in Haskell for the Elm compiler is to the Elm parser library.
I would imagine pretty similar. But yeah, that's that's the idea. You can you can do one of and then you can do parser dot problem as one of those.
So you so if you say it's either you say I am going to parse a value that's either a number or a string.
So try parsing a number. And of course, as we talked about, if you take a single step down any of those paths, you're committed.
So you you say, OK, try parsing a number. If that doesn't work out, try parsing a string.
If that doesn't work out, here's the problem. Hey, I expected to see either a number or a string.
And you put that as the one of one of parse number parsing problem.
I expected number string. Yeah. So what about syntax error messages?
So you know, parser dot problem, you can say whatever you want.
You can try and make it as helpful as possible with as much knowledge that you have.
How do you get other kind of error messages like when you do parser dot one of does it give you a nice error message like I just said before?
Or you have to write them yourself. Parser dot one of is basically going to give you whatever error it encounters first.
So in Elm parser terminology, that's a dead end.
I believe it's just going to hit a dead end and then stop.
And then I was expecting a colon or something. Yes.
So not that's really useful. Yeah.
So the way that you write very precise, expressive error messages with Elm parser to try to get the type of quality of error messages that you see in the Elm compiler,
the tools that were given to do that in Elm parser are in parser dot advanced.
So this module, parser dot advanced, as I think you saw, it pretty much mirrors the regular parser module.
But it's got some a couple of changes and a couple of extra functions.
That's what it's for. That was going to be one of my questions. Like, when do you use parser at events?
Basically, it's because of the error messages. Is that it?
Exactly. So basically, when you do the regular parser module, when you run a parser, there's actually a hard coded list of problems.
There's like a problem type in the parser module.
So the type parser dot problem is a hard coded type that says I was expecting this token, I was expecting this type of value, or it could just be a string that says this was the problem.
If you say parser dot problem, it's going to be your custom problem string.
It just gives you a place to provide a string that gives an error. Right.
And if you use parser dot advanced now, that problem type is your own custom type.
So a problem could be a very expressive custom type that you define and you define how to build up that type as you build up your parsers.
But to to give good error messages, you need to know in what state you are.
So that's the other thing that parser advanced gives you.
Couldn't have segued better myself. So it allows you.
So if you look at the parser type that's defined in the parser module and that's defined in the parser advanced module,
you'll notice that there are some extra type variables in the advanced type for the context and the type of problem.
So in the regular parser module, the parser type has, as I said, a hard coded.
This is your type of problem that you could have. And if you have a custom problem, it's a string in the parser advanced.
Parser advanced parser type. You have a custom type for your problems and you.
Isn't that convenient? A custom type for your problems? What I've needed my whole life.
Just simplify it to a custom type. I prefer it to be an empty two.
That would be nice, wouldn't it? There's another T.
Shirt idea. We'll work on that. And then the other extra type variable that you have for the parser advanced parser type is for the context.
So you can have special context in your parser.
And so you can you can do something in context. So you could say in the context of parsing a let statement.
And so now you have this sort of stack of context that says, OK, I was I was in a let statement and within that let statement, I was in another let statement.
And then within that, I was trying to parse a list and then I encountered this error.
And so that's basically what the Elm compiler uses, a similar technique to provide you with more context that tells you exactly where the problem is coming from.
OK, so you would probably use the regular normal parser module to do simple things like parsing a phone number.
If you try to parse a language, then you will probably want to use parser advanced.
Yes, exactly. Exactly. Because there's a lot more context and nesting of different types of expressions and that sort of thing.
So, yeah, exactly. I think that's a good rule of thumb. And actually, well, it's maybe a bit of a tangent, but I'm actually starting to wonder for my Elm Markdown parser,
whether I should just use the regular parser module, because Markdown is unique in that it's not supposed to fail.
There's no invalid Markdown. If you have some sort of like closing token that you forgot,
like you forgot the closing parenthesis for a link tag in your Markdown. Well, it's just a valid string literal instead of being an actual link block.
Yeah, you always have a fallback that is just regular string. Exactly. That's the smallest tangent that you've ever done.
My whole life is a tangent, Jeroen. This whole podcast is a tangent.
Yeah. OK, well, I think I know how to write an Elm syntax parser now. Just like the matrix. You know, Kung Fu.
I know Kung Fu. I know Elm parser. I know Elm parser.
If we can give people that feeling with some of our podcast episodes, then I will be happy. Yeah. Let us know.
Just tweet at us and say, I know Elm parser and we'll understand.
Yeah. You can also say I succeeded at parsing. The key to success is succeed.
Succeed is the key to success. Yeah, it's good. I think the monospace font is what what makes it.
So it's going to be a better T shirt. Yeah.
I think we've covered the basic building blocks pretty well.
And of course, there's there's always more to explore. You're always going to find more.
But my biggest advice of anything is please, please, please write unit tests.
If you're building a project, you will thank me later is very worth it with that in mind.
I mean, if you learn one thing, write tests for your parser if you take one thing away.
But if you take two things away, maybe we should talk a little bit about some some things to keep in mind when you're starting a project.
I think one one thing that is really valuable is if you're writing a parser, there's a good chance that you're working with some sort of specification.
And if if there's a specification, there may be a formal specification document for it. Those are very helpful.
Like for for dates, there's some specification. What is it?
That number again? 80 ISO 8601.
That's right. That's right. Oh, yes. That's right. That's the formal specification for that.
Yeah. So like for Markdown, there's something called the common mark specification.
GitHub flavored Markdown is an extension of that that builds off of that.
It's been very handy to be able to look through a formal description of it, and it's it's actually very thoughtfully put together.
So it's it's a very useful resource. Actually, for my own Markdown parser, I was able to steal the test suite from the MarksJS project,
which what they do is they take all of the Markdown specs and they actually run them as tests.
So they say, OK, the Markdown spec gives us all these examples of this Markdown input should give this HTML output.
I run all of those thousands of tests on Dillon Kern's L Markdown, and it is excellent.
It is so nice. It's like saved me so much time.
So use those resources if you can find them. And chances are, if you're writing a parser for something, there are probably good resources for that.
That's a great place to start.
I think another thing that's very useful is just looking at other people's Elm parser code.
There are starting to be more and more examples of this out there.
So you can take a look at Dillon Kern's L Markdown and we've done a live stream on that, too.
So that's another resource.
So when you go to ask for help or do you have any resources to troubleshoot your problems?
For sure. If you ask in the Elm Slack, there will be someone to help.
If I see someone ask a parser question, I will help them.
But I just write lots of tests. It's not easy, but I just write lots of tests and then I keep trying things until the test pass.
I'm not a smart man, Jeroen, but I am good at writing tests.
Yeah, you're good at hacking is what I heard, too.
Well, it's not hacking if you have tests. Then it's very professional and refined.
It's performance. That's what it is. Performance.
Yes, it makes me happy. It makes me happy to have tests because hacking is just no fun if you don't have tests.
But just trying out a bunch of random things until something works with some tests telling you if it actually works.
Oh, bliss. Love it. Highly recommended.
Yeah, that's really been what I've done is just written a lot of tests and figured it out over time.
There are some helpful resources in the Elm Parser repository that kind of explain, give you like a conceptual overview of a few things.
So that's a good thing to look at. It kind of talks about backtrackable and those types of things.
Yeah, you can feel that Evan gave it quite a bit of love.
He did. It's really well written.
I would even say that it's one of the best things about Elm actually is the Elm Parser project.
So if you love Elm, try the parser.
Yeah, it opens up some really cool possibilities.
And I think that there's been, as with many of the really wonderful things about Elm, you see this feature in Elm.
And then there's this vibrant innovation going on in the ecosystem.
And I see Elm Parser as being the same thing that just creates this space for innovation where we see people doing some really cool things.
Yeah, you couldn't have Elm pages without the Elm Markdown Parser.
And I couldn't have Elm Review without the Elm Syntax Parser.
The Elm Syntax Parser. Exactly.
That it's opened up some very cool things.
Elm pages would still, I would still find it useful even if it was just using the Elm Explorations Markdown.
But you're right that I built my Markdown Parser because I wanted to do certain things in the context of like a static site where I wanted to render highly custom views in my Markdown.
So, yeah, look at examples.
Martin Janacek has his Elm in Elm compiler, which is not fully completed, still a work in progress, but that's something to check out and you can look at his talk on that at Elm Europe.
Matt Griffith has a really cool project called Elm Markup, which is it's very different from Markdown in that Markdown is designed to never fail.
Elm Markup is designed to give you a well defined syntax that will fail in specific cases with nice error messages.
So Matt has actually done some really cool stuff with the parser there to both give you very nice error messages.
That's actually a great repository to look at if you want to learn how to do precise, expressive error messages.
But he also recovers from those errors gracefully so you can have partially rendered views so it can recover from errors and still present you with something when you're in like dev mode.
So you got a parser and errors.
Yeah, he gives you a parser, nice errors and fault tolerance.
So it's a fault tolerant.
Parser with nice errors, which is very cool.
I recommend taking a look.
Teresa has a cool parser project where she does a YAML parser and she gave a talk at Elm Conf a couple of years back, three years back.
I don't think we're at this time.
But we'll link to that.
And it's a very nice introduction to some of the core concepts that we've talked about.
She's got like lots of great code examples in her slides, and I definitely recommend watching that.
Yeah, and I've got an example in an early somewhere about a equals one.
Oh, yeah, we should link to that.
Yeah, definitely.
All right. Well, I think with that, let's free the people to go play around with Elm parser and build some cool stuff.
Maybe we'll see some cool innovations popping up to continue pushing the boundaries with what you can do with Elm.
Yeah. Good luck.
And have fun, especially have fun as people have told me.
Parsers are fun.
Parsing Elm is even more fun.