elm-test v2 with Martin Janiczek

Martin Janiczek joins us to talk about fuzz testing, why it matters, and how the upcoming elm-test v2 changes the way you write fuzzers in Elm.
August 1, 2022


Hello Jeroen. Hello Dillon.
Did you feel like something was missing last episode?
Yeah, like, it was a nice episode, but it felt like we didn't have like a, what do you call it? A Martin.
A Martin! Exactly! Hey, wouldn't it be nice? Actually, we've got another Martin here today. Martin Janacek back with us again. Thanks so much for coming back on, Martin.
Hey there, hello. Nice to be back here.
So soon. Ready for another Martin episode. Let's do it.
Yeah, well, thanks for joining us. And so you've been hard at work for a while on this pretty ambitious project, getting a new fuzzing approach in the Elm testing library.
So first of all, congratulations on getting like completing the project and getting that merged in. That is quite ambitious and impressive.
Yeah, thank you. It is quite a good feeling to finish a project because I have so many unfinished projects.
Yeah, and that's a lot of coordination to like actually convince people that this is like a good official approach to switch over to.
And, you know, I mean, so this fuzzing approach, like there's like a lot of interesting theory behind it that you had to really, I think, do a lot of reading about to compare these different approaches.
And but yeah, it's a really fascinating idea. So, yeah, great, great job on that.
Well, we did talk about fuzzing in the episode dedicated to Elm tests, but would you like us to, but could you explain what fuzzing is again?
I think what you're trying to say, Jeroen, is that your memory is a little fuzzy on that.
There it is. Yeah, sure. I can definitely try to describe fuzzing and why use it and so on.
So I think I think of fuzz tests or as it's more well known in the functional programming world as property based tests.
I think of them as like a continuation of the line from unit tests through parameterized tests to to infinity.
So let me explain that. So unit tests usually have just like one example, right?
You hard coded one value and you are trying to assert something about it.
Either you are trying to check the result of calling a function with like equals, right?
So adding one to three is four or you are trying to learn something else about it.
Let's say, I don't know, like multiplying by two makes it even or something like that.
So it doesn't have to be a value that you like an output value that you want to check against.
But you you usually have one specific input example that you are trying to test.
Then the next step is parameterized tests.
So you could think of them as a list of tuples, let's say.
So list of inputs and their corresponding outputs.
And so that's usually helpful. In my experience, it's helpful with parsers because it's hard to stay sane and like have everything nailed down when you are creating a parser for some language.
So that's I think those tests definitely speak the most to me about like why tests are good.
Because I just want to make sure that I didn't break something when trying to implement this new feature.
So parameterized tests are a great way to just check the same property or check the function in the same way with lots of different examples, lots of different hardcoded inputs.
And then just to clarify, Elm test does not have support for parameterized tests, but you can emulate it using like list on map.
Yeah. So you would have a list of these inputs and outputs and you would just pass it to list map.
You would create tests out of that and you would just give them to test.describe or test.concat or something.
Describe. Describe.
Yeah, test.describe.
Yeah, you can still do it in Elm even though it doesn't have this kind of like annotation style that let's say Java JUnit has and so on.
What's nice about parameterized tests is that when you specify more inputs, so if your function has, let's say, three arguments and you have tables for each of these, you get all the combinations, right?
So it's a nice way to cover a lot of area quickly.
But all of these, like unit tests and parameterized tests, they have the problem of only checking what you can think about, right?
So you know about edge cases like zero and minus one and so on, but perhaps you didn't think about what happens with max integer or something.
So that's what property based tests or fast tests as Elm calls them.
That's what those are really good about because they generate the inputs randomly.
And so that's nice, but it has its own issues, right?
Because if you can't hard code the inputs, suddenly you can't hard code the outputs either.
So it can't be just a simple input transform, expect equals some output because it's random.
So you can't really do that.
And so you need to change your thinking a little bit about it.
You need to change your mindset and you need to go more general.
So instead of just checking for equals, let's say you have to think about what invariants does your function, what invariants hold for your function?
Which properties hold for it?
Yeah, which properties?
What is always true, right?
No matter what input is going to be there.
And so suddenly, let's say if I, this is really like insultingly stupid example, but for the add function, you would want to check that a plus b is the same as b plus a.
You would want to check that zero plus anything equals that something, right?
That you gave it.
And so there are these like laws, mathematical laws that you have for addition that you can check with random integers.
And in a similar vein, there are different other properties that you can check for.
Let's say JSON encoders and decoders, they are each other's inverses.
So you can do that kind of round trip where you generate the value, encode it, decode it and check that is the same, right?
That you didn't lose any kind of data.
And so there's a lot of these properties that you can check, but it's definitely shift in your mindset.
So it may not come to you as easily as just writing unit tests.
Yeah, definitely.
It definitely seems like a different workflow because this test driven development approach is so based on like fake it till you make it, hard coding something.
You know, you say I expect, you know, this function to combine parts of a URL path.
And when I have this input, this is the output and then you hard code the function to return that output.
And now it's green.
And then you you go case by case with with specific concrete hard coded cases, hard coding pieces of it into the implementation.
That's like the TDD, the classic TDD approach.
So how does it change the workflow if you're doing like TDD with fuzzing?
Or is it just a separate workflow?
I'm not really sure myself if it's still TDD because in your example, as you said, you hard code the output into the implementation because that's the easiest way to get the test green.
And then you have three of those examples and it just becomes like if this then that, if this and that else is this third answer.
And it could go infinitely long.
Right. You could have like 20 inputs in your test and then you make like switch with 20 cases in the implementation.
And that's not what you want.
Right. So there needs to be a moment where you switch and say, OK, it's actually easier to do the right thing.
Right. And do the actual implementation.
And I feel like when you add the fuzz tests into the mix, if you say, let's say, let's say A plus B equals B plus A.
And let's say the associativity law, like A plus B plus C in parenthesis equals A plus B in parenthesis plus C and so on.
If you add all of these laws, you kind of define how the implementation should behave.
And suddenly, because the tests are random, you have no way of keeping that switch statement or like the case of statement in case of L.
Right. You have no way of keeping that implementation because the values, the inputs are going to be random each time.
And so either you just, you know, you have to give up on that lazy implementation mindset in a way.
And so I think it kind of forces you to go to do like, OK, let's let's do the actual implementation step.
If you add fuzz tests in there, you just have to do the right thing.
Do you typically do it in your workflow as like, you know, as you're building the implementation, driving the tests in this TDD style?
Or do you typically do fuzz testing after as a like, let's make sure we covered all the cases?
Yeah, I usually do my tests after, but it might not be.
It might be just because I am not very principled TDD practitioner.
So what about for vanilla unit tests?
Do you do do you do those first or I do streams where you do them first?
Yeah, but it's that's mostly for like Advent of Code and nice, simple examples like that.
Yeah, yeah. It's honestly it's not how I do stuff at work.
Yeah, so so far, I don't know if it's really nice because they give you the test cases right away.
Right. So it's it's nice to just put them in and check the correctness of your solution early.
But usually I am not so principled about it and I just write my implementation and write tests afterwards.
I realize that my code might be much better and cleaner and like more testable if I start with tests.
So that's something I try to do. But even for unit tests, I usually do them after.
OK, I see. I wonder because I have to say I do not I do not do very much fuzz testing.
And but my instinct is that I would probably do fuzz testing after, whereas I like to do
regular unit testing, driving the implementation.
But yeah, and it seems to be the sweet spot, I think.
Yeah, that's where I would like to get.
Yeah, I think I would do the same. It also kind of feels like when you fake it until you make it,
you hard code everything. Well, the unit test is still a hard version.
Right. And then you end up with the fuzz test. That is the real one.
Oh, interesting. I see. So you're saying you could like generalize your unit test,
start with like the unit test being the hard coded one.
And then the fuzz version would be generalizing that.
Yeah, that's a cool idea. I like that.
In a way, you could say you're not done until you have done fuzz testing.
But I mean, fuzz testing is not what you always will end up with. So that's not entirely true.
There are applications and there are places where unit tests are more appropriate.
But it's, you know, you can have both. And unit tests are great for understanding.
Let's say you have a new colleague and they look at the test and suddenly they see
how you are using your functions and so on.
And property based tests are more for getting the confidence that your functions are really correct.
Right. It approaches a proof, right? It approaches a proof as your sample size approaches infinity.
So it's and since it's random.
Yeah, I think it was Dijkstra who said that tests can only prove that there's a bug,
but they cannot prove that there's no bugs. Right.
I'm not sure of the correct wording, but that's the gist of it.
And yeah, so you can almost infinitely increase your confidence, but you can never get there.
That's where like formal methods come in and proofs and so on.
Type checkers, static analysis.
It definitely feels like it's more in the category of proofs.
Than just tests.
I will say like when I've coached people on these incremental approaches to writing code,
which I think hopefully people think of me as a broken record talking about incremental things.
If so, then I've achieved my mission, which is to...
You might be broken, but at least you're immutable.
No record updates for me.
But when I'm coaching people and like doing a TDD example, like one of the things I notice is that
people want to jump to the generalized implementation first.
And I want to untrain that habit of trying to go straight to the generalized solution
and instead say, let's talk about one case at a time, because I think our brains are actually
more effective at thinking about one case and building code for one case.
Otherwise we overengineer and our brain gets overwhelmed.
It's like too large of a problem to bite off at once.
So it gives us these manageable incremental chunks to work on.
So I definitely think there's like value to this TDD approach.
And I definitely like the idea of doing fuzzing after.
And I think that's a great way to do it.
And I think it's good for people to keep in mind that like balancing that incremental
approach with the fuzzing, which will pull you more in the generalized implementation direction.
Also, now that you talk about it, I think fuzzing is great for when you can't hold
the whole state in your head when there's too many options.
So this is more for, let's say application state and messages.
So, you want to make sure that no matter what messages you get,
something will always be true because you can't really think through it.
One example that I am now thinking of is the text editor, like Elm editor that I tried to make,
which was totally from scratch, no text area, no custom JavaScript.
So it was just like each character was its own thing.
Each character was its own div and all the cursor manipulation and selection.
It was all done from scratch.
And so there were a lot of situations where I thought the implementation is correct because
the way that I tested the application, everything seemed to be fine.
So if I am on the last line and I click the down arrow, it should move me to the end of
the line, right?
It shouldn't move it down because there's no other line.
This was the last one.
And all these cases, I could think of some, but I had no idea if that's all of them or
if there is some behavior that I am forgetting.
And so what property based tests were really good about was saying no matter what happens,
no matter what keys I press, no matter what selection is currently active, no matter where
a cursor is, if I do this, this will happen, right?
And so it was great letting the computer just throw random messages at it and tell me,
hey, no, if you do, I don't know, up arrow, down arrow, left arrow, this happens.
And you said it shouldn't, right?
And I wouldn't have thought of that example.
So I think really fast tests are great after the fact, after you implement your program.
It's really nice that you write these unit tests and the fastest almost the same way,
especially in the same tool, because you can switch from one to the other much easier than
if you had to write a end to end test, for instance, where you go for Cypress or Puppeteer.
Like, okay, now you have to write everything in a different language and using a different
But here's just like a few characters, few lines of change and you're good.
I need to dust off this branch I have.
But I've got a branch in my LMarkdown parser that the parser, in Markdown, there's no such
thing as invalid Markdown.
It's just Markdown that won't have the effect you expected.
An incomplete link in Markdown is just plain text, but it's not invalid Markdown.
So I have a fuzz test that you throw any random string at it and there should be nothing that
has an Elm parser error.
That's a pretty straightforward, just throw a lot of input at it.
You can imagine sort of doing it in a little bit of a more sophisticated way where you
try to construct more meaningful inputs that might be better at fleshing out.
You can increase the chance of something weird happening.
If you know what to look for.
If you know, right, exactly.
But at the same time, you could also bias it towards the things you're already looking
for, which you're trying to avoid.
So, I mean, I guess you could do both.
You could say, here's just a totally random string.
And then you could have another fuzz test that says, here is a set of known characters
that could cause problems in unfinished HTML tag, unfinished link tag.
And we have tools for that, right?
We do have like this one of function and it's like derivatives like frequencies and weighted
and however those are called where you can say, OK, 90% of the time give me this random
string and 10% of the time give me this like carefully constructed, like tricky string.
And you can switch the weights.
You can change that.
You know, something is you can switch the weights and you can find a balance where it's
doing useful work, but you can still be sure that it will try all the crazy random things
that you didn't think of with the totally random, uniformly random string.
How much confidence do you have that fuzz test will discover something if there is something
to be found?
Because I think by default, they run like 100 different inputs.
So for instance, for finding errors in a mock down parser, like Shirk will find something
sometimes, but hundreds might be low.
So you might not trust it in the first run at least.
100 might be a little bit too low for that markdown example because strings are like
collections and or at least with total random string generator, 100 might be too little.
But if you had this kind of like, let's create my string from like, this is a normal world.
This is a link.
This is, you know, with missing parentheses and like all these different stuff where these
chunks are bigger, then you might cover the space better.
And so you might need less tests, but definitely depending on what you are testing, you can
say, okay, let's run these fuzz tests.
You know, let's run a million of them instead of 100.
And some people, this is now going into like testing and trying to break C programs, libraries
like PNG libraries and so on.
People do keep running them nonstop, right?
They run them 24 seven and every now and then, you know, it finds something.
So this possibly something we could do with elntest, like let it run for a specified amount
of time or let it run indefinitely instead of just saying, here's, you know, try 100
times and then stop.
So you can choose how many monkeys you want and what these monkeys can write.
Yeah, exactly.
And you're like, oh no, they wrote Harry Potter again.
We have to throw it out.
I think this is a very interesting topic of how you produce inputs that essentially meaningfully
represent the cases that you want to be testing.
Because I mean, in the case of the markdown input, you're actually probably going to have
an overrepresentation of, for example, cases with without matching closing tags, right?
Because the odds of getting a valid matching closing tag is very low.
That's going to produce a certain class of errors more.
It's going to represent that class of errors more.
But maybe there's a class of errors where you have a valid matching tag, but within
that there's an invalid one.
Or maybe there's a, you know, maybe there's something meaningful that happens when you
do nesting.
Maybe there's something meaningful that happens when you have something within a block quote,
which you can have all sorts of nested, these nested markdown block structures that can
have other markdown within them, which can have other markdown within them.
You're not going to represent those cases, which means you are looking at more of a monkey
typing Shakespeare situation where sure, theoretically, you're representing that input
because you're representing an infinite space that's not constrained.
But realistically, statistically, you're not representing that well enough to trust that
you're getting coverage of that case.
Yeah, I think what might work really well in that case is, well, not going with total
random string, not going with valid SAT that you would convert to string and then test,
but going in the middle in like the tokens.
So from implementing the parser, you know that, let's say left brace or the left square
bracket, those are special or indentation is special or the exclamation mark is special.
And so you can use those weights in the one of function.
You can tweak that and you can say, OK, let's try what happens if I have an image tag and
inside is something else.
So you can say with probability this and that, generate a string that is those tokens.
So like exclamation mark, square bracket, something in them, then open question, then
open parenthesis and try to run the whole thing again inside.
So you can try to create these cases.
You can try to go based on the tokens.
So the tokens are a little bit like bigger unit than just a character.
But suddenly you are trying the interesting stuff more often than just, let's say, alphabetical
characters because those do not.
Yeah, it seems like there are a lot of approaches you could use.
Like another one that comes to mind is you could, you know, if you construct the input
from an AST, then you know you're constructing valid input and you turn that into a string.
I mean, obviously, you know, you could do the reversible approach.
Reversible is always an interesting thing to try with fuzzing if there is such a concept
in the problem you're looking at.
But beyond like reversible, you can take a build a random AST.
You can wait certain things that, you know, to make sure there's an even distribution
of like things like nesting, for example.
And then you could go and change random characters or you could change, you could look for specific
characters that have meaning and either add or remove them.
That's another thing that might bring out some interesting cases.
And of course, you could just go and build your own totally custom generator that's going
to build up inputs that represent these different cases.
Yeah, exactly.
All right.
I feel like we've talked about fuzz testing in general.
I think people know what they are.
Should we talk about what you've done, Martin?
Like what is new in your implementation?
What is different?
All right.
So my changes to elm test and what's going to come out in elm test v2 is a re implementation
of the shrinking process.
So just to quickly summarize it, elm test will generate random inputs to your function.
And if it finds that input fails the test, it will try to simplify it.
So it might turn 512, it might turn it into zero or it might divide it by two.
It might subtract one from it.
It just tries to make it smaller somehow or simpler.
And there...
If your sorting function fails on a list with a thousand items, but it also fails to sort
correctly on a list, there's an input with two items that also fails.
You don't want your error message to say, hey, I found a failing example.
You want it to show you the simplest example.
It can show you the fails.
It's not very helpful if it gives you something that takes two screens just to base the value.
So definitely...
Thank you, elm test.
Like, I guess this is wrong.
But it's really like when you see a failure from a property based test and it is shrunk
down to its minimal failing value, a lot of the times the bug is obvious, right?
Because you see, oh, like that's, that's that.
Because it is small and you can kind of like guess as to what happens.
But if really, if it's like, let's say for my editor, if it's a list of 500 messages,
I have no idea which of these are just like fluff and which of these are really important.
And similarly for the margramparcer, you could have a huge string, huge document and somewhere
in there is going to be the failing part.
Like let's say the image tag that is not closed or something.
And if the shrinker can remove everything else and just give you that image tag, that's
going to be much clearer for you.
And so we do this shrinking process and there are two ways, or at least in the currently
known open source world, there are two ways to do shrinking and we are using one of them.
And it has some, it has some problems.
When you say we.
Do you mean?
Elm test.
The current Elm test.
Yeah, right.
The current Elm test, let's say Elm test v1 is using value based shrinking.
So let's, let's name those two approaches.
Value based shrinking kind of knows how to shrink a value, but it doesn't know any kind
of context about it.
And so you could think about it as function from the value, let's say from integer to
many possible ways, how to make it smaller.
So from integer to a list of integers.
So giving that shrink integer function five would give you a list with four zero two,
things like that.
And when you say that it needs to return a list of integers, those are the list of things
to try out the list of possible simplifications.
Because you can simplify things in different ways.
You want it to represent like zeroness is an interesting thing to shrink and evenness
is an interesting thing to shrink.
So, so two and zero are meaningful and smaller numbers.
So they would be interesting to shrink.
It's I wouldn't say it's, it's about evenness, but it's just trying to make the values simpler
in some way.
So for integers, it might, you know, you might try to shrink to zero.
You might try to shrink to some negative number for strings.
You might change the character from B to A, or you might also just make the string shorter.
And for different types, the way to shrink the type will be different.
But it starts with random, right?
This, this value based or type based shrinking approach, it says, here's a random value I'm
starting with.
Now, let me, once I find a failing case, let me see if I can create another failing case
by shrinking that failing case.
So it will kind of follow the process of here's a value that's for some reason it's interesting
because it fails the test.
I will give you smaller values or simpler values.
And you tell me if any of those fails the test also.
And so it kind of follows this process and it shrinks.
If it finds a smaller value that still fails the test, it will run the like the candidates
function again.
It will try to see, okay, how can I shrink this one?
And so it will follow that process until it can't shrink anymore according to the test
because, you know, values from one test wouldn't necessarily be failing some other test.
So it needs to shrink with regards to the current failing test.
Yeah, absolutely.
And so that's, that's the value based shrinking.
It has no context of how the value would generate it, whether it was mapped somehow, right?
It doesn't know anything about a generator and that creates certain problems.
Let's say if you had the end then function, which, you know, with random generation, it's
often useful.
Or if you had, let's say a filter function where you would say generate integer, but
I don't want it to be even, you would shrink the failing value, but you have no way of
running those filters on it again.
So it loses some of those invariants that you generated with.
And so that's the issue.
That's the issue with shrinking purely based on the value because you have no context about
the generation.
I've definitely run into this problem before and been confused as to why before I read
about the changes you were working on, where, you know, I basically want to say, you know,
as we talked about when you're, when you're building up a fuzz test, you can't check against
concrete expected outputs.
You have to check properties.
But so often you want to prune down your input list to say, well, these properties should
hold for this subset of inputs.
You know, for even numbers or and so you have to get clever.
You can't really filter it down.
But what I recall is you can map so you can get even numbers by doing an int fuzzer and
then map and then times two, which works very neatly for that case.
But you can't really say give me random input and then filter out invalid values.
Yeah, and so there's this history of Elm test, like in way in the 018 days.
It actually had like and then and a way to fail the test and so on.
But we removed it because it has these issues with shrinking.
But you know, you do want to have these functions.
Sometimes they really are useful.
And sometimes you can't really just like map arbitrary integer to be even.
That's, you know, that's that's a motivating example of why you don't want to use filter
and why you want to like construct right values rather than throw them out.
But sometimes it's really it would be really nice to have the and then function and to
have the filter function.
And the other approach, which I'm trying to get to the way it will work in Elm test P2.
Doesn't have those issues because we are shrinking something else than the value itself.
We are shrinking the PRNG history.
So PRNG is the random number generation algorithm.
And so we are is the P pseudo pseudo pseudo random.
Yeah, exactly.
So in V2 Elm test will remember the context of the generation.
And it will try to shrink that context that let's say list of dice rolls.
So, you know, the PRNG dice gave you five, gave you three, gave you two.
And from that, you somehow generated a value.
That's the that's the role of the fuzzer library to kind of transform those integers to your
values, custom types, whatever.
So you're and you're talking about the dice rolled being the seeds that we started with
to generate our input values.
Yeah, maybe maybe not exactly seeds because seed does have like a meaning for the random
number generation.
But the values that the PRNG algorithm produced, we do remember those.
We know that if we generate a value from those, we will get that failing value.
And now, instead of having a shrink function for the resulting value, we are shrinking
the PRNG history.
And suddenly, suddenly, not only you as the user do not have to care about shrinkers at
all because because previously you sometimes needed to write your own shrinkers.
OK, and so you as the user do not need to use shrinkers anymore.
These are internal to library and the library has many ways to try and shrink that list
of dice rolls.
So it can zero some of them.
It can delete chunks.
It can try to sort values in a chunk.
It has all these different strategies.
And it is fine tuned to the combinators that the fuzzer library gives you.
So the shrinkers know that the list fuzzer is written in such a way that the shrinking
strategy will work really well with that.
And this also gives you the benefit of generating the values after shrinking.
So suddenly you do filter after you have shrunk the PRNG history.
You are doing the end then function after you have shrunk.
And so all these invariants now hold because you are running those functions on the new
alternative history.
So you generate a list of integers.
That's your basis from a C.
You generate a lot of values.
And for each value you derive those.
Well, one value or multiple value.
You derive those into a value that you want to test.
And I'm guessing that in some cases you could have you could shrink it to the same value.
So for instance, if I if my generator generates a number between zero and 256, like that's
that's how this library works.
But all I want is a coin flip.
So it could have it could generate 232 and it could then simplify to 231.
But those would return both in heads or tails.
And that would not be very interesting.
So is it?
So sometimes you can get the same value from different PRNG histories and that's fine.
We mostly care about simplifying the PRNG history as much as we can.
So making it shorter and making it smaller.
And there's this convention between the shrinking process and between the generators that if
the PRNG history is smaller, the resulting value of your type that you want to test will
be simpler.
So the shrinker just doesn't care if it was 230 or 228 that generated the zero.
It will just pick the smaller one.
And if the generator has done its job well and maps simpler maps smaller histories to
simpler values, then you will shrink towards simpler values.
And so it might go from one to zero in your final value.
It might go from zero to zero as long as we have shrunk the history a little bit.
It counts.
And also from a user's point of view, it doesn't really matter all that much whether it does
a few cycles extra.
It's not like you're generating duplicate use cases.
Yeah, and with Elm being pure, you don't really care about running functions again and again.
Maybe a mocked on parser, but...
So, OK, you mentioned that in an Elm 18 version of Elm test, there were these and then and
filter functions for fuzzing.
What would have happened if you encountered a failing test from using filter?
I'm not sure if like generating an even set of even numbers would be a good example of
that to do, you know, and then filter out only even ones.
But what would happen with the shrinking if you were to use those with the older version?
So the test would fail with a generated free, let's say, and then it would try to shrink.
But a shrinker wouldn't know about this like this predicate, right, that you were filtering
And so it would shrink to a value that doesn't where the predicate doesn't hold.
So, right.
Even if the generator would always give you like one, three, five, seven, after shrinking
you could get it to.
And that's not very helpful, right?
Because you only cared about the other class of values.
That makes sense.
I think I recall trying to do some problem like this, I think with Elm 19, encountering
something where I had to be careful not to create like an unresolvable shrinker or something
where I could potentially run out of things to try and then just, you know, just run out
and just hang or crash.
Yeah, that's very much still like an issue that you can run into with the Elm test V2
because that's just like, you know, that's just how filter works.
If you generate even values and then filter those to only keep odd values, you are never
going to see a value, right?
And so it could endlessly spin.
We are, I think the implementation that got merged has like 15 tries and after 15 tries,
it just says it's impossible to generate a value that would where this filter would hold.
So we just give up.
And so, yeah, we are like adding runtime failures into like the fastest world.
I don't know how to say it, but with Elm test V1, you couldn't filter.
And so you couldn't really have this class of errors, right?
But right now you can.
And that's, I think it's a good enough compromise to have these errors if we get the end then
function in return.
So the type based shrinking, it's looking at the actual values before it shrinks the
result, which would mean like if you say, it's shrinking those ints.
And then if you do, you know, you're mapping the values, but it shrank them before
it gave them to you to map.
Whereas with the...
Well, no, in the value based shrinking approach, it shrinks after everything is generated and
mapped and filtered.
I see.
So if you map it, it shrinks based on the actual mapped value.
Yeah, it shrinks right at the end of everything.
What if you map it to a different type?
How does it know how to shrink that?
To a different type, didn't it?
It's always going to be the same type.
Well, no, you can definitely map values to be a different type.
You can create custom type based on integer or something.
And then it's still the same type.
Well, then it's a custom type.
If you do and then you do and take an int and turn it into a custom
type constructor to call?
Yeah, then you have a different variant, but you still have the same type.
But different from the original type of the fuzz generator.
Oh, yeah.
So how does it know how to reduce your custom type?
So this is something that's been called integrated shrinking.
And that's why you are not using random generators.
That's why you are using fuzzers.
And so each of those fuzzer combinators has two parts.
One is about generating.
So presumably it's doing like this random map with the function that you gave it.
And then the second part is the shrinking information.
And I'm not really sure about the details, but I believe it kind of knows that it was
made from an integer.
And so it tries to shrink those integers.
And there's the mapping between the integer and your custom type.
And so it still kind of knows how to shrink because it was created from integer.
But it doesn't have any information about your type, right?
You would have to create, you would have to use something like fuzz custom where you give
it generator for your type and a shrinker for your type.
So it kind of feels like what is V2 doing?
So V2 is different in that the fuzzers are mostly just concerned about generating.
And there's no three of possible ways to shrink the whole thing down.
There are no shrink value to different value function.
It basically just goes one layer deeper.
And the shrinking is concerned with the inputs to the PRNG, right?
And so it tries to generate a different PRNG history and then tries to generate a value
from that.
It doesn't have to succeed all the time because let's say if you have a list of three items
that might need because of how the list generator, how the list fuzzer is written, it might need,
let's say three or four values, like three or four dice rolls.
And if you shrink that history into a list of two dice rolls, then you can never generate
a list of four values out of that.
There's just too little information, right?
And so it doesn't need to succeed every time.
But if it doesn't succeed, it just throws that shrinking attempt away.
And it keeps those that are somehow simpler, but still generate a value and that value
still fails.
And so you are running your generator all the time.
You are, that's the difference.
So in the v1 approach, you generate a value and then you run the shrinker and the test
a lot of times.
But in the v2 approach, you are generating a lot of times when you find a failing value,
you run the shrinker.
And after every shrink, you run the generator again.
And so all the invariants are still kept.
So you've written an upgrade guide for running your fuzzers or your fuzz tests.
And in the v1 API, I see a lot of uses of random generators.
And in the v2, you use a lot of fuzzers instead.
So this feels like it's a simplification to the API.
It's a lot more consistent.
And I'm guessing you also get a few benefits out of it.
So for you as the user, I believe it's nice that you can stay in one module and just use
one module fully and express everything you want just with fuzzers.
And from the perspective of the library, it's needed because we need to track the PRNG history
So if we gave you a function to create a fuzzer from generator, which we do, but it's an escape
hedge and it will not shrink well at all.
So please don't use it.
But if you don't use what?
Don't use what?
Don't use fuzz from generator.
Okay, I think it's the equivalent of the current fuzz custom with no shrinking, with shrinking
Basically, we need to...
The fuzz library is built on top of random generator for integers and nothing else.
That's a simplification, but basically that's it.
And we build the whole like how to do floats out of integers, how to do strings out of
integers, how to do everything else from integers.
We do that ourselves because we need to be able to hold that list of generated integers
to shrink it.
And so if we give you a function where you can provide generator of your custom type,
we have no way of running it other than just generating a random integer, using that as
a seed and remembering that seed.
But it's just like a meaningless number, right?
That's the point of seeds.
Yeah, you can try and simplify it.
Yeah, there's no structure that we could try to somehow shrink down.
It's just one big integer and yeah, for the library it's a black box.
Yeah, you could try different inputs, but you wouldn't know if they were simpler or
So what's the point?
So I guess that's the price of it.
That now you don't need to use the random library, but we really strongly urge you not
to use it, not to use the escape page.
So sometimes you might need, if for example, if your library is exposing generators, it
might now be a good idea to expose fuzzers also.
And you have no other choice than to write them from scratch using the fuzzer API.
But we have tried to make it really simple to do so.
And I know of no cases where you could do something with the random module, but couldn't
do it with the fuzzer module.
So you just added this escape hatch just in case, basically.
Just in case somebody wants to use third party generator, right?
You don't want to clone Elm geometry and like see how the generator is created just so that
you can create fuzzer based on that.
You could just use the escape page, but you are not going to get nice shrinking because
it's a black box.
It makes sense that it somehow like loses semantic information when you just have a
generator because there are bits of state and semantics that the fuzzer keeps track
I mean, I could imagine a world where the Elm random generator is just a random generator
implementation is actually just a fuzzer under the hood.
So you define it using the same thing.
And now you say fuzzer from generator and hey, what do you know?
It actually is a fuzzer.
It's just that we hid that from you or, you know.
You can go that way.
So we could create a function fuzzer to generator and that's fine because we don't need to
define because we just throw away the history that we needed for shrinking.
You don't need it if you just want to generate the value.
So we could give you the function.
And if that was the underlying representation of random numbers or Elm random library, we
could have a nice, very well shrinking from generator function.
But it would need, you know, it would need changes to the Elm random package.
So these inputs, could you like shed a little light on how you take like these basic generators,
which are conceptually generating ints as the input and how you derive these, you know,
kind of baseline generators for other types of values in a meaningful way, like the list
generator and pool generators and things like that.
How do you ensure that the baseline generator values map to, you know, reduce down to simpler
primitive values with these core generators or core fuzzers?
So, so integers are easy.
Integers just map to integers.
There's a little bit more complication around that where we prefer some ranges of integers
and then, you know, with smaller probability, we give you larger and larger integers.
There's also some things where we just take, let's say the most significant bit of the
integer to mean the sign bit so that, so that we prefer positive values to negative values.
So if, if, if both minus two and two would fail the test, we are going to give you the
two because it's just nicer.
So there are some, the world is more complicated than that, but integers are pretty easy.
And so what can we build from that?
You could build characters.
So you have the function character for code, which just uses the, it's not ASCII, right?
It's UTF 16 or something.
You just build, build a character from a number.
So you have characters.
We can have lists or just like generally collections by flipping your coin.
So this is the approach we chose to generate lists where you flip a coin.
If it's zero, then you don't create any more elements.
If you get one, you try to generate again.
So that's where and then comes in.
So you have these like recursive algorithm, which is building the list.
And so that leaves a trail of, of the PRNCh history, right?
And then some value that's mapping to the integer, then there's going to be one.
And again, history for the second value, then there's going to be zero and that's it.
Because that's when we start generating the list.
And so, so it always reduces to smaller.
So you can, if you delete the first two values from the PRNCh history, it's going to leave
you with a valid list.
So if one something, one something zero shrinks down to one something zero, that's going to
be fine because the list accepts that.
And so this shrinks better than if we tried to generate a number for the count of how
many values we want and then just generating those in line because you would have to decrement
the number and delete some other chunk after it.
So it's just kind of weird, you know, with the internal representation, it's kind of
not very simple way to shrink those lists.
So we opted for the nicely chunkable flip a coin approach.
And then you have multiple histories for each list elements, right?
Is that what you said?
Oh, one history for each slot.
One history.
Well, one history gives you one final value.
So one PRNCh history will give you, let's say a list of characters or something, but
each of those characters in there, it might not be composed of just one dice roll.
It might be composed of multiple, right?
So each fuzzer can draw arbitrarily many random values.
And so you could think about it as of a tree or of some kind of released of nested lists
where a given sub branch is created by this specific fuzzer, but in the end, it's all
flattened out and the shrinking library doesn't really know which dice roll is related to
which value, right?
It is kind of blind to that and it just tries different stuff, zeroing, decrementing, removing
and just sees, just throws it at the wall and sees what sticks.
And that might result in a few things that don't make sense and which may fail, but it
will find something which will work out.
So does it create a new branch every time you call and then?
Is that it?
So, and then just runs another fuzzer.
And so fuzzers have state inside them, right?
They do have the current history and you can think of those like base fuzzers, like integer
or flip a coin or whatever.
You can think of them as generating value and giving, letting you do something with
that, but also appending to the state.
And so that's kind of like this monadic dance of keeping a state around where you only exposed
the value, but basically there are no like concurrent different histories.
There's just one history.
But as you go through the function and as you call these L functions recursively, they
are going to append to that state and in the end you end up with both the created value
of your type and with the PRNG history.
So fuzzer is a monad.
It has an end.
So yes.
That's the definition, right?
And so, and so, right, we have, we have integers, characters, lists, all these things like map,
map to, and then those can be done without the specific values.
Those are just like way to combine those.
And now that you have lists, you can do all kinds of stuff.
You can do strings, which are just conceptually just lists of characters.
You can do sets, dictionaries with the filter function.
Well, I want to say with a filter function, but it's different.
It's defined differently.
You can do lists of certain length and so on.
That's actually pretty, pretty smart.
We have this, this flip a coin fuzzer that gives you one with certain probability.
So you give it a float.
And so sometimes these lists of certain length fuzzers, if they see they have enough stuff,
they will say, okay, the next flip of a coin will be with probability zero, right?
And so it will, it will not generate stuff in advance and then just like throw it out.
It will just like stop, like I'm giving you this biased coin.
And so there is, there are tricks like that, but it works out pretty well and it's all
pretty, pretty optimized and optimal.
There's one fuzzer that is really, really, really, really complex.
And that's for floats because we can't really use a float generator, right?
We can only use generators for integers.
And we are inspired by the library hypothesis, which is this like property based distinct
library for Python, which started with this let's say internal shrinking approach.
And they gave a lot of care to shrinking floats nicely.
So, you know, with the current LN test, you could have the test that fails for some kind
of float.
And of course it will generate some random number in the like 10 millions and so on with
all these different like digits.
And it will shrink down to some really unreadable integer next to zero, right?
So 0.00008135, whatever.
And this is not very nice because if the test would fail with zero or with one, why not
give you that, right?
It would be just much nicer for the user.
And so we can't really do some kind of trick where we take two integers and divide, do
some bit magic and just divide things by the maximal float and give you flows in the uniform
There will be a really easy way out, a really easy way to generate floats, but it doesn't
shrink well, right?
And so what hypothesis did and what LN test now does is we reorder the bits in the IEEE
754 float representation to kind of follow the rule from the shrinkers, right?
So if the random integer is smaller, it will result in a simpler float.
And so there's a lot of like bitwise masking and oring and shifting and it's really crazy.
And that's the part that I was most stressed about.
But the result is floats will shrink beautifully and it will give you, so it will prefer positive
numbers, it will prefer smaller numbers, but it will prefer integers over decimals, right?
So it will prefer numbers like one over 1.5, but also it will prefer smaller fractions
over huge fractions.
So it will give you 1.5 instead of 1.76543215, right?
And so I'm pretty happy about that.
And I'm curious to see if this like different distribution of floats, if it will find some
bugs easier than the V1 version of LN test.
I'm not really sure about that, but I'm curious to see.
And we are also adding the not the numbers in there.
We are adding the infinities in there, which is probably going to be what finds the most
bugs, but also, you know, sometimes your library or application doesn't really care about those.
And you are just like, okay, if you give me garbage, I will give you garbage no matter.
So next to like fast.float, we also have fast.nicefloat, which is just numbers.
So, you know, if suddenly you have like 30 failure tests because of infinities that will
not happen in your code, who knows?
Will they not?
That you hope won't happen.
But you can pick the easy way out and you can just switch from floats to nice floats.
And those tests, those test failures will go away, but perhaps keep them in.
Perhaps somehow guard against these values in your application code, right?
It's also an option you can take.
I have to say, I still haven't encountered a NAN in Elm code, like in JavaScript.
But I have never counted one.
So I think I'm just going to go with nice floats always.
I mean, maybe, maybe, maybe I'll try it.
And yeah, I did.
I did encounter them at work recently.
I can't remember what specific scenario that was, but I was just getting NANs all around
and I was like, why?
These are like, there's no reason.
It's always something about like, by dividing by zero, you get infinities and then you do
something about the infinities and suddenly it's a NAN.
So yeah, it's, yeah.
All you need to do is write a fuzztest to make sure that no code returns a NAN.
That's it.
Do it everywhere and you can go to go.
Perhaps that's also something Elm review could do.
Like, Hey, what are you doing with that zero?
Put it down.
You could check for division by zero at least.
I mean, with fuzzers, like why not just do a fuzz test?
Like, why not just throw more inputs at it?
If you can get meaningful test results, it's really cool how much thought you put into,
you know, all of these little details for, I mean, essentially this is kind of the user
experience, the usability of fuzzers and their errors and how meaningful they are.
Because if you write a fuzzer that fails, then you have a failing test, but it's a question
of how useful is the reduction it's able to give you this simplified value.
I think this is like the red line going through the whole release, like making stuff nicer
for the user and like giving you more power, but also giving you better results.
And making it map more intuitively and yeah, this definitely is giving me a lot of motivation
to like throw some more fuzzers in my projects.
I definitely have some, but I feel that I'm not doing a good enough job identifying those
opportunities and keeping my eye out for where I need to test a property and where it would
be helpful to have a lot of inputs exercised, which there are really a lot of cases.
Most of my tests are Elm review tests, so I would have to generate some Elm code and
then I would have to make sure that it's reports an error in this kind of case.
Like, yeah, I feel like fuzzers are very appropriate there, but I could learn something.
I could imagine fuzz testing with Elm review where you could say something like, you know,
with import as, and then you use that import as to generate your strings.
So you use a module a certain way because like often that's something I'll miss in Elm
review fixes or rules is I won't check for a hard coded module and maybe do fixes with
that hard coded module or without checking the way it's imported.
So if you fuzz that, you know, that would be the challenge is now, you know, you have
to write the tests and consume that instead of just writing a hard coded string.
So there's certainly a trade off there.
Also Elm review asks you to write the expected fixed code.
So you would also have to generate that based on the same input.
So basically you're redoing this, the implementation.
Yeah, that's always the issue of like, how do you stop from just writing the implementation
in the test?
And that's kind of, kind of really the one tricky thing about property based tests, like
thinking of the properties and there's an awesome blog post.
I can't recall the author, but it's about property based tests in F sharp.
I think it is like F sharp for fun and profit is the name of the blog.
Oh, Scott Walsh.
Oh yeah.
So he has some great blog posts about what types of properties are there, right?
So there's this like mathematical laws.
You can check that like appending an empty string to anything will be that string and
so on.
But there's also these like round trips and like Oracle testing where let's say you are
implementing your own optimized sort function.
You can always check it against the standard library list dot sort, right?
And so there are these things where you know kind of what the expected output should be
and you can test against like the reference implementation and so on.
There's all these different types of properties that you can use and he's really great about
like showing examples and kind of walking you through it, building intuition.
And yeah, so that's, I think we could put that into show notes.
That's really good intro into how to think about property based tests.
That sounds amazing.
Yeah, that sounds very helpful.
I would love to have like a, you know, just a set of categories to think about of different
types of properties to check for in my brain when I'm writing code.
That seems like that would help me motivate me to actually reach for that tool more often.
Yeah, as you say, it's a problem of identifying when to use it.
So those categories seem like a great trick for that.
And that's basically what the blog post is about, those types of categories.
So go for it.
That's great.
I see there's a video associated with it too.
That's perfect.
Also, like the amount of care that you put into shrinking to nicer values and making
that intuitive is like really changed the way I think about fuzzing and made me think like,
wow, you know, both think more deeply about like, how do I want to generate fuzz inputs
to make sure I'm meaningfully exercising the relevant cases?
And then how do I reduce those down in a meaningful way?
Like, I was thinking, you know, I'm sure you have to think carefully about the semantics
if you want to produce a fuzzer that really reduces values down meaningfully.
Now, again, if you get a fuzz test that fails, it's failing no matter how well you've reduced
So you don't necessarily need to worry about that for just getting an error message.
And it's more how nice do we want to make the fuzz out?
Yeah, it's all about the UX.
Right, right.
But for the input, that's a different story because you do want to make sure that you're
exercising the important cases.
You want to be sure that you are kind of spread over the input space properly so that given
enough time, you will find the failing value.
It's like a question of, you know, in a unit test, it's pretty common that in TDD, I always
try to encourage people that the error message is part of the failure and part of the test
driven part in TDD.
So you don't just go make it green.
Like, see the error message, improve the error message if it doesn't tell you what to do
and then do the thing it tells you to do.
But so it would be the equivalent of like, yeah, you have a failing test, your CI is
going to break, but now you don't know why because you don't have meaningful information.
So those things are both important.
It's probably more important that it just fails in general.
But if you can make it nice output, that's great, too.
So this is this has not been released yet.
This is still in a beta version of the untest CLI.
Right, but it has a command to make sure that you can to allow you to try it out.
So this is still in a testing period, a testing period.
So how can people try it out and what do you expect to hear from people?
Yeah, we have posted some discourse posts about both the new version of the untest
CLI that allows you to try it and also call for testing, call for kind of like, please
help us find if there's anything horribly wrong before we make the 2.0.0 version.
So, yeah, so in the new CLI, I believe starting from revision 8, you have this command.
Let's say what it was, install unstable test master, something like that.
It's going to be that if you do like untest help.
Yeah, you got it right.
Yeah, yeah.
We also linked to it in the show notes anyway.
So yes.
Thank you.
So it basically just pretends that your 1.2.2 version, which is the current untest version,
it just pretends that it contains the code from 2.0.
So it rewrites your elm home, which is usually in like the home slash dot elm slash something.
And so we are essentially rewriting your cache and on the next untest run, it will pick up
the new library.
And you can undo that either by just removing the cache, so removing the dot elm directory,
or you can use uninstall unstable test master command in elm test.
And yeah, so after you install that, after it kind of tempers with your cache, you can
run elm test or elm test RS, I believe, and it will use the new library.
And so there are some API changes, right?
So you can expect to see some failures from like test code not compiling, but that's going
to be, I believe, mostly the true and false functions being gone from the expect module
and also the tuple and triple functions changing to like pair or the other way around.
Basically tuple fuzzers got a little bit of, again, user experience improvement, I believe,
where you don't have to structure the inner fuzzers into a tuple.
Yeah, right.
Yeah, you don't have to pass in a tuple of fuzzers.
You pass in two arguments for the tuple and it's not called fuzz dot tuple.
It's called fuzz dot pair.
It's not called fuzz dot tuple three.
It's called fuzz dot triple, which I much prefer, especially if you're using a tuple
with the Elm syntax actually only having those two kinds now.
And so there is a document about basically API changes.
And so I have tried to do a good job of like listing them all out.
I actually used Elm div command to get that and saying what you can do instead.
And it should be quite non problematic.
It should be really just like those tuple changes and expect true and false being gone,
where again, there's the explanation of what you can do instead.
It's one to one, so all cases should be covered.
So you can test it out.
You can change your test suite and see whether there is some problem that we didn't think
about that we should fix before releasing 2.0.
And you can also tell us if it caught any new bugs or if you are happy about a fault
fuzzer, I would be very happy to hear that.
Do you expect that it could find less problems than before as well?
It's possible.
I think the distribution definitely changed.
Right, so Elm test one and Elm test two will try different points with different probability.
But we didn't really change all that much.
We will still trigger like the whole space as before.
It might be just the probabilities that change.
So again, if you run your tests enough, you should see the same errors and hopefully you
will see them sooner because we are preferring small inputs than larger inputs and so on.
It is not totally uniform.
All right, great.
So there are some really nice videos you put together kind of walking through some of the
design, so people should definitely check those out.
We'll link to those in the show notes.
Are there any other resources we should point people to?
I think we should mention that for feedback, you should go to the testing channel on Slack
or maybe open a GitHub issue.
Yeah, you can do that.
We are, or at least I am monitoring the testing channel, so I will definitely be there.
You can post on this course in the post or create your own.
You can definitely raise an issue on GitHub.
As for the resources, I'm not sure about any fundamental resources.
I definitely take a lot of inspiration from how the library Hypothesis does things in
the Python world and they have a blog with a lot of these like why do that and how to
think about it and let's say testing stateful programs, which we don't really have in like,
we don't have like side effects, but we still do have those like updated messages and so
So you can, you can glean a little bit from that.
It's not always going to be applicable because we just don't have certain kinds of problems,
but yeah, I like their approach of putting the user experience or like the developer
experience as a priority.
And so we are actually in talks with Jakub Hampel, a gamble man on Slack.
We are kind of toying with the idea of, let's say a failure database.
So if the fastest we'll find a bug, it will remember it.
And the next time you run the test, it will try it first, right?
Just so that you can, you don't need to like randomly find it again, but it will try it
straight away.
And forever from now on or something.
I guess until you clear the cache or whatever.
But yeah.
So I think, I think we can still steal a lot of nice ideas from the ecosystem in different
And so hopefully the testing story in Elm is going to be nicer and nicer.
Which already it's so nice just by using a language like Elm with pure functions and
not having implicit state in these things.
So testing is such a fun thing.
Add fuzzing to the mix and you get potentially some flakiness back, but maybe the database
thing, it's a good kind of flakiness, but hopefully.
Hopefully your inputs are well distributed enough to not be flaky though.
All right.
Well, Martin, thank you so much for coming back on.
It was a pleasure.
And again, congratulations on getting this merged in.
Thank you.
Thanks for having me.
It was a pleasure.
And your rune until next time.
Until next time.