It’s fair for public data to be excluded on an opt-out basis, rather than included on an opt-in one [...]
No, no it’s not. This is a critical thing about ownership and copyright in the world. We own what we make the moment we make it. Publishing text or images on the web does not make it fair game to train AI on. The “public” in “public web” means free to access; it does not mean it’s free to use.
Besides that, I’d also add what I’ve seen no one else mention so far: People post content on web that they don’t own all the time. No one has to prove ownership to post anything.
Someone who publishes my work as their own (theft) or republishes my work (like quoting or linking back) doesn’t have the right to make the choice for me to let my content be used for training AI. This is where I struggle the most with the “opt-out” style of AI training on the web.
Whether reposting my content elsewhere is in good faith or not, it is now up someone other than me to decide whether or not to disallow AI training webcrawlers in their robots.txt file. To add insult to injury, that person may not have the knowledge—or even the power—to do so if they’re posting content they don’t own on a site they also don’t own, like social media.
I can play whac-a-mole with those bots on servers I control—which I don’t like doing, for the record—but I have none of that control anywhere else.