Child safety org launches AI model trained on real child sex abuse images

Stopthatgirl7@lemmy.world · 3 months ago

Child safety org launches AI model trained on real child sex abuse images

FourPacketsOfPeanuts@lemmy.world · edit-2 3 months ago

Wow I had no idea

It sounds like a much needed improvement then!

Any idea if Photo DNA needs training sets to the same extent AI does? It still feels like training currently LLM models, at least how I think they work, requires vast amounts of “examples”

It still feels like that amounts to putting huge amounts of task csam just “out there” with tech companies. If it saves a bunch of human moderators the toil of having to review quite so much then that’s definitely a great help. But can you say anything about the comparative scale of the content involved? My impression is that previous versions of something like photoDNA would need a set of something for testing purposes. But the quantity needed to train AI is going to be vastly bigger (and therefore it’s possible leak vastly worse?)

schizo@forum.uncomfortable.business · 3 months ago

comparative scale of the content involved

PhotoDNA is based on image hashes, as well as some magic that works on partial hashes: resizing the image, or changing the focus point, or fiddling with the color depth or whatever won’t break a PhotoDNA identification.

But, of course, that means for PhotoDNA to be useful, the training set is literally ‘every CSAM image in existance’, so it’s not really like you’re training on a lot less data than an AI model would want or need.

The big safeguard, such as it is, is that you basically only query an API with an image and it tells you if PhotoDNA has it in the database, so there’s no chance of the training data being shared.

Of course, there’s also no reason you can’t do that with an AI model, either, and I’d be shocked if that’s not exactly how they’ve configured it.

FourPacketsOfPeanuts@lemmy.world · 3 months ago

Ok. I mean I have no idea how government agencies organise this. If these are exceptional circumstances where a system needs exposing to “every csam image ever” then I would reasonably assume that justifies the one off cost of making the circumstances exceptionally secure. It’s not like they’re doing that every day.

You raise a separate important point. How is this technology actually used in practice? It sounds like photoDNA, being based on hashes, is a strictly one way thing, information is destroyed in the production of the hashing model. And the result can only be used to score a likelihood that a new image is csam or not. But AI is not like that, the input images are used to train a model and while the original images don’t exist in there, it distills the ‘essence’ of what those photos are down into their encoded essence. And as such an AI model can be used both for detection and generation.

All this to say, perhaps there are ways for photoDNA to be embedded in systems safely, so that suspect csam doesn’t have to be transmitted elsewhere. But I don’t think an AI model of that type is safe to deploy anywhere. It feels like it would be too easy for the unscrupulous to engineer the AI to generate csam instead of detect it. So I would guess the AI solution would mean hosting the model in one secure place and suspect images having to be sent to it. But is that really scalable. It’s a huge amount of suspect images from all sorts of messaging platforms we’re talking about.

schizo@forum.uncomfortable.business · 3 months ago

AI model of that type is safe to deploy anywhere

Yeah, I think you’ve made a mistake in thinking that this is going to be usable as generative AI.

I’d bet $5 this is just a fancy machine learning algorithm that takes a submitted image, does machine learning nonsense with it, and returns a ‘there is a high probability this is an illicit image of a child’, and not something you could use to actually generate CSAM with.

You want something that’s capable of assessing the similarities between a submitted image and a group of known bad images, but that doesn’t mean the dataset is in any way usable for anything other than that one specific task - AI/ML in use cases like this is super broad and has been a thing for decades before the whole ‘AI == generative AI’ thing became what everyone is thinking.

But, in any case: the PhotoDNA database is in one place and access to it is scaled by the merit of uh, lots of money?

And of course, any ‘unscrupulous engineer’ that may have any plans for doing anything with this is probably not a complete idiot, even if a pedo: they’re going to have shockingly good access controls and logging and well, if you’re in the US, if the dude takes this database and generates a couple of CSAM images using it, the penalty is, for most people, spending the rest of their life in prison.

Feds don’t fuck around with creation or distribution charges.

FourPacketsOfPeanuts@lemmy.world · 3 months ago

Yes true

barsoap@lemm.ee · 3 months ago

Yeah, I think you’ve made a mistake in thinking that this is going to be usable as generative AI.

Possibly not on its own but that’s not really the issue: Once you have a classifier you can use its judgements to train a generator. PhotoDNA faces the same issue that’s the reason why it’s not available to the general public.

Child safety org launches AI model trained on real child sex abuse images

Child safety org launches AI model trained on real child sex abuse images

Child safety org flags new CSAM with AI trained on real child sex abuse images