If you don’t mind, I’d be interested to see the images you used. The broad validation tests I’ve done suggest 80-90% accuracy in general, but there are some specific categories (anime, for example) on which it performs kinda poorly. If your test samples have something in common it would be good to know so I can work on a fix.
I find it very funny that people are so concerned about false positives. Models like these should really only be used as a screening tool to catch things and flag them for human review. In that context, false positives seem less bad than false negatives (although, people seem to demand zero error in either direction, and that’s just silly).