Harassment Filter

Moderation

Definition

The harassment filter is an optional Reddit safety setting that automatically routes comments likely to be harassing into a community's mod queue for review. Reddit describes it as powered by a large language model trained on moderator actions and on content removed by the platform's internal safety and enforcement teams, so it tries to predict which comments human moderators would treat as abusive.

The filter matters as one of the clearer examples of AI being deployed directly inside Reddit's moderation pipeline. Instead of relying solely on user reports after harm is visible, it attempts to catch likely harassment before it gains traction, easing the emotional and time burden on volunteer moderators. It also raises the questions that follow any automated content classifier: the model can misjudge tone, sarcasm or reclaimed language, and it filters for human review rather than removing content automatically, an acknowledgement that judgement about harassment still needs a person in the loop.

Sources

01
Harassment filter — Reddit Help
Official / Reddit2024
02
Safety Filters — Reddit Help
Official / Reddit2024