Analyzing the underlying mechanism of suffix-based LLM jailbreaks, we find it relies on aggressively hijacking the model context 🥷, which intensifies with the suffix’s universality. Exploiting this, we enhance and mitigate existing attacks.
Through introducing a strong, new SEO attack ⛽💡, we extensively evaluate widely-used embedding-based retrievers’ susceptibility to SEO attacks via corpus poisoning, linking it to key properties in embedding space.
We propose an efficient attack against neural tabular classifiers for automatic robustness evaluation, addressing attacker objectives such as feasibility (via incorporation of database constraints) and cost-efficiency.