A Quick Stop at the HostileShop
HostileShop creates a simulated web shopping environment where an attacker agent LLM attempts to manipulate a target shopping agent LLM into performing unauthorized actions. Crucially, HostileShop does not use an LLM to judge attack success. Instead, success is determined automatically and immediately by the framework, which reduces costs and enables rapid continual learning by the attacker LLM.
HostileShop is best at discovering prompt injections that induce LLM Agents to make improper "tool calls". In other words, HostileShop finds the magic spells that make LLM Agents call functions that they have available to them, often with the specific input of your choice.
HostileShop is also capable of enhancement and mutation of "universal" jailbreaks. This allows cross-LLM adaptation of universal jailbreaks that are powerful enough to make the target LLM become fully under your control, for arbitrary actions. This also enables public jailbreaks that have been partially blocked to work again, until they are more comprehensively addressed.
I created HostileShop as an experiment, but continue to maintain it to let me know if/when LLM agents finally become secure enough for use in privacy preserving systems, without the need to rely on oppressive levels of surveillance.