Prototyping AI-driven systems can be complex, but with a basic understanding of programming and a few hours of work, it is possible to create a chatbot for taking notes, an editor for creating images from text, and a tool for summarising customer comments. Machine learning (ML) systems can embed issues like societal prejudices and safety worries, which can be discovered and validated with behavior evaluation or testing. However, many popular behavioral evaluation tools do not support the models, data, or behaviors that real-world practitioners typically deal with. Model evaluation is difficult, as performance can only be roughly estimated using aggregate indicators. Practitioners must manually test hand-picked cases from users and stakeholders to evaluate models and select the optimal deployment version properly.
