ORES Explorer

An Interactive Visual Playground for Wikipedia AI Model

Project Background

Wikipedia, the largest online encyclopedia, is maintained by a community of volunteer editors. The open-collaboration environment allows people around the world to contribute but also exposed the platform to more risks such as vandalism. ORES is a web service and API developed by the Wikipedia scoring platform team that automatically evaluates the edit quality and intention using Machine Learning (ML).

ORES is used by volunteer tool developers around the world to develop tools to support ant-vandalism and edit review work on Wikipedia. However, some tool developers had difficultly understanding and effectively using the ORES models. Due to inappropriate configuration, some tools have caused serious problems such as reverting large numbers of good edits in the Wikipedia ecosystem.

Project Scope

I worked with Dr.Haiyi Zhu's to develop an educational visualization website the ORES ML system. The project goal is to help tool developers to learn about in-context ML knowledge, recognize inherent value trade-offs, and make decisions aligned with their goals when using ORES.

My Role

This is my BHCI capstone project and I worked both as both a UX Researcher and Designer. Later, I also leveraged my web development experience to prototype the visualizations using the real data.

Project Time: Jan 2020 - May 2020 (BHCI capstone), August 2020 - Now (Independent Study)

Team: Dr. Haiyi Zhu Research Team

Final Solution Preview

ORES Explorer

An educational visual playground for ORES API

Target Users

Tool Developers who use the ORES API to built vandalism detection and edit review tools on Wikipedia

User Journey

introduction

threshold-explorer

feature injection

introduction

1/5

Impact

Understanding Of Trade-offs

ORES Explorer significantly improved users' understanding of the value trade-offs (e.g. false positive rate & false negative rate, accuracy & fairness) in the ORES system.

“Slide it more to the left, but at the same time, you're also going to flag more good edits with that as well.”

Informed Decision Making

Users were able to make more informed decisions on model selection and threshold configuration based on the information gathered.

“It's better to have a false positive than a false negative … a lot of the times false negatives will sneak through and they'll sit in an article for weeks or months before somebody goes, "Hey, this doesn't sound right", and actually goes about changing it.”

Trust In AI

Perceiving the information about model trade-off and fairness helped users develop more trust in ORES.

“Now that I can actually visualize what is actually happening with this data set, I could just tell that it actually does work. And it's not just voodoo magic occurring behind the API.”

Process - Research

Overview

Research Goal: Understand the current experiences of people who are designing, using, and being affected by the ORES-based Tools.

Secondary Research

Goal

Develop a better understanding of the model and how it is used in the development of anti-vandalism tools on Wikipedia.

Methods

Literature Review, Tool analysis, Data Analysis on model's training & testing data

Insight 01 - How does ORES work?

For each edit, ORES outputs a damaging score (the likelihood that the edit causes damage) and a goodfaith score (the likelihood that the edit was saved in good-faith).

Insight 02 - How to use ORES?

When using ORES, developers have an opportunity to tweak the model by setting the threshold (which determines the prediction). For example, for the damaging model, if the threshold is set to 0.5, any edits with scores above 0.5 will be predicted as damaging.

Insight 03 - What are some current ORES-based tools?

Generally, there are two types of tools:
- Fully automated bots that automatically revert damaging edits.
- Semi-automated review tools that surface damaging / bad-faith edits to facilitate human reviewers.

Insight 04 - Any known issues / challenges?

Developers usually have two goals when developing an anti-vandalism tool: 1) reducing the number of good edits misclassified as damaging edit (low False Positive Rate) 2) Catching as many edits as possible (low False Negative Rate.
However, research has shown that there is a natural trade-off between these two metrics: optimizing one often leads to poorer performance on the other. Thus, developers need to decide what to prioritize based on the type of tool that they are developing.
The ORES model tends to be more aggressive to edits made by newcomers and anonymous editors. Such disparity might cause problems such as low new member retention.

Primary Research

Goal

Getting to know the context and problems from perspectives of ORES users, Wikipedia reviewers, editors.

Methods

User Interview

Consolidated Insights

We used an affinity diagram to synthesize our insights from interviews.

Screen Shot 2020-09-14 at 8.58.21 AM.png

Understanding & Using ORES

It could be challenging for developers w/o a machine learning background to work with ORES.
Developers want more transparent information on ORES models, feature definition, and data collection process.

Developing ORES-based tools

Developer adopt different design strategies when developing the tools
Many developers found internal problems with the ORES system in the tool development process. A place to collect their feedback is needed.
Developers found it useful but challenging to work with different models and thresholds.

Using ORES-based tools

Most tools prioritized edit quality and review efficiency.
Newcomers’ experience is often less considered. However, they are the ones who need the most help.
Features such as filtering and collaboration support are commonly requested by the editors.

Project Goal & Design Objectives

Find intuitive ways to explain the affordance and trade-offs of ORES models, so that Wikipedia tool developers will be able to build better tools with ORES, which includes but not limited to:

Educate digestible ML concepts in the context of Wikipedia
Facilitate understanding of trade-offs in the threshold setting.
Help tool builders to choose the right threshold for their tools
Surface group performance disparity (bias) in the model's predictions

Usability - Feasibility Matrix

We mapped all the ideas on the Usability & Feasibility Matrix to evaluate their costs and impacts. We are particularly interested in ideas in the top right corner (high usability &. feasibility).

Screen Shot 2020-09-14 at 9.23.28 AM.png

Storyboard & Speed dating

We then explored how we can construct a complete user experience with the ideas that we came up with. We created storyboards for each idea and tested using speed-dating. Our three main ideas are:

Interactive Quiz
Interactive Guideline
Chatbot

Interactive Guideline received the most positive feedback among the three ideas and thus we decided to choose that as the fundamental for the product user flow.

Process - Prototyping

We then created Low, Mid, High-Fidelity prototypes for the design, each followed by a round of user testing

Low-fidelity prototyping

User Flow:

Read general information about ORES
Choose development goal (Bot or Semi-automated review tools)
Explore threshold trade-offs and model performance disparity using interactive visualizations
Get tool development suggestions

Screen Shot 2020-09-14 at 9.53.29 AM.png

Specific visualizations

Screen Shot 2020-09-14 at 10.55.30 AM.pn

Mid-fidelity Prototyping

Mid-fidelity Prototyping Evaluation

Methods

Think Aloud Protocol
Co-Design

Results

Main Insights

Users w/o ORES development or ML background would take a longer time to understand the data and thus need more guidance.
Users think that the language should progress from colloquial to formal.
Users would prefer direct suggestions on the model threshold based on the type of tool that they are building.
Users think it would be helpful to point out the bias (performance disparity) explicitly as well as include information on how to mitigate the bias.
Learning that the model is biased didn’t discourage users from using the mode. Quote: "All models are known to be biased”.

Notes

Threshold Visualization

Threshold Calculator

Bias Mitigation

Threshold Visualization

1/4

Final Product - ORES Explorer

01 About ORES

A landing page that provides a basic overview of the ORES system and introduces some essential ML concepts in the context of Wikipedia

02 Threshold Visualization

An interactive visualization that allows users to explore the model performance under different threshold settings and perceive trade-offs between different evaluation metrics (accuracy, false positive rate, false negative rate)

Interactive prototype (built with D3.js, React.js)

03 Threshold Recommender

Threshold Recommender directly recommend a threshold to the users based on the type of tool that they are building (e.g. higher threshold for Bots, lower threshold for Human Review Tools)

04 Group Disparity Visualizer

Our research shows that there is a disparity in the model's performance between edits from different editor groups - e.g. newcomers and experienced editors, logged-in editors and anonymous editors. Group Disparity Visualizer helps to surface this issue by allowing users to see and compare the model’s performance on different groups of edits using a sample dataset

Interactive prototype (built with D3.js, React.js)

05 Feature Injection

To help users better understand how ORES makes different decisions based on the experience of editors (which results in a bias towards newcomers and anonymous editors), feature injection allows users to select an edit and see how the prediction changes if it's made by a different editor.

Interactive Prototype

Evaluation

Method

(done during Summer 2020 by the other team members)

To evaluate the effectiveness of our website, we designed a think-aloud interview protocol incorporating a design task. We recruited participants both in and outside the Wikipedia community. In the design task, participants first chose an ORES-based tool they would like to build in a real-world environment (two options: review tool that flags edits to a human reviewer or a bot that automatically reverts edits) Participants then interacted with the visualizations, made decisions on the model threshold, and spoke out their thoughts in the process.