Worthy of notice/Draft

From Driscollwiki

Jump to: navigation, search


Contents

Introduction

Wikipedia is an online encyclopedia written and edited entirely by volunteers. Alternately lauded and criticized for its unusual structure, the site is nonetheless widely read, frequently appearing first in Google search results. The present research concerns Wikipedia's future. Recent data indicate that the number of contributors to Wikipedia sharply declined in 2009 while its readership continued to rise (Angwin & Fowler, 2009). This growing imbalance suggests several editorial problems for the world's most accessible encyclopedia: factual and mechanical errors will linger, the range of topics will be limited, and biases are likely to remain unchallenged.

This paper marks the beginning of a multi-stage investigation of Wikipedia's changing editorial population. These early stages involve bounding the field of study, identifying units of analysis, generating productive research questions, and prototyping a new tool for conducting software-assisted research. The overarching goal is to produce a flexible data set compatible with both quantitative and qualitative analyses. Along the way, I will also consider some of the ethical questions that arise in the course of studying implicitly public online events.

Methodology

The norms, discourse, and practices that make up the Wikipedia phenomena are inextricably bound up in software. Accordingly, software should play a central role in the methodological framework of the present research.[1] First, I need to understand the day-to-day technical practices of a typical Wikipedia editor (if such a person can be imagined to exist). Second, I will propose a semi-autonomous research tool to assist in archiving and reconstructing Wikipedia events. Finally, I plan to undertake the initial development of this tool myself as an intimate ethnographic exploration of the code and protocols on which Wikipedia relies.

This project relies on a significant amount of preliminary fieldwork to identify bounded units of analysis and establish a set of actionable research questions. Guimarães describes this essentially grounded process as a purposeful "hanging around" in potentially sites of study (146). By making technical mistakes, committing (mild) social faux paus, and humbly asking for help, the boundaries, norms, and unresolved conflicts within the field gradually come to light. For example, though Wikipedia is subject to constant revision, deletion procedures can last several weeks and unfold in obscure corners of the site. Without regularly "hanging around" during the last few months, I might have missed this critical phenomenon altogether.

Software development as an ethnographic process

Before discussing the details of my preliminary fieldwork, I want to take a moment to consider the ethnographic value of building research tools. Although social scientists have long used software to assist with a variety of methodologies, they have tended to either rely on general-purpose tools such as Atlas/ti, or to rely on collaborators from the fields of mathematics and computer science. For studies that concern online phenomena such as Wikipedia, however, the experience of developing specifically-tailored assistive software provides a potentially powerful point of entry to the field.

Just as anthropologists, historians, and critical theorists must learn the spoken and written languages of their chosen fields, so should the ethnographer engage materially with the various technologies she encounters. It is not necessary to achieve the mythological standard of "near-native competency" to begin to benefit from writing and speaking about code (Sandvig, 2009, 147). In this sense, the tenderfoot programmer is at an advantage to the intermediate. By blundering around, making a beginner mistakes, and seeking assistance, the newbie can learn the norms and vocabulary of a given technological phenomenon without having to demonstrate any practical expertise.

Wikipedia, like many contemporary social phenomena, relies on an open source software platform deployed in a semi-public environment. This trend provides a particularly rich learning opportunity for the curious social scientist. Researchers can download, install, and play with their own instances of Wikipedia's underlying software at little to no upfront cost. Furthermore, the platform exposes considerable data to the public through an Application Programming Interface (API). By learning the relatively simple syntax with which the API communicates, researchers can speculate about the underlying data models that structure and govern the social interactions on Wikipedia.[2]

Preliminary Fieldwork

I conducted preliminary fieldwork on the English-language Wikipedia from February to May of 2010. This research took several forms including contributing text and photos to various Wikipedia articles, learning to customize my "Userpage" [3], tracing the history of Wikipedia's policy documents, silently observing the deletion of several articles, writing a small prototype tool for gathering data from the MediaWiki API, and conducting a telephone interview with a Wikipedia critic. Although this grounded work prepared me to design a more narrowly-tailored future study, I plan to continue my open-ended engagement with the site. At times, I found myself overwhelmed by the diversity of practices and seemingly contradictory norms I encountered. In order to speak with any confidence about the characteristics of the editorial population, I need to remain an active editor in touch with the day-to-day demands of the site.

Who writes Wikipedia?

The event that sparked the current research was a report in the Wall Street Journal that Wikipedia was "losing editors" (Angwin & Fowler, 2009). To understand this claim that the editorial population is undergoing a change, I inquired first after previous research about Wikipedia's contributors. The literature suggests three dominant theories about the character of the editorial population. The first two theories, established between 2002 and 2006, arose out of a lively disagreed about the best method for quantifying an individual's contribution. The third theory, proposed this year, considers the editorial influence of autonomous and semi-autonomous software agents.

The Wikipedians write Wikipedia

Early coverage of Wikipedia in the news media imagined thousands and thousands of anonymous users contributing one or two sentences each. Countless people add two or three random sentences and somehow, the fascinated reporters mused, an encyclopedia emerges! Jimbo Wales, Wikipedia's founder, believed instead that a commited core group of a few hundred "Wikipedians" write the majority of Wikipedia content. This group of tireless volunteers is supported by many small contributions from the swarm of anonymous and infrequent editors. To test this belief, he counted the number of edits made by each registered user on Wikipedia. Consistent with his hypothesis, Wales found that over 70% of the edits were made by just 2% of the users (Swartz, 2007).[4]

Despite a lack of methodological transparency, Wales' findings were widely repeated both among users and in such venues as the New Yorker (Schiff, 2006). One possible reason for the uncritical acceptance of this data is that it describes Wikipedia in terms familiar to the rhetoric of the American frontier. According to Wales' narrative, Wikipedia is not the aggregagate result of thousands of small anonymous contributions but rather the product of a few hundred volunteers engaged in humble, thankless toil. For anyone threatened by the notion that the internet might enable radically decentralized popular productivity, professional journalists included, Wales' data is a relief. Wikipedia is a oligarchy just like everything else; the numbers prove it.

The Wikipedians edit Wikipedia

In response to Wales' study, Swartz developed a new method for measuring contributions to Wikipedia (2006).[5] Rather than count the overall number of edits made by a user, Swartz' study began with a sample of 200 articles. He analyzed each article's history and quantified the impact of each editor by measuring the number of characters they personally contributed that remained in the article's latest revision. For his admittedly small sample, N=200, Swartz found that the most substantive contributions were made by non-habitual users. Rather than contribute the core content of an article, as Wales suggested, registered users with thousands of edits (the Wikipedians) tended to address issues of overall polish and consistenct: spelling, formatting, and tone.

Swartz' discussion of his findings foreshadows the present research. Calling the status quo "dangerous", he worried that Wikipedia's governance structure failed to include the voices of infrequent but substantial contributors (2006). Over time, he cautioned, this arrangement would alienate occasional contributors and might even lead to policies that discourage non-habitual participation.

The Wikipedians became cyborgs

In 2010, Geiger & Ribes noted a signficant shift in the makeup of Wikipedia's editorial population since Swartz' 2006 study. They found that fully autonomous editorial "bots" and software-assisted human editors now account for over 25% of all edits. These assistive softwares vastly improved the efficiency and efficacy of Wikipedias custodial participants. In particular, Geiger & Ribes found that recently-developed software tools enabled a relatively small number of volunteers to considerably reduce the volume and shorten the lifespan of malicious edits by anonymous vandals.

The new cyborg editorial force is not without its drawbacks, however. Geiger & Ribes warn that bots empowered to "revert" or undo edits made by human contributors effective serve as a powerful epistemological police force deciding what is or is not appropriate for inclusion in Wikipedia. Additionally, bots tend to be deployed primarily by the kinds of Wikipedia users who are drawn to frequent custodial tasks rather than occasional significant contributions. As a result, the epistemology enforced by autonomous editorial bots may not be shared by Wikipedia's most substantive contributors.

Who governs Wikipedia?

Wikipedia's anarchic outward appearance belies a highly structured project governed by an ever-changing matrix of user-generated policies, procedures, and social norms. Groups of users enact competing editorial philosophies by constantly proposing, ratifying, enforcing, and revising new policies and guidelines. For example, "NPOV" refers to "Neutral Point of View"[6], a guideline for the composition and editing of Wikipedia articles. Although neutrality is one of the project’s oldest tenets, no mechanism exists to stabilize its interpretation or enforcement. When neutrality is contested, a separate body of policy is activated through which consensus may be reached and a resolution found.[7] Although books like Broughton's Manual attempt to provide guidance to new editors, the full corpus of Wikipedia policy and procedure is not available in any single location as it is under constant review (2008).

Wikipedia thus represents the unlikely combination of being highly policied and policed as well as radically open to newcomers. In a more cynical moment, one might see this arrangement as a trap. Newcomers are welcome to start editing and making changes without so much as creating a username. Yet, as soon as they invest enough in the project to make a significant contribution, they are subject to its arcane policies and automated disciplinary procedures.

Inclusionist, deletionist

One of the common reasons given by Wikipedia users for the departure of a large number of contributors is the emergence of a "deletionist" philosophy among some of the project's more senior editors and administrators (Stolfi, 2010). Beginning in 2006, Wikipedia editors have been increasingly divided on questions of inclusivity. "Deletionists" prefer that Wikipedia include fewer, more polished articles while "inclusionists" accept articles that are malformed, incomplete, or of niche interest in the hope that future volunteers will make improvements (Baker, 2008). As is true across the Wikipedia project, neither group is univocal and there is considerable variation in how each individual editor puts his or her philosophy into practice. That said, "deletionist" editors tend to come into contact most often with first-time contributors who are not yet familiar with the norms and standards of the project.

The Deletion function is available only to Wikipedia "Admins", a special class of editors empowered to delete articles and block other users. Any user may request adminship and they are vetted by a loose consensus procedure overseen by a senior editor with Bureaucrat designation. As of February, 2010, there are approximately 1,700 active Admins on the English-language Wikipedia.[8] These admins do not have blanket permission to delete articles from Wikipedia. Articles must first be nominated for deletion according to the Deletion Policy[9] After a week of public deliberation to which any user may contribute, the article is either deleted or the nomination is rescinded.[10]

Notability

One of the policies most commonly cited to justify deletion is the Wikipedia Notability guideline[11]. Critics of Wikipedia's policies assert that the operational understanding of "notability" is subject to change according to the unique biases of individual Admins (Scott, 2004). The notability guideline is cited particularly often to justify the removal of articles related to pop- and sub-cultural interests (Baker, 2008). For example, consider this statement from Wikipedia user Stifle,

"[My] Myspace test states that if the subject of an article, be it a band, a new religion, a person, or anything else, currently uses a Myspace page as one of its main online homes, it does not warrant a page on Wikipedia. This can be extended to other free webspace providers as necessary; most notable concepts, people, bands, etc. at least own their own domain." -- http://en.wikipedia.org/wiki/User:Stifle/Myspace_test

Stifle's "MySpace test" highlights the normativizing process enabled by a vague notability guideline. With the help of assistive software, the prejudices of a minority of users can constrain or enable the categories of knowledge included on the site.

Theorizing the Article

Wikipedia is not a single, bounded totality. There are many Wikipedias. Visitors to http://www.wikipedia.org/ see the most obvious multiplicity as they are prompted to select from dozens of Wikipedias in different langes. Wikipedia is known and knowable only through the unique experiences of its readers and editors.

Likewise, it is difficult to easily identify the bounds of a single Wikipedia "article." Unlike the heading, paragraph, and occasional illustration that make up a conventional encyclopedia article, a Wikipedia article is a data object inclusive of,

  • the latest revision and all of its content (images, illustrations, static, and dynamic texts),
  • all previous revisions,
  • the revision history,
  • metadata for each revision (username or IP address of the editor, time, date, and a short comment)
  • a list of categories to which it belongs, and
  • a "Talk" page with its own revision history and metadata.

For visitors to the Wikipedia web site, this information is made publicly available through "Page", "History", and "Discussion" links rendered near the top of page.

Each Wikipedia article a discursive event embedded in history, shot through with culture, and subject to Foucault's archival method, eventalization. In this sense, the whole of Wikipedia is an unbounded, evolving, interrelated collection of discourses. There are too many articles for any one user to experience them all. Thus, each user knows a different Wikipedia according to whichever articles she reads and the specific times and dates on which she requests them. This slippery arrangement of discourses makes it very difficult to confidently make claims about the nature of Wikipedia as a unified whole. Instead, the Article is the appropriate unit of analysis for this project as it balances a generalizable set of functions (constrained by the MediaWiki software platform) with the unique sociality that arises among the editors drawn to a particular topic.

4520308534_2e1612c311.jpg Figure 1. An example of the Article as a unit of analysis.

For the purposes of research, I should also consider the Article from the point of view of the practices that accompany and construct it. Depending on the status of the article and user, the following verbs are available when one encounters an article on Wikipedia:

  • Read
  • Edit (Revise, expand, append, erase)
  • Revert
  • Discuss
  • Link to
  • Copy, quote from[12]
  • Nominate for deletion
  • Delete[13]

Likewise, the present research is also concerned with the activities of three users:

  • The "Creator" is the user who made the first edit that established the Article in the database,
  • The "Primary" Contributor is the user who, according to Swartz' method, made the most substantive contributions to the article, and
  • The "Deleter" is the user who nominated the article for deletion.

Next steps

Building an archive of traces

To proceed from my open-ended preliminary fieldwork, I intend to implement a research tool to automatically collect data from the field. For each of the materials, practices, and people considered in the Article above, the MediaWiki platform generates some trace artifact legible to my proposed software assistant. In particular, I will be focusing on articles nominated for deletion based on a claim that the article's topic is non-notable. To capture this information, a piece of customized data collection software will be run each day at noontime PST. The software will gather a list of articles nominated for deletion during the previous day.[14] The software will next mine each article and create a local copy of all of the information outlined in the unit of analysis described in the previous section.

Articles proposed for deletion are debated for seven days before they are removed from Wikipedia. Alongside the data collection software, a different program will cycle through the previous six days of articles in the local archive and check Wikipedia to see if any of them have either been deleted or restored. The results of this investigation will also be saved to the local archive.

Reconstructing the scene

Several programs will also be developed to reconstruct various scenarios from the information stored in the local archive. The goal of these tools is to analyze the archive in search of interesting details for further qualitative inquiry. For example, an article that is repeated nominated and saved from deletion ought to be more closely examined. Similarly, a user who regularly nominated articles for deletion might be a good candidate to approach for an interview.

Future research

My working hypothesis is that the vagueness of the notability policy coupled with the tone-deaf policing of editorial "bots" make Wikipedia an unfriendly place for newcomers and infrequent contributors. Preliminary fieldwork suggests that further research is needed to understand the relationships among Wikipedia's governance procedures, the notability policy, fully- and semi-automated editing software, and the changing editorial population. The systemic collection and analysis of articles nominated for deletion should reveal details that I overlooked in my research to date. Below, I have included some of the questions that will guide the next round of research.

Wikipedia governance

Despite each Article in Wikipedia being its own discourse, the project is govered as a totality. I have detailed some of the major efforts to identify Wikipedia's editorial population. Swartz implies that the frequent custodial contributors are also the primary policymakers. Has anyone done a systematic analysis of the governing population on Wikipedia? Who contributes? Who votes? Where do the governing and editing populations overlap and diverge? Furthermore, what might draw a user to participate in the legislative side of Wikipedia?

Wikipedia readership

Perhaps the most crucial population at stake in any study of Wikipedia, the readership is also the most difficult to quantify. How might I apprehend the reader in our data collection procedures? If there is a way to make the readership visible, what is helpful to know about them? Is the readership of Wikipedia represented in its governance? Who feels greater "ownership" over Wikipedia? Who feels greater trust in the accuracy of Wikipedia? Its editors or its readers?

Notability

There is no technical reason that any article should be excluded from Wikipedia. I suggest the notability guideline for closer analysis because my preliminary fieldwork indicates that it is the most broadly interpreted and applied of all Wikipedia policies. My first hypothesis is that "deletionist" editors who frequently invoke the notability guideline are highly committed to the goal of developing a free encyclopedia. To what extent do these editors adhere to or depart from traditional notions of what is "encyclopedic" and how is this expressed through their administrative practices? Why not a maintain a more expansive notion? And, perhaps most coarsely, why do the editors delete anything?

In the past, Wikipedia has been criticized in the news media for errors in the biographies of living persons. Did the notability guideline arise in response to libel claims by celebrities? If so, is there another safeguard that might be in place to limit the liability of Wikipedia in such situations?

Pop culture and the notability guidelines

Anecdotal accounts suggest that "deletionist" editors tend to target articles of pop cultural interst. Is there a relationship between the systematic removal of pop culture content on the part of "deletionist" administrators and the alleged decline in the number of new Wikipedia editors? Is pop culture a "gateway" to greater participation in Wikipedia?

It may be possible to subject the dataset generated in the next round of research to a content analysis to identify pop culture articles. Once properly coded, it may be possible to begin testing the questions above and to develop more nuanced questions to explore the circulation of pop culture content on Wikipedia.

Autonomous bots and the notability guideline

It is clear that sophisticated bots are used to identify and respond autonomously to vandalism and malicious behavior on Wikipedia. Have bots also been deployed that implement the notability guideline? What kinds of algorithms have been implemented or proposed for the automatic evaluation of notability? Is this technically feasible? Do deletionist editors value their subjective reasoning or are they eager to use assistive software?

Known unknowns

I began this project with a set of very specific research questions. The more that I "hung around" during my preliminary research, the more that I found evidence to complicate my initial assumptions. As I move into the second stage of this research, I again find myself with specific questions. The ressearch strategy I outlined in this paper is designed to reflect this oscillation between specificity and openness. By continually returning to a humble state of ignorance, I hope to keep in sight the complexity and instability of this rapidly-shifting field.

Endnotes

  1. The methodological framework used in this paper is elaborated further in a paper for COMM552 titled "Mining vBulletin: Software-assisted ethnography."
  2. For a more detailed exploration of the MediaWiki API, see the Appendix.
  3. http://en.wikipedia.org/wiki/User:Driscoll
  4. Wales did not publish this data nor the details of his methodology and analysis. The numbers quoted here were cited by Aaron Swartz from a speech given by Wales in 2002.
  5. Note that Swartz' study was never formally published in a peer-reviewed journal. However, Kittur et. al. later presented a study that employed a similar methodology to Swartz and found similar results (2007).
  6. NPOV is the second of Wikipedia's "Five Pillars", http://en.wikipedia.org/wiki/Five_pillars_of_Wikipedia
  7. Dispute Resolution procedures in further detail, http://en.wikipedia.org/wiki/Wikipedia:Dispute_resolution
  8. For more detail on the hierarchy of editors, see: http://en.wikipedia.org/wiki/Wikipedia:Administrators
  9. http://en.wikipedia.org/wiki/Wikipedia:Deletion_policy
  10. In certain cases, Admins may delete an article without going through the process of nomination and deliberation. For a complete review of the "speedy deletion" procedures, see: http://en.wikipedia.org/wiki/Wikipedia:SD
  11. http://en.wikipedia.org/wiki/Wikipedia:N
  12. Free licensing explicitly permit this according to certain limits. http://en.wikipedia.org/wiki/Wikipedia:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License
  13. Erasing an article, also called "blanking", is to edit the Page, highlight all of the text, press backspace, and save the change. Any user, even Anonymous users, can erase a Page. To delete an Article, however, involves eradicating the entire discourse from Wikipedia. The Article, its history, and accompanying discussion are all permanently removed from the site. Unlike blanking a page, only Administrators can initiate deletion procedings. There is no mechanism for "reverting" or undo-ing a deletion.
  14. A chronological list of articles proposed for deletion is stored at http://en.wikipedia.org/wiki/Category:Proposed_deletion along with short reasons for their proposal. The software will parse this list for sveral colloquial terms and acronyms used to invoke the notability criterion.

References

Appendix: Writing with the MediaWiki API

The MediaWiki API makes a considerable amount of data and functionality available to programmers who want an alternative point of entry into Wikipedia. The same affordances that enable the construction of autonomous vandal-fighting bots might be used by social scientists to develop project-specific research tools. The basic transaction between a third-party and an API is made through an HTTP request. Although this is most useful when it is embedded within a piece of software, it can be constructed manually using the address bar in a web browser. The example in Figure 2 is a request for the first ten titles of page proposed for deletion on May 4, 2010. Figure 3 shows the data as returned by the Wikipedia servers. Although this data is structured in XML, it remains quite human-readable.

http://en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Proposed_deletion_as_of_4_May_2010

Figure 2. Manually constructed request for data from the MediaWiki API.

<?xml version="1.0"?>
<api>
  <query>
    <categorymembers>
      <cm pageid="26864001" ns="0" title="Al-Hamd Educational Society" />
      <cm pageid="14383726" ns="0" title="Albany Medical Clinic" />
      <cm pageid="27175616" ns="0" title="Ashton Michaels" />
      <cm pageid="27228259" ns="0" title="Baganga, the game" />
      <cm pageid="27215397" ns="0" title="Bbag" />
      <cm pageid="26000210" ns="0" title="Andhy Blake" />
      <cm pageid="26862698" ns="0" title="Blood Dance" />
      <cm pageid="27225943" ns="0" title="Bob Barrett (Minnesota politician)" />
      <cm pageid="26866372" ns="0" title="Chadwick Boseman" />
      <cm pageid="6056912" ns="0" title="David Bowe" />
    </categorymembers>
  </query>
  <query-continue>
    <categorymembers cmcontinue="Cheesey Beans|" />
  </query-continue>
</api>

Figure 3. Data returned by the MediaWiki API in machine-readable XML format.

Personal tools