VBulletin/Draft

From Driscollwiki

Jump to: navigation, search


Contents

Introduction: Why vBulletin?

During the past year, several colleagues and I have been developing case studies about fan communities engaging in civic or political activities.[1] As we began to share the cases with each other, one of the curious similarities that emerged was the ubiquity of vBulletin, a web-based messageboard platform.[2] The communities represented by our initial eleven cases varied considerably in terms of age, ethnicity, interest, and goals, yet we found four were using vBulletin as their primary online infrastructure. Cursory Google searches turned up evidence of conversations relating to all of the cases on vBulletin messageboards across the web.[3] In spite of this apparent ubiquity, vBulletin is almost totally absent from social science literature.

vBulletin will likely continue to appear in future research concerning online popular cultures. With this in mind, it is worth taking some time to learn more about the platform itself. What are the unique affordances that draw in such a diverse userbase? How has it become so widely used without attracting the popular attention of comparatively smaller services such as Twitter? Do the structural features of vBulletin tacitly encourage certain kinds of discourse while discouraging others? To address these questions, we begin with an exploration of vBulletin as a technology.

This paper proposes the development of a software tool to assist scholars who encounter vBulletin in the course of their research. First, it suggests a methodological foundation based on previous work in computer-assisted qualitative research, online ethnography, and related ethical considerations. Next, it explores technological antecedents for mining and mapping internet discourse. Finally, it outlines specific functionality that could help answer questions raised in our case studies. Along the way, I will also make a case for the ethnographic value of software development by researchers.

Developing a methodological framework

Specialized software tools like SPSS, R, Atlas/ti, and NUDIST have long been used to assist both quantitative and qualitative research methodologies in the social sciences. Sometimes they are deployed primarily as amplifiers that can "automate" or "speed up" human work (Barry, 1998) but as researchers achieve greater comfort and familiarity with their tools, assistive softare has the potential to enable new ways of reading and manipulating data (Burroughs-Lange & Lange, 1993). Studies of online phenomena are especially well-suited to the productive integration of assistive research tools. Whereas the data gathered in an off-line ethnographies will require a translation process to be manipulable by software, the computer-mediated field presents opportunities for software accompaniament at each stage of the research design: collection, arrangement, analysis, and visualization.

Of course, the seasoned ethnographer need not worry that he or she will be put out of the job by a piece of software. The rich potential for using software in online research is accompanied by a set of tricky technical and ethnical problems that require a researcher's subjective judgement. As a very preliminary starting point, we should proceed under the assumption that any automated (or semi-automated) software processes will necessarily be paired with participant observation, interviews, and/or focus groups so as to maintain the "context and situatedness" of online phenomena (Schneider & Foot, 2005, 165). In a best-case scenario, this oscillation between software and human research agents enables the reconstruction of scenes that would otherwise be very difficult to apprehend by either approach alone.

Preliminary fieldwork

Ethnography is the basic material from which we will build a software-assisted methodology. For phenomena unfolding primarily online, it can be difficult to know what constitutes "the field" (Beaulieu, 184). Furthermore, if one is interested in a given "community", the usual markers of shared geographic place are absent. Neither the server nor the client software are appropriate substitutes for the shared-air of a laboratory, classroom, coffee house, or cock-fight. Although a site like Facebook is often casually discussed as though it is a knowable totality, the experience of any given user will depend on his or her unique constellation of friends, interfaces, apps, and experiences. To know the boundaries and elements of given field requires a foundational familiarity with the everyday practices of the community or phenomenon under investigation (Schneider & Foot, 160).

With this standard in mind, Guimarães suggests beginning the design of a research study by first observing the practices of an interesting group of people (146). After "hanging around" with users in virtual world for a few weeks, he found that they frequently extended their relationships into other online platforms, notably e-mail and instant messaging (149). Although the preliminary fieldwork was time-consuming and tedious at times, it enabled Guimarães to develop a research design that was more closely aligned to the structures and values of the community he planned to study.

Units of analysis

A second challenge to apprehending the field of study is determining the units of analysis. The interface suggests one typology: "members" craft "posts" that are organized into "threads" within "forums" and "sub-forums" on a site. But how do we understand the relationship of members to their posts? How do we account for anonymous readers or users who post under two or more pseudonyms? Can we design a research strategy that is flexible enough to allow for the creation of a new sub-forum or the migration of users to a new server?

Schenider & Foot suggest the term "web objects" for discussing the various entities under observation in an online study (157). This terminology allows the object to serve multiple purposes; as "inscriptions of web practices" and as "structures for online action" (157). On one hand, a vBulletin "thread" is the product of dozens of individual actions on the part of users and software. On the other, it is itself a piece of software with numerous links and scripts that might be activated by the reader. The "web object" is as much a tool as the software from which it arises.

Beaulieu's work on hyperlinks highlights the flexibility of web objects as understood by social science. Within the same body of research, she considers hyperlinks as objects that "trigger events", "give meaning", and structure "action" (2005, 193). Most importantly, the hyperlink as unit of analysis enables Beaulieu to (re)construct a site from dozens of seemingly disaparate trace data points.

Traces

Before we can confidently select the possible "objects" and appropriate units of analysis in study involving vBulletin, we need to identify the unique data made available by the platform. In their study of Wikipedia vandals, Geiger & Ribes relied on publicly-available "traces" generated as products and by-products of user activity on the site (2010). These traces included timestamps, comments, usernames, and other metadata accompanying the daily activities of Wikipedia editors. Beaulieu referred to such traces as "signs left behind" and suggested that they might prove valuable as "shared objects" among quantitative and qualitative methods (183).

"Born digital" textual data such as bids on an eBay auction or comments on a YouTube video are particularly well-suited to automated collection by assistive software tools (Schneider & Foot, 166). Although software for parsing audio and video data exists, it is computationally demanding and conceptually complex. Text parsing, however, is accessible to researchers with little to no formal training in computation. Dates, times, prices, and URLs are all examples of rigidly structured that can be defined by a researcher for autonomous discovery and automatic archiving.

In some cases, researchers may wish to begin collecting traces before units of analysis have been identified. Once stored in a structured database, the traces are available to be browsed, filtered, sorted, and visualized in ways that may make object relationships easier to identify. As a note of caution, researchers who choose to take up a software-assisted "grounded theory" approach to studies of online activity must be especially critical of the tools they choose. The "neat and tidy" data structures that undergird computer software rarely maps simply to the complex social relations they are used to mediate (Burroughs-Lange & Lange, 1993).

Archives

As the software collects data from the field, it also generates its own traces that may be valuable to the research. Common archival metadata includes the time and date at which a given trace was collected and stored. Depending on the questions driving the research, the software might also perform some initial categorization, tagging, and coding such as identifying the title, subject, and author of a post on vBulletin. More sophisticated software might also parse text for keywords or identify other semantic information.

For this metadata to be useful, it must be situated in context. By pairing the archive with qualitative detail garnered from participant observation, interviews, or focus groups, the researcher may be able to reassemble the field at a particular moment in history. The result is not an exact replica but a life-like diorama of the field. Researchers must be careful not to be overbroad in their analyses of these resurrected scenes. They are a powerful tool that afford time for reflection and contemplation but they ought not be wholly substituted for first-person ethnographic inquiry.

Reconstruction

Acknowledging that the reconstruction of a field from archival data will always fall somewhat short of perfect replication frees the researcher using assistive software to imagine productively alternative arrangements. We can thus generally understand "reconstruction" as the systematic reassembly of markers, logs, and traces left by human and non-human actors (Geiger & Ribes, 2010, 3). This expanded definition allows us to reassemble scenes strategically to meet the needs of our research questions. For example, if we are interested in the interactions among users from different geographic locales, we might develop multiple reconstructions from a single vBulletin forum in which only threads and posts made during specific times of day or by specific users are represented.

In 2005, Schneider & Foot coined the term "web sphere" to concisely describe a field or community traversing multiple softwares and servers (158). Although this concept is not a perfect fit for the present project, it offers a useful vocabulary and helpful analytic structure for understanding the products of reconstruction. Rather than emphasize the technical features of a field, thinking through "sphere" focuses our attention on thematic and temporal characteristics. In a hypothetical study of baseball fandom, researchers could "hang around" Red Sox fans gathered on the TalkSox messageboard but the fans' web sphere likely includes the Yahoo fantasy sports service, the Major League Baseball play-by-play portal, and a variety of blogs by Boston sportswriters. In order to fully archive this field, the research design needs to consider each of these services

Conversely, the sphere does not necessarily include all of the activity on a single server. Many installations of vBulletin are aimed at very broad topic areas. For these sites, regular users of one subforum many never visit any of the others. PreludeZone is a vBulletin installation for fans and owners of Honda's Prelude, a car model that was produced from 1982-2001 in five different "generations." The site breaks down accordingly and features a separate subforum for each generation. As posters to the "1st Gen" subforum do not interact regularly with users of the "5th Gen" subforum, we might assume that the norms, culture, and history of the two subforums are also quite distinct.

Thinking through the "sphere" challenges assumptions about proximity in online spaces. The web sphere inclusive of the "1st Gen" forum may extend off of the PreludeZone server (perhaps to a YouTube video about repairing the brakes on the first generation Prelude) but not include any of the activity on the "5th Gen" forum. Likewise, users are not expected to read every single thread and post in order to maintain community membership. Depending on the diversity and structure of a particular forum, researchers might determine that there are distinct web spheres within a subforum based on recurring topics among subset of its total population.

The key to conceptualizing the sphere for a given community is preliminary fieldwork ("hanging around") and sensitivity to changes in the field once the research is underway. Guimarães suggests looking for clues about the "proper" behavior expected by users (147). Participant observation and active engagement with the community are two possible avenues for encountering these discourses. As the researcher is learning to use the platform, he or she will inevitably become confused or make mistakes. It is at this point that more experienced users will correct, reprimand, guide, ridicule, or help. Collective and individual responses to beginner mistakes can be valuable clues for apprehending how a group of people understand the boundaries of their own sphere (Guimarães, 147; Sandvig, 150).

Emphemeral / persistent

Paradoxically, materials that interest us online may be both ephemeral and persistent (Schneider & Foot, 166). It is often impossible to predict the reliability of any data we encounter. The traces produced by real-time interactions (such as video chat) may appear ephemeral, though they can be recorded, while asynchronous artifacts (such as a hypertext essay) suggest permanence, though they may be under constant revision. The process of archiving traces from the field mitigates some of the challenges of this instability but introduces thorny technical and ethnical issues that must be considered in advance.

Just as preliminary fieldwork is required to begin learning the "sphere" of services, softwares, practices, and people, it is also important to identify the temporal characteristics of a field (Schneider & Foot, 159). Various understandings of the history, present, and future of a phenomenon will affect people's expectations regarding ephemerality and persistence within it. Before the researcher can assess the rate of attrition, for example, he or she must "hang around" long enough to either see people coming and going from a space or to encounter a discussion of community norms. Depending on the field of study, these "cycles" can vary widely; from anthropology's mythical one-year "complete cycle" to the multi-year churn of knowledge in a technical discipline to the moment-to-moment interactions on Chatroulette (Sandvig, 150; Guimarães, 152).

Events play a special role in the temporal character of a field. In the earlier hypothetical study of baseball fans, the field coheres around a series of planned events: the baseball season. Imagine that a player is injured mid-way through the season. A new, related field might arise in response to this unexpected event. The research design should account for such unpredictable changes to the field. At what interval will the archiving tools collect traces? For how long will archiving proceed? Will the the scope of the archive be updated to accomodate changes to the sphere? If the archive tools are deployed in a "set-and-forget" fashion, it does not mean that the researcher will ignore the volatility of the field. Rather, he or she may take note of unexpected events and incorporate this knowledge into the analysis of the archived traces.

Technically, the web is highly persistent. Files on servers are accessible unless someone deletes them. In practice, of course, files move, URLs expire, and researchers cannot rely on the stability of remote servers. Although the local archive enables researchers to return to old materials, it also obscures the important direct and contextual effects that instability might have on the norms of the field. To account for this, scholars should foreground the temporal characteristics of a phenomenon when using an archive of traces to "reconstruct" historical moments.

Public data

In Beaulieu's study of the functional Magnetic Resonance Imaging Data Center (fMRIDC), a large volume of scientific data was made available over the internet for use by "lay experts" (183). A very similar public data discourse accompanies the growing number of government transparency projects such as Watchdog.net and OpenCongress.org. In addition, web services increasingly make their internal data and functionality available to the public through RSS/XML-encoded feeds and Application Programming Interfaces (APIs). This trend toward data portability enabled the rise of "mashup" projects that merge data from two or more public sources to address the needs of a specific population.[4]

These data may constitute meaningful traces for a field under observation. If so, hurrah! The data is presented explicitly for public consumption. In most cases, however, the traces we want to collect circulate in a grey area in which their "public"-ness is less clear (Bruckman, 227). To collect this data automatically, a "scraping" tool rapidly parses large volumes of web data in search of pre-determined patterns of information (for example, it might collect whatever string of characters follows the phrase "Subject:".) From the point of view of the remote webserver, the "scraper" might appear identical to a human user. The ethical implications of this arrangement will be elaborated later in this paper.

What is not represented by data?

The sheer volume of data that can be collected and sorted with assistive software can overwhelm the critical absences that persist within a field (Beaulieu, 188). Rather than simply rely on the automatically generated archive, the researcher must pay closer attention to the field in search of data and data patterns that evade the software (Barry, 1998). The most uncollectable data is that which is never generated in the first place. The lack of certain content or features from a web site or piece of software may reflect a strategic exclusion on the part of the administrator or developer (Schneider & Foot, 163) In a moment of abundance, absence is often more meaningful than presence.

In many scenarios that involve vBulletin, the most important participant is the user/reader/viewer. He or she may never contribute a comment or ask a question but the significance of the phenomenon rests on their presence. If these individuals never create accounts or sign in to vBulletin, they leave very few collectable traces. Web metrics, server logs, and "view counts" obliquely suggest readership but are hardly reliable quantities. Beaulieu found that the public databases she observed presented "sophisticated diagrams" about their visitors but could say very little about "who the users were, or about their use of the database" (185). Assistive software can automate many rote tasks but it is the responsibility of the human researcher to account for the invisible, the excluded, the obscured, and the uncollectable (Burroughs-Lange & Lange, 1993; Barry, 1998).

User experience

One last element to be considered in the reconstruction of a scence is the role of interface. Knowing the tools and contexts in which users interact with a sphere is critical to making sense of the traces in an archive. The affordances and constraints of different access schema affect how, when, and why practices develop as they do. Even a seemingly simple activity like visiting a static webpage may be vary considerably depending on whether the page is viewed from a mobile phone or a laptop. Depending on the goals of the study, researchers might consider bandwidth, client software, operating system, screen resolution, and other technical details in their reconstruction of a scene.

Notes on ethics

The software-assisted methodology outlined above raises some tricky ethical questions that are worth considering again before digging deeper into the technological details. First, the impact of automated data acquisition on a community may be less immediately apparent than with more traditional methods because the researcher is not present at the moment of collection. An important goal of preliminary fieldwork is to determine the appropriate degree of disclosure. The negative effects of brash or socially tone-deaf researches can have ramifications beyond the scope of the present study. Researchers working in the field are responsible not only to the community with whom they hope to work but also to their fellow scholars (Bruckman, 217).

Efficiency should not be the only criterion guiding the design of an assistive research tool. The available technology may enable a wide range of practices that are inappropriate to the norms and culture of the field or community under observation. In their work with World of Warcraft players, Williams & Xiong found it more culturally appropriate to approach potential subjects using avatars "in-game" rather than post to player forums (Williamds & Xiong, 135). Although this process was more labor-intensive, they found that players were more willing to participate and less skeptical of the research. Not only was the strategy more ethically and culturally appropriate, it yielded more reliable data.

Software-assisted qualitative research must take extra precautions with regard to participant privacy (Sack, 2001, 89). As noted above, the "public"-ness of data on the web is rarely defined clearly. This ambiguity is further complicated by the ease with which seemingly ephemeral traces can be collected, archived, and recalled. Bruckman uses the term "semi-published" to describe this kind of material (227).

Semi-published materials should not be (re)published without first being subject to "rigorous close reading" by human researchers (Sack, 2001, 88). Assistive software tools can collect and represent data in ways that participants may have never expected. Seemingly innocent data may cause harm when they are visualized in a new fashion. For example, the timestamps that accompany every post on a vBulletin messageboard are publicly visible but to aggregate the timestamps of a user's post history may be embarrassing or have a chilling effect on the partipation of others.

Anonymizing data can help mitigate the negative impact on individual user privacy but it is no a fool-proof solution. Bruckman reminds us that in a tight-knit community, some of the most compelling data will be identifiable regardless of sincere anonymization efforts. Dibbell anonymized the individuals discussed in his article "A Rape in Cyberspace" but several people in the LambdaMOO community made a game out of figuring out which users he was describing. Even when the subject matter is sensitive, unraveling the veiled information will be a riddle that is "too much fun to resist" for some users (Bruckman, 219-220).

vBulletin is used by an enormous variety of organizations, each with its own history and norms. There is no simple privacy benchmark that can be programmed to create a piece of ethical software. In any study involving living people, the human researcher must employ a variety of methods to ensure that the research tools have minimal negative impact.

On the other hand, no software presently exists for exploring the social characteristics of vBulletin. The tool proposed below may actually be of use to some vBulletin users. Just as social scientists adapt tools from other disciplines to the needs of research, the software developed to assist research should be made available to the public. The innovations that accompany non-academic uses may be productively integrated into future research.

Previous technological research

There are three primary areas where a semi-autonomous software tool can assist qualitative inquiry involving vBulletin: mirroring, mining, and analysis. Once preliminary research has defined the units of analysis and temporal boundaries of the field, a piece of software can begin to collect artifacts into a local archive. A second piece of software will mine the archive for desired traces. Finally, a third piece of software can accompany and enable the analysis and visualization of the collected archive. Each of these areas has a rich history within and without of academia.

Mirroring

Automatically archiving web content helps researchers overcome the instability of web content. The most basic form of archiving is to "mirror" whole webpages. Wget is a commandline tool that uses HTTP, the same protocol implemented by web browsers, to request data from the web. It can recreate the entire "tree" of a given website on a local harddrive that can be explored and manipulated locally, even without a network connection. By scheduling wget to periodically "mirror" a given URL, the researcher will be able to return to view a site on specific dates for comparative and historical inquiry. Furthermore, wget (and similar tools) can be set up to follow links off of the current site and into sites one or more degrees removed from the seed. For researches that do not have clearly delineated, multi-site boundaries, this can be a good starting point for assessing the spread of a given phenomenon across servers.

Drawbacks to the full mirroring approach is that it can quickly produce an enormous volume of redundant data that will require considerable effort to sort. The tool may not be able to access dynamic data, data that requires a password, or data stored inside of Adobe Flash applications or Java applets. To address the first of these problems, the tool should be configured to selectively mirror data according to logical criteria informed by participant observation. More sophisticated software may need to be developed to acquire data from within binary formats. A short-term solution is to automatically store this material for later coding and analysis by hand.

Mining

Mirroring software excels at accurately duplicating web resources but does not take advantage of the computer's ability to parse, sort, and store data according to social or semantic guidelines. A second layer of assistive software should be deployed alongside of the mirroring tools to mine the local archive for interesting traces. "Web mining" is multi-disciplinary area of research that draws on several computer science traditions: database management, information retrieval, artificial intelligence, machine learning, and natural language processing. Mining software tends to combine information retrieval - building the archive - with information extraction (Kosala & Blockeel, 2000). For the purposes of our project, wget might prove an adequate tool for information retrieval but the rest of the mining tasks must be customized according to the specific features of the vBulletin platform.

Information extraction refers to the process by which useful information is parsed out a larger collection of data. In the case of vBulletin, the mining process is complicated by the vague, inconsistent semantic structure of HTML. The markup tags in web documents may provide preliminary metadata to simplify the mining task but the software cannot assume that they will always be implemented correctly. In addition to content, mining software can also be written to identify the structural characteristics among multiple documents. Rather than examining the content of a single page, the structural approach produces a topology of links within and without a set of documents (Morales et. al, 2009).

Both content-based and structural mining studies are traditionally organized around somewhat different goals than the ones that motivate the online ethnography. Analyses that combine content and structure are used to quantify the popularity, relevance, and similarity of a groups of web resources. When considering vBulletin, we will be mining content in search of the traces relevant to the units of analysis identified by preliminary fieldwork. The structures of interest will tend to be discursive clusters such as the group of posts that constitute a thread and the group of threads that constitute a subforum. As the project develops, future researchers may wish to implement additional analyses that take advantage of the more sophisticated approaches found in computer science and artificial intelligence.

Analysis

The earliest work in qualitative online research relied on "hand" analyses of online conversations performed by linguists, sociologists, media studies researchers, and others (Sack, 2001, 86). Even when logging and mirroring software was used to collect data, some researchers were hesitant to use software for qualitative analysis. They worried that the numerical characteristics of a database would draw analysts away from rigorous qualitative analyses to perform less rigorous, less useful quantitative methods (Burroughs-Lange & Lange, 1993). In addition, they feared that software would introduce a counter-productive amount of "distance" between the analyst and the data (Barry, 1998). These concerns were not unfounded and this project proceeds on the belief that a multi-method research design can safeguard against the risks that they rightly identified.

Atlas/ti [5] and NUDIST/NVivo[6] are multi-purpose tools designed to assist researchers performing common qualitative tasks (Barry, 1998). They are particularly well-suited to coding in content analysis and transcription from interviews and focus groups. Neither tool is particularly well-suited to the open-ended fieldwork associated with participant observation and online ethnography. For this task, a somewhat more specialized tool is required to address the unique architectures of the field underobservation.

Conversation Map

4557833939_770594f8a5_o.gif

Conversation Map is a tool for analyzing the "very large-scale conversations" that take place on USENET newsgroups (Sack, 2000; 2001; 2002). Sack designed the software to be available to social scientists and newsgroup participants alike. Using techniques derived from the social and semantic network literature, Conversation Map parses and re-presents a newsgroup archive by highlighting recurring themes and social relationships (see fig. TODO). This alternative interface enables researchers to quickly locate interesting moments in the history of a newsgroup to zoom in on for closer analysis.

Conversation Map closely matches the proposed tool for assisting analyses of vBulletin but differs in two crucial dimensions. First, Conversation Map assumes the existence of a plaintext archive, an affordance of the USENET messaging platform. Although vBulletin has similar functionality, the two systems fundamentally differ in how they store and transmit messages. vBulletin data is stored in a single database while USENET employs a redundant "store and forward" design in which messages are distributed across many federated servers. To begin aggregating messages from USENET, a researcher can run a USENET server without modification. In most cases, the single vBulletin database will not be directly accessible and researchers will instead need to draw on the information as presented to users on the web.

The second primary difference between Conversation Map and the proposed vBulletin tool is in the former's attention to semantic relationships. Conversation Map presents users with a "content-based" browser interface in which common themes and terms provide the primary points of entry into the archive. Unfortunately, this functionality is beyond the scope of the present project. Because the vBulletin architecture introduces significant constraints absent in USENET, the immediate priority is to establish a reliable mechanism for creating an maintaining a productive archive.

Why should we build our own tools?

This is not a proposal to hire developers. The process of building assistive tools is itself a qualitatively productive activity. Guimarães urges us to pursue research techniques that are socially and culturally appropriate to our fields of study (141). To learn more about vBulletin as a platform, we must engage with it according to its own discourse and logic; code and protocol. The knowledge that arises alongside this engagement may then benefit future studies that encounter vBulletin through its communities of users.

The tools that we build do not need to be perfect and the process of their construction need not be maximally efficient. Rather, by maintaining a sense of self-reflexive dual purpose throughout their development, the same missteps, confusion, and failure that frustrate the technician will enrich the ethnographer. Sandvig champions the productive potential of "happy ignorance" with regard to the technological subjects he explores (150). Asking novice questions and seeking help may put the researcher in touch with the mythology and training that guide a discipline or technological regime. There is great value in learning how practitioners grow and maintain their expertise.

Finally, a prototype tool provides a powerful starting point for future interdisciplinary collaboration. The semi-functional prototype, however buggy or incomplete, will provide a discursive crossroads for engineers, researchers, subjects, designers, and theorists to meet. If the code is open, researchers from other departments and institutions can make a copy of their own to try out. In the ideal scenario, this playful experience transforms weaknesses into topics for discussion; bugs into invitations. Once shared, the research tool is not simply a means to an end but an "object to think with" (Turkle, 2007).

Research-driven software design

To outline the specifications of a vBulletin research tool, we should start from the questions that we want to be able to ask. One advantage that Atlas/ti and NUDIST have over other tools is that they were designed from the beginning to address the specific needs of social science research methodology. For the present project, we will focus on one specific scenario that arose in a case study about the Pricescope community.[7]

Pricescope is a website for diamond consumers, fans, and industry stakeholders to socialize, share resources, and do business. Their popular messageboard is divided into several subforums including a variety of topics related and unrelated to gemstones.[8] Swartz observed that a relatively obscure subforum titled "Around the World" came alive in response to the 2008 U.S. presidential election campaigns. During a particularly heated debate, she observed one user write to another, "I hate your politics but I love your diamonds." Swartz highlights this moment as evidence that the Pricescope users were able to "maintain civility" because of a shared foundation of respect constructed through their interactions in other areas of the site.

When Swartz observed this interaction between users, several questions likely leapt immediately to mind. One line of thinking traces the interconnections between their posting histories:

  • In which other threads did both posted?
  • Have they ever quoted or replied to one another?
  • When was the last time the two posted in the same thread?
  • Is there a third person with whom they both regularly interact?

Another group of questions concerns their user profiles:

  • Which user joined the board first?
  • Which user posts more frequently?
  • Do the two users typically post during the same times of the day?
  • Are there other pairs of users with a similar relationship?

And a third category inquires after the structures of the site:

  • When was the "Around the World" subforum created?
  • Are there stickies[9], administrators, or moderators on this subforum?
  • How does "Around the World" compare with other subforums in terms of activity and diversity?

The questions raised by the Pricescope example suggest three units of analysis for this case: the user, the topic (or "thread"), and the subforum. Each of these "web objects" is represented both in the "backend" data structures on Pricescope server and in the presentation of that data within the forum's web interface (Schneider & Foot, 2005). Because we do not have access to the Pricescope backend, the data must be first extracted from the interface and stored locally.

The vBulletin interface envelopes user data within a combination of technologies: semantic markup (HTML), style guidelines (CSS), digital images (PNG, JPEG, GIF), and client-side scripts (Javascript). The process of retrieving this data is performed by "web scraping" software. Although it has a history in the computer science literature, "scraper" software is often linked in popular discourse to the nefarious activities of spammers, "harvesters", and "phishers" (King, 2008; Smith, 2008). Scrapers have been used in the past to rapidly "harvest" email addresses from publicly accessible websites. The unlucky owners of these addresses are then deluged by spam messages.

Software tools are not ideologically inert objects (Dean, 2009; Galloway, 2006). The discursive connection between scraping and anti-social activities requires that researchers proceed with extra caution and care for the communities they study. Though the information on Pricescope is plainly accessible to the public, we might decide to contact an administrator before beginning to automatically collect data. The ethical use of web scraping tools will be unique to each field of study but all researchers should remain conscious of the negative discourses that their tools may invoke.

Conclusion

This paper proposed the development of a research tool to support scholars who encounter instances of the vBulletin messageboard platform in the course their studies. The methodology that drives this project is a mixed-method approach that pairs participant observation with software-assisted data collection and qualitative analysis. Preliminary fieldwork is required to determine the units of analysis and temporal characteristics that bound the field of study. Semi-automated data collection software is next deployed to build a local archive of traces generated from activities occuring in the field. Finally, the tool assists in the data analysis by visualizing archived traces in ways not possible within the traditional vBulletin interface.

The discussion raised some important technical and ethical considerations that ought to accompany any software-assisted online ethnography. Researchers must be careful not to rely too heavily on their local archive as critical information may be obscured, excluded, or unknowable to the data collection software. In some cases, research software might violate norms regarding ephemerality so researchers must also pay close attention what constitutes "public" data in their chosen field.

Finally, I wish to reiterate the ethnographic value of writing code for the social scientist interested in technologically-mediated phenomena. Even seemingly unsuccessful attempts to develop custom research tools will put the researcher in touch with new discourses and may open up new possibilities for interdisciplinary collaboration. The process of developing the tool described in this paper will not only benefit colleagues working with communities of vBulletin users but will contribute to a more nuanced understanding of the affordances and constraints of the vBulletin platform itself as a compelling object of study.

Endnotes

  1. These cases will be published in a forthcoming white paper titled From Participatory Culture to Public Participation.
  2. vBulletin is representative of several messageboard platforms including phpBB and Ideal BB. The present discussion is relevant to those softwares as well.
  3. For an easy test of vBulletin's quiet pervasiveness, add the phrase "Powered by vBulletin" to any Google search.
  4. The canonical example is HousingMaps, a site that represents the rental housing available on Craigslist as a customized instance of Google Maps.
  5. http://www.atlasti.com/
  6. http://www.qsrinternational.com/
  7. All of the information about Pricescope was collected by Lana Swartz as part of the forthcoming white paper From Participatory Culture to Public Participation.
  8. The Pricescope forums use Ideal BB, an older messageboard platform that is no longer in production. For the purposes of this discussion, it is nearly identical to vBulletin.
  9. On the typical vBulletin installation, forum topics are organized in reverse chronological order starting at the top of the page. Stickies are topics that stay permanently at the top of the forum. They are often used for FAQs, forum rules, or updates from the forum administrators.

References

  • Barry, C. A. (1998). Choosing qualitative data analysis software: Atlast/ti and Nudist compared. Sociological Research Online, 3(3). Retrieved from: http://www.socresonline.org.uk/3/3/4.html
  • Beaulieu, A. (2005). Sociable hyperlinks: An ethnographic approach to connectivity. In C. Hine (Ed.), Virtual methods: Issues in social research on the internet (pp. 183-197). Oxford: Berg Publishers. Retrieved from: http://site.ebrary.com/lib/uscisd/Doc?id=10233369&ppg=197
  • Burroughs-Lange, S. G., Lange, J . (1993). Denuded data! Grou nded theory using the nudist com puter analysis program: In resea rch the challenge to teacher sel f-efficacy posed by students wit h learning disabilities in austr alian education. Paper presented at the Annual Meeting of the Am erican Educational Research Asso ciation, Atlanta, Ga, April 12-1 6.
  • Bruckman, A. (2002). Studying the amateur artist: A perspective on disguising data collected in human subjects research on the internet. Ethics and Information Technology, 4, 217-231.
  • Dean, J. (2009) Democracy and other neoliberal fantasies. Durham: Duke University Press.
  • Galloway, A. R. (2006). "Protocol". Theory Culture Society, 23, pp 317. doi: 10.1177/026327640602300241
  • Geiger, R. S. and Ribes, D. (2010) The Work of Sustaining Order in Wikipedia: The Banning of a Vandal. In Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work (CSCW), ACM, New York (2010).
  • Guimarães Jr., M. J. L. (2005). Doing anthropology in cyberspace: Fieldwork boundaries and social environments. In C. Hine (Ed.), Virtual methods: Issues in social research on the internet (pp. 141-156). Oxford: Berg Publishers. Retrieved from: http://site.ebrary.com/lib/uscisd/Doc?id=10233369&ppg=155
  • King, C. (2008) "When web scraping can help - or hurt - your business". Cynosure: Dishing on the digital universe, August 28. Retrieved from: http://cynosure.crystalking.com/?p=162
  • Kosala, R. & Blockeel, H. (2000). Web mining research: a survey. ACM SIGKDD Explorations Newsletter, 2(1), June. Retrieved from: http://doi.acm.org/10.1145/360402.360406
  • Morales, S. P. B., Fandiño, H. A. B., & Rodríguez, J. R. (2009). Hypertext classification to filtrate information on the web. Proceedings of the 2009 Euro American Conference on Telematics and Information Systems: New Opportunities to increase Digital Citizenship, Prague, Czech Republic. Retrieved from: http://doi.acm.org/10.1145/1551722.1551723
  • Sack, W. (2000) "Conversation map: A content-based usenet newsgroup browse". Proceedings from IUI, New Orleans, pp. 233-240.
  • Sack, W. (2001) "Conversation map: An interface to very large-scale conversations". Journal of Management Information Systems, Winter, 17(3), pp. 73-92.
  • Sack, W. (2002) "What does a very large-scale conversation look like? Artificial dialectics and the graphical summarization of large volumes of e-mail". Leonardo, 35(4), pp. 417-426.
  • Schneider, S. M. & Foot, K. A. (2005). Web sphere analysis: An approach to studying online action. In C. Hine (Ed.), Virtual methods: Issues in social research on the internet (pp. 157-170). Oxford: Berg Publishers. Retrieved from: http://site.ebrary.com/lib/uscisd/Doc?id=10233369&ppg=184
  • Smith, R. (2008). "The ethics of website-data extraction". Associated Content, September 2. Retrieved from: http://www.associatedcontent.com/article/932829/screenscraping_ethics.html
  • Turkle, S. (Ed.) (2007). Evocative objects: Things we think with. Cambridge: MIT Press.
  • Williams, D. & Xiong, L. (2009) Herding cats online: Real studies of virtual communities. In Hargittai, E. (Ed.) Research Confidential, pp. 122-140. Ann Arbor: University of Michigan Press.
Personal tools