This web site provides access to the Tobacco Documents Corpus and many of the linguistic and statistical tools we developed for investigating its content. To contact a project staff member, send an email to tobacco@uga.edu. Please select one of the following links:
Introduction to the Tobacco Documents Corpus: This page provides a quick introduction to the Tobacco Documents Corpus (TDC), suggestions for better viewing, and definitions for many of the terms associated with the TDC. You can access this page from any TDC page by clicking on the information page link . Also available on this page are links to information on improving this server's performance, searching for known documents, and topics known to have significant variation in the document set.
Create a Sub-Corpus from the TDC: The Create a Sub-Corpus tool allows you to select specific text types from the entire Tobacco Document Corpus by categories such as 1) decade in which the documents were written, (2) the tobacco company which was the source of the document, (3) whether the target audience was internal to the industry or external, (4) whether the target audience was a specific person or group or a generalized audience, and (5) the component of the corpus in which the document was sampled. Once you have selected a subsample of documents with which to work, this function also offers you the capability of displaying different text elements such as marginalia, emphasis, cross-outs, or headings.
View Terms in Context: This tool allows you to search for specific words in document sets you select. The output displays the key word(s) in red surrounded by their contexts. You can also display the entire document in which a particular word-in-context appears by clicking the corresponding hpyerlink.
Plot Differences Across Document Groupings: This tool allows you to determine whether particular words or phrases are dominant in one group of documents compared to the document corpus as a whole. You can select groups according to decade, industry source, and target audience. Output includes graphs of raw frequencies and relative frequencies of the terms you select, as well as z-scores to assist in determining if the difference from the norm is statistically significant.
Plot Peak Usage: This tool allows you to produce a line graph depicting changes in word or phrase frequency over time (by year). The method uses a rolling-average to minimze artifacts of small-group sampling (which can cause wild variation) and gives a clear indication of how term usage changes year to year.
View Rhetorical Case Data: This link allows you to view groupings of documents selected to investigate rhetorical aspects of the corpus. In addition to the rhetorical case documents themselves, you can also view graphs pertaining to some of their linguistic characteristics. Two types of rhetorical cases are compiled. (1) Audience cases juxtapose a document targeted to an industry-internal audience with one on the same topic addressed to an industry-external audience. (2) Multiple draft cases assemble a sequence of drafts of the same document so that you can see how the final version evolved.

The Tobacco Documents Project and the resultant data available here on the server were made possible by a grant from the National Cancer Institute (1 RO1 CA87490-01, July 1, 2001 to June 30, 2004). Please visit the UGA Tobacco Documents website for further information.

NIH-NCI Tobacco-Documents Project at the University of Georgia (Grant # 1 RO1 CA87490-01).