Temporarily Served by symple.design
uga arches UGA Tobacco Documents Project
Go to Home Page


Introduction and Glossary of Site Terms

  • Introduction to the University of Georgia Tobacco Documents Corpus - The unprecedented release of several million previously internal documents by tobacco manufacturers has created a resource that will be of great interest to scholars and policy analysts from many disciplines. In addition to the fields of tobacco control and public health, these documents may yield valuable insights to users in business studies, rhetoric and communication, and text linguistics.

    To facilitate investigations of the tobacco industry documents, the Linguistic Analysis of Tobacco Documents Project at the University of Georgia has assembled a carefully designed corpus of these texts. We have also devised a workbench of tools enabling you to conduct your own inquiries regarding these documents. This webpage gives you entrée to our corpus and to those tools.

    The documents available through this interface have been selected through a scientifically designed stratified randomization process (see Kretzschmar, W. A., Darwin, C., Brown, C., Rubin, D. L., & Biber, D. (2004). Looking for the smoking gun: Principled sampling in creating the tobacco industry documents corpus, Journal if English Linguistics, Volume 32, 31-47 for further information). As of June 2004, about 1200 documents are in the corpus. Each document was hand-keyed and tagged so that specific elements of the texts such as cross-outs and handwritten marginalia could be isolated and input into computer-assisted analyses, should you so wish. The interface offers five different tools to search the database. You may create your own sub-corpus, use a keyword-in-context concordance, plot significant differences in the use of words or phrases, plot peak usage of words across time, or access a set of rhetorical document cases.

  • Helps and Hints for Using the TDC Web Server - Below are various tips and technical information that will allow you to better utilize the Tobacco Documents Server, to include tips on viewing, improving performance, searching for known documents, and a list of topics known to have significant variation in the document set.

    • Viewing Tips - This site was designed for use with Microsoft Internet Explorer, and (as of this version)it has not been thoroughly tested using Netscape products, and there have been some reported problems with graph images loading. Text size should be set to MEDIUM or smaller. In Internet Explorer, the sequence for changing text size is to click VIEW - TEXT SIZE - SMALLER. Wherever you find yourself on the server, if you have questions look for a blue hyperlink or the information button . This will link you to associated items on this glossary and information page. See the Technical Tips for information on how to run this server from a local hard disk (it's faster if you want to take the time to set it up).
    • Technical Tips - The main reason to discuss technical aspects of this server is performance. That is, you want to speed things up a bit. As it comes, the server runs completely from the CD where all data and programs are stored. Consequently, the speed at which the tools run is dependent less on the speed of the computer processor, and more on the speed of the CD drive. This is because there is a lot of data retrieval. With a fast CD (32x or more), there is really no reason to do anything but use the CD. However, with a slow CD, or if you are using the graphing tools a lot, it will be much faster to run from your local hard drive.

      To do this is quite simple. First, close the server. Next, copy the folder "ugaserver" and all sub-folders from the CD drive to your hard disk. It doesn't matter where you put it, just as long as you remember where it is. It may be necessary for you to use the Explore option when doing this. From "My Computer" right-click the CD drive and choose "Explore". Then just drag and drop the "ugaserver" folder to where you want it. You will need about 800 MBs of space on you hard disk. Once the folder is on your hard disk, open it and double-click "start.exe" to begin. You could also right-click "start.exe" and use the "Send To" function to create a shortcut on your desktop. At any rate, "start.exe" gets you going.

      There are two other bits of information you might want to know. First, if you would like to see a log of all the activity on this server, instead of using "start.exe" run "c-start.exe" instead. This will give you a console view that displays all server activity. The second item is that the server assigns and uses a random port number (between 1000 and 7999) each time it runs. You will be able to see this in your browser address bar. This is for security reasons, but it means that it is useless to try book marking pages. You'll never be able to find them again.

    • Document List (Search) - If you are simply interested in knowing if a particular document is included among the 1200 documents comprising the Tobacco Documents Corpus, or if you want to view a known document, this is a good place to start. This page is a sorted listing of all the documents in the Tobacco Documents Corpus. This includes 808 documents from the Stratified Random Sample (Quota Sample), 100 from the Supplemental Sample, and over 200 from the Rhetorical Case Studies. You can view the PDF image, text, or xml coding by clicking on the links to the right of the document identification number. In most cases, the document identification number is the starting Bates number of the document. However, some Rhetorical cases are identified by case number. Click here to view.
    • Hot Topics - What you will find as you begin to use this resource is that you will run into many dead ends. That is, there are many terms that never show up, or are so infrequent that statistical measures are unreliable. This is normal. Not everything occurs at reliably measurable rates. In fact, statistically speaking most things don't, and that is why those that do are interesting. However, if you feel like you are shooting in the dark, we have prepared a list of hot terms for you. These are terms that we know to deviate significantly from the norm and have reliable z-scores . This is a good place to start if you are new to tobacco document study. Click here to view.
  • Acknowledgements -

    • NCI - This research was supported by a grant entitled Linguistic Analyses of Tobacco Industry Documents (RO1 CA 87490) from the National Cancer Institute, National Institutes of Health, U.S. Department of Health and Human Services. The views expressed, however, are solely those of the project team members.
    • This server and the CGI scripts which produce the data output are written in the Python programming language. Our thanks to the folks at www.python.org and others who have worked so hard to create these programming resources.
    • We would also like to thank the folks at Advanced Software Engineering for their generosity in allowing us to distribute their product on our CD. We use their ChartDirector© program to generate our graphical data displays, and we recommend it highly.
  • Select Document Type - This option allows you to select documents based on the ways individual documents are indexed in the Tobacco Document Corpus. For example, if you leave the boxes checked (a check mark appears in each clicked box) for (1) Stratified Random Sample Collection , (2) Undated Documents , (3) Philip Morris and (4) Named Industry-Internal Audience , you will receive only those documents in the corpus that conform to all four criteria. If you click only the box for the Industry-External Audience and Supplemental Collection , and you leave all other boxes unclicked -- you will obtain all the documents within the supplementary sample. Hint: You have to select something. No boxes checked = no documents selected.
  • Stratified Random Sample - This sample, which is the principle sample of the Tobacco Documents Corpus, contains 808 documents randomly selected from the set of tobacco documents known as the "snapshot," namely all documents up to June 1999. The sample is stratified by decade to ensure presentation across time. The only exclusions are documents that were non-English and those that had less than 50 words of analyzable text (i.e. documents that are primarily forms and/or images ). Nearly all of the documents in this collection are addressed to industry-internal audiences .
  • External Audience Supplemental Sample - Since 96% of the documents in the Stratified Random Sample are addressed to industry-internal audiences, for clear comparisons it was necessary to deliberately search out documents that were addressed to industry-external audiences, such as press releases and letters to consumers and regulatory agencies. This collection contains a stratified, random sample of 100 such documents. The procedure was identical to that used for the larger Stratified Random Sample, with the added specification of industry-external audience .
  • Rhetorical Cases - This corpus was assembled to allows you to view groupings of documents selected for their rhetorical interest (nothing random here). Two types of rhetorical cases are compiled: (1) Audience cases juxtapose a document targeted to an industry-internal audience with another document on the same topic addressed to an industry-external audience; and (2) Multiple-draft cases assemble a sequence of drafts of the same document so that you can see how the final version evolved from first draft to final draft. In addition to viewing the rhetorical case documents themselves, you can also view graphs representing some of their linguistic characteristics.
  • Document Class - For purposes of sampling, each document in the Tobacco Document Corpus was classified as to whether the addressee/audience was named or un-named, and industry-internal or industry-external. The four document classifications are therefore (1) Named Internal Audience , (2) Named External Audience , (3) Un-named Internal Audience , and (4) Un-named External Audience .
  • Named Internal Audience - Each document in this class has a named recipient who is internal to the tobacco industry. Named recipients are any individual(s) designated by name. The distinction between named and unnamed audiences is one way of pointing to different genres; most documents with named audiences will be letters or memoranda. A memorandum with a TO: list of names could be a member of this class, but a memorandum addressed just to Philip Morris North America Employees would not. Internal audiences are comprised of any individual(s) who was (were) employed by a tobacco manufacturer, trade organization, subsidiary, or who have had documented financial relationships (e.g., paid consultant) with any such organization.
  • Named External Audience - Each document in this class has a named recipient who is external to the tobacco industry. Named recipients are any individual(s) designated by name. The distinction between named and unnamed audiences is one way of pointing to different genres; most documents with named audiences will be letters or memoranda. A memorandum with a TO: list of names could be a member of this class, but a memorandum addressed just to Philip Morris North America Employees would not. External audiences are comprised of any individual(s) who was (were) never employed by a tobacco manufacturer, trade organization, subsidiary, and who never had financial relationships (e.g., paid consultant) with any such organization.
  • Un-Named Internal Audience - Each document in this class has an unnamed target audience who is internal to the tobacco industry. Unnamed audiences are never individuated or designated by name. The distinction between named and unnamed audiences is one way of pointing to different genres; most documents with unnamed audiences will be reports. A memorandum addressed just to Philip Morris North America Employees would count as having an unnamed audience as well. Internal audiences are comprised of any individual(s) who were employed by a tobacco manufacturer, trade organization, subsidiary, or who have had documented financial relationships (e.g., paid consultant) with any such organization.
  • Un-Named External Audience - Each document in this class has a unnamed target audience who is external to the tobacco industry. Unnamed audiences are never individuated or designated by name. The distinction between named and unnamed audiences is one way of pointing to different genres; most documents with unnamed audiences will be reports. A memorandum addressed just to Philip Morris North American Employees would also count as a member of this class. External audiences are comprised of any individual(s) who was (were) never employed by a tobacco manufacturer, trade organization, subsidiary, and who never had a financial relationships (e.g., paid consultant)with any such organization
  • Decade - With the exception of the Rhetorical Cases Collection , the Tobacco Documents Corpus is stratified by decade which makes decade a key sampling variable.
  • 1900-1959 - Documents generated by tobacco manufacturers and industry organizations between 1900 and 1959. These five decades were grouped together because each of these early decades by itself contained relatively few documents.
  • 1960-1969 - Documents generated by tobacco manufacturers and industry organizations between 1960 and 1969.
  • 1970-1979 - Documents generated by tobacco manufacturers and industry organizations between 1970 and 1979
  • 1980-1989 - Documents generated by tobacco manufacturers and industry organizations between 1980 and 1989
  • 1990-1999 - Documents generated by tobacco manufacturers and industry organizations between 1990 and late 1999, when the tobacco Master Settlement Agreement went into effect.
  • Undated Documents (19xx) - Documents generated by tobacco manufacturers and industry organizations that lack specific dates both in the attorneys' indices and on the document images. Also referred to as 19xx.
  • Bliley Set Documents - Approximately 37,000 documents originally ruled inadmissible in the Minnesota Blue Cross lawsuit due to attorney/client privilege. Most of these documents were released in April 1998 by the US Senate Commerce Committee, of which Senator Bliley was Chair. As a group, this set of documents is especially incriminating.
  • Industry Source - The parties to the 1998 Master Settlement Agreement which were compelled to produce previously internal documents include the following industry sources: American Tobacco Company , Brown and Williamson , Council for Tobacco Research , Lorillard Tobacco , Philip Morris Company , R. J. Reynolds , and Tobacco Institute .
  • American Tobacco Company - One of the five manufacturers, American Tobacco , Brown and Williamson , Lorillard , Philip Morris , R. J. Reynolds , which were part of the 1998 tobacco Master Settlement Agreement. As a result of that settlement, each company is required to make available to the public its internal documents through June, 2010. American Tobacco was purchased by Brown and Williamson.
  • Brown and Williamson - One of the five manufacturers, American Tobacco , Brown and Williamson , Lorillard , Philip Morris , R. J. Reynolds , which were part of the 1998 tobacco Master Settlement Agreement. As a result of that settlement, each company is required to make available to the public its internal documents through June, 2010.
  • Council for Tobacco Research - The Council for Tobacco Research (CTR), spun off from the earlier Tobacco Industry Research Council, was funded and formed by a cartel of tobacco manufacturers in 1964. Its purpose was to manage a program of research that generally minimized the links between smoking and disease. CTR was closed in 1998 as part of the Master Settlement Agreement.
  • Lorillard - One of the five manufacturers, American Tobacco , Brown and Williamson , Lorillard , Philip Morris , R. J. Reynolds , which were part of the 1998 tobacco Master Settlement Agreement. As a result of that settlement, each company is required to make available to the public its internal documents through June, 2010.
  • Philip Morris - One of the five manufacturers, American Tobacco , Brown and Williamson , Lorillard , Philip Morris , R. J. Reynolds , which were part of the 1998 tobacco Master Settlement Agreement. As a result of that settlement, each company is required to make available to the public its internal documents through June, 2010.
  • R. J. Reynolds - One of the five manufacturers, American Tobacco , Brown and Williamson , Lorillard , Philip Morris , R. J. Reynolds , which were part of the 1998 tobacco Master Settlement Agreement. As a result of that settlement, each company is required to make available to the public its internal documents through June, 2010.
  • Tobacco Institute - The Tobacco Institute (TI) spun off from the earlier Tobacco Industry Research Council. It was funded and formed by a cartel of tobacco manufacturers in 1958. Its purpose was to manage a program of public relations an lobbying that would deflect efforts of anti-smoking activists. TI was closed in 1998 as part of the Master Settlement Agreement.
  • Collection - The Tobacco Documents Corpus is comprised of three components: (1) Stratified Random Sample Collection , (2) Industry-External Supplemental Collection , and (3) Rhetorical Cases Collection . Each utilizes a different sampling plan. Denotes from which collection documents came: Quota Collection, Supplemental Collection or Rhetorical Collection.
  • Display Metadata - Metadata are information about the document, but they do not include any actual document text. Metadata elements associated with each document include the Bates document identification numbers , page and word counts , document dates , industry sources , and certain Tobacco Document Corpus project filing notes . Metadata for some documents also includes the attorneys' index entry that provides a document title, names the addressee and the author, and any tobacco products mentioned. Generally you will want to display at least the start Bates number for each document.
  • Start Bates Number - As in many complex lawsuits with a large amount of documentary evidence, each page of a document is stamped with a unique multi-digit (and sometimes alphanumeric) identifier known as a Bates number. The tobacco industry attorneys decided what pages belonged together to constitute a given document. Sometimes a document might include a cover letter, plus a report, plus appendices. In other cases, that very same cover letter might constitute a different document. The Start Bates Number is the Bates number for the first page of a document set.
  • End Bates Number - As in many complex lawsuits with a large amount of documentary evidence, each page of a document is stamped with a unique multi-digit (and sometimes alphanumeric) identifier known as a Bates number. The tobacco industry attorneys decided what pages belonged together to constitute a given document. Sometimes a document might include a cover letter, plus a report, plus appendices. In other cases, that very same cover letter might constitute a different document . The end Bates number is the Bates number from the last page in a designated document.
  • Document Date - The date of origin for each document, formatted as YYYYMMDD (ie. May 14,1978 would be formatted 19780514). In cases where a component of the date is unknown, that portion has been designated with a zero. For instance, if a document was written on an unknown date in August 1967, the date would be formatted 19670800. Documents without discernible dates are labeled 00000000 (and belong to the 19xx decade).
  • Number of Pages - Check this box to display the number of pages in a document. The tobacco industry attorneys decided what pages belonged together to constitute a given document. Sometimes a document might include a cover letter, plus a report, plus appendices. In other cases, that very same cover letter might constitute a different document . The number of pages present in the document thus sometimes depends on what the trial attorneys happened to decide belonged together.
  • Number of Words - Check this box to display the number of words in the document. A word is usually defined as one or more letters or digits situated between two white spaces and within the Main Text portions of the document.
  • Tobacco Documents Project Notes - Project document encoders occasionally included information they thought might prove useful in making the encoded document clearer. These may include notations about the state of the document, any inconsistencies in the document, or remarks about any peculiarities of the document.
  • Attorneys' Index Information - Standard attorneys' indexing information including a document title, subject, date, author, addressees, product mentions was provided for most of the Tobacco Document Corpus entries by the staff of the Legacy Foundation online document archive at University of California, San Francisco.
  • Primary Divisions: Document Data - Refers to the data contained in the major divisions of the document itself (not the meta data). When searching, you almost always want to leave the "maindoc data" section checked. Please refer to the separate information tags to determine if you need to retrieve information from them.
  • Predoc Component - Any given document -- the beginning and end of which were defined by defendants' attorneys and designated by a continuous Bates number set -- might have embedded within it a number of separate, additional documents. A predoc component refers to a separate document that precedes the main or most substantive component within a Bates number set. Predocs are most often cover letters or some set of instructions for distributing the main document.
  • Maindoc Component - Any given document -- the beginning and end of which were defined by defendants' attorneys and designated by a continuous Bates number set -- might have embedded within it a number of separate, additional documents. At the very least, a document must include a maindoc component. This is the primary document within the Bates number set, expressing the core topic of the document as a whole, and excluding foregoing or following material. The maindoc should almost always remain selected.
  • Postdoc Component - Any given document -- defined by defendants' attorneys and designated by a continuous Bates number set -- might have embedded within it a number of separate, additional documents. The Postdoc Component refers to any full document following the main document. A postdoc cannot bear the explicit label appendix , however. Post components are generally full reports or earlier correspondence referred to within a main document.
  • Appendices - Any given document -- defined by defendants' attorneys and designated by a continuous Bates number set -- might have embedded within it a number of separate, additional documents. An appendix is defined first of all by an explicit heading indicating appendix . Appendices are often compilations of sales or other technical data, often mainly in tabular form. If the document as a whole is long (over 2000 words), the appendix may be briefly described in a project note , but the text not actually encoded.
  • Xdoc Data - An xdoc is a document entity not otherwise defined in the document protocol.
  • Secondary - Secondary Document Divisions refer to the different parts of any Division within the Primary Document Divisions . For instance, any Predoc may contain Pre Text , Main Text and Post Text , just as any appendices may contain Pre Text , Main Text and Post Text . Please see the individual definitions of each Division to determine if you would like to include them in your search.
  • Display Pretext - This function allows you to view the pretext elements of any document. Pretext refers all analyzable text that precedes the major prose segment of the document. It generally consists of salutations, distribution lists, and the date. Pretext does not include title or headers preceding the initial block of analyzable text, since those are classified as headers. Any document component(e.g., predoc as well as maindoc) may contain pretext.
  • Display Text - This function should almost always be selected. It allows you to view the primary text of a document. Text refers to the more or less continuous prose, the heart of the document. This is the part of the document which is most likely to be subject to grammatical analysis, for example. The text does include any headers , even headers which precede the text; but text excludes pretext and excludes page break information.
  • Display Posttext - This function allows you to view the posttext elements of any document. Posttext refers all analyzable text that follows the major prose segment of the document. It generally consists of signatures. Any document component (e.g., predoc as well as maindoc) may contain pretext.
  • Display Normative Text - This function should almost always be selected. It permits you to view the normative, continuously running part of the text that has not been subject to any special emphasis or editing marks. If you leave this box unchecked, you will not see the bulk of the document.
  • Display Emphasized Text - Check the box for this function to include text which has been emphasized typographically, usually through bold or italic fonts. (The emphasized text will appear in red font in the HTML display of the documents.) If you leave this box unchecked, you will not see any emphasized text.
  • Display Marked Text - Check the box to include text that has been circled, underlined, highlighted, etc. by hand after the completion of a typed version (the marked text will appear in red font in the HTML display of the documents). If you leave this box unchecked, you will not see marked text.
  • Display Lined-out Text - Check the box for this function to include text which has been crossed out by hand during editing (the lined-out text will appear in red font in the HTML display of the documents). Making lined out text visible is a good way to see an earlier draft of the document at hand. Do not check this box if you want to see the later, edited version of the text.
  • Display Inserted text - Check the box for this function to include text which has been inserted by hand during editing (the inserted text will appear in red font in the HTML display of the documents). Making inserted text visible is a good way to see the latest, edited draft of the document at hand. Do not check this box if you want to see the earlier version of the text.
  • Display Titles/Headers - Check the box for this function to include title of any document or to view headings that may be used to show topical boundaries between sections of a document (the titles and headers will appear in red font in the HTML display of the documents). Titles and headers are not usually sentences integrated into the text, and therefore you may not want to include headers or titles if you are doing grammatical analyses. (Note: Titles and headers are not part of pretext; they are part of the main text.) If you leave this box unchecked, you will not see titles and headers.
  • Display Quotations - Check the box for this function to include any text that is quoted from another source or person besides the main author (the quoted text will appear in red font in the HTML display of the documents). Quotations may include excerpts from focus group responses or segments from government documents that are being attacked by tobacco spokespersons (e.g., The Surgeon General's Report ). Since quotations are not language composed by tobacco industry sources, you may want to exclude them from analyses of tobacco industry language. If you leave this box unchecked you will not see quotes.
  • Display Marginalia - Check the box to include the text of hand-written or typed marginalia and interlinear comments added by a reader after the completion of the original document (the marginalia will appear in red font in the HTML display of the documents). Marginalia may include stamps like For Immediate Release or Confidential . Sometimes the marginalia will reveal a reader's initial gut response to a text.
  • Display Illegible Text - Check the box to include encoders' annotations of text that could not be reliably deciphered. If the encoder had a strong conjecture as to what the text should say, he/she may have typed in a reading of the text and enclosed it in illegible tags (The encoder's rendering of the illegible text will appear in red font in the HTML display of the documents).
  • Display Image Text - Check the box for this function to include any text that might appear inside an image (the text from within the images will appear in red font in the HTML display of the documents). Only text segments longer than 50 words have been encoded. Image text is usually comprised of words that are part of an advertisement.
  • Display Form Text - Check the box for to include any text that might appear inside a form. (The text from within the forms will appear in red font in the HTML display of the documents.) Only text segments longer than 50 words have been encoded. Form text is usually comprised of words inserted into cell in a preprinted form. Forms differ from tables in that tables are generated specifically for a particular report, whereas forms are usually preprinted.
  • Display Table Text - Check the box to include any text that might appear inside a table the text from within the tables will appear in red font in the HTML display of the documents). Only text segments longer than 50 words have been encoded. Tables differ from forms in that tables are usually generated specifically for a particular report, whereas forms are usually preprinted.
  • Display Image/Form Descriptions - Check the box for this function to include the document encoders' descriptions of images and forms (the descriptions will appear in red font in the HTML display of the documents).
  • Display Image or Form Captions - Check the box for this function to include captions that might have labeled images or forms. Captions refer to text outside of the actual image or form (i.e., above, to the side, or below) while Image Text and Form Text refer to text within the Image or Form proper.
  • Display Symbols - Check the box for this function to include encoders' descriptions of symbols that encoders have attempted to capture in words. An example of a symbol would be a typeset arrow. (Descriptions of the symbols will appear in red font in the HTML display of the documents.)
  • Display Page Break Data - Check the box to include data that occur between pages (the page break data will appear in green font in the HTML display of the documents). Page break data usually includes page numbers, fax marking, perhaps footer phrases or mottos. Page break data is not usually necessary to the overall evaluation of the document.
  • Display Footnotes - Check the box to include authors' footnotes or end notes. (The footnotes will appear in red font in the HTML display of the documents.) The text in a footnote tag is not in line with the rest of the text and generally has an anchor of some sort to alert the reader to find the footnote and further explication/clarification of the foregoing point.
  • Display Footnote Anchor - Check the box for to include footnote anchors. (The footnote text will appear in red font in the HTML display of the documents.) Footnote Anchor tags enclose the letter, number, or symbol used to alert a reader that the text has a note. Please see the Footnote tag for further information on Footnotes.
  • Xitem - An xitem is an entity not otherwise defined in the document protocol.
  • Select Output Format - Allows you to select the way in which you want to view and/or download the document(s). Formats include HTML , which includes color and graphics but no visible tags; XML , which includes all the text tags; text , which shows only ASCII characters but not formatting; and reduced XML , which shows ASCII text with only the most basic XML tagging.
  • HTML format - This output format (HTML, Hypertext markup language) is the easiest to read and preserves all formatting.
  • ASCII (plain text) Format - Plain Text output format displays the text with no formatting. It is preferable for those who want to input the document into a text analysis tool such as Wordsmith.
  • Plain Text with Basic XML format - Documents produced in plain text with basic XML codes and tags,This means you will retrieve the document in plain ANSI and ASCI codes with items such as "Unmarked Text," "Titles and headers," "Marginalia," etc. included.
  • Full XML format - Document produced with all XML tags, allowing you to see all coding. This format is the most flexible for those familiar with XSLT and XML management.
  • Quick View - Click this box for a short snapshot version of the output you have selected. This function allows you to quickly view the first 10 documents that match your search criteria. In this way you may check and make sure it is in the format you desire.
  • View Online - Click this box to view on your screen all of the documents that match your search criteria. This option does not actually download the files to your computer. Be warned that this can take a while. You may be creating an HTML file of over 6 MB, and it takes a while for this to load.
  • Download to File - Click this option to save the documents you have selected to your own computer or disk. This will cause a window will open, giving you the chance to save the document and to specify a location for the downloaded file. You will need to change the ending of the document name from .py to either .txt, .xml or .html (depending on what output format you have selected). Be warned that this can be a long process as some of the options will generate a file of more than 5 MB.
  • Search Parameters - Determines search method. See the next four entries.
  • All Terms Search - Searches for documents that contain all of the words you have typed into the search field box. For instance, if you enter Philip and Morris , you will only receive documents that include both words. You will not receive documents that include the word Philip alone.
  • Any Term Search - Searches for documents any of the words you have typed into the search field box. For instance,if you enter Philip Morris, you will receive the set of documents containing either Philip, or Morris, or both.
  • Clean search - Searches for the exact words you entered into the search field box. Typing in the word cigarette will not return documents containing the word cigarettes .
  • Fuzzy Search - Searches for variations of the word that you entered into the search field box. For instance, if you enter smok , you will receive documents containing smoke , smoking , smoker , nonsmomker , etc.
  • File - The name file or files denotes 1) a computer file/document that contains an occurrence of a given term or set of terms, or 2) a constituent file/document of a sub-group of the Tobacco Documents Corpus.
  • Term - The name term denotes either a word, or collocation, or both. See below.
  • Word - The name word refers to any single word, or in technical terms, any sequence of lowercase alpha-numeric characters and/or apostrophes (') not interrupted by white space. More specifically, any sequence using any the ASCII character codes 39, 48-57, and 97-122. All analyses are based primarily on words and/or word combinations (collocations), and all input is converted to lowercase prior to processing.
  • Collocation - Any two words (see above definition) found in running text within three places of each other. Collocations are useful in that they further specify the meaning of any single word by adding context. For example, the word mass has a distribution similar to the word market (strong peak around 1985) and could be mistaken for a marketing term. However, the collocations mass-market, mass-markets, and mass-marketing never occur in the data. The word mass is most often a collocation of the word spectrometry.

    You must specify collocations by putting a dash between the two words which form it. Thus, designating young-smokers causes the program to count young smokers , young female smokers , and young and impressionable smokers as instances of the same collocation.

  • Term Scores - In our data displays we present data for terms (words and collocations) in three ways: raw frequencies, raw proportions, and z-scores. Each score presents the data in a unique way and works toward a more complete understanding by lowering the likelihood of the others being misinterpreted.

    Data are also presented for terms and files. Term data represents the occurrence of a term (or set of terms) in a given sub-group of the larger corpus. Files data represents the number of files in a given sub-group which contain an occurrence of the given term (or set of terms).

    See each description below.

  • Raw Frequency - Raw frequency is the actual count of terms (or files with terms) found in any sub-grouping of the Tobacco Documents Corpus or the corpus as a whole. Although informative when compared to other measures, raw frequencies by themselves can be deceptive because the sub-groups vary widely in total number of words and number of files. For example, a raw count of 10 terms in one group may represent 0.01% of the total words, but in a group with half the number of words, 10 represents 0.02% of the total (twice the value).
  • Proportion - 1) The count of a given set of terms per 100 total terms, or 2) the count of files/documents containing a given term per 100 total files/documents. Essentially, a percentage. Although informative when comparing the distribution of a specific term across sub-groups, proportions do not allow comparison between terms. This is because a given proportion may be deviant for one term but not for another.
  • Z-score - A z-score is a standardized method of indicating the extent a given score deviates from an expected value (usually a mean) and whether the deviation is large enough to be considered significant. That is, some deviation is always expected, but only when it becomes extreme do we take notice. In general, a z-score is calculated by dividing the difference of the score and the mean by the standard deviation. For our data we use a proportions test to calculate z-scores. See below.

    In the humanities and social sciences, the general convention is that a score is significant only if it is so deviant that it is expected to happen less than 1 in 20 trials, or 0.05%. For the z-scores we present, this is represented by scores outside the range -1.96 to 1.96 (represented by a gray area on our displays). So, if you see a score inside this range, 1.35 or -0.69 for example, these are normal deviations, nothing to get excited about. However, a score outside this range indicates something unexpected or abnormal. This is where you want to look.

    Z-scores have the advantage of allowing comparison between sets of terms; however, they should not be interpreted apart from the other measures. The reason for this is that how much scores deviate from a mean gives no indication of how prevalent they are (or even if they exist).

  • Proportions Test - The primary question we ask about out data is the following: Given a set of terms, does the proportion of the given terms to the total terms in a sub-group of the corpus differ significantly from the proportion found in the corpus as a whole? That is, is the distribution different somewhere, and is it so different that we should care?

    To answer this we use a standard proportions test (for two proportions). The norm or expected proportion is the count of the given terms in the corpus to the total number of terms, and the experimental value is the proportion found in the sub-group of the corpus. The end result is a z-score which allows comparison to other term data. All of this is thoroughly described in L. Davis' book Statistics in Dialectology (1990) as well as Moore and McCabe's Introduction to the Practice of Statistics (1999). Please see these texts for additional information.

    The advantage to this type of test compared to a more standard means test is that it addresses our question more directly, the idea of proportions. The disadvantage is that it sums multiple cases (documents) together to derive a mean. This prevents one from knowing if the count is taken from a single document or if it is distributed evenly over the sum of documents. We deal with this problem by also examining the proportions of files/documents which contain occurrences of the terms. That is, we look not only at the number of terms, but also the distribution of terms in the file set. The strongest indication of deviation is when a term count has an z-score with an absolute value greater than 1.96 and has a file distribution z-score with an absolute value greater than 1.96 (assuming both scores are reliable).

  • Reliability - As with most statistical test, the proportions test we use has limits. Most notably, it is not reliable at the extremes, meaning when proportions approach 0.0 or 1.0. To check for this, Moore and McCabe in Introduction to the Practice of Statistics (1999) suggest that given the combined proportion (p)(of the two being compared) and total counts of terms in the corpus (n1) and in the sub-group (n2), then n1*p, n1*(1/p), n2*p, and n1*(1/p) should all be greater than 5. We call this the extremes test.

    In terms of reliability, each data point is marked with ok or low. An ok rating means that the score passes the extremes test for both term count and file count. A low rating means that it does not for one or both.

  • Total - The total of either all words or collocations found in the corpus or sub-corpus. These counts are used in deriving proportion and z-scores. Because of the nature of the statistics used, words and collocations are never examined together.
  • Industry-Internal vs. Industry External Audiences - This function allows you to view the relative frequencies and z-scores of internal audience documents (documents intended only for those within the tobacco industry) versus external audience documents (documents intended for public use). (For further information, see also "Named Industry Internal Audience)"
  • Named Audiences vs. Unnamed Audiences - This function allows you to view the relative frequencies and z-scores of Named Audience documents (documents with specific addressees) versus Unnamed Audience documents (documents without a specific addressee). (For further information, see also "Named Industry Internal Audience)
  • Half Decades Grouping - Selecting this function will allow you to view documents in groups of 5 year increments. Documents will be grouped 1970-1975, 1975-1980, etc.

  • Shifted Decades Grouping - Selecting this function will allow you to view documents in groups where the decades are "shifted." This means that instead of viewing documents by decade (1980-1989), you will view documents shifted 5 years (1985-1994, 1955-1964, etc.)

  • Keyness - Keyness is an index of whether or not a word has a higher or lower frequency of occurrence in a document than one would expect. Frequency within the document is compared to frequency within the corpus overall. A positive keyness score means that the occurrence within the document is more frequent than would be expected, based on the overall corpus.
  • eXtensible Markup Language (XML) - A markup protocol and file type used by TDC employees to capture the physical and rhetorical structure of the tobacco documents. It is viewable in any text editor or web browser.
  • eXtensible Stylesheet Language Transformations (XSLT) - A programming protocol used to transform XML document to other forms.
  • Portable Document File (PDF) - Our "original" document images are stored as PDF files. This allows you to see what the actual document looked like. You will need Adobe Reader which can be downloaded free at www.adobe.com.

NIH-NCI Tobacco-Documents Project at the University of Georgia (Grant # 1 RO1 CA87490-01). Please send all comments and suggestions to tobacco@uga.edu.