Word level based comparative text analysis

Many questions of the humanities, which relate to specific text resources, can be reduced to the analysis of vocabulary. Especially the comparison of such vocabulary is of central interest. This may require comparing two own text resources or a text resource with a reference corpus. CLARIN allows to easily perform such comparative analyses using the resources and Web tools it provides. The following guide will show this on the basis of a simple example. The show case covers the discovery and selection of resources, their processing and finally their analysis. The aim is to demonstrate to scholars how to answer own scientific questions with the help of comparative text analysis within CLARIN.

Especially interesting for

All scholars that are comparing texts or vocabulary, including:

  • Scholars from the historical sciences
  • Scholars from the political sciences
  • Scholars from all philologies

Requirements:

At least two texts are available.

Aim:

The vocabulary used is to be compared to find fundamental differences.

Solution:

Using the CLARIN-D infrastructure, a comparative analysis of the vocabulary of the texts can easily be conducted.

Related CLARIN-D projects:

A Short Guide to comparative analysis of vocabulary

Search for Text Resources and Processing:

Select text resources for analysis

  1. Entry point for resource search:: VLO, CLARIN's search engine for language resources.

  2. Search for: "English Newspaper" in the search field
  3. Refinement: as "Resource Type" select "Written Corpus".

    VLO search

  4. Example selection: English Newspaper corpus from 2012 containing 3 Million sentences

    VLO results

Thematic restriction of the resource

  1. Browse the content of the text resource: Click on "Plain text search via Federated Content Search".

  2. Example search: "Europe"
  3. Example selection: Display 250 hits

    FCS-search

Processing of the text resource

  1. Processing of the output by applying WebLicht to the search results:

    1. Click on "View"
    2. Click on "Use WebLicht"
    3. Click on "Send To WebLicht"

     

  2. Login into WebLiCHT
    • A list of European academic research institutions appears.
    • o Search for your research institution, if you have no AAI-enabled account at your own institution, select 'clarin.eu website account'.
    • You will see a login page of your research institution.
    • Log in with your details, usually this is your University account.
    • You will see the WebLicht interface.
  3. Check the data to be used: In the "Upload" section you'll see the file name of the data provided. Click "OK"
  4. Analysis of the vocabulary of the texts by double-clicking the Web Services:
    • Tokenization: IMS:Tokenizer (Stuttgart)
    • POS-Tagging: SfS: POS Tagger - OpenNLP (Tübingen)
    • Click on "Run Tools"
    • Save the results "Save Result"
      • Go to last web service from the list at the bottom
      • You will see four icons below the line on the right side
      • Click on the arrow pointing down to download the result

    WebLICHT processing chain

Comparative analysis

The actual analysis of the data is conducted with the help of the web application CorpusDiff. The application allows for the comparison of the vocabulary of two or more text resources.

Import the resource

  1. Click on: "Upload Own Corpus"

  2. Load file that was previously created by WebLicht.
    • Click on "Select a file on your computer or drop it here"
    • Select the previously created file for upload
    • Alternative: Depending on the browser and operating system you drag a file from your file manager into this area
    • Click on "Upload"
  3. Make sure that for "File Type" TCF (Text Corpus format) is selected, the output format of WebLicht.
  4. If you want to skip the preprocessing and use a sample file for the comparative analysis, you can use this File

    Upload of a file

Conduct own analysis

  1. "Configuration"

    • Select the imported file
    • Choose a reference corpus (for example, a news corpus or a Wikipedia corpus of the same year)
    • Optionally more corpora or other corpora can be selected, which are then compared pair wise.
  2. Enter a "job title" for your analysis
  3. Press the "Compute" button.

    Configuration

Evaluation

  1. Under "Job Selection" click on the completed analysis

  2. The matrix shows the pairwise similarities between text resources with values between 0 (dissimilar corpora) and 1 (identical corpora)

    Ergebnis des Vergleichs

  3. Click a field of the matrix to see further results that describe the different uses of vocabulary in the texts.

    • o lists of words, that occur much more frequently (relatively) in one of the two text resources
    • lists of vocabulary that occurs only in one of the texts.
    • Restrict the displayed results for individual parts of speech, for example, noun or proper nouns
    • Example: When comparing the Europe-related texts with the contents of Wikipedia, words such as crisis, debt or fears are prominent, which clarify the thematic focus of the text resource previously generated.

    table with results

  4. At this point, other analyzes are also possible. One possibility would be the comparison of texts by different authors, from different sources (news and Wikipedia) or corpora of individual years to identify typical vocabulary or topics.