Cross-corpus search and download of recordings of the BAS CLARIN repository

 

Searching throughout various corpora is usually only possible by downloading the corpus, normalizing their structure and importing them into a query tool. For some resources downloading the whole corpus and combining them may not be possible due to licensing issues.

Large collections of speech recordings and annotations contain different sub-corpora which are especially relevant in research contexts. Thus, it is a special interest to get access to those corpora. Within CLARIN, many data sets are available for academic research. This requires authentication as member of the academic society. It is possible to define criteria that the collected data should meet. This way, it is not necessary to look at the complete data set, which contains many irrelevant files. 

Especially relevant for

  • Humanities scholars interested in empirical speech data
  • Developers in speech technology

Starting point:

we know that the BAS CLARIN repository allows cross-corpus searches and downloads.

Task:

to download all recordings that contain recorded German dialogs with at least one speaker who is a native speaker of Russian

Solution:

find the repository, authenticate, cross-corpora search and download the results for further investigation

Related CLARIN-D tools and services

Short guide on how to do cross-corpus searches

  1. Start Chrome or Firefox and Google search for:
  2. BAS CLARIN repository
  3. Follow the link "BAS CLARIN repository"
  4. The main landing page of the BAS CLARIN repository appears.
  5. to download data from the BAS CLARIN repository you must authentificate yourself as an academic.
    • Click on the link 'Login via your institution', just below the CLARIN logo.
    • A selection page of European academic institutions appears.
    • find your home institution, or - in case you do not have an AAI account of your home institution - select 'clarin.eu website account'.
    • A login page of your home institution should appear.
    • log-in with your university account.
    • The BAS CLARIN repository page should again appear
  6. in the line below the BAS logo you should read 'You are authentified to have full access to the BAS repository'.
  7. Click an 'Search' in the left menu.
  8. A search mask of the BAS CLARIN repository appears
  9. In the category 'Language' select 'German'
  10. In the category 'Conversation Type' select 'dialogue'
  11. In the category 'Actor's mother tongue' select 'Russian'
  12. Un-mark the radio button 'Exact match'
  13. Click on 'Submit'
  1. A list of recordings appear that fits to your selection; each is summarized by the most prominent metadata; you can click on the name links of individual recording sessions to see more metadata and the links to contained signals and annotations.
  2. scroll down to the 'Download' section If not already there, fill in a valid email address to which the repository can send the download information; acknowledge the terms of usage, select 'annotation files only' (otherwise the package is quite large: 4,6GB), and click on 'create and download .tar archive'.
  3. After a few seconds the repo should acknowledge the download request with the message 'An email containing the download link will be sent to: (your email address) The download package is composed in the background, and the download link will be send to you as soon as the package is available. The email looks something like:
    "The requested tar archive has been created on 2015-12-14T14:54:13.000Z.
    Please follow this download link:
    [....]
    The archive will be available for 24 hours from now on."
  4. click on the link or copy&paste the URL [....] into a web browser. The downloaded *.tgz archive should contain a sub-directory with the same name as the speech corpus. This subdirectory contains the documentation and a separate subdirectory for every recording session.