Shape Corpus Workshop 2017

Shape Corpus Workshop 2017

Follow Shape Corpus Workshop 2017
Share on
Copy link to clipboard

This workshop addresses processes involved in creating usable forms of text from early sources. Examples include: making structured dictionaries out of paper-based books; conversion of digital files into usable formats; textual versions of early manuscript records; useful ways to organise and transc…

CoEDL Australia


    • May 8, 2017 LATEST EPISODE
    • infrequent NEW EPISODES
    • 19m AVG DURATION
    • 10 EPISODES


    Search for episodes from Shape Corpus Workshop 2017 with a specific topic:

    Latest episodes from Shape Corpus Workshop 2017

    The dictionary of the Tahitian Academy: from the Word file to the digital database

    Play Episode Listen Later May 8, 2017 19:52


    The Tahitian-French dictionary of the Tahitian Academy, published in hardcopy in 1999, compiles two centuries of study of Tahitian vocabulary, from the first lexicographical works of the missionary John Davies up to contemporary word formation, with more than 12.000 lemmas. Until recently, the source files of the paper dictionary was MS Word documents. In 2013, with the help of Nick Thieberger, we had the opportunity to convert the Word files to a Text file with MDF fields markers (Coward & Grimes, 2000). The original files were automatically converted to structured format based on the formatting of the original work (using the online service OxGarage), and this file was then converted to the structured format provided by MDF. Not all formatting was consistent so some content ended up in the wrong fields. The list was then imported into an online PostgreSQL database by Hugues Talfer who added had hoc functions to facilitate the human correction process. We also needed to handle mismatches between the structure of some entries in the Tahitian dictionary and standard MDF structure and had to create new fields. The work is still in progress, but the first outcomes are already visible on the new Tahitan Academy website (http://www.farevanaa.pf/v2/dictionnaire.php). It offers the possibility of searching Tahitian to French and French to Tahitian thanks to the reverse entry field (re) provided by the database. This was not previously possible. Such a project finds immediate social and pedagogical utility for all the Tahitian community and for lexicographical research on Tahitian. Académie tahitienne. (1999). Dictionnaire tahitien-français. Papeete: STP Multipress. Coward, D., & Grimes, C. (2000). Making Dictionaries. A guide to lexicography and the Multi-Dictionary Formatter. Waxhaw: SIL International. Davies, J. (1851). A Tahitian and English Dictionary. Tahiti: London Missionary Society's Press.

    Making the signs fit: From archive to ELAN and beyond

    Play Episode Listen Later May 8, 2017 20:14


    Adam Kendon’s in-depth analysis of Australian Indigenous sign languages still remains the most broad-reaching to date (Kendon, 1988), even as steps are being taken to build on the foundations he laid (Adone & Maypilama, 2013; Carew & Green, 2015; Green & Wilkins, 2014). Kendon called these sign languages ‘alternate’, as they are not generally the primary mode of communication but rather are used instead of speech in particular cultural circumstances. Kendon’s fieldwork in the late 1970s and the 1980s in Central Australia generated valuable records of sign used in Warlpiri, Kaytetye, Warumungu, Warlmanpa, Jingulu, Mudburra and Anmatyerr speaking communities. The original 16 mm film and VHS video recordings, housed at AIATSIS, comprise more than 50 hours of archival material. The collection includes metadata with various fields, including spoken language, semantic domain, language sign glosses with English translations, and a phonetic transcription in a unique font that Kendon devised especially for the purpose. There is also a time-code that points to locations in the film media. I discuss some of the steps that can be taken to get the most out of this metadata, link it to the media it refers to, and make this unique collection searchable. This is a first step in forming a comparative corpus of Indigenous sign that combines old and new sources. The format and structure of archival deposits and their delivery to users leads to some steps forward…and some backwards. The lessons learnt also have implications for the ways that structure our contemporary archival collections. The presentation will end with some suggestions for further uses of this material and a bid for collaboration. Adone, D., & Maypilama, E. (2013). A Grammar Sketch of Yolŋu Sign Language. Darwin: Charles Darwin University. Carew, M., & Green, J. (2015). Making an online dictionary for Central Australian sign languages. Learning Communities - International Journal of Learning in Social Contexts. Special Issue: Indigenous Sign Languages, 16, 40–55. Green, J., & Wilkins, D. P. (2014). With or Without Speech: Arandic Sign Language from Central Australia. Australian Journal of Linguistics, 34(2), 234–261. https://doi.org/10.1080/07268602.2014.887407 Kendon, A. (1988). Sign languages of Aboriginal Australia: Cultural, semiotic and communicative perspectives. Cambridge University Press.

    Ken’s Kaytetye

    Play Episode Listen Later May 8, 2017 19:33


    In 1959 Ken Hale made what we believe were the first Kaytetye audio recordings. He recorded six speakers, eliciting words, phrases and one story. Since then the language has changed significantly and the community would like to access the recordings and the beloved narrative. We chose a web-based interface to present the collection that would consist of: • Biography and photo of speakers • Audio • 1959 fieldnotes • Written text in standard orthography • Morphological and word gloss • Free translations We discuss the issues encountered, current solutions and their relative successes. For example, in consultation with Kaytetye collaborator Alison Ross, we found some subject matter was inappropriate for public display, so we developed a way to 'hide' such material that has so far been unproblematic. Obtaining formal permissions from institutions was also unproblematic. In terms of workflow we chose FLEX to create consistent glossing; however this has been time-consuming and requires complex set up, import and export. We also identified a bug in FLEX that caused timecode information to be lost. Our master copy is ELAN, so we have to go back and make changes in that every time we make a change in FLEX. Another issue was the online interlinear audio & text presentation. Trials with a fully-transcribed story showed that the system wasn't ideal for our situation so we built a custom system that could deal with the particular attributes of the data ­– the audio file length, and the number of annotations.

    Engaging a crowd for a common goal … Aboriginal and Torres Strait Islander language transcription activities @ SLNSW

    Play Episode Listen Later May 8, 2017 13:48


    The State Library NSW (SLNSW) is dedicated to making their collections more accessible through digital experiences and initiatives. Hand written documents are sometimes difficult to read and the text, in digitised images, not easily searchable thereby making these invaluable documents virtually invisible. The production of transcripts of these original manuscripts vastly improves access to these historical documents for researchers, historians and members of the public. The Rediscovering Indigenous Languages project website features historic word lists, records and other documents relating to Indigenous Australian languages from the State Library’s collections. Making these records available for language revival purposes is of paramount importance to SLNSW, and the transcription of these unique documents is vital. This important transcription work falls on the shoulders of a crowd of dedicated volunteers. Crowdsourcing or “engaging a crowd for a common goal” is an inclusive, cost effective method to share these unrivalled historical manuscripts with the world. This paper will discuss the why, how and who of the transcription initiatives within the SLNSW. It will also discuss some of the outputs that have been created from transcribed materials, by curating datasets and making them available for experimentation. Biography Melissa Jackson is of Bundjalung decent with family links to the Baryulgil area near Grafton, New South Wales. Melissa has worked in various NSW government departments, including Department of Housing and Attorney Generals Department before starting work at the State Library of New South Wales in 1991. She has a background in teaching graduating from University of Western Sydney also obtaining her librarianship qualifications from University of Technology Sydney and a Master in Indigenous Language Education at the University of Sydney.

    Automatic alignment of mis-matched video and audio

    Play Episode Listen Later May 8, 2017 13:36


    The Gurindji ACLA corpus (2003-2007) consists of 178 sessions (60hr) of child-adult interaction which were simultaneously video and audio-recorded, but not manually synched at the start of each recording (e.g. with a hand clap). The corpus was linked to the audio recordings due to ethical restrictions on the archiving. The video has great value for studies of sign and interaction; but has not been used for studies of this kind because it is a prohibitively time-consuming process to manually match up the video to the existing audio timestamps. This demonstration will introduce a method of automatically correcting this difference, using a Python script av_align.py. This script finds the most likely point at which the video and audio overlap, and extends the file with blank sound or image accordingly. The timestamps of an associated CLAN or ELAN file are also automatically adjusted if necessary.

    Ngarda-ngarli thabi: building a database for a regional Aboriginal public song tradition of the Pilbara

    Play Episode Listen Later May 8, 2017 20:43


    Thabi is a public genre of individually-composed song indigenous to the west Pilbara region. The songs are sung by one or two singers without dance accompaniment, and are held by members of the Ngarluma, Yindjibarndi, Palyku, Martuthunira, Kurrama, Nyiyaparli, Banyjima, Yinhawangka, Kariyarra, Nyamal and Ngarla language groups – collectively referred to as Ngarda-ngarli. While performances thrived from at least the 1930s to the 1960s and were recorded up until the 1980s, today only a small number of elders hold knowledge and perform the songs. In this workshop we examine the process of building a song database that draws on legacy recordings of thabi, text sources and photos, for use by contemporary and future nyinirri (singers). In the first half, Ngarluma PhD student Andrew Dowding discusses how the project came about, starting over ten years ago when he first came across recordings of his maternal grandfather and prolific thabi composer Robert Churnside among hundreds of thabi songs in the AIATSIS archives, and how initial play back sessions conducted by Dowding, Treloyn and Brown with thabi custodians and elders informed the concept of an online digital archive to support revitalisation. In the second half, Jared Kuvent and Reuben Brown will discuss the workflow for digitising text sources, splitting audio files and consolidating audio metadata, and demonstrate the project’s database in File Maker Pro, as a rapid prototype for a song database and as a step toward future migration of the project’s data to public content management systems such as Mukurtu for community use and ownership beyond the life of the research project.

    Historical sources for language revival in Victoria

    Play Episode Listen Later May 8, 2017 18:02


    Colonisation hit languages in Victoria particularly harshly. Because of a number of factors including the wealth of natural resources in this part of the continent, colonisers arrived early, in great numbers, and rapidly claimed most of the land, thereby pushing Aboriginal people off their traditional Countries, often into missions. As well as the unprecedented mixing of people from different language groups in the one mission, mission rules commonly forbade the practice of language and culture as primitive or heathen. In the present day, the most obvious and immediate consequence of this for language is that the present work of language revival is heavily reliant on historical sources. While there are individual exceptions, in most language groups there is little community knowledge of language beyond a few dozen words, and perhaps a song or two, usually from mission times. In addition, because the loss of language was severe so early, there are almost no sound recordings of language, until the intensive collections made by Luise Hercus in the 1960s. These recordings are very highly valued by communities, but it has to be said that all the speakers on those tapes were at least bilingual, and most spoke English as their primary language. The use of these historical sources are full of difficulties however – and especially so when the researcher is not an academic. In the first place, the quality of the collections themselves are wildly variable, ranging from scribbled notes by people little skill in hearing a language with sounds different to English, to considered collection of grammar patterns by people who had taken the time to get to know a group of people. It is not uncommon to find a single word spelt in 15-20 different ways. Meanings of words, too, are very much subject to the filter of English translations, so that kin terms, for example, are often squashed into the kinship system of a foreign culture. Even before we can work with this level of difficulty, there is a prior problem of accessibility. Limitations on access to historical sources include their physical location, which may be interstate or overseas, or in local institutions which nonetheless presents considerable barriers to ease of use by many Aboriginal people. The sources themselves are often in poor condition, and the copies which are commonly available even worse. Added to this is the fact that even good quality originals are of course in the handwriting conventions of the 19th century, which are not the same as now. Most sources are as yet to be transcribed, and those transcriptions that have been done are, again, more readily accessible to academics than to community people. Given our reliance on the historical sources for language revival, one of the primary tasks of VACL is to address these difficulties. We do this by – · training of community researchers to understand the interpretation of historical documents · linguistic support and co-research · provision of a community-friendly library which includes copies of many manuscript sources · transcription of manuscripts · collation of key materials for each language in a 'one stop shop' folder · digitisation program targeted to making those folders available to specific communities in a password-protected online environment.

    Compilations and Copyright

    Play Episode Listen Later May 8, 2017 30:31


    Working with Aboriginal and Torres Strait Islander language legacy materials may involve complex legal, ethical and policy issues. Some issues are more easily resolved than others and all involve risk. To appreciate this risk and make informed decisions, linguists, publishers, and the communities they work with must have some understanding of copyright law, contract law, ethical guidelines and current and future legal developments. This presentation will consider these factors in both theory and practice. This presentation is tailored to assist those working with legacy materials and planning to transform, update and adapt them for new publications and community outputs. Thomas Allen is a lawyer with more than ten years’ experience working with Aboriginal and Torres Strait Islander clients. He is currently the Rights Manager at the Australian Institute of Aboriginal and Torres Strait Islander Studies and works with historic and contemporary archival material.

    Creating a corpus from the Library Archive of Aboriginal Languages

    Play Episode Listen Later May 8, 2017 16:45


    The Living Archive of Aboriginal Languages is a digital collection of endangered literature in Indigenous languages of the Northern Territory (www.cdu.edu.au/laal). The goal of the project is to collect, digitise, and preserve the vast set of written texts produced in NT schools and community activities since the 1970s, and to make them available to new audiences on the internet. The Archive now includes over 3000 items in PDF and text formats in 50 languages, with more items awaiting permission to put online. Most of the materials in the collection are small booklets of around 10-25 pages, with anything from a single word to dense text on each page, and illustrations using line drawings, photos, and other styles. They include a range of genres, with stories of traditional and contemporary life, creation stories, cautionary tales, ethnobiology, instructional and literacy materials. In becoming the custodians of the digital versions of such materials, the project team faced many challenges, such as incomplete or inconsistent metadata, difficulties in OCR of minority languages to create text versions, and striving to create an online interface that works for different user audiences, from Indigenous users in remote communities to academic researchers to the general public. A lack mark-up of text files means the Archive is not very accommodating to corpus analysis. A key challenge has been negotiating the appropriate permissions to make these materials available online, with navigating both Australian copyright law and Indigenous cultural and intellectual property perspectives, and their inherent incommensurability.

    Setting up the Howitt-Fison Archival Corpus

    Play Episode Listen Later May 8, 2017 26:43


    The ARC Linkage project LP160100192 Howitt and Fison’s anthropology has just got underway, led by Helen Gardner of Deakin University and with participation of four universities and participation of four partner institutions, Museum Victoria, State Library of Victoria, Native Title Services Victoria and the Victorian Aboriginal Corporation for Languages. It aims to analyse nineteenth century anthropologists Lorimer Fison and A.W. Howitt’s accounts of Indigenous kinship, social organisation, and local languages. This project will assemble Fison and Howitt’s and their correspondents’ records into best-practice digital formats, with widely accessible interactive data presentation, and bring these extraordinary records to the broadest possible community. Currently the research team are in the process of planning how this will be done and we would like to engage in dialogue with others deriving analyzable corpora from old written data (without any associated sound recordings). Some of the important issues are: - Transcription of the text of fieldnotes, letters etc. which includes amounts of words and longer passages in languages of South-East Australia which have not been well described, as well as English. The spelling of the Aboriginal languages is not regular, but can be deciphered with careful study. We have discussed the possibility of automatic transcription with an expert-system approach. - What is the optimum system for entering the transcription? A corpus approach might lead us to a multi-tiered tool eg like FLEX. - We also need to deal with other formats such as genealogies and various kind of diagram which are important and interact with text in this corpus. - Decoding of handwriting, which complicates the possibility of automatic transcription - Because of the latter, and in order to draw a wider interested audience into the materials, we want to use crowdsourcing for at least some of the transcription, so we cannot make the task of transcription too onerous or technical.

    Claim Shape Corpus Workshop 2017

    In order to claim this podcast we'll send an email to with a verification link. Simply click the link and you will be able to edit tags, request a refresh, and other features to take control of your podcast page!

    Claim Cancel