Summarizing historical text can help people gather, organize, and share knowledge, but cultural and linguistic changes and the sheer volume of archives can make interpreting historical text challenging even for experts. Researchers at the University of Sheffield, Beihang University, and the Open University in the U.K. recently attempted to tackle this problem using AI and machine learning techniques. They say their approach, which can summarize historical documents written in German and Chinese, provides a strong baseline for future studies.
The researchers chose to focus on the languages of German and Chinese for their “rich textual heritages” and “accessible” resources for historical and modern forms. Both languages serve as “outstanding” representatives of two distinct writing systems — German for alphabetic and Chinese for ideographic — and investigating them could lead to generalizable insights for a wide range of other languages, according to the searchers. Moreover, linguistic experts in both languages are abundant, making it easy to find modern-language summaries for German and Chinese text for evaluating machine learning summarization systems.
To build a historical German language training dataset, the researchers picked newspapers from the years 1650 to 1800, randomly selecting 100 out of 383 available stories for annotation. And for Chinese, they chose a collection of stories from the Wanli period of the Ming Dynasty, searching over 200 related academic papers and retrieving 100 news texts. To generate summaries in the modern languages for the historical stories, the coauthors recruited two experts with degrees in Germanistik and Ancient Chinese Literature, respectively. They produced a corpus of 100 news stories and summaries in each language that were then examined by six other experts for quality control.
The researchers note that they only had summarization training data for modern German and Chinese and very limited corpora for historical forms of the languages. To get around these limitations, they used a transfer learning-based approach they say could be bootstrapped even without cross-lingual training — i.e., training across historical and modern forms of the languages.
“Historical text summarization posits some unique challenges … Historical texts cannot be handled by traditional cross-lingual summarizers, which require cross-lingual [training] or at least large summarization datasets in both languages,” the researchers wrote. “Further, language use evolves over time, including vocabulary and word spellings and meanings, and historical collections can span hundreds of years. Writing styles also change over time. For instance, while it is common for today’s news stories to present important information in the first few sentences, a pattern exploited by modern news summarizers, this was not the norm in older times.”
In experiments, the researchers say automatic and human evaluations demonstrated the strength of their method over state-of-the-art baselines. In the future, they plan to improve their models to add further languages and increase the size of the training dataset they used for each language.
“This paper introduced the new task of summarizing historical documents in modern languages, a previously unexplored but important application of cross-lingual summarization that can support historians and digital humanities researchers,” the researches wrote. “This paper is the first study of automated historical text summarization.”
VentureBeat
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform
- networking features, and more
Author: Kyle Wiggers
Source: Venturebeat