The English-language Wikipedia contains over 6 million articles, and the combined versions for all other languages contain over 28 billion words over 52 million articles in 309 languages. It’s an incomparably valuable resource for knowledge seekers, needless to say, but one that requires pruning by the over 132,000 registered active monthly editors.
In search of an autonomous solution, researchers at MIT developed an AI and machine learning system that addresses inconsistencies in Wikipedia articles. Thanks to a family of algorithms, it’s able to identify such errors and update the articles as needed, using the latest information from around the web to produce up-to-date sentences.
The algorithms in question were trained on a data set containing pairs of sentences, in which one sentence is a claim and the other is a relevant Wikipedia sentence. Each pair is labeled in one of three ways: “agree,” meaning the sentences contain matching factual information; “disagree,” meaning the two contain contradictory information; or “neutral,” where there’s not enough information for either label.
The system takes as input an outdated sentence from an article plus a “claim” sentence that contains updated and conflicting information. Two algorithms juggle the heavy lifting, including a fact-checking classifier that’s pretrained to label each sentence pair in a data set with “agree,” “disagree,” or “neutral.” A custom “neutrality masker” module identifies which words in the outdated sentence contradict the claim and removes the minimal number of words required to maximize neutrality so that the pair can be labeled as neutral, after which it creates a binary “mask” over the outdated sentence.
A two-encoder-decoder framework generates the final output sentence post-masking, such that the model learns compressed representations of the claim and the outdated sentence. The two encoder-decoders, working in conjunction, then fuse dissimilar words from the claim by sliding them into the spots left vacant by the deleted words.
The researchers say the system can also be used to augment corpora to minimize bias when training fake news detectors. Some of the detectors train on data sets of sentence pairs to learn to verify a claim by matching it to given evidence. In these pairs, the claim will either match certain information with a supporting “evidence” sentence from Wikipedia or will be modified to include information contradictory to the evidence sentence. The models are trained to flag claims with refuting evidence as false, which can be used to help identify fake news.
In a test, the team used the deletion and fusion techniques from the Wikipedia task to balance pairs in a data set and help mitigate bias. For some pairs, a modified sentence’s false information was used to regenerate fake evidence supporting a sentence. Some of the key phrases then existed in both the agree and disagree sentences, which forced the models to analyze more features.
The researchers report that their augmented data set reduced the error rate of a popular fake news detector by 13%. They also say that in the Wikipedia experiment, the system was more accurate in making factual updates and its output more closely resembled human writing.
Author: Kyle Wiggers.