Computation and Language preprint round-up
Our AI summarises and provides extracts from the latest Computation and Language preprints published on arXiv
The individual reports below - including each headline - were generated automatically by our machine-reading software from an RSS feed of recent cs.CL preprints on arXiv from 18 March 2021.
Two strong black-box adversarial attacks for multilingual models that push their ability to handle code-mixed sentences to the limit.
In ‘Code-Mixing on Sesame Street’, Samson Tan and Shafiq Joty (2021) noted that the past year has seen incredible breakthroughs in cross-lingual generalization with the advent of massive multilingual models. They show that training on code-mixed data synthesized via word alignment improves clean and robust accuracy. The adversarial training method could be used to both improve machine understanding of code-mixers by making multilingual representations more language-invariant. Increasing the number of embedded languages POLYGLOSS can draw upon results in greater drops in model performance. XLM-R and Unicoder were trained on monolingual CommonCrawl data. mBERT was trained on multilingual Wikipedia.
The researchers claim their findings reinforce previous studies in this subject: “If the models did not rely on lexical overlap but performed comparisons at the semantic level, such perturbations should not have severely impacted their performance. Our results on QA also corroborate Lee et al’s finding that models trained on SQuAD-style datasets exploit Lexical overlap,” Tan said. Discussing potential shortcomings, “We acknowledge that our methods do not fully model real code-mixing. It is impossible to guarantee the semantic preservation of a sentence generated by BUMBLEBEE due to the word aligner’s statistical nature. Future work lies in investigating how the choice of matrix language affects model robustness,” they acknowledge. Tan and Joty propose that while CAT improves robustness, there remains a significant gap between the robust and clean accuracies. Future work lies in investigating how the choice of matrix language affects model robustness.
Aspect category detection (ACD) is one of the challenging tasks in the Aspect-based sentiment Analysis (ABSA) problem, which is used to assess the accuracy of online reviews.
A team led by Dang Van Thin of the Thin Multimedia Communications Laboratory (MMLab) (2021) investigated the effectiveness of monolingual compared with multi-lingual pre-trained BERT model for Vietnamese aspect category detection problem. They observe that most of the models achieve a higher performance than those models where is trained only on the Vietnamese training set. For the low-resource language such as Vietnamese, there are a few studies on Aspect Category Detection task. The researchers aim to use these datasets combined with Vietnamese training dataset based on multilingual pre-trained BERT models. Empirical results prove the effectiveness of PhoBERT compared to several models, including the XLM-R, mBERT model.
In this paper, they used two Vietnamese benchmark datasets and SemEval-2016 task 5 datasets in the restaurant and hotel domain at the sentence-level in the experiments. They discarded sentences which do not contain any aspect categories. Data and code to reproduce the analyses can be found at: https://github.com/google-research/bert/blob/master/multilingual.md.
New way of training a language model via a few-shot learning method, which can be used to fact-check online content.
In ‘Towards Few-Shot Fact-Checking via Perplexity’, Nayeon Lee et al. (2021) reported that few-shot learning is being actively explored to overcome the heavy dependence on large-scale labeled data. Brown et al illustrated the impressive potential of language models as strong zero-shot and fewshot learners across translation, commonsense reasoning and natural language inference. The group propose a novel way of leveraging the perplexity score from LMs for the few-shot fact-checking task. PPLGPT2-XL is the best performing compared to the other, smaller, models for Covid-Scientific and FEVER. They believe the method can be augmented into other existing approaches, for instance, leveraging the perplexity score in the final step of the FEVER fact-checkers as additional input.
59000 scholarly articles were involved in the research. They hope the proposed approach encourages future research to continue developing LM-based methodologies. By doing so, the community can move towards a data-efficient approach that is not constrained by the requirement of a large labeled dataset.
A new transformer that considers hard lexical constraints has been proposed.
Lee-Hsun Hsieh et al. (2021) described ENCONTER. The field of Natural Language Generation (NLG) has seen significant improvements in recent years across many applications. While POINTER shows promising results, it does not consider hard constraints which involve entities that must be included in the generated sequence. ENCONTER supports hard entity constraints, and encourages more meaningful tokens to be generated in the early stages of generation reducing cold start. Constrained text generation is an important task for many real world applications. The team focus on hard entity constraints and the challenges associated with enforcing them in text generation. It will be interesting to consider more diverse constraints and user interaction in the generation process.
The analysis involved 1393 news articles. The team claim that the two models outperform the strong baselines, POINTER-E and GPT2, in recall, quality and failure rate. For future research, it will be interesting to consider more diverse constraints and user interaction in the generation process.
The performance of natural language generation systems has improved substantially with modern neural networks, but there is still room for improvement.
James Hargreaves et al. (2021) reported on incremental beam manipulation for natural language generation. In natural language generation (NLG), the goal is to generate text representing structured information that is both fluent and contains the right information. Rerankers are commonly used to increase the performance of NLG systems decoded by beam search. The team proposed incremental beam manipulation, which modifies the ranking of partial hypotheses within the beam at intermediate steps of the decoding.
547 cases were included in the study. The researchers admit that “The results showed that applying beam manipulation, instead of a reranker, was able to increase the BLEU score by 1.04 on the E2E and WebNLG challenges.” The group suggest that in future work, they intend to refine the method further by conditioning the reranker on how far through the beam search they are. Data and code to reproduce the analyses can be found at: https://github.com/tuetschek.
Novel method for automatically generating contrast sets for the visual question answering task GQA, which we use to assess the performance of supervised models.
Yonatan Bitton et al. (2021) propose a method for automatic generation of large contrast sets for visual question answering (VQA). Bitton and colleagues evaluate two leading models, LXMERT and MAC, and find a 13–17% reduction in performance compared to the original validation set. They show that the automatic method for contrast set construction can be used to improve performance by employing it during training. NLP benchmarks typically evaluate in-distribution generalization, where test sets are drawn i.i.d from a distribution similar to the training set. Bitton and colleagues first identify a large subset of questions requiring specific reasoning skills.
This process may yield an arbitrarily large number of contrasting samples per question. There are many candidates for replacing objects participating in questions. They report experiments with up to 1, 3 and 5 contrasting samples each question. The findings potentially support previous studies in this field: “Our method computes the answer of perturbed questions, thus vastly reducing annotation cost. We demonstrate the effectiveness of our approach on the popular GQA dataset and its semantic scene graph image representation,” Bitton claimed. The team suggest that the automatic method for creating contrast sets allows us to ask those questions. Future work in better training mechanisms could help in making more robust models. The authors have provided data and code at: https://github.com/yonatanbitton.
The role of images in the detection of fake news on social media is well-understood, but the role of text in the task is not as well understood.
Gullal Cheema et al. (2021) report that social media platforms have become an integral part of the everyday lives. They have investigated the role of images and tweet text for two problems related to fake news, claim, and conspiracy detection. The authors experimented with the recently proposed multimodal co-attention transformer ViLBERT and observed a promising performance using both image and text. The team selected the following four publicly available Twitter datasets with highquality annotations.
The number of tweets in the original datasets is four to fifteen times more as they were mined for text-based fake news detection. Cheema and colleagues selected the following four publicly available Twitter datasets with highquality annotations. However, “To address the limitation of visual models, we will consider models that can deal with text and graphs in images and extract suitable features,” concede the authors. They advocate that they observed a promising performance using both image and text even with relatively small-sized datasets. The team plan to investigate multimodal transformers in more detail and analyze if the performance does scale with more data in similar tasks. The researchers have provided data and code at: https://github.com/cleopatra-itn/image_text_claim_detection.
Text-mining and machine learning techniques to speed up the investigation of fraudsters by detecting code words hidden in normal emails.
Van der Zee and colleagues (2021) reported on code word detection in fraud investigations using a deep-learning approach. The ability to model the context of a text is vital in fraud investigations, especially for code-word detection. No work has been done on detecting code words using large pre-trained deep learning models such as BERT. The BERT-based method will pick up any kind of unexpected language use. The ability to properly model this context has greatly advanced in recent years due to the successful advances using deep-learning algorithms.
Fong et al follow a similar approach by randomly sampling sentences from the Enron data set and replacing the first noun in each sentence with a different noun. The authors achieve a detection ratio of 83% and 90% on two other synthetic data sets. The authors propose that further research will be done to identify more relevant evidence items which have a discriminative relation to a fraud scenario and which can be obtained by an appropriate AI-method. In the proposed model these evidence items all are formatted as an answer to one or more variants of the six golden investigation questions. The authors have provided data and code at: https://github.com/pushshift/api.
New architecture, siamese multilingual transformer (SML), to efficiently align multilingual embeddings for Natural Language Inference.
Javier Huertas-Tato et al. (2021) studied SML. Natural Language Inference has been a recurring issue in Natural Language Processing. Architectures such as XLM-R achieve excellent results in cross lingual representation learning. By pre-training these bidirectional architectures on unlabeled data and taking advantage of the near context, it is possible to build base models. The siamese multilingual transformer topology presented in this research consists of an architecture appended to frozen transformer models to understand entailment relationships between cross-lingual sentence pairs. With a huge reduction of 92% of trainable parameters, they achieve 82.5% accuracy for SICK.
The study involved 570000 pairs. The investigators concede that “We find that frozen transformers and a linear head are not as good as expected. We suspect that fine tuning compromises multilingual capabilities for cross-lingual sentences. Comparing English to Spanish results, it is clear that the native language of these models is English.” The team advocate that the sheer size of some architectures or the poor performance when applied to different domains are some issues that evidence the need for further research. In the subsections, they describe the state-of-the-art literature on Natural Language Inference.
Researchers have published a data paper on the Darija Open Dataset, an open-source translation project for the Moroccan dialect.
Aissam Outchakoucht and Hamza Es-Samaali (2021) reported in ‘Moroccan Dialect -Darija- Open Dataset’ that a significant amount (77,63%) of Darija vocabulary is borrowed from Modern Standard Arabic (MSA), 11,72% from French, and less than 1% from Spanish and Tamazight languages. This open project aims to be a standard resource for researchers, students, and anyone who is interested in Moroccan Dialect. Darija Open Dataset is the largest open-source collaborative project for Darija. Open and collaborative, this project is built with NLP applications in mind. They hope for the contribution of Moroccan IT community in order to build a pedestal for any future application of Natural Language Processing. DODa contains more than 10,000 entries covering verbs, nouns, adjectives, verb-to-noun / singularto-plural correspondences, conjugations, etc. To the best of their knowledge, DODa is the largest Darija-English translation dataset.
Data and code to reproduce the analyses can be found at: https://github.com/darija-open-dataset/dataset.
Novel approach to training large-scale neural network models, in which the student model is encouraged to mimic the teacher's behavior.
In ‘MixKD’, Kevin Liang and colleagues (2020) noted that recent language models (LM) pre-trained on large-scale unlabeled text corpora in a self-supervised manner have significantly advanced the state of the art across a wide variety of natural language processing (NLP) tasks. While these models have yielded impressive results, they typically have millions, if not billions, of parameters. With a knowledge distillation set-up, they aim to reduce this loss in performance. They introduce MixKD, a method that uses data augmentation to significantly increase the value of knowledge distillation.
MixKD significantly outperforms knowledge distillation and other previous methods that compress large-scale language models. They show that the method is especially effective when the number of available task data samples is small. The researchers recommend that they believe that the MixKD framework can further reduce the gap between student and teacher models with the incorporation of more recent mixup and knowledge distillation techniques.