e-ISSN 2231-8526
ISSN 0128-7680
Tien-Ping Tan, Chai Kim Lim and Wan Rose Eliza Abdul Rahman
Pertanika Journal of Science & Technology, Volume 30, Issue 1, January 2022
DOI: https://doi.org/10.47836/pjst.30.1.06
Keywords: Attention, CNN, LSTM, parallel text, sentence alignment
Published on: 10 January 2022
A parallel text corpus is an important resource for building a machine translation (MT) system. Existing resources such as translated documents, bilingual dictionaries, and translated subtitles are excellent resources for constructing parallel text corpus. A sentence alignment algorithm automatically aligns source sentences and target sentences because manual sentence alignment is resource-intensive. Over the years, sentence alignment approaches have improved from sentence length heuristics to statistical lexical models to deep neural networks. Solving the alignment problem as a classification problem is interesting as classification is the core of machine learning. This paper proposes a parallel long-short-term memory with attention and convolutional neural network (parallel LSTM+Attention+CNN) for classifying two sentences as parallel or non-parallel sentences. A sliding window approach is also proposed with the classifier to align sentences in the source and target languages. The proposed approach was compared with three classifiers, namely the feedforward neural network, CNN, and bi-directional LSTM. It is also compared with the BleuAlign sentence alignment system. The classification accuracy of these models was evaluated using Malay-English parallel text corpus and UN French-English parallel text corpus. The Malay-English sentence alignment performance was then evaluated using research documents and the very challenging Classical Malay-English document. The proposed classifier obtained more than 80% accuracy in categorizing parallel/non-parallel sentences with a model built using only five thousand training parallel sentences. It has a higher sentence alignment accuracy than other baseline systems.
Ahmad, K. (2017). Hikayat Hang Tuah [The epic of Hang Tuah]. Dewan Bahasa & Pustaka.
Ajamiseba, D. C. (1983). A classical malay text grammar: Insights into a non-wester text tradition. Australian National University.
Almeman, K., Lee, M., & Almiman, A. A. (2013). Multi dialect Arabic speech parallel corpora. In Proceedings International Conference on Communications, Signal Processing, and Their Applications (ICCSPA) (pp. 1-6). IEEE Publishing. http://dx.doi.org/10.1109/ICCSPA.2013.6487288
Brown, P. F., Lai, J. C., & Mercer, R. L. (1991). Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (pp. 169-176). Berkeley. http://dx.doi.org/10.3115/981344.981366
Chen, S. F. (1993). Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (pp. 9-16). ACM Publishing. http://dx.doi.org/10.3115/981574.981576
Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19, 75-102. http://dx.doi.org/10.3115/981344.981367
Grégoire, F., & Langlais, P. (2017). A deep neural network approach to parallel sentence extraction. ArXiv preprint.
Khaw, J. Y. M., Tan, T. P., & Ranaivo, B. (2021). Kelantan and Sarawak Malay dialects: Parallel dialect text collection and alignment using hybrid distance-statistical-based phrase alignment algorithm. Turkish Journal of Computer and Mathematics Education, 12(3), 2163-2171. https://doi.org/10.17762/turcomat.v12i3.1160
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1746-1751). Association for Computational Linguistics Publishing. http://dx.doi.org/10.3115/v1/D14-1181
Lim, S. L. O., Lim, H. M., Tan, E. K., & Tan, T. P. (2020). Examining machine learning techniques in business news headline sentiment analysis. In R. Alfred, Y. Lim, H. Haviluddin & K. O. Chin (Eds.), Computational Science and Technology (pp. 363-372). Springer. https://doi.org/10.1007/978-981-15-0058-9_35
Luo, S., Ying, H., & Yu, S. (2021). Sentence alignment with parallel documents helps biomedical machine translation. ArXiv preprint.
Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. In Proceedings of Conference on Empirical Methods in Natural Language Processing (pp. 1412-1421). Association for Computational Linguistics Publishing. http://dx.doi.org/10.18653/v1/D15-1166
Ma, X. (2006, May 22-28). Champollion: A robust parallel text sentence aligner. In Proceedings of Fifth International Conference on Language Resources and Evaluation (pp. 489-492). Genoa, Italy.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in Neural Processing Information Systems 26 (pp. 3136-3144). Curran Associates, Inc.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532-1543). Association for Computational Linguistics Publishing. http://dx.doi.org/10.3115/v1/D14-1162
Salleh, M. (2010) The epic of Hang Tuah. Malaysian Institute of Translation & Books.
Sennrich, R., & Volk, M. (2010, November 4 - December 14). MT-based sentence alignment for OCR-generated parallel texts. In The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010) (pp. 1-11). Denver, Colorado.
Stratos, K., Collins, M., & Hsu D. (2015). Model-based word embeddings from decompositions of count matrices. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (pp. 1282-1291). Association for Computational Linguistics Publishing. http://dx.doi.org/10.3115/v1/P15-1124
Vulić, I., & Moens, M. F. (2015). Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (Vol. 2, pp. 719-725). Association for Computational Linguistics Publishing. http://dx.doi.org/10.3115/v1/P15-2118
Wołk, K., & Marasek, K. (2014). A sentence meaning based alignment method for parallel text corpora preparation. New Perspectives in Information Systems and Technologies, 1, 229-237. http://dx.doi.org/10.1007/978-3-319-05951-8_22
Yeong, Y. M., Tan, T. P., & Gan, K. H. (2019) A hybrid of sentence-level approach and fragment-level approach of parallel text extraction from comparable text. Procedia Computer Science, 161, 406-414. http://dx.doi.org/10.1016/j.procs.2019.11.139
Zhou, P., Qi, Z., Zheng, S., Xu, J., Bao, H., & Xu, B. (2016, December 11-16). Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In Proceedings of the 26th International Conference on Computational Linguistics (pp. 3485-3495). Osaka, Japan.
Ziemski, M., Junczys-Dowmunt, M., & Pouliquen, B. (2016, May 23-28). The United Nations parallel corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016 (pp. 3530-3534). Portorož, Slovenia.
ISSN 0128-7680
e-ISSN 2231-8526