Recurrent Neural Network in Natural Language Processing

8 min readDec 14, 2022

The most trending topic in deep learning and artificial intelligence is undoubtedly ChatGPT for the past two weeks. Today, we are unveiling the secrets and logistics behind ChatGPT hype and technology that really is implemented. Why is it such a breakthrough from applicational and mathematical aspects?

Essentially, the technique that is implemented in ChatGPT is a language model that is trained to generate human-like sentences by predicting the next word in a sequence of words. It utilizes a combination of natural language processing and deep learning neural networks to generate text that sounds coherent to human writing. Specifically, techniques like word embeddings and recurrent neural networks are key algorithms that make this application possible. The word embeddings is a representation of words in a numericalized format that is learnable for the model. The recurrent neural network is a model that considers the context of previous layers (in this case words) when generating the next layer (text). In short, ChatGPT is scaled recurrent neural networks (RNN) that work with sequential data like text. RNN is different from other conventional neural networks because RNN considers the input sequence one element at a time while maintaining the internal state that captures all information about the past layer elements in the sequence. This feature allows the RNN to incorporate context from the input sequence when making predictions, or in ChatGPT case, generating a cohesive sentence.

With the understanding that RNN is an advanced neural network that makes NLP a better application in different contexts, let’s break down what natural language processing is from a more technical perspective. NLP is a field in artificial intelligence that develops models and algorithms for computers to understand and process human language. It is a junction of subjects from computer science, linguistics, and cognitive psychology. It enables computers to understand and thus generate natural language. Normally, there are typically four steps in NLP. When NLP is given with a text dataset, it does text preprocessing where word cleaning and preparing. In preprocessing, the machine does tokenization which is breaking text into individual words and phrases, then, sentence segmentation which splits text into individual sentences. Lastly, NLP removes the punctuations and special characters to make given word data into raw text data for further processing. After text preprocessing, the machine does feature extraction. To simply put, a machine selects meaningful words from a text, known as keywords, and identifies the word role in the sentence (whether it is a subject or object or a verb…etc). The third step after preprocessing and extraction is modeling. In modeling, the features machine extracted from the previous step become input for a model that processes and generates predictions based on given text data. From a given labeled dataset as input, an expected output are the labels with sentiment of a piece of text that contains certain keywords. Last step for a complete NLP operation would be inference. Inference is when the model has been trained from the third step, and inference makes predictions on the new, untrained/unseen text data. This is known as inference since it involves feeding new text data into a trained algorithm model to make predictions based on features extracted from the given text.

Common tasks that NLP consists of like text classification, machine translation, sentiment analysis, named entity recognition, part-of-speech tagging, etc. Text application is the ability to categorize or topic model a given piece of text. Machine translation is translation of one language into another. Sentiment analysis is identifying positive, negative, neutral emotional tones from a given text. Named entity recognition is the extraction of place, people, organizations, and locations from given text. Part-of-speech tagging is the ability to identify noun, verb, adjective, or words played in the grammatical sense. Now that we understand what NLP does, some applications that NLP does are translation, text summarization, sentiment analysis, and chatbots are common implementations for NLP.

Now that we have briefly introduced NLP applications, let’s delve into technical contributions RNN have made to Natural Language Processing (NLP) to create ChatGPT. We acknowledged that RNN helps improve the language modeling by learning the dependencies between words in sentences to generate coherent sentences. Beyond that, RNN empowers NLP to improve machine translation by learning to encode the meaning of sentences of a language and decode them to another language with more accuracy from machine translation. RNN also provides more context of a question and generates more relevant answers based on the given input. Lastly, RNN enables sentiment analysis with more context from the text to really accurately depict the contextual information of texts.

Figure 1: illustration of the RNN Encoder-Decoder

In this article, we will reference this published paper whose leading author is Professor Kyunghyun Cho from CDS and a couple researchers across the data science community. This paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation propose a neural network model called RNN Encoder-Decoder that composed of two recurrent neural networks (RNN): the first RNN encodes sequence of symbols into fixed vector representation, and the second RNN that decodes representation into sequence of symbols. These two RNN are jointly trained to maximize conditional probability of targeted sequence. To simply put, this model learns a semantically and syntactically meaningful representation of linguistic phrases through Encoder-Decoder.

Form figure 1, the neural architecture have encode variable-length sequence into fixed-length vector representation where decode into variable length sequence that learn conditional distribution over variable length sequence conditioned on probability p(y1, …, yT’ | x1, …, x T’). The encoder reads every symbol of an input sequence x as a hidden state of RNN changes based on the equation h(t) = f (h<t-1>, xt. c), where f is a non-linear activation function ranging from logistic sigmoid function to long short-term memory (LSTM), and h are conditioned on y and summary of c of input sequence at time t. Since it is recurrent, the next layer neural network will be conditioned on the previous symbol (figure on the right).

To elaborate on the hidden state, there is a new novel hidden unit that adaptively remembers and forgets proposed by this paper. The activation function for hidden unit reset gate rj from equation (5). Then updating of the gate zj is computed by equation (6).

The actual activation for the proposed unit hj then is computed by equation (7) in which the condition equation (8) is computed.

Once the RNN Encoder-Decoder is trained, the model can be utilized in two methods. First method is to generate a target sequence given input sequence. Second method is to use scoring probability pθ(y | x) to give a pair of input and output sequences.

The statistical machine translation system (SMT) is the goal in finding decoder translation of given sentences that maximize the probability of the translation model. SMT framework was introduced by Koehn et al., 2003 and Marcu and Wong, 2002. The translation model is factorized into translation probabilities of matching phrases in source and target sentences, and probability are considered as additional features in a log-linear model that are weighted accordingly to maximize the BLEU score. Equation (9) exhibits the idea of the SMT equation where Z(e) is a normalization constant not dependent on the weights, and weights are optimized to maximize BLEU score on the development set.

Then the second method of scoring phrase pairs with RNN Encoder-Decoder essentially use scores as an additional feature in the log-linear model when tuning SMT decoder. When training RNN Encoder-Decoder, we ignore normalized frequencies of phrase pairs in the original corpora. To reduce the computation expense of randomly selecting phrase phrases from a bigger text table and ensuring RNn does not simplify learning to rank the phrases. Therefore, what the scoring does is allowing news scores to enter into the existing tuning algorithm with minimal additional overhead in computation.

The experiment that this paper does is to evaluate English/French translation task. The RNN Encoder-Decoder used has 1000 hidden units with proposed gates at encoder and decoder. The input matrix for input symbols and hidden unit is approximately two lower-rank matrices with output matrix approximately similar. The experiment use rank-100 matrices to learn an embedding of dimensions 100 for each word. The activation function used for the equation (8) h is hyperbolic tangent function. The experiment use Adadelta and stochastic gradient descent in training the RNN Encoder-Decoder with hyperparameters e = 10−6 and ρ = 0.95. For each update, the algorithm used 64 randomly selected phrase pairs from the 348M table words and trained for over a three days period.

For Quantitative analysis, this experiment uses four different combinations 1. Baseline configuration 2. Baseline + RNN 3. Baseline + CSLM + RNN 4. Baseline + CSLM + RNN + Word penalty. As you can see from table 1 is the experiment result, the adding features computed by the neural networks consistently improves the performance over baseline performance which is expected from the study.

From the results, this paper proposed a new neural network architecture referred to as RNN Encoder-Decoder that is able to learn mapping from a sequence of arbitrary length to another sequence. The new proposal is able to score a pair of sequences in terms of conditional probability and also generate a target sequence given a source sequence. Beside the new introduced architecture. The novel hidden unit that includes reset and update gates that adaptively control each hidden unit remembers or forgets while reading/generating a sequence was another new innovation. The proposed model in this experiment is evaluated with the task of statistical machine translation from English to French. Qualitatively, the experiment can show a new model capturing the linguistic regularities in phrase pairs well and RNN Encoder-Decoder proposed well-formed target phrases. The scores of BLEU and SMT systems improve further by performance.

In conclusion, the results that we are seeing from this translational experiment and its scores demonstrate a good example of how recurrent neural networks are utilized from the natural language processing application. RNN was able to capture the hidden state and also encode and decode the text to entertain the application of language translation from NLP.

Reference Work Cited
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
Rise of AI at 2022 landscape toward Data Science
NLP: Basics to RNN and LSTM
GPT3 Design Hype prototypr.io
Deep Learning Approach in Predicting Next Words
Recurrent RNN towards Data Science

Recurrent Neural Network in Natural Language Processing

Written by Mindy Wu