Natural L Anguage P Rocessing (a Lmost ) from S Cratch

Journal of Machine Learning Research 12 (2011) 2493-2537

Submitted 1/10; Revised 11/10; Published 8/11

Natural Language Processing (Almost) from Scratch
Ronan Collobert∗
Jason Weston†
L´ on Bottou‡ e Michael Karlen
Koray Kavukcuoglu§
Pavel Kuksa¶

RONAN @ COLLOBERT. COM
JWESTON @ GOOGLE . COM
LEON @ BOTTOU . ORG
MICHAEL . KARLEN @ GMAIL . COM
KORAY @ CS . NYU . EDU
PKUKSA @ CS . RUTGERS . EDU

NEC Laboratories America
4 Independence Way
Princeton, NJ 08540

Editor: Michael Collins

Abstract
We propose a uniﬁed neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-speciﬁc engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
Keywords: natural language processing, neural networks

1. Introduction
Will a computer program ever be able to convert a piece of English text into a programmer friendly data structure that describes the meaning of the natural language text? Unfortunately, no consensus has emerged about the form or the existence of such a data structure. Until such fundamental
Articial Intelligence problems are resolved, computer scientists must settle for the reduced objective of extracting simpler representations that describe limited aspects of the textual information.
These simpler representations are often motivated by speciﬁc applications (for instance, bagof-words variants for information retrieval), or by our belief that they capture something more general about natural language. They can describe syntactic information (e.g., part-of-speech tagging, chunking, and parsing) or semantic information (e.g., word-sense disambiguation, semantic role labeling, named entity extraction, and anaphora resolution). Text corpora have been manually annotated with such data structures in order to compare the performance of various systems. The availability of standard benchmarks has stimulated research in Natural Language Processing (NLP)
∗.
†.
‡.
§.
¶.

Ronan Collobert is now with the Idiap Research Institute, Switzerland.
Jason Weston is now with Google, New York, NY.
L´ on Bottou is now with Microsoft, Redmond, WA. e Koray Kavukcuoglu is also with New York University, New York, NY.
Pavel Kuksa is also with Rutgers University, New Brunswick, NJ.

c 2011 Ronan Collobert, Jason Weston, L´ on Bottou, Michael Karlen, Koray Kavukcuoglu and Pavel Kuksa. e C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA

and effective systems have been designed for all these tasks. Such systems are often viewed as software components for constructing real-world NLP solutions.
The overwhelming majority of these state-of-the-art systems address their single benchmark task by applying linear statistical models to ad-hoc features. In other words, the researchers themselves discover intermediate representations by engineering task-speciﬁc features. These features are often derived from the output of preexisting systems, leading to complex runtime dependencies.
This approach is effective because researchers leverage a large body of linguistic knowledge. On the other hand, there is a great temptation to optimize the performance of a system for a speciﬁc benchmark. Although such performance improvements can be very useful in practice, they teach us little about the means to progress toward the broader goals of natural language understanding and the elusive goals of Artiﬁcial Intelligence.
In this contribution, we try to excel on multiple benchmarks while avoiding task-speciﬁc engineering. Instead we use a single learning system able to discover adequate internal representations.
In fact we view the benchmarks as indirect measurements of the relevance of the internal representations discovered by the learning procedure, and we posit that these intermediate representations are more general than any of the benchmarks. Our desire to avoid task-speciﬁc engineered features prevented us from using a large body of linguistic knowledge. Instead we reach good performance levels in most of the tasks by transferring intermediate representations discovered on large unlabeled data sets. We call this approach “almost from scratch” to emphasize the reduced (but still important) reliance on a priori NLP knowledge.
The paper is organized as follows. Section 2 describes the benchmark tasks of interest. Section 3 describes the uniﬁed model and reports benchmark results obtained with supervised training.
Section 4 leverages large unlabeled data sets (∼ 852 million words) to train the model on a language modeling task. Performance improvements are then demonstrated by transferring the unsupervised internal representations into the supervised benchmark models. Section 5 investigates multitask supervised training. Section 6 then evaluates how much further improvement can be achieved by incorporating standard NLP task-speciﬁc engineering into our systems. Drifting away from our initial goals gives us the opportunity to construct an all-purpose tagger that is simultaneously accurate, practical, and fast. We then conclude with a short discussion section.

2. The Benchmark Tasks
In this section, we brieﬂy introduce four standard NLP tasks on which we will benchmark our architectures within this paper: Part-Of-Speech tagging (POS), chunking (CHUNK), Named Entity
Recognition (NER) and Semantic Role Labeling (SRL). For each of them, we consider a standard experimental setup and give an overview of state-of-the-art systems on this setup. The experimental setups are summarized in Table 1, while state-of-the-art systems are reported in Table 2.
2.1 Part-Of-Speech Tagging
POS aims at labeling each word with a unique tag that indicates its syntactic role, for example, plural noun, adverb, . . . A standard benchmark setup is described in detail by Toutanova et al. (2003).
Sections 0–18 of Wall Street Journal (WSJ) data are used for training, while sections 19–21 are for validation and sections 22–24 for testing.
The best POS classiﬁers are based on classiﬁers trained on windows of text, which are then fed to a bidirectional decoding algorithm during inference. Features include preceding and following
2494

NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH

Task

Benchmark

Data set

POS

Toutanova et al. (2003)

WSJ

Chunking

CoNLL 2000

WSJ

NER

CoNLL 2003

Reuters

SRL

CoNLL 2005

WSJ

Training set
(#tokens)
sections 0–18
( 912,344 ) sections 15–18
( 211,727 )
“eng.train”
( 203,621 ) sections 2–21
( 950,028 )

Test set
(#tokens)
sections 22–24
( 129,654 ) section 20
( 47,377 )
“eng.testb”
( 46,435 ) section 23
+ 3 Brown sections
( 63,843 )

(#tags)
( 45 )
( 42 )
(IOBES)
( 17 )
(IOBES)
( 186 )
(IOBES)

Table 1: Experimental setup: for each task, we report the standard benchmark we used, the data set it relates to, as well as training and test information.

System
Shen et al. (2007)
Toutanova et al. (2003)
Gim´ nez and M` rquez (2004) e a

System
Shen and Sarkar (2005)
Sha and Pereira (2003)
Kudo and Matsumoto (2001)

Accuracy
97.33%
97.24%
97.16%

(a) POS

System
Ando and Zhang (2005)
Florian et al. (2003)
Kudo and Matsumoto (2001)

F1
95.23%
94.29%
93.91%

(b) CHUNK

System
Koomen et al. (2005)
Pradhan et al. (2005)
Haghighi et al. (2005)

F1
89.31%
88.76%
88.31%

(c) NER

F1
77.92%
77.30%
77.04%

(d) SRL

Table 2: State-of-the-art systems on four NLP tasks. Performance is reported in per-word accuracy for POS, and F1 score for CHUNK, NER and SRL. Systems in bold will be referred as benchmark systems in the rest of the paper (see Section 2.6).

tag context as well as multiple words (bigrams, trigrams. . . ) context, and handcrafted features to deal with unknown words. Toutanova et al. (2003), who use maximum entropy classiﬁers and inference in a bidirectional dependency network (Heckerman et al., 2001), reach 97.24% per-word accuracy. Gim´ nez and M` rquez (2004) proposed a SVM approach also trained on text windows, e a with bidirectional inference achieved with two Viterbi decoders (left-to-right and right-to-left). They obtained 97.16% per-word accuracy. More recently, Shen et al. (2007) pushed the state-of-the-art up to 97.33%, with a new learning algorithm they call guided learning, also for bidirectional sequence classiﬁcation. 2495

C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA

2.2 Chunking
Also called shallow parsing, chunking aims at labeling segments of a sentence with syntactic constituents such as noun or verb phrases (NP or VP). Each word is assigned only one unique tag, often encoded as a begin-chunk (e.g., B-NP) or inside-chunk tag (e.g., I-NP). Chunking is often evaluated using the CoNLL 2000 shared task.1 Sections 15–18 of WSJ data are used for training and section
20 for testing. Validation is achieved by splitting the training set.
Kudoh and Matsumoto (2000) won the CoNLL 2000 challenge on chunking with a F1-score of 93.48%. Their system was based on Support Vector Machines (SVMs). Each SVM was trained in a pairwise classiﬁcation manner, and fed with a window around the word of interest containing
POS and words as features, as well as surrounding tags. They perform dynamic programming at test time. Later, they improved their results up to 93.91% (Kudo and Matsumoto, 2001) using an ensemble of classiﬁers trained with different tagging conventions (see Section 3.3.3).
Since then, a certain number of systems based on second-order random ﬁelds were reported
(Sha and Pereira, 2003; McDonald et al., 2005; Sun et al., 2008), all reporting around 94.3% F1 score. These systems use features composed of words, POS tags, and tags.
More recently, Shen and Sarkar (2005) obtained 95.23% using a voting classiﬁer scheme, where each classiﬁer is trained on different tag representations2 (IOB, IOE, . . . ). They use POS features coming from an external tagger, as well carefully hand-crafted specialization features which again change the data representation by concatenating some (carefully chosen) chunk tags or some words with their POS representation. They then build trigrams over these features, which are ﬁnally passed through a Viterbi decoder a test time.
2.3 Named Entity Recognition
NER labels atomic elements in the sentence into categories such as “PERSON” or “LOCATION”.
As in the chunking task, each word is assigned a tag preﬁxed by an indicator of the beginning or the inside of an entity. The CoNLL 2003 setup3 is a NER benchmark data set based on Reuters data.
The contest provides training, validation and testing sets.
Florian et al. (2003) presented the best system at the NER CoNLL 2003 challenge, with 88.76%
F1 score. They used a combination of various machine-learning classiﬁers. Features they picked included words, POS tags, CHUNK tags, preﬁxes and sufﬁxes, a large gazetteer (not provided by the challenge), as well as the output of two other NER classiﬁers trained on richer data sets. Chieu
(2003), the second best performer of CoNLL 2003 (88.31% F1), also used an external gazetteer
(their performance goes down to 86.84% with no gazetteer) and several hand-chosen features.
Later, Ando and Zhang (2005) reached 89.31% F1 with a semi-supervised approach. They trained jointly a linear model on NER with a linear model on two auxiliary unsupervised tasks.
They also performed Viterbi decoding at test time. The unlabeled corpus was 27M words taken from Reuters. Features included words, POS tags, sufﬁxes and preﬁxes or CHUNK tags, but overall were less specialized than CoNLL 2003 challengers.
1. See http://www.cnts.ua.ac.be/conll2000/chunking.
2. See Table 3 for tagging scheme details.
3. See http://www.cnts.ua.ac.be/conll2003/ner.

2496

NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH

2.4 Semantic Role Labeling
SRL aims at giving a semantic role to a syntactic constituent of a sentence. In the PropBank
(Palmer et al., 2005) formalism one assigns roles ARG0-5 to words that are arguments of a verb
(or more technically, a predicate) in the sentence, for example, the following sentence might be tagged “[John]ARG0 [ate]REL [the apple]ARG1 ”, where “ate” is the predicate. The precise arguments depend on a verb’s frame and if there are multiple verbs in a sentence some words might have multiple tags. In addition to the ARG0-5 tags, there there are several modiﬁer tags such as ARGM-LOC
(locational) and ARGM-TMP (temporal) that operate in a similar way for all verbs. We picked
CoNLL 20054 as our SRL benchmark. It takes sections 2–21 of WSJ data as training set, and section 24 as validation set. A test set composed of section 23 of WSJ concatenated with 3 sections from the Brown corpus is also provided by the challenge.
State-of-the-art SRL systems consist of several stages: producing a parse tree, identifying which parse tree nodes represent the arguments of a given verb, and ﬁnally classifying these nodes to compute the corresponding SRL tags. This entails extracting numerous base features from the parse tree and feeding them into statistical models. Feature categories commonly used by these system include (Gildea and Jurafsky, 2002; Pradhan et al., 2004):
• the parts of speech and syntactic labels of words and nodes in the tree;
• the node’s position (left or right) in relation to the verb;
• the syntactic path to the verb in the parse tree;
• whether a node in the parse tree is part of a noun or verb phrase;
• the voice of the sentence: active or passive;
• the node’s head word; and
• the verb sub-categorization.
Pradhan et al. (2004) take these base features and deﬁne additional features, notably the part-ofspeech tag of the head word, the predicted named entity class of the argument, features providing word sense disambiguation for the verb (they add 25 variants of 12 new feature types overall). This system is close to the state-of-the-art in performance. Pradhan et al. (2005) obtain 77.30% F1 with a system based on SVM classiﬁers and simultaneously using the two parse trees provided for the SRL task. In the same spirit, Haghighi et al. (2005) use log-linear models on each tree node, re-ranked globally with a dynamic algorithm. Their system reaches 77.04% using the ﬁve top Charniak parse trees. Koomen et al. (2005) hold the state-of-the-art with Winnow-like (Littlestone, 1988) classiﬁers, followed by a decoding stage based on an integer program that enforces speciﬁc constraints on SRL tags. They reach 77.92% F1 on CoNLL 2005, thanks to the ﬁve top parse trees produced by the
Charniak (2000) parser (only the ﬁrst one was provided by the contest) as well as the Collins (1999) parse tree.
4. See http://www.lsi.upc.edu/˜srlconll.

2497

C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA

2.5 Evaluation
In our experiments, we strictly followed the standard evaluation procedure of each CoNLL challenges for NER, CHUNK and SRL. In particular, we chose the hyper-parameters of our model according to a simple validation procedure (see Remark 8 later in Section 3.5), performed over the validation set available for each task (see Section 2). All these three tasks are evaluated by computing the F1 scores over chunks produced by our models. The POS task is evaluated by computing the per-word accuracy, as it is the case for the standard benchmark we refer to (Toutanova et al.,
2003). We used the conlleval script5 for evaluating POS,6 NER and CHUNK. For SRL, we used the srl-eval.pl script included in the srlconll package.7
2.6 Discussion
When participating in an (open) challenge, it is legitimate to increase generalization by all means.
It is thus not surprising to see many top CoNLL systems using external labeled data, like additional
NER classiﬁers for the NER architecture of Florian et al. (2003) or additional parse trees for SRL systems (Koomen et al., 2005). Combining multiple systems or tweaking carefully features is also a common approach, like in the chunking top system (Shen and Sarkar, 2005).
However, when comparing systems, we do not learn anything of the quality of each system if they were trained with different labeled data. For that reason, we will refer to benchmark systems, that is, top existing systems which avoid usage of external data and have been well-established in the NLP ﬁeld: Toutanova et al. (2003) for POS and Sha and Pereira (2003) for chunking. For NER we consider Ando and Zhang (2005) as they were using additional unlabeled data only. We picked
Koomen et al. (2005) for SRL, keeping in mind they use 4 additional parse trees not provided by the challenge. These benchmark systems will serve as baseline references in our experiments. We marked them in bold in Table 2.
We note that for the four tasks we are considering in this work, it can be seen that for the more complex tasks (with corresponding lower accuracies), the best systems proposed have more engineered features relative to the best systems on the simpler tasks. That is, the POS task is one of the simplest of our four tasks, and only has relatively few engineered features, whereas SRL is the most complex, and many kinds of features have been designed for it. This clearly has implications for as yet unsolved NLP tasks requiring more sophisticated semantic understanding than the ones considered here.

3. The Networks
All the NLP tasks above can be seen as tasks assigning labels to words. The traditional NLP approach is: extract from the sentence a rich set of hand-designed features which are then fed to a standard classiﬁcation algorithm, for example, a Support Vector Machine (SVM), often with a linear kernel. The choice of features is a completely empirical process, mainly based ﬁrst on linguistic intuition, and then trial and error, and the feature selection is task dependent, implying additional research for each new NLP task. Complex tasks like SRL then require a large number of possibly
5. Available at http://www.cnts.ua.ac.be/conll2000/chunking/conlleval.txt.
6. We used the “-r” option of the conlleval script to get the per-word accuracy, for POS only.
7. Available at http://www.lsi.upc.es/˜srlconll/srlconll-1.1.tgz.

2498

xx xx NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH

Input Window

Text
Feature 1
.
.
.
Feature K

word of interest

cat sat on the mat
1
1
1
wN w1 w2 . . .

K
K
w1 w2 . . .

K wN Lookup Table

LTW 1
.
.
.
LTW K

d

concat

Linear

M1 × ·

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx n1 hu

HardTanh

Linear

M2 × · n2 hu = #tags

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx Figure 1: Window approach network. complex features (e.g., extracted from a parse tree) which can impact the computational cost which might be important for large-scale applications or applications requiring real-time response.
Instead, we advocate a radically different approach: as input we will try to pre-process our features as little as possible and then use a multilayer neural network (NN) architecture, trained in an end-to-end fashion. The architecture takes the input sentence and learns several layers of feature extraction that process the inputs. The features computed by the deep layers of the network are automatically trained by backpropagation to be relevant to the task. We describe in this section a general multilayer architecture suitable for all our NLP tasks, which is generalizable to other NLP tasks as well.
Our architecture is summarized in Figure 1 and Figure 2. The ﬁrst layer extracts features for each word. The second layer extracts features from a window of words or from the whole sentence, treating it as a sequence with local and global structure (i.e., it is not treated like a bag of words).
The following layers are standard NN layers.
3.1 Notations
We consider a neural network fθ (·), with parameters θ. Any feed-forward neural network with L l layers, can be seen as a composition of functions fθ (·), corresponding to each layer l:
L L−1
1
fθ (·) = fθ ( fθ (. . . fθ (·) . . .)) .

2499

xxx xx xxx x xx x C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA

Input Sentence

The cat sat
1
1 w1 w2 . . .

on the mat
1
wN

K
K
w1 w2 . . .

K wN Padding

Padding

Text
Feature 1
.
.
.
Feature K

Lookup Table

LTW 1
.
.
.
LTW K

d

Convolution

M1 × ·

n1 hu Max Over Time

max(·)

n1 hu Linear

M2 × ·

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxx n2 hu HardTanh

Linear

M3 × · n3 hu = #tags

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxx Figure 2: Sentence approach network.

In the following, we will describe each layer we use in our networks shown in Figure 1 and Figure 2.
We adopt few notations. Given a matrix A we denote [A]i, j the coefﬁcient at row i and column j in the matrix. We also denote A dwin the vector obtained by concatenating the dwin column vectors i around the ith column vector of matrix A ∈ Rd1 ×d2 :
A

dwin i T

= [A]1, i−dwin /2 . . . [A]d1 , i−dwin /2 , . . . , [A]1, i+dwin /2 . . . [A]d1 , i+dwin /2 .
2500

NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH

As a special case, A 1 represents the ith column of matrix A. For a vector v, we denote [v]i the i scalar at index i in the vector. Finally, a sequence of element {x1 , x2 , . . . , xT } is written [x]T . The ith
1
element of the sequence is [x]i .
3.2 Transforming Words into Feature Vectors
One of the key points of our architecture is its ability to perform well with the use of (almost8 ) raw words. The ability for our method to learn good word representations is thus crucial to our approach. For efﬁciency, words are fed to our architecture as indices taken from a ﬁnite dictionary
D . Obviously, a simple index does not carry much useful information about the word. However, the ﬁrst layer of our network maps each of these word indices into a feature vector, by a lookup table operation. Given a task of interest, a relevant representation of each word is then given by the corresponding lookup table feature vector, which is trained by backpropagation, starting from a random initialization.9 We will see in Section 4 that we can learn very good word representations from unlabeled corpora. Our architecture allow us to take advantage of better trained word representations, by simply initializing the word lookup table with these representations (instead of randomly). More formally, for each word w ∈ D , an internal dwrd -dimensional feature vector representation is given by the lookup table layer LTW (·):
LTW (w) = W

1 w, where W ∈ Rdwrd ×|D | is a matrix of parameters to be learned, W 1 ∈ Rdwrd is the wth column of W w and dwrd is the word vector size (a hyper-parameter to be chosen by the user). Given a sentence or any sequence of T words [w]T in D , the lookup table layer applies the same operation for each word
1
in the sequence, producing the following output matrix:
LTW ([w]T ) =
1

W

1
[w]1

W

1
[w]2

...

W

1
[w]T

.

(1)

This matrix can then be fed to further neural network layers, as we will see below.
3.2.1 E XTENDING

TO

A NY D ISCRETE F EATURES

One might want to provide features other than words if one suspects that these features are helpful for the task of interest. For example, for the NER task, one could provide a feature which says if a word is in a gazetteer or not. Another common practice is to introduce some basic pre-processing, such as word-stemming or dealing with upper and lower case. In this latter option, the word would be then represented by three discrete features: its lower case stemmed root, its lower case ending, and a capitalization feature.
Generally speaking, we can consider a word as represented by K discrete features w ∈ D 1 ×
· · · × D K , where D k is the dictionary for the kth feature. We associate to each feature a lookup table k k k LTW k (·), with parameters W k ∈ Rdwrd ×|D | where dwrd ∈ N is a user-speciﬁed vector size. Given a
8. We did some pre-processing, namely lowercasing and encoding capitalization as another feature. With enough (unlabeled) training data, presumably we could learn a model without this processing. Ideally, an even more raw input would be to learn from letter sequences rather than words, however we felt that this was beyond the scope of this work. 9. As any other neural network layer.

2501

C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA

k word w, a feature vector of dimension dwrd = ∑k dwrd is then obtained by concatenating all lookup table outputs:

 

W1 11
LTW 1 (w1 ) w 
 

.
.
.
.
LTW 1 ,...,W K (w) = 
.
=
.
.
K 1
W wK
LTW K (wK )

The matrix output of the lookup table layer for a sequence of words [w]T is then similar to (1), but
1
where extra rows have been added for each discrete feature:


W 1 1 1] . . . W 1 1 1]
[w 1
[w T


.
.
.
.
LTW 1 ,...,W K ([w]T ) = 
(2)
.
1
.
.
K 1
K 1
W [wK ] . . . W [wK ]
T

1

These vector features in the lookup table effectively learn features for words in the dictionary. Now, we want to use these trainable features as input to further layers of trainable feature extractors, that can represent groups of words and then ﬁnally sentences.
3.3 Extracting Higher Level Features from Word Feature Vectors

Feature vectors produced by the lookup table layer need to be combined in subsequent layers of the neural network to produce a tag decision for each word in the sentence. Producing tags for each element in variable length sequences (here, a sentence is a sequence of words) is a standard problem in machine-learning. We consider two common approaches which tag one word at the time: a window approach, and a (convolutional) sentence approach.
3.3.1 W INDOW A PPROACH
A window approach assumes the tag of a word depends mainly on its neighboring words. Given a word to tag, we consider a ﬁxed size ksz (a hyper-parameter) window of words around this word.
Each word in the window is ﬁrst passed through the lookup table layer (1) or (2), producing a matrix of word features of ﬁxed size dwrd × ksz . This matrix can be viewed as a dwrd ksz -dimensional vector by concatenating each column vector, which can be fed to further neural network layers. More formally, the word feature window given by the ﬁrst network layer can be written as:


W 1
[w]t−d /2 win 

.


.
.




1
W 1 fθ = LTW ([w]T ) tdwin = 
(3)
.
1
[w]t


.


.
.


W

1
[w]t+d

win /2

1
Linear Layer. The ﬁxed size vector fθ can be fed to one or several standard neural network layers which perform afﬁne transformations over their inputs: l−1 l fθ = W l fθ + bl , l l−1

l

(4)

where W l ∈ Rnhu ×nhu and bl ∈ Rnhu are the parameters to be trained. The hyper-parameter nl is hu usually called the number of hidden units of the l th layer.
2502

NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH

HardTanh Layer. Several linear layers are often stacked, interleaved with a non-linearity function, to extract highly non-linear features. If no non-linearity is introduced, our network would be a simple linear model. We chose a “hard” version of the hyperbolic tangent as non-linearity. It has the advantage of being slightly cheaper to compute (compared to the exact hyperbolic tangent), while leaving the generalization performance unchanged (Collobert, 2004). The corresponding layer l applies a HardTanh over its input vector: l−1 l fθ = HardTanh( fθ
),
i

where

i


 −1 if x < −1 x if − 1

Natural L Anguage P Rocessing (a Lmost ) from S Cratch

Similar Documents

Popular Essays