Skip to main content

Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

tabular, [5pt] \small\ $^{1}$ Facebook AI Research, \small\ $^{2}$ Berkeley AI Research (BAIR), UC Berkeley, \small\ $^{3}$ New York University, \small\ $^{4}$ Redwood Center for Theoretical Neuroscience, UC Berkeley

Abstract

Transformer networks have revolutionized NLP representation learning since they were introduced. Though a great effort has been made to explain the representation in transformers, it is widely recognized that our understanding is not sufficient. One important reason is that there lack enough visualization tools for detailed analysis. In this paper, we propose to use dictionary learning to open up these `black boxes' as linear superpositions of transformer factors. Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors, e.g., word-level polysemy disambiguation, sentence-level pattern formation, and long-range dependency. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights. We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work. The code is available at https://github.com/zeyuyun1/TransformerVis.

Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors

∗ 2

∗ 1 , 2

Zeyu Yun Yubei Chen Bruno A Olshausen Yann LeCun

1 Facebook AI Research

3 New York University

4 Redwood Center for Theoretical Neuroscience, UC Berkeley

Transformer networks have revolutionized NLP representation learning since they were introduced. Though a great effort has been made to explain the representation in transformers, it is widely recognized that our understanding is not sufficient. One important reason is that there lack enough visualization tools for detailed analysis. In this paper, we propose to use dictionary learning to open up these 'black boxes' as linear superpositions of transformer factors. Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors, e.g., word-level polysemy disambiguation, sentence-level pattern formation, and long-range dependency. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights. We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work. The code is available at https://github. com/zeyuyun1/TransformerVis .

Introduction

Though the transformer networks (Vaswani et al., 2017; Devlin et al., 2018) have achieved great success, our understanding of how they work is still fairly limited. This has triggered increasing efforts to visualize and analyze these 'black boxes'. Besides a direct visualization of the attention weights, most of the current efforts to interpret transformer models involve 'probing tasks'. They are achieved by attaching a light-weighted auxiliary classifier at the output of the target transformer layer. Then only the auxiliary classifier is trained for wellknown NLP tasks like part-of-speech (POS) Tagging, Named-entity recognition (NER) Tagging,

∗ equal contribution. Correspondence to: Zeyu Yun < chobitstian@berkeley.edu > , Yubei Chen < yubeic@{fb.com, berkeley.edu} >

Syntactic Dependency, etc. Tenney et al. (2019) and Liu et al. (2019) show transformer models have excellent performance in those probing tasks. These results indicate that transformer models have learned the language representation related to the probing tasks. Though the probing tasks are great tools for interpreting language models, their limitation is explained in Rogers et al. (2020). We summarize the limitation into three major points:

· Most probing tasks, like POS and NER tagging, are too simple. A model that performs well in those probing tasks does not reflect the model's true capacity. · Probing tasks can only verify whether a certain prior structure is learned in a language model. They can not reveal the structures beyond our prior knowledge. · It's hard to locate where exactly the related linguistic representation is learned in the transformer.

Efforts are made to remove those limitations and make probing tasks more diverse. For instance, Hewitt and Manning (2019) proposes 'structural probe', which is a much more intricate probing task. Jiang et al. (2020) proposes to generate specific probing tasks automatically. Non-probing methods are also explored to relieve the last two limitations. For example, Reif et al. (2019) visualizes embedding from BERT using UMAP and shows that the embeddings of the same word under different contexts are separated into different clusters. Ethayarajh (2019) analyzes the similarity between embeddings of the same word in different contexts. Both of these works show transformers provide a context-specific representation.

Faruqui et al. (2015); Arora et al. (2018); Zhang et al. (2019) demonstrate how to use dictionary learning to explain, improve, and visualize the uncontextualized word embedding representations. In

2 , 4

this work, we propose to use dictionary learning to alleviate the limitations of the other transformer interpretation techniques. Our results show that dictionary learning provides a powerful visualization tool, leading to some surprising new knowledge.

Method

Hypothesis: contextualized word embedding as a sparse linear superposition of transformer factors. It is shown that word embedding vectors can be factorized into a sparse linear combination of word factors (Arora et al., 2018; Zhang et al., 2019), which correspond to elementary semantic meanings. An example is:

We view the latent representation of words in a transformer as contextualized word embedding. Similarly, we hypothesize that a contextualized word embedding vector can also be factorized as a sparse linear superposition of a set of elementary elements, which we call transformer factors . The exact definition will be presented later in this section.

Figure 1: Building block (layer) of transformer

Due to the skip connections in each of the transformer blocks, we hypothesize that the representation in any layer would be a superposition of the hierarchical representations in all of the lower layers. As a result, the output of a particular transformer block would be the sum of all of the modifications along the way. Indeed, we verify this intuition with the experiments. Based on the above observation, we propose to learn a single dictionary for the contextualized word vectors from different layers' output.

The Details of the Non-negative Sparse Coding Optimization

Given a set of tokenized text sequences, we collect the contextualized embedding of every word using a transformer model. We define the set of all word embedding vectors from l th layer of transformer model as X ( l ) . Furthermore, we collect the embeddings across all layers into a single set X = X (1) ∪ X (2) ∪ · · · ∪ X ( L ) .

By our hypothesis, we assume each embedding vector x ∈ X is a sparse linear superposition of transformer factors :

where Φ ∈ I R d × m is a dictionary matrix with columns Φ : ,c , α ∈ I R m is a sparse vector of coefficients to be inferred and glyph[epsilon1] is a vector containing independent Gaussian noise samples, which are assumed to be small relative to x . Typically m>d so that the representation is overcomplete . This inverse problem can be efficiently solved by FISTA algorithm (Beck and Teboulle, 2009). The dictionary matrix Φ can be learned in an iterative fashion by using non-negative sparse coding, which we leave to the appendix section C. Each column Φ : ,c of Φ is a transformer factor and its corresponding sparse coefficient α c is its activation level.

Visualization by top activation and LIME interpretation. An important empirical method to visualize a feature in deep learning is to use the input samples, which trigger the top activation of the feature (Zeiler and Fergus, 2014). We adopt this convention. As a starting point, we try to visualize each of the dimensions of a particular layer, X ( l ) . Unfortunately, the hidden dimensions of transformers are not semantically meaningful, which is similar to the uncontextualized word embeddings (Zhang et al., 2019).

Instead, we can try to visualize the transformer factors. For a transformer factor Φ : ,c and for a layerl , we denote the 1000 contextualized word vectors with the largest sparse coefficients α ( l ) c as X ( l ) c ⊂ X ( l ) , which correspond to 1000 different sequences. For example, Figure 3 shows the top 5 words that activated transformer factor-17 Φ : , 17 at layer0 , layer2 , and layer6 respectively. Since a contextualized word vector is generally affected by many tokens in the sequence, we can use LIME (Ribeiro et al., 2016) to assign a weight to each token in the sequence to identify their relative importance to α c . The detailed method is left to Section 3.

To determine low-, mid-, and high-level trans-

former factors with importance score. As we build a single dictionary for all of the transformer layers, the semantic meaning of the transformer factors has different levels. While some of the factors appear in lower layers and continue to be used in the later stages, the rest of the factors may only be activated in the higher layers of the transformer network. A central question in representation learning is: 'where does the network learn certain information?' To answer this question, we can compute an 'importance score' for each transformer factor Φ : ,c at layerl as I ( l ) c . I ( l ) c is the average of the largest 1000 sparse coefficients α ( l ) c 's, which correspond to X ( l ) c . We plot the importance scores for each transformer factor as a curve is shown in Figure 2. We then use these importance score (IS) curves to identify which layer a transformer factor emerges. Figure 2a shows an IS curve peak in the earlier layer. The corresponding transformer factor emerges in the earlier stage, which may capture lower-level semantic meanings. In contrast, Figure 2b shows a peak in the higher layers, which indicates the transformer factor emerges much later and may correspond to mid- or high-level semantic structures. More subtleties are involved when distinguishing between mid-level and high-level factors, which will be discussed later.

An important characteristic is that the IS curve for each transformer factor is relatively smooth. This indicates if a vital feature is learned in the beginning layers, it won't disappear in later stages. Instead, it will be carried all the way to the end with gradually decayed weight since many more features would join along the way. Similarly, abstract information learned in higher layers is slowly developed from the early layers. Figure 3 and 5 confirm this idea, which will be explained in the next section.

Experiments and Discoveries

We use a 12-layer pre-trained BERT model (Pre; Devlin et al., 2018) and freeze the weights. Since we learn a single dictionary of transformer factors for all of the layers in the transformer, we show that these transformer factors correspond to different levels of semantic or syntactic patterns. The patterns can be roughly divided into three categories:

Figure 2: Importance score (IS) across all layers for two different transformer factors. (a) This figure shows a typical IS curve of a transformer factor corresponding to low-level information. (b) This figure shows a typical IS curve of a transformer factor corresponds to mid-level information.

word-level disambiguation, sentence-level pattern formation, and long-range dependency. In the following, we provide detailed visualization for each pattern category. Due to the space limit, only a small amount of the factors are demonstrated in the paper. To alleviate the 'cherry-picking' bias, we also build a website for the interested readers to play with these results.

Low-level: word-level polysemy disambiguation. While the input embedding of a token contains polysemy, we find transformer factors with early IS curve peaks usually correspond to a specific word-level meaning. By visualizing the top activation sequences, we can see how word-level disambiguation is gradually developed in a transformer.

We show how the disambiguation effect develops progressively through each layer in Figure 3. In Figure 3, the top 5 activated words and their contexts for transformer factor Φ : , 30 in different layers are listed. The top activated words in layer 0 contain the word 'left' varying senses, which is being mostly disambiguated in layer 2 albeit not completely. In layer 4, the word 'left' is fully disambiguated since the top-activated word contains only 'left' with the word sense 'leaving, exiting.' We also show more examples of those types of transformer factors in Table 1: for each transformer factor, we list out the top 3 activated words and their contexts in layer 4. As shown in the table, nearly all top-activated words are disambiguated into a single sense.

Further, we can quantify the quality of the disambiguation ability of the transformer model. In the example above, since the top 1000 activated words

  • he< unk> shortly to returne to Italy where he left his family
  • banks who bet big in assisting gulf to defeat mesa only to be left broke when gulf backed out. ¢ with huggins at the helm, roundly defeated the rump left wing of the reform party to begin 28 years of unint
  • ¢| level; the robe is held at the right thigh by the left hand, ana the legs are shapeless.
  • ¢ end), howard felver( quarterback), caley( left nalfback), gustave ferbert( right halfback

(a) layer 0

  • in from oklahoma and northeast texas, dissipating what was left of this tropical depression by september 2.
  • ¢ learning to Tly and nad completed nearly 250 nours by the time he left america.
  • years later, in 196/, lefty is incorrectly told that that year' s detroit riots were

sang told reporters trom nature that about a dozen of the fossils had left china illegally.

  • allegiance; the ten who refused were taken to newgate prison and left to starve.

  • in getting the naval officers into his house, and the mob eventually left.

  • all of the federal troops had left at this point, except totten who had stayed behind to

  • saying that he has left the outsiders, kKovu asks simba to jet him join

  • eventually, all boycott' s employees left, forcing nim to run the estate without help.
  • story concerned the attempts of a scientist to photograph the soul as it left the body.

Figure 3: Visualization of a low-level transformer factor, Φ : , 30 at different layers. (a), (b) and (c) are the topactivated words and contexts for Φ : , 30 in layer0 , 2 and 4 respectively. We can see that at layer0 , this transformer factor corresponds to word vectors that encode the word 'left' with different senses. In layer-2, a majority of the top activated words 'left' correspond to a single sense, "leaving, exiting." In layer 4, all of the top-activated words 'left' have corresponded to the same sense, "leaving, exiting." Due to space limitations, we invite the readers to use our website to see more of those disambiguation effects.

Table 1: Several examples of low-level transformer factors. Their top-activated words in layer 4 are marked blue, and the corresponding contexts are shown as examples for each transformer factor. As shown in the table, nearly all of the top-activated words are disambiguated into a single sense. Please note the last example of Φ : , 33 is a rare exception, the reader may check the appendix to see a more complete list. More examples, top-activated words and contexts are provided in Appendix.

and contexts are 'left' with only the word sense 'leave, exiting', we can assume 'left' when used as a verb, triggers higher activation in Φ : , 30 than 'left' used as other sense of speech. We can verify this hypothesis using a human-annotated corpus: Brown corpus (Francis and Kucera, 1979). In this corpus, each word is annotated with its corresponding part-of-speech. We collect all the sentences contains the word 'left' annotated as a verb in one set and sentences contains 'left' annotated as other part-of-speech. As shown in Figure 4a, in layer 0, the average activation of Φ : , 30 for the word 'left' marked as a verb is no different from 'left' as other senses. However, at layer 2, 'left' marked as a verb triggers a higher activation of Φ : , 30 . In layer 4, this difference further increases, indicating disambiguation develops progressively across layers. In fact, we plot the activation of 'left' marked as verb and the activation of other 'left' in Figure 4b. In layer 4, they are nearly linearly separable by this

Figure 4: (a) Average activation of Φ : , 30 for word vector 'left' across different layers. (b) Instead of averaging, we plot the activation of all 'left' with different contexts in layer0 , 2 , and 4 . Random noise is added to the y-axis to prevent overplotting. The activation of Φ : , 30 for two different word senses of 'left' is blended together in layer0 . They disentangle to a great extent in layer2 and nearly separable in layer4 by this single dimension.

  • ¢ wrote that" stankonia reeks of artful ambition rendered with impeccable skil
  • ¢ stated similar pros, describing the sounatrack as" suspenseful, Gynamic and always adrenaline charged."
  • ¢ with guest performances from vocalists frances maya and susan calloway, among others. the concert premiered several
  • it was an attitude of mind which tempered the sternness of Nis approach with an engaging humour and all very effectively done, creepily atmospheric and splendidly gruesome"
  • a film that’ s stylish, breezily entertaining, and surprisingly sweet. on metacritic

  • consensus reads' charming, audacious, and timely, wall@-@ e's light

  • *.@ 1 out of 10, as being" beautiful, charismati: engaging and one of the most
  • that sinatra is simply superb, comical, pitiful, childisnly brave,

  • everything madonna has been denounced for being — meticulous, calculated, domineering and artificial.

  • new york times called it a" zany, lively, uninhibited, sexual odyssey

  • ¢ she stated that the show was"

reezy and entertaining and reasonably clever, at least when Its sheriock

  • ¢ of five stars, called it" a dark and delicious delight[ and] a must@-@
  • episodes, saying, moments so shocking that you © t's smart, entertaining, and has
  • ‘ull of exhilarating, ecstatic, thrilling, fun ana

Figure 5: Visualization of a mid-level transformer factor. (a), (b), (c) are the top 5 activated words and contexts for this transformer factor in layer4 , 6 , and 8 respectively. Again, the position of the word vector is marked blue. Please notice that sometimes only a part of a word is marked blue. This is due to that BERT uses word-piece tokenizer instead of whole word tokenizer. This transformer factor corresponds to the pattern of 'consecutive adjective'. As shown in the figure, this feature starts to develop at layer4 and fully develops at layer8 .

Table 2: Evaluation of binary POS tagging task: predict whether or not 'left' in a given context is a verb.

single feature. Since each word 'left' corresponds to an activation value, we can perform a logistic regression classification to differentiate those two types of 'left'. From the result shown in Figure 4a, it is pretty fascinating to see that the disambiguation ability of just Φ : , 30 is better than the other two classifiers trained with supervised data. This result confirms that disambiguation is indeed done in the early part of pre-trained transformer model and we are able to detect it via dictionary learning.

Mid level: sentence-level pattern formation. We find most of the transformer factors, with an IS curve peak after layer 6, capture mid-level or highlevel semantic meanings. In particular, the midlevel ones correspond to semantic patterns like phrases and sentences pattern.

Wefirst show two detailed examples of mid-level transformer factors. Figure 5 shows a transformer factor that detects the pattern of consecutive usage of adjectives. This pattern starts to emerge at layer 4, develops at layer 6, and becomes quite reliable at layer 8. Figure 6 shows a transformer factor, which corresponds to a pretty unexpected pattern: 'unit exchange', e.g., 56 inches (140 cm). Although this exact pattern only starts to appear at layer 8, the sub-structures that make this pattern, e.g., parenthesis and numbers, appear to trigger this factor in layers 4 and 6. Thus this transformer factor is also

  • *— 1 at home to end york' s three@-@ match run without a win, with all the team' s goals coming In the first half from carson, fletcher and brobbel( 2).
  • ¢ football league second division( 2): 1932 — 33, 1967 — 63
  • ¢ the journal of modern history 33( 2): 148 — 156@.
  • football league first division runner@-@ up( 1): 1955 — 56
  • ¢ the journal of modern history 40( 2): 155 — 165@.
  • the spy next door( larry( lucas till))
  • hannah montana: the movie( travis brody( lucas
  • percy jackson& the olympians: the lightning thief( percy jackson( logan lerman))
  • charlie and the chocolate factory( willy wonka( jonnny depp))
  • harry potter film series( percy weasiey( chris rankin))
  • *--@ 16 hana( 56 to 64 inches( 140 to 160 cm)) war norse is that it was a matter of pride
  • moving above 83 ° f( 28 ° c) sea surface temperatures, kathieen quickly strengthened.
  • ¢ at prudhoe bay was more than 120 ° f( 49 ° c) degrees, and there was a danger that if it

‘@ 3@-@ inch( 160 mm) calibre steel barrel.

  • outdoor seating area( 4@,@ 300 square feet( 400 m2)) and a 2@,@ 500@-@

Figure 6: Another example of a mid-level transformer factor visualized at layer4 , 6 , and 8 . The pattern that corresponds to this transformer factor is 'unit exchange'. Such a pattern is somewhat unexpected based on linguistic prior knowledge.

Table 3: A list of typical mid-level transformer factors. The top-activation words and their context sequences for each transformer factor at layer8 are shown in the second column. We summarize the patterns of each transformer factor in the third column. The last 4 columns are the percentage of the top 200 activated words and sequences that contain the summarized patterns in layer4 , 6 , 8 , and 10 respectively.

gradually developed through several layers.

While some mid-level transformer factors verify common semantic or syntactic patterns, there are also many surprising mid-level transformer factors. We list a few in Table 3 with quantitative analysis.

For each listed transformer factor, we analyze the top 200 activating words and their contexts in each layer. We record the percentage of those words and contexts that correspond to the factors' semantic pattern in Table 3. From the table, we see that large

Table 4: We construct adversarial texts similar but different to the pattern 'Consecutive adjective'. The last column shows the activation of Φ : , 35 , or α (8) 35 , w.r.t. the blue-marked word in layer 8.

percentages of top-activated words and contexts do corresponds to the pattern we describe. It also shows most of these mid-level patterns start to develop at layer 4 or 6. More detailed examples are provided in the appendix section F. Though it's still mysterious why the transformer network develops representations for these surprising patterns, we believe such a direct visualization can provide additional insights, which complements the 'probing tasks'.

To further confirm a transformer factor does correspond to a specific pattern, we can use constructed example words and context to probe their activation. In Table 4, we construct several text sequences that are similar to the patterns corresponding to a particular transformer factor but with subtle differences. The result confirms that the context that strictly follows the pattern represented by that transformer factor triggers a high activation. On the other hand, the closer the adversarial example to this pattern, the higher activation it receives at this transformer factor.

High-level: long-range dependency. High-level transformer factors correspond to those linguistic patterns that span an extended range in the text. Since the IS curves of mid-level and high-level transformer factors are similar, it is difficult to distinguish those transformer factors based on their IS cures. Thus, we have to manually examine the top-activation words and contexts for each transformer factor to differentiate between mid-level and high-level transformer factors. To ease the process, we choose to use the black-box interpreta- tion algorithm LIME (Ribeiro et al., 2016) to identify the contribution of each token in a sequence. There also exist interpretation tools that specifically leverage the transformer architecture (Chefer et al., 2021, 2020). In the future, one could adapt those interpretation tools, which may potentially provide better visualization.

Given a sequence s ∈ S , we can treat α ( l ) c,i , the activation of Φ : ,c in layerl at location i , as a scalar function of s , f ( l ) c,i ( s ) . Assume a sequence s triggers a high activation α ( l ) c,i , i.e. f ( l ) c,i ( s ) is large. We want to know how much each token (or equivalently each position) in s contributes to f ( l ) c,i ( s ) . To do so, we generated a sequence set S ( s ) , where each s ′ ∈ S ( s ) is the same as s except for that several random positions in s ′ are masked by ['UNK'] (the unknown token). Then we learns a linear model g w ( s ′ ) with weights w ∈ R T to approximate f ( s ′ ) , where T is the length of sentence s . This can be solved as a ridge regression:

The learned weights w can serve as a saliency map that reflects the 'contribution' of each token in the sequence s . Like in Figure 7, the color reflects the weights w at each position. Red means the given position has positive weight and green means negative weight. The magnitude of weight is represented by the intensity. The redder a token is, the more it contributions to the activation of the transformer factor. We leave more implementation and mathematical formulation details of LIME algorithm in the appendix.

We provide detailed visualization for two different transformer factors that show long-range dependency in Figure 7, 8. Since visualization of highlevel information requires more extended context, we only offer the top two activated words and their contexts for each such transformer factor. Many more will be provided in the appendix section G.

Wenamethe pattern for transformer factor Φ : , 297 in Figure 7 as 'repetitive pattern detector'. All top activated contexts for Φ : , 297 contain an obvious repetitive structure. Specifically, the text snippet 'can't get you out of my head" appears twice in the first example, and the text snippet 'xxx class passenger, star alliance' appears three times in the second example. Compared to the patterns we found in the mid-level [6], the high-level patterns like 'repetitive pattern detector' are much more abstract. In some sense, the transformer detects if there are two (or multiple) almost identical embedding vectors at layer10 without caring what they are. Such behavior might be highly related to the concept proposed in the capsule networks (Sabour et al., 2017; Hinton, 2021). To further understand this behavior and study how the self-attention mechanism helps model the relationships between the features outlines an interesting future research direction.

Figure 8 shown another high-level factor, which detects text snippets related to 'the beginning of a biography'. The necessary components, day of birth as month and four-digit years, first name and last name, familial relation, and career, are all midlevel information. In Figure 8, we see that all the information relates to biography has a high weight in the saliency map. Thus, they are all together combined to detect the high-level pattern.

'SUBWOISND SSSsUISNg 'Sdap|OYUPsed POM 'eqn Buisn papiozeas PULIGOS JIe EAS YURGIPD 'suapjOUpse>d LUNUIe|G puesqo> sem Buos ayy */ qnj> s dnouB dod ysiqug 40) Buos e uM 412 PAS /UOLINUAD SSesGxXe UPDEWe 'sJaquWeW PjOb dn awo> Oo} ONp au} pajueM OYUM 'Ja}Iny UOWIs JaBeUeW 22uel||@ 183s 'sisBuessed ssej> ssouisng /3sij HMI = sie ysiuq Aq sayja60) nd uaeq BBY oym 'sinep qo. 1e}s SiebUessed ssej> jasne) winiwasd /jaine eAo1 'pjob_—ppue siuuap Ayye> Aq pa2npoud pue uayQu M sem peal /puoweip spuejabeayiu AWUYU! )Jeys ayy (SuaPjOyPse> = —=— Au Jo FAG NOA FSB § MMB aquiaydas /| uo paseajas PHOM PUeIGO) JIe EAS YUPGHID pue SSP/OYUPsED =— SEM }! SALJUNOD UBSCOINS see pue WopHuly payun WNUIZe|G PUeIqGO) Je PAS /UOLINJUSD SSBJGXS ay} ul ayIyM 'eljesjsne ul aUOYdoyed Aq O07 Jequiajydas UBDLAWE 'SI@DUSSSEC Sse|> SS@UISNG /JS1y BIURIIE IES = | UO paseajas Sem }I Pue 'andj LUNG]e OIpNys UyYBia s Siebuassed ssej> jaune] WiniWaud /jaune| |@Aos 'puowelp ASGHEE wo. ajGuis peal ay} se uasoy> sem ,peay Au spuejabeapiu Ayuyut )AyUYU! 24) (P4e> BBGM puesqo> jo FAG NOK FSH } WES .°,,4yHye < yun >auyjuo jas APPInb He Aa YURGIID Pue 'SJapjOypsed WNUIje|G puesqgos

Figure 7: Two examples of the high activated words and their contexts for transformer factor Φ : , 297 . We also provide the saliency map of the tokens generated using LIME. This transformer factor corresponds to the concept: 'repetitive pattern detector'. In other words, repetitive text sequences will trigger high activation of Φ : , 297 .

JaPesJ SNOIDEJOA Puke Alea UB SEM AYs 'BOURD B JO} POOM 34} EYJEMe!Y SpIAOIG 0] JaPJO UI aft) Syi ABAID 0} PeYy 3a7} By} Jeu} BUuUeSY UOdN SQOs OU! SING aus UDIUM je 'eUuJeMelU JO BOs a4} Jay 0} Oulpesd Jayyej JBY sem saOWeW jSAIJL9 JBY JO BUC FSIUIIOIA L92U09 e (2161 F 6SBL Beep UEWAUO}s JUeAIG sejOnop ueweuojs Aioliew UeU} saundy SB/GEXJEW ss BJIOW Maj Used SAPY 3J9} JUSLWIBAOU

Assauubneus 'sempiw ay} WOJ Jassed jsaq au) 3 O}] WIY Palapisuod sWel|IM pue Yay rea} Jsajyeaub § I12q}00} aq 0} swel||IM pasapisuo> Assauybneys UBLUIG sileg yDeqyjey apisbucje pue sweiyim | Aiuay Y2ROD Peay JapuN |jeq}ooO) abayjo> peaAe| d ay "aAaMoy 'eJOSaUUIW jo AjIsuaAIUn 94} Pepusye sy UsUM 'sdUsU8dx—a DIJa/UI]e ou pey 'abajjO> 0] JOUG pue 'joouds UbiY ined 'js YOU papuayje ay "Asseuybneys PseMpa PUP (19]S0} RE FR] 40 WOS pulodes 84) 'e}oseuUIW 'prno)> IS Ul Z6BL '9 YaJeW UO WOg sem Assauybney

Figure 8: Visualization of Φ : , 322 . This transformer factor corresponds to the concept: 'some born in some year' in biography. All of the high-activation contexts contain the beginning of a biography. As shown in the figure, the attributes of someone, name, age, career, and familial relation all have high saliency weights.

Discussion

Dictionary learning has been successfully used to visualize the classical word embeddings (Arora et al., 2018; Zhang et al., 2019). In this paper, we propose to use this simple method to visualize the representation learned in transformer networks to supplement the implicit 'probing-tasks' methods. Our results show that the learned transformer factors are relatively reliable and can even provide many surprising insights into the linguistic structures. This simple tool can open up the transformer networks and show the hierarchical semantic or syntactic representation learned at different stages. In short, we find word-level disambiguation, sentence-level pattern formation, and long-range dependency. The idea of a neural network learns low-level features in early layers, and abstract concepts in the later stages are very similar to the visualization in CNN (Zeiler and Fergus, 2014). Dictionary learning can be a convenient tool to help visualize a broad category of neural networks with skip connections, like ResNet (He et al., 2016), ViT models (Dosovitskiy et al., 2020), etc. For more interested readers, we provide an interactive website 1 for the readers to gain some further insights.

Acknowledgements

We thank our reviewers for their detailed and insightful comments. We also thank Yuhao Zhang for his suggestions during the preparation of this paper.

References

Pretrained bert base model (12 layers). https: //huggingface.co/bert-base-uncased ,

Juexiao Zhang, Yubei Chen, Brian Cheung, and Bruno A Olshausen. 2019. Word embedding visualization via dictionary learning. arXiv preprint arXiv:1910.03833 .

Supplementary Materials

Importance Score (IS) Curves

Figure 9: (a) Importance score of 16 transformer factors corresponding to low level information. (b) Importance score of 16 transformer factors corresponds to mid level information respectively.

The importance score curve's characteristic has a strong correspondence to a transformer factor's categorization. Based on the location of the peak of an IS curve, we can classify a transformer factor as low-level, mid-level or high-level. The importance score for low-level transformer factors peak in early layers and slowly decrease across the rest of the layers. On the other hand, the importance score for mid-level and high-level transformers slowly increases and peaks at higher layers. In Figure 9, we show two sets of the examples to demonstrate the clear distinction between those two types of IS curves.

Taking a step back, we can also plot IS curve for each dimension of word vector (without sparse coding) at different layers. They do not show any specific patterns, as shown in Figure 10. This makes intuitive sense since we mentioned that each of the entries of a contextualized word embedding does not correspond to any clear semantic meaning.

Figure

(b)

LIME: Local Interpretable Model-Agnostic Explanations

After we trained the dictionary Φ through nonnegative sparse coding, the inference of the sparse code of a given input is

For a given sentence and index pair ( s, i ) , the embedding of word w = s [ i ] by layer l of transformer is x ( l ) ( x, i ) . Then we can abstract the inference of a specific entry of sparse code of the word vector as a black-box scalar-value function f :

Let RandomMask denotes the operation that generates perturbed version of our sentence s by masking word at random location with '[UNK]' (unkown) tokens. For example, a masked sentence could be

[Today is a ['UNK'],day]

Let h denote a encoder for perturbed sentences compared to the unperturbed sentence s , such that

The LIME algorithm we used to generated saliency map for each sentences is the following:

The Details of the Non-negative Sparse Coding Optimization

Where Ridge w is a weighted ridge regression defined as:

d ( · , · ) can be any metric that measures how much a perturbed sentence is different from the original sentence. If a sentence is perturbed such that every token is being masked, then the distance h ( h ( s ′ ) , glyph[vector] 1)

should be 0, if a sentence is not perturbed at all, then h ( h ( s ′ ) , glyph[vector] 1) should be 1. We choose d ( · , · ) to be cosine similarity in our implementation.

In practice, we also uses feature selection. This is done by running LIME twice. After we obtain the regression weight w 1 for the time, we use it to find the first k indices corresponds to the entry in w 1 with highest absolute value. We use those k index as location in the sentence and apply LIME for the second time with only those selected indices from step 1.

Overall, the regression weight w can be regarded as a saliency map. The higher the weight w k is, the more important the word s [ k ] in the sentence since it contributes more to the activation of a specific transformer factor.

We could also have negative weight in w . In general, negative weights are hard to interpret in the context of transformer factor. The activation will increase if they are removed those word correspond to negative weights. Since a transformer factor corresponds to a specific pattern, then word with negative weights are those word in a context that behaves 'opposite" of this pattern.

The Details of the Non-negative Sparse Coding Optimization

Let S be the set of all sequences, recall how we defined word embedding using hidden state of transformer in the main section: X ( l ) = { x ( l ) ( s, i ) | s ∈ S, i ∈ [0 , len ( s )] } as the set of all word embedding at layer l , then the set of word embedding across all layers is defined as

In practice, we use BERT base model as our transformer model, each word embedding vector (hidden state of BERT) is dimension 768. To learn the transformer factors, we concatenate all word vector x ∈ X into a data matrix A . We also defined f ( x ) to be the frequency of the token that is embedded in word vector x . For example, if x is the embedding of the word 'the', it will have a much larger frequency i.e. f ( x ) is high.

Using f ( x ) , we define the Inverse Frequency Matrix Ω : Ω is a diagonal matrix where each entry on the diagonal is the square inverse frequency of each word, i.e.

Then we use a typical iterative optimization procedure to learn the dictionary Φ described in the main section:

These two optimizations are both convex, we solve them iteratively to learn the transformer factors: In practice, we use minibatches contains 200 word vectors as X . The motivation of apply Inverse Frequency Matrix Ω is that we want to make sure all words in our vocabulary has the same contribution. When we sample our minibatch from A , frequent words like 'the' and 'a' are much likely to appear, which should receive lower weight during update.

Optimization 2 can converge in 1000 steps using the FISTA algorithm 2 . We experimented with different λ values from 0.03 to 3, and choose λ = 0 . 27 to give results presented in this paper. Once the sparse coefficients have been inferred, we update our dictionary Φ based on Optimization 3 by one step using an approximate second-order method, where the Hessian is approximated by its diagonal to achieve an efficient inverse (Duchi et al., 2011). The second-order parameter update method usually leads to much faster convergence. Empirically, we train 200k steps and it takes roughly 2 days on a Nvidia 1080 Ti GPU.

In the following three sections, we provide visualization of more example transformer factor in low-level, mid-level, and high-level. Here's table of Contents that contain hyperlinks which direct to each level:

2 The FISTA algorithm can usually converge within 300 steps, we use 1000 steps nevertheless to avoid any potential numerical issue.

Low-Level Transformer Factors

Low-Level Transformer Factors

noun, the element of a person that enables them to be aware of the

Explaination: Mind: world and their experiences.

Transformer factor 16 in layer 4 Explaination: Park: noun, 'park' as the name

Transformer factor 30 in layer 4 Explaination: left: verb, leaving, exiting

Transformer factor 33 in layer 4 Explaination: light: noun, the natural agent that stimulates sight and makes things visible:

Transformer factor 47 in layer 4 Explaination: plants: noun, vegetation

tallest tree in 2011. ·", or colourless enamel, as in the ground areas, rocks and trees. · produced from 16 to 139 weeks after a forest fire

Low-Level Transformer Factors

Mid-Level Transformer Factors

Transformer factor 13 in layer 10 Explaination: Unit exchange with parentheses: e.g. 10 m (1000cm)

Transformer factor 24 in layer 10 Explaination: Male name

Transformer factor 25 in layer 10 Explaination: Attributive Clauses

income of$ 34@,@ 795.

Transformer factor 42 in layer 10 Explaination: Some kind of disaster, something unfortunate happened

Transformer factor 50 in layer 10 Explaination: Doing something again, or making something new again

redeveloping the north side' s former rail yard and the area

Transformer factor 51 in layer 10 Explaination: apostrophe s, possesive

Transformer factor 86 in layer 10 Explaination: Pattern: Consecutive years, this is convention to name foodball/rugby game season

Transformer factor 99 in layer 10 Explaination: past tense

Transformer factor 102 in layer 10 Explaination: African name

muzorewa' s united african national council( uanc).

Transformer factor 125 in layer 10 Explaination: Describing someone in a paraphrasing style. Name, Career

Transformer factor 134 in layer 10

Explaination: Transition sentence

Transformer factor 152 in layer 10 Explaination: in some locations

This is the end of visualization of mid level transformer factor. Click [D] to go back.

High-Level Transformer Factors

Transformer networks have revolutionized NLP representation learning since they were introduced. Though a great effort has been made to explain the representation in transformers, it is widely recognized that our understanding is not sufficient. One important reason is that there lack enough visualization tools for detailed analysis. In this paper, we propose to use dictionary learning to open up these ‘black boxes’ as linear superpositions of transformer factors. Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors, e.g., word-level polysemy disambiguation, sentence-level pattern formation, and long-range dependency. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights. We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work. The code is available at https://github.com/zeyuyun1/TransformerVis.

Though the transformer networks Vaswani et al. (2017); Devlin et al. (2018) have achieved great success, our understanding of how they work is still fairly limited. This has triggered increasing efforts to visualize and analyze these “black boxes”. Besides a direct visualization of the attention weights, most of the current efforts to interpret transformer models involve “probing tasks”. They are achieved by attaching a light-weighted auxiliary classifier at the output of the target transformer layer. Then only the auxiliary classifier is trained for well-known NLP tasks like part-of-speech (POS) Tagging, Named-entity recognition (NER) Tagging, Syntactic Dependency, etc. Tenney et al. (2019) and Liu et al. (2019) show transformer models have excellent performance in those probing tasks. These results indicate that transformer models have learned the language representation related to the probing tasks. Though the probing tasks are great tools for interpreting language models, their limitation is explained in Rogers et al. (2020). We summarize the limitation into three major points:

Most probing tasks, like POS and NER tagging, are too simple. A model that performs well in those probing tasks does not reflect the model’s true capacity.

Probing tasks can only verify whether a certain prior structure is learned in a language model. They can not reveal the structures beyond our prior knowledge.

It’s hard to locate where exactly the related linguistic representation is learned in the transformer.

Efforts are made to remove those limitations and make probing tasks more diverse. For instance, Hewitt and Manning (2019) proposes “structural probe”, which is a much more intricate probing task. Jiang et al. (2020) proposes to generate specific probing tasks automatically. Non-probing methods are also explored to relieve the last two limitations. For example, Reif et al. (2019) visualizes embedding from BERT using UMAP and shows that the embeddings of the same word under different contexts are separated into different clusters. Ethayarajh (2019) analyzes the similarity between embeddings of the same word in different contexts. Both of these works show transformers provide a context-specific representation.

Faruqui et al. (2015); Arora et al. (2018); Zhang et al. (2019) demonstrate how to use dictionary learning to explain, improve, and visualize the uncontextualized word embedding representations. In this work, we propose to use dictionary learning to alleviate the limitations of the other transformer interpretation techniques. Our results show that dictionary learning provides a powerful visualization tool, leading to some surprising new knowledge.

Hypothesis: contextualized word embedding as a sparse linear superposition of transformer factors. It is shown that word embedding vectors can be factorized into a sparse linear combination of word factors Arora et al. (2018); Zhang et al. (2019), which correspond to elementary semantic meanings. An example is:

We view the latent representation of words in a transformer as contextualized word embedding. Similarly, we hypothesize that a contextualized word embedding vector can also be factorized as a sparse linear superposition of a set of elementary elements, which we call transformer factors. The exact definition will be presented later in this section.

Due to the skip connections in each of the transformer blocks, we hypothesize that the representation in any layer would be a superposition of the hierarchical representations in all of the lower layers. As a result, the output of a particular transformer block would be the sum of all of the modifications along the way. Indeed, we verify this intuition with the experiments. Based on the above observation, we propose to learn a single dictionary for the contextualized word vectors from different layers’ output.

To learn a dictionary of transformer factors with non-negative sparse coding.

Given a set of tokenized text sequences, we collect the contextualized embedding of every word using a transformer model. We define the set of all word embedding vectors from l𝑙lth layer of transformer model as X(l)superscript𝑋𝑙X^{(l)}. Furthermore, we collect the embeddings across all layers into a single set X=X(1)∪X(2)∪⋯∪X(L)𝑋superscript𝑋1superscript𝑋2⋯superscript𝑋𝐿X=X^{(1)}\cup X^{(2)}\cup\cdots\cup X^{(L)}.

By our hypothesis, we assume each embedding vector x∈X𝑥𝑋x\in X is a sparse linear superposition of transformer factors:

where Φ∈I​Rd×mΦIsuperscriptR𝑑𝑚\Phi\in{\rm I!R}^{d\times m} is a dictionary matrix with columns Φ:,csubscriptΦ:𝑐\Phi_{:,c}\ , 𝜶∈I​Rm𝜶IsuperscriptR𝑚\bm{\alpha}\in{\rm I!R}^{m} is a sparse vector of coefficients to be inferred and ϵbold-italic-ϵ\bm{\epsilon} is a vector containing independent Gaussian noise samples, which are assumed to be small relative to 𝒙𝒙\bm{x}. Typically m>d𝑚𝑑m>d so that the representation is overcomplete. This inverse problem can be efficiently solved by FISTA algorithm Beck and Teboulle (2009). The dictionary matrix ΦΦ\Phi can be learned in an iterative fashion by using non-negative sparse coding, which we leave to the appendix section C. Each column Φ:,csubscriptΦ:𝑐\Phi_{:,c}\ of ΦΦ\Phi is a transformer factor and its corresponding sparse coefficient 𝜶csubscript𝜶𝑐\bm{\alpha}_{c} is its activation level.

Visualization by top activation and LIME interpretation. An important empirical method to visualize a feature in deep learning is to use the input samples, which trigger the top activation of the feature Zeiler and Fergus (2014). We adopt this convention. As a starting point, we try to visualize each of the dimensions of a particular layer, X(l)superscript𝑋𝑙X^{(l)}. Unfortunately, the hidden dimensions of transformers are not semantically meaningful, which is similar to the uncontextualized word embeddings Zhang et al. (2019).

Instead, we can try to visualize the transformer factors. For a transformer factor Φ:,csubscriptΦ:𝑐\Phi_{:,c} and for a layer-l𝑙l, we denote the 1000 contextualized word vectors with the largest sparse coefficients αc(l)subscriptsuperscript𝛼𝑙𝑐\alpha^{(l)}{c} as Xc(l)⊂X(l)subscriptsuperscript𝑋𝑙𝑐superscript𝑋𝑙X^{(l)}{c}\subset X^{(l)}, which correspond to 1000 different sequences. For example, Figure 3 shows the top 5 words that activated transformer factor-17 Φ:,17subscriptΦ:17\Phi_{:,17} at layer-00, layer-222, and layer-666 respectively. Since a contextualized word vector is generally affected by many tokens in the sequence, we can use LIME Ribeiro et al. (2016) to assign a weight to each token in the sequence to identify their relative importance to αcsubscript𝛼𝑐\alpha_{c}. The detailed method is left to Section 3.

To determine low-, mid-, and high-level transformer factors with importance score. As we build a single dictionary for all of the transformer layers, the semantic meaning of the transformer factors has different levels. While some of the factors appear in lower layers and continue to be used in the later stages, the rest of the factors may only be activated in the higher layers of the transformer network. A central question in representation learning is: “where does the network learn certain information?” To answer this question, we can compute an “importance score” for each transformer factor Φ:,csubscriptΦ:𝑐\Phi_{:,c} at layer-l𝑙l as Ic(l)subscriptsuperscript𝐼𝑙𝑐I^{(l)}{c}. Ic(l)subscriptsuperscript𝐼𝑙𝑐I^{(l)}{c} is the average of the largest 1000 sparse coefficients αc(l)subscriptsuperscript𝛼𝑙𝑐\alpha^{(l)}{c}’s, which correspond to Xc(l)subscriptsuperscript𝑋𝑙𝑐X^{(l)}{c}. We plot the importance scores for each transformer factor as a curve is shown in Figure 2. We then use these importance score (IS) curves to identify which layer a transformer factor emerges. Figure 2a shows an IS curve peak in the earlier layer. The corresponding transformer factor emerges in the earlier stage, which may capture lower-level semantic meanings. In contrast, Figure 2b shows a peak in the higher layers, which indicates the transformer factor emerges much later and may correspond to mid- or high-level semantic structures. More subtleties are involved when distinguishing between mid-level and high-level factors, which will be discussed later.

An important characteristic is that the IS curve for each transformer factor is relatively smooth. This indicates if a vital feature is learned in the beginning layers, it won’t disappear in later stages. Instead, it will be carried all the way to the end with gradually decayed weight since many more features would join along the way. Similarly, abstract information learned in higher layers is slowly developed from the early layers. Figure 3 and 5 confirm this idea, which will be explained in the next section.

We use a 12-layer pre-trained BERT model Pre ; Devlin et al. (2018) and freeze the weights. Since we learn a single dictionary of transformer factors for all of the layers in the transformer, we show that these transformer factors correspond to different levels of semantic or syntactic patterns. The patterns can be roughly divided into three categories: word-level disambiguation, sentence-level pattern formation, and long-range dependency. In the following, we provide detailed visualization for each pattern category. Due to the space limit, only a small amount of the factors are demonstrated in the paper. To alleviate the “cherry-picking” bias, we also build a website for the interested readers to play with these results.

Low-level: word-level polysemy disambiguation. While the input embedding of a token contains polysemy, we find transformer factors with early IS curve peaks usually correspond to a specific word-level meaning. By visualizing the top activation sequences, we can see how word-level disambiguation is gradually developed in a transformer.

We show how the disambiguation effect develops progressively through each layer in Figure 3. In Figure 3, the top 5 activated words and their contexts for transformer factor Φ:,30subscriptΦ:30\Phi_{:,30} in different layers are listed. The top activated words in layer 0 contain the word “left” varying senses, which is being mostly disambiguated in layer 2 albeit not completely. In layer 4, the word “left” is fully disambiguated since the top-activated word contains only “left” with the word sense “leaving, exiting.” We also show more examples of those types of transformer factors in Table 1: for each transformer factor, we list out the top 3 activated words and their contexts in layer 4. As shown in the table, nearly all top-activated words are disambiguated into a single sense.

Further, we can quantify the quality of the disambiguation ability of the transformer model. In the example above, since the top 1000 activated words and contexts are “left” with only the word sense “leave, exiting”, we can assume “left” when used as a verb, triggers higher activation in Φ:,30subscriptΦ:30\Phi_{:,30} than “left” used as other sense of speech. We can verify this hypothesis using a human-annotated corpus: Brown corpus Francis and Kucera (1979). In this corpus, each word is annotated with its corresponding part-of-speech. We collect all the sentences contains the word “left” annotated as a verb in one set and sentences contains “left” annotated as other part-of-speech. As shown in Figure 4a, in layer 0, the average activation of Φ:,30subscriptΦ:30\Phi_{:,30} for the word “left” marked as a verb is no different from “left” as other senses. However, at layer 2, “left” marked as a verb triggers a higher activation of Φ:,30subscriptΦ:30\Phi_{:,30}. In layer 4, this difference further increases, indicating disambiguation develops progressively across layers. In fact, we plot the activation of “left” marked as verb and the activation of other “left” in Figure 4b. In layer 4, they are nearly linearly separable by this single feature. Since each word “left” corresponds to an activation value, we can perform a logistic regression classification to differentiate those two types of “left”. From the result shown in Figure 4a, it is pretty fascinating to see that the disambiguation ability of just Φ:,30subscriptΦ:30\Phi_{:,30} is better than the other two classifiers trained with supervised data. This result confirms that disambiguation is indeed done in the early part of pre-trained transformer model and we are able to detect it via dictionary learning.

Mid level: sentence-level pattern formation. We find most of the transformer factors, with an IS curve peak after layer 6, capture mid-level or high-level semantic meanings. In particular, the mid-level ones correspond to semantic patterns like phrases and sentences pattern.

We first show two detailed examples of mid-level transformer factors. Figure 5 shows a transformer factor that detects the pattern of consecutive usage of adjectives. This pattern starts to emerge at layer 4, develops at layer 6, and becomes quite reliable at layer 8. Figure 6 shows a transformer factor, which corresponds to a pretty unexpected pattern: “unit exchange”, e.g., 56 inches (140 cm). Although this exact pattern only starts to appear at layer 8, the sub-structures that make this pattern, e.g., parenthesis and numbers, appear to trigger this factor in layers 4 and 6. Thus this transformer factor is also gradually developed through several layers.

While some mid-level transformer factors verify common semantic or syntactic patterns, there are also many surprising mid-level transformer factors. We list a few in Table 3 with quantitative analysis. For each listed transformer factor, we analyze the top 200 activating words and their contexts in each layer. We record the percentage of those words and contexts that correspond to the factors’ semantic pattern in Table 3. From the table, we see that large percentages of top-activated words and contexts do corresponds to the pattern we describe. It also shows most of these mid-level patterns start to develop at layer 4 or 6. More detailed examples are provided in the appendix section F. Though it’s still mysterious why the transformer network develops representations for these surprising patterns, we believe such a direct visualization can provide additional insights, which complements the “probing tasks”.

To further confirm a transformer factor does correspond to a specific pattern, we can use constructed example words and context to probe their activation. In Table 4, we construct several text sequences that are similar to the patterns corresponding to a particular transformer factor but with subtle differences. The result confirms that the context that strictly follows the pattern represented by that transformer factor triggers a high activation. On the other hand, the closer the adversarial example to this pattern, the higher activation it receives at this transformer factor.

High-level: long-range dependency. High-level transformer factors correspond to those linguistic patterns that span an extended range in the text. Since the IS curves of mid-level and high-level transformer factors are similar, it is difficult to distinguish those transformer factors based on their IS cures. Thus, we have to manually examine the top-activation words and contexts for each transformer factor to differentiate between mid-level and high-level transformer factors. To ease the process, we choose to use the black-box interpretation algorithm LIME Ribeiro et al. (2016) to identify the contribution of each token in a sequence. There also exist interpretation tools that specifically leverage the transformer architecture (Chefer et al., 2021, 2020). In the future, one could adapt those interpretation tools, which may potentially provide better visualization.

Given a sequence s∈S𝑠𝑆s\in S, we can treat αc,i(l)subscriptsuperscript𝛼𝑙𝑐𝑖\alpha^{(l)}{c,i}, the activation of Φ:,csubscriptΦ:𝑐\Phi{:,c} in layer-l𝑙l at location i𝑖i, as a scalar function of s𝑠s, fc,i(l)​(s)subscriptsuperscript𝑓𝑙𝑐𝑖𝑠f^{(l)}{c,i}(s). Assume a sequence s𝑠s triggers a high activation αc,i(l)subscriptsuperscript𝛼𝑙𝑐𝑖\alpha^{(l)}{c,i}, i.e. fc,i(l)​(s)subscriptsuperscript𝑓𝑙𝑐𝑖𝑠f^{(l)}{c,i}(s) is large. We want to know how much each token (or equivalently each position) in s𝑠s contributes to fc,i(l)​(s)subscriptsuperscript𝑓𝑙𝑐𝑖𝑠f^{(l)}{c,i}(s). To do so, we generated a sequence set 𝒮​(s)𝒮𝑠\mathcal{S}(s), where each s′∈𝒮​(s)superscript𝑠′𝒮𝑠s^{\prime}\in\mathcal{S}(s) is the same as s𝑠s except for that several random positions in s′superscript𝑠′s^{\prime} are masked by [‘UNK’] (the unknown token). Then we learns a linear model gw​(s′)subscript𝑔𝑤superscript𝑠′g_{w}(s^{\prime}) with weights w∈ℝT𝑤superscriptℝ𝑇w\in\mathbb{R}^{T} to approximate f​(s′)𝑓superscript𝑠′f(s^{\prime}), where T𝑇T is the length of sentence s𝑠s. This can be solved as a ridge regression:

The learned weights w𝑤w can serve as a saliency map that reflects the “contribution” of each token in the sequence s𝑠s. Like in Figure 7, the color reflects the weights w𝑤w at each position. Red means the given position has positive weight and green means negative weight. The magnitude of weight is represented by the intensity. The redder a token is, the more it contributions to the activation of the transformer factor. We leave more implementation and mathematical formulation details of LIME algorithm in the appendix.

We provide detailed visualization for two different transformer factors that show long-range dependency in Figure 7, 8. Since visualization of high-level information requires more extended context, we only offer the top two activated words and their contexts for each such transformer factor. Many more will be provided in the appendix section G.

We name the pattern for transformer factor Φ:,297subscriptΦ:297\Phi_{:,297} in Figure 7 as “repetitive pattern detector”. All top activated contexts for Φ:,297subscriptΦ:297\Phi_{:,297} contain an obvious repetitive structure. Specifically, the text snippet “can’t get you out of my head" appears twice in the first example, and the text snippet “xxx class passenger, star alliance” appears three times in the second example. Compared to the patterns we found in the mid-level [6], the high-level patterns like “repetitive pattern detector” are much more abstract. In some sense, the transformer detects if there are two (or multiple) almost identical embedding vectors at layer-101010 without caring what they are. Such behavior might be highly related to the concept proposed in the capsule networks Sabour et al. (2017); Hinton (2021). To further understand this behavior and study how the self-attention mechanism helps model the relationships between the features outlines an interesting future research direction.

Figure 8 shown another high-level factor, which detects text snippets related to “the beginning of a biography”. The necessary components, day of birth as month and four-digit years, first name and last name, familial relation, and career, are all mid-level information. In Figure 8, we see that all the information relates to biography has a high weight in the saliency map. Thus, they are all together combined to detect the high-level pattern.

Dictionary learning has been successfully used to visualize the classical word embeddings Arora et al. (2018); Zhang et al. (2019). In this paper, we propose to use this simple method to visualize the representation learned in transformer networks to supplement the implicit “probing-tasks” methods. Our results show that the learned transformer factors are relatively reliable and can even provide many surprising insights into the linguistic structures. This simple tool can open up the transformer networks and show the hierarchical semantic or syntactic representation learned at different stages. In short, we find word-level disambiguation, sentence-level pattern formation, and long-range dependency. The idea of a neural network learns low-level features in early layers, and abstract concepts in the later stages are very similar to the visualization in CNN Zeiler and Fergus (2014). Dictionary learning can be a convenient tool to help visualize a broad category of neural networks with skip connections, like ResNet He et al. (2016), ViT models Dosovitskiy et al. (2020), etc. For more interested readers, we provide an interactive website111https://transformervis.github.io/transformervis/ for the readers to gain some further insights.

We thank our reviewers for their detailed and insightful comments. We also thank Yuhao Zhang for his suggestions during the preparation of this paper.

The importance score curve’s characteristic has a strong correspondence to a transformer factor’s categorization. Based on the location of the peak of an IS curve, we can classify a transformer factor as low-level, mid-level or high-level. The importance score for low-level transformer factors peak in early layers and slowly decrease across the rest of the layers. On the other hand, the importance score for mid-level and high-level transformers slowly increases and peaks at higher layers. In Figure 9, we show two sets of the examples to demonstrate the clear distinction between those two types of IS curves.

Taking a step back, we can also plot IS curve for each dimension of word vector (without sparse coding) at different layers. They do not show any specific patterns, as shown in Figure 10. This makes intuitive sense since we mentioned that each of the entries of a contextualized word embedding does not correspond to any clear semantic meaning.

For a given sentence and index pair (s,i)𝑠𝑖(s,i), the embedding of word w=s​[i]𝑤𝑠delimited-[]𝑖w=s[i] by layer l𝑙l of transformer is x(l)​(x,i)superscript𝑥𝑙𝑥𝑖x^{(l)}(x,i). Then we can abstract the inference of a specific entry of sparse code of the word vector as a black-box scalar-value function f𝑓f:

Let R​a​n​d​o​m​M​a​s​k𝑅𝑎𝑛𝑑𝑜𝑚𝑀𝑎𝑠𝑘RandomMask denotes the operation that generates perturbed version of our sentence s𝑠s by masking word at random location with “[UNK]” (unkown) tokens. For example, a masked sentence could be

[Today is a [‘UNK’],day]

Let hℎh denote a encoder for perturbed sentences compared to the unperturbed sentence s𝑠s, such that

The LIME algorithm we used to generated saliency map for each sentences is the following:

Where R​i​d​g​ew𝑅𝑖𝑑𝑔subscript𝑒𝑤Ridge_{w} is a weighted ridge regression defined as:

d​(⋅,⋅)𝑑⋅⋅d(\cdot,\cdot) can be any metric that measures how much a perturbed sentence is different from the original sentence. If a sentence is perturbed such that every token is being masked, then the distance h​(h​(s′),1→)ℎℎsuperscript𝑠′→1h(h(s^{\prime}),\vec{1}) should be 0, if a sentence is not perturbed at all, then h​(h​(s′),1→)ℎℎsuperscript𝑠′→1h(h(s^{\prime}),\vec{1}) should be 1. We choose d​(⋅,⋅)𝑑⋅⋅d(\cdot,\cdot) to be cosine similarity in our implementation.

In practice, we also uses feature selection. This is done by running LIME twice. After we obtain the regression weight w1subscript𝑤1w_{1} for the time, we use it to find the first k𝑘k indices corresponds to the entry in w1subscript𝑤1w_{1} with highest absolute value. We use those k𝑘k index as location in the sentence and apply LIME for the second time with only those selected indices from step 1.

Overall, the regression weight w𝑤w can be regarded as a saliency map. The higher the weight wksubscript𝑤𝑘w_{k} is, the more important the word s​[k]𝑠delimited-[]𝑘s[k] in the sentence since it contributes more to the activation of a specific transformer factor.

We could also have negative weight in w𝑤w. In general, negative weights are hard to interpret in the context of transformer factor. The activation will increase if they are removed those word correspond to negative weights. Since a transformer factor corresponds to a specific pattern, then word with negative weights are those word in a context that behaves “opposite" of this pattern.

Let S𝑆S be the set of all sequences, recall how we defined word embedding using hidden state of transformer in the main section: X(l)={x(l)​(s,i)|s∈S,i∈[0,l​e​n​(s)]}superscript𝑋𝑙conditional-setsuperscript𝑥𝑙𝑠𝑖formulae-sequence𝑠𝑆𝑖0𝑙𝑒𝑛𝑠X^{(l)}={x^{(l)}(s,i)|s\in S,i\in\left[0,len(s)\right]} as the set of all word embedding at layer l𝑙l, then the set of word embedding across all layers is defined as

In practice, we use BERT base model as our transformer model, each word embedding vector (hidden state of BERT) is dimension 768. To learn the transformer factors, we concatenate all word vector x∈X𝑥𝑋x\in X into a data matrix A𝐴A. We also defined f​(x)𝑓𝑥f(x) to be the frequency of the token that is embedded in word vector x𝑥x. For example, if x𝑥x is the embedding of the word “the”, it will have a much larger frequency i.e. f​(x)𝑓𝑥f(x) is high.

Using f​(x)𝑓𝑥f(x), we define the Inverse Frequency Matrix ΩΩ\Omega: ΩΩ\Omega is a diagonal matrix where each entry on the diagonal is the square inverse frequency of each word, i.e.

Then we use a typical iterative optimization procedure to learn the dictionary ΦΦ\Phi described in the main section:

These two optimizations are both convex, we solve them iteratively to learn the transformer factors: In practice, we use minibatches contains 200 word vectors as X𝑋X. The motivation of apply Inverse Frequency Matrix ΩΩ\Omega is that we want to make sure all words in our vocabulary has the same contribution. When we sample our minibatch from A𝐴A, frequent words like “the” and “a” are much likely to appear, which should receive lower weight during update.

Optimization 2 can converge in 1000 steps using the FISTA algorithm222The FISTA algorithm can usually converge within 300 steps, we use 1000 steps nevertheless to avoid any potential numerical issue.. We experimented with different λ𝜆\lambda values from 0.03 to 3, and choose λ=0.27𝜆0.27\lambda=0.27 to give results presented in this paper. Once the sparse coefficients have been inferred, we update our dictionary ΦΦ\Phi based on Optimization 3 by one step using an approximate second-order method, where the Hessian is approximated by its diagonal to achieve an efficient inverse Duchi et al. (2011). The second-order parameter update method usually leads to much faster convergence. Empirically, we train 200k steps and it takes roughly 2 days on a Nvidia 1080 Ti GPU.

In the following three sections, we provide visualization of more example transformer factor in low-level, mid-level, and high-level. Here’s table of Contents that contain hyperlinks which direct to each level:

Low-Level: E

Transformer factor 2 in layer 4 Explaination: Mind: noun, the element of a person that enables them to be aware of the world and their experiences. • that snare shot sounded like somebody’ d kicked open the door to your mind". • i became very frustrated with that and finally made up my mind to start getting back into things." • when evita asked for more time so she could make up her mind, the crowd demanded," ¡ ahora, evita,< • song and watch it evolve in front of us… almost as a memory in your head. • was to be objective and to let the viewer make up his or her own mind." • managed to give me goosebumps, and those moments have remained on my mind for weeks afterward." • rests the tir’ d mind, and waking loves to dream •, tracks like’ halftime’ and the laid back’ one time 4 your mind’ demonstrated a[ high] level of technical precision and rhetorical dexter • so i went to bed with that on my mind". •ment to a seed of doubt that had been playing on mulder’ s mind for the entire season". • my poor friend smart shewed the disturbance of his mind, by falling upon his knees, and saying his prayers in the street • donoghue complained that lessing has not made up her mind on whether her characters are" the salt of the earth or its sc • release of the new lanois@-@ produced album, time out of mind. • sympathetic man to illegally" ghost@-@ hack" his wife’ s mind to find his daughter. • this album veered into" the corridors" of flying lotus’" own mind", interpreting his guest vocalists as" disembodied phantom Transformer factor 16 in layer 4 Explaination: Park: noun, ’park’ as the name • allmusic writer william ruhlmann said that" linkin park sounds like a johnny@-@ come@-@ lately to an •nington joined the five members xero and the band was renamed to linkin park. • times about his feelings about gordon, and the price family even sat away from park’ s supporters during the trial itself. • on 25 january 2010, the morning of park’ s 66th birthday, he was found hanged and unconscious in his • was her, and knew who had done it", expressing his conviction of park’ s guilt. • jeremy park wrote to the north@-@ west evening mail to confirm that he • vanessa fisher, park’ s adoptive daughter, appeared as a witness for the prosecution at the • they played at< unk> for years before joining oldham athletic at boundary park until 2010 when they moved to oldham borough’ s previous ground,< • theme park guests may use the hogwarts express to travel between hogsmead • s strength in both singing and rapping while comparing the sound to linkin park. • in a statement shortly after park’ s guilty verdict, he said he had" no doubt" that • june 2013, which saw the band travel to rock am ring and rock im park as headline act, the song was moved to the middle of the set • after spending the first decade of her life at the central park zoo, pattycake moved permanently to the bronx zoo in 1982. • south park spoofed the show and its hosts in the episode" south park is gay!" • harrison" sounds like he’ s recorded his vocal track in one of the park’ s legendary caves". Transformer factor 30 in layer 4 Explaination: left: verb, leaving, exiting • did succeed in getting the naval officers into his house, and the mob eventually left. • all of the federal troops had left at this point, except totten who had stayed behind to listen to • saying that he has left the outsiders, kovu asks simba to let him join his pride • eventually, all boycott’ s employees left, forcing him to run the estate without help. • the story concerned the attempts of a scientist to photograph the soul as it left the body. • in time and will slowly improve until he returns to the point at which he left. • peggy’ s exit was a" non event", as" peggy just left, nonsensically and at complete odds with everything we’ ve • over the course of the group’ s existence, several hundred people joined and left. • no profit was made in six years, and the church left, losing their investment. • on 7 november he left, missing the bolshevik revolution, which began on that day. • he had not re@-@ written his will and when produced still left everything to his son lunalilo. • they continued filming as normal, and when lynch yelled cut, the townspeople had left. • with land of black gold( 1950), a story that he had previously left unfinished, instead. • he was infuriated that the government had left thousands unemployed by closing down casinos and brothels. • an impending marriage between her and albert interfered with their studies, the two brothers left on 28 august 1837 at the close of the term to travel around europe Transformer factor 33 in layer 4 Explaination: light: noun, the natural agent that stimulates sight and makes things visible: • forced to visit the sarajevo television station at night and to film with as little light as possible to avoid the attention of snipers and bombers. • by the modest, cream@-@ colored attire in the airy, light@-@ filled clip. • the man asked her to help him carry the case to his car, a light@-@ brown volkswagen beetle. • they are portrayed in a particularly sympathetic light when they are killed during the ending. • caught up" was directed by mr. x, who was behind the laser light treatment of usher’ s 2004 video" yeah!" • piracy in the indian ocean, and the script depicted the pirates in a sympathetic light. • without the benefit of moon light, the light horsemen had fired at the flashes of the enemy’ s • second innings, voce repeated the tactic late in the day, in fading light against woodfull and bill brown. •, and the workers were transferred on 7 july to another facility belonging to early light, 30 km away in< unk> town. • unk> brooklyn avenue ne near the university of washington campus in a small light@-@ industrial building leased from the university. • factory where the incident took place is the< unk>(" early light") toy factory(< unk>), owned by hong •, a 1934 comedy in which samuel was portrayed in an unflattering light, and mrs beeton, a 1937 documentary,< unk> • stage effects and blue@-@ red light transitions give the video a surreal feel, while a stoic crowd make • set against the backdrop of mumbai’ s red@-@ light districts, it follows the travails of its personnel and principal, • themselves on the chinese flank in the foothills, before scaling the position at first light. Transformer factor 47 in layer 4 Explaination: plants: noun, vegetation • the distinct feature of the main campus is the mall, which is a large tree – laden grassy area where many students go to relax. • each school in the london borough of hillingdon was invited to plant a tree, and the station commander of raf northolt, group captain tim o • its diet in summer contains a high proportion of insects, while more plant items are eaten in autumn. • large fruitings of the fungus are often associated with damage to the host tree, such as that which occurs with burning. • she nests on the ground under the cover of plants or in cavities such as hollow tree trunks. • orchards, heaths and hedgerows, especially where there are some old trees. • the scent of plants such as yarrow acts as an olfactory attractant to females. • of its grasshopper host, causing it to climb to the top of a plant and cling to the stem as it dies. • well@-@ drained or sandy soil, often in the partial shade of trees. • food is taken from the ground, low@-@ growing plants and from inside grass tussocks; the crake may search leaf • into his thought that the power of gravity( which brought an apple from a tree to the ground) was not limited to a certain distance from earth, • they eat both seeds and green plant parts and consume a variety of animals, including insects, crustaceans • fyne, argyll in the 1870s was named as the uk ’ s tallest tree in 2011. •", or colourless enamel, as in the ground areas, rocks and trees. • produced from 16 to 139 weeks after a forest fire in areas with coniferous trees.

This is the end of visualization of low-level transformer factor. Click [D] to go back.

Transformer factor 13 in layer 10 Explaination: Unit exchange with parentheses: e.g. 10 m (1000cm) • 14@-@ 16 hand( 56 to 64 inches( 140 to 160 cm)) war horse is that it was a matter of pride to a •, behind many successful developments, defaulted on the$ 214 million($ 47 billion) in bonds held by 60@,@ 000 investors; the van • straus, behind many successful developments, defaulted on the$ 214 million($ 47 billion) in bonds held by 60@,@ 000 investors; • the niche is 4 m( 13 ft) wide and 3@. • with a top speed of nearly 21 knots( 39 km/ h; 24 mph). •@ 4 billion( us$ 21 million) — india’ s highest@-@ earning film of the year •) at deep load as built, with a length of 310 ft( 94 m), a beam of 73 feet 7 inches( 22@. •@ 3@-@ inch( 160 mm) calibre steel barrel. • and gave a maximum speed of 23 knots( 43 km/ h; 26 mph). • 2 km) in length, with a depth around 790 yards( 720 m), and in places only a few yards separated the two sides. • hull provided a combined thickness of between 24 and 28 inches( 60 – 70 cm), increasing to around 48 inches( 1@. • switzerland, austria and germany; and his mother, lynette federer( born durand), from kempton park, gauteng, is •@ 2 in( 361 mm) thick sides. •) and a top speed of 30 knots( 56 km/ h; 35 mph). •, an outdoor seating area( 4@,@ 300 square feet( 400 m2)) and a 2@,@ 500@-@ square@ Transformer factor 24 in layer 10 Explaination: Male name • divorcing doqui in 1978, michelle married robert h. tucker, jr. the following year, changed her name to gee tucker, moved back • divorced doqui in 1978 and married new orleans politician robert h. tucker, jr. the following year; she changed her name to gee tucker and became • including isabel sanford, when chuck and new orleans politician robert h. tucker, jr. visited michelle at her hotel. • of 32 basidiomycete mushrooms showed that mutinus elegans was the only species to show antibiotic( both antibacterial • amphicoelias, it is probably synonymous with camarasaurus grandis rather than c. supremus because it was found lower in the •[ her] for warmth and virtue" and mehul s. thakkar of the deccan chronicle wrote that she was successful in" deliver[ • em( queen latifah) and uncle henry( david alan grier) own a diner, to which dorothy works for room and board. • in melbourne on 10 august 1895, presented by dion boucicault, jr. and robert brough, and the play was an immediate success. • in the early 1980s, james r. tindall, sr. purchased the building, the construction of which his father had originally financed • in 1937, when chakravarthi rajagopalachari became the chief minister of madras presidency, he introduced hindi as a compulsory • in 1905 william lewis moody, jr. and isaac h. kempner, members of two of galveston’ • also, walter b. jones, jr. of north carolina sent a letter to the republican conference chairwoman cathy • empire’ s leading generals, nikephoros bryennios the elder, the doux of dyrrhachium in the western balkans • in bengali as< unk>: the warrior by raj chakraborty with dev and mimi chakraborty portraying the lead roles. • on 1 june 1989, erik g. braathen, son of bjørn g., took over as ceo Transformer factor 25 in layer 10 Explaination: Attributive Clauses • which allows japan to mount an assault on the us; or kill him, which lets the us discover japan’ s role in rigging american elections — • certain stages of development, and constitutive heterochromatin that consists of chromosome structural components such as telomeres and centromeres • to the mouth of the nueces river, and oso bay, which extends south to the mouth of oso creek. •@,@ 082 metric tons, and argentina, which ranks 17th, with 326@,@ 900 metric tons. • of$ 42@,@ 693 and females had a median income of$ 34@,@ 795. • ultimately scored 14 points with 70 per cent shooting, and crispin, who scored twelve points with 67 per cent shooting. • and is operated by danish air transport, and one jetstream 32, which seats 19 and is operated by helitrans. • acute stage, which occurs shortly after an initial infection, and a chronic stage that develops over many years. •, earl of warwick and then william of lancaster, and ada de warenne who married henry, earl of huntingdon. • who ultimately scored 14 points with 70 per cent shooting, and crispin, who scored twelve points with 67 per cent shooting. • in america, while" halo/ walking on sunshine" charted at number 4 in ireland, 9 in the uk, 10 in australia, 28 in canada • five events, heptathlon consisting of seven events, and decathlon consisting of ten< unk> every multi event, athletes participate in a •@-@ life of 154@,@ 000 years, and 235np with a half@-@ life of 396@. • comfort, and intended to function as the prison, and the second floor was better finished, with a hall and a chamber, and probably operated as the •b, which serves the quonset freeway, and exit 7a, which serves route 402( frenchtown road), another spur route connecting the Transformer factor 42 in layer 10 Explaination: Some kind of disaster, something unfortunate happened • after the first five games, all losses, jeff carter suffered a broken foot that kept him out of the line@-@ up for • allingham died of natural causes in his sleep at 3: 10 am on 18 july 2009 at his • upon reaching corfu, thousands of serb troops began showing symptoms of typhus and had to be quarantined on the island of< un • than a year after the senate general election, the september 11, 2001 terrorist attacks took place, with giuliani still mayor. • the starting job because fourth@-@ year junior grady was under suspension related to driving while intoxicated charges. • his majesty, but as soon as they were on board ship, they died of melancholy, having refused to eat or drink. • on 16 september 1918, before she had even gone into action, she suffered a large fire in one of her 6@-@ inch magazines, and • orange goalkeeper for long@-@ time starter john galloway who was sick with the flu. • in 1666 his andover home was destroyed by fire, supposedly because of" the carelessness of the maid". • the government, on 8 february, admitted that the outbreak may have been caused by semi@-@ processed turkey meat imported directly •ikromo came under investigation by the justice office of the dutch east indies for publishing several further anti@-@ dutch editorials. • that he could attend to the duties of his office, but fell ill with a fever in august 1823 and died in office on september 1. •@ 2 billion initiative to combat cholera and the construction of a$ 17 million teaching hospital in< unk • he would not hear from his daughter until she was convicted of stealing from playwright george axelrod in 1968, by which time rosaleen • relatively hidden location and proximity to piccadilly circus, the street suffers from crime, which has led to westminster city council gating off the man in Transformer factor 50 in layer 10 Explaination: Doing something again, or making something new again • 2007 saw the show undergo a revamp, which included a switch to recording in hdtv, the introduction • during the ship’ s 1930 reconstruction; the maximum elevation of the main guns was increased to+ 43 degrees, increasing their maximum range from 25@, • hurricane pack 1 was a revamped version of story mode; team ninja tweaked the • she was fitted with new engines and more powerful water@-@ tube boilers rated at 6@ • from 1988 to 2000, the two western towers were substantially overhauled with a viewing platform provided at the top of the north tower. • latest missoula downtown master plan in 2009, increased emphasis was directed toward redeveloping the north side’ s former rail yard and the area • 1896: the ribbon of the army version medal of honor was redesigned with all stripes being vertical. • the new badge includes a star to represent the european cup win in 1982, and • missoula downtown master plan in 2009, increased emphasis was directed toward redeveloping the north side’ s former rail yard and the area just • also assisted in comprehensive infrastructure renovations, restored a dependable supply of electricity, revamped the baggage handling facilities as well as the arrival and departure lounge • hurricane pack 1 was a revamped version of story mode; team ninja tweaked the encounters • 1896: the ribbon of the army version medal of honor was redesigned with all stripes being vertical. • from 1988 to 2000, the two western towers were substantially overhauled with a viewing platform provided at the top of the north tower • assisted in comprehensive infrastructure renovations, restored a dependable supply of electricity, revamped the baggage handling facilities as well as the arrival and departure lounges • bond series and the fourth to star roger moore as bond; the plot was significantly changed from the novel to include excursions into space. Transformer factor 51 in layer 10 Explaination: apostrophe s, possesive • the irish times was critical of the book’ s text but wrote positively of the included photographs. • if it survived long enough to become old@-@ fashioned it was likely to be • you by phil spector as his inspirations, which resulted to the album’ s wall of sound resonance. • the irish times was critical of the book’ s text but wrote positively of the included photographs. • album to the wu tang clan and nine inch nails, particularly comparing the album’ s production( which was done by various producers with executive producer don gilmore • to the wu tang clan and nine inch nails, particularly comparing the album’ s production( which was done by various producers with executive producer don gilmore) • toward the commoners and interested in easing their burden but suspicious about the letter’ s true purpose, reluctantly signed the document under intense pressure from the french • the novel’ s reception was even warmer than that of its predecessor; waugh was • first song selected for inclusion after her mother’ s recommendation and the song’ s melancholic lyrics. • it divided critics at the time; although they praised the game’ s writing and scale of choice, they criticized its technical flaws. • mgm executive al lewin said that several years after the film’ s release stroheim asked him for the cut footage. • the game’ s production was turbulent, as the design’ s scope exceeded the available resources • nicki escudero from the phoenix new times noted the song’ s superficial themes which included lyrics about" sex, money and cheating" • mgm executive al lewin said that several years after the film’ s release stroheim asked him for the cut footage. • labrie said that there was" a lot of discussion" about the song’ s wording and how direct it should be. Transformer factor 86 in layer 10 Explaination: Pattern: Consecutive years, this is convention to name foodball/rugby game season • with york the previous season, signed a contract until the end of 2013 – 14 and sheffield united midfielder elliott whitehouse signed on a one@-@ • as of the end of the 2014 – 15 season, aston villa have spent 104 seasons in the top tier of english • won 13 and drew two of their opening 15 league matches of the 1985 – 86 campaign, and seemed destined to win the first division title. • mcallister, still without a goal in 2009 – 10, couldn’ t get on the scoresheet in the three games • john bentley led united to a fourth@-@ place finish in 1912 – 13. • he made 46 appearances, scoring three goals, in the 2001 – 02 season before spending the close season with the kalamazoo kingdom in the • he moved to basingstoke town towards the end of 2001 – 02, making his debut in march 2002. • 7[ note 1] was the worst record in the nhl for 2011 – 12 and the first time in franchise history they finished in last place. • side, who withdrew from the football league at the end of the 1893 – 94 season after finishing bottom of the second division. • spent a year as a physics instructor at the university of minnesota in 1916 – 17, then two years as a research engineer with the westinghouse lamp • defeat was a 7 – 2 loss to witton albion in the 2001 – 02 season. • york achieved three successive wins for the first time in 2013 – 14 after beating northampton 2 – 0 away, with bowman and fletcher scoring in • he started to develop more of an offensive game, finishing off the 2001 – 02 season with 58 points in the 47 games he played in seattle. • suart limited matthews to 19 league appearances in 1958 – 59. • jaw warriors of the western hockey league( whl) during the 2000 – 01 season. Transformer factor 99 in layer 10 Explaination: past tense • r. in their review of rihanna’ s top 20 songs, time out ranked" man down" as their tenth best track, writing that it is • rolling stone ranked" imagine" number three on its list of" the 500 greatest songs • japan’ s computer entertainment rating organization( cero) rated ninja gaiden and black, on their release, as 18+ • ultimate classic rock ranked" lola" as the kinks’ third best song, saying" • adrien begrand of popmatters described" south of heaven" as" an unorthodox set opener • columbia records released it as the album’ s fourth and final single on june 14, • rolling stone ranked it the best song of 2009 and the 36th@-@ best song • indielondon’ s jack foley noted" wind it up" as a highlight of the sweet escape and called • premiere magazine listed frank booth, played by dennis hopper, as the fifty@-@ • columbia records released" crazy in love" on may 18, 2003, as the lead • the times considered the production the best since the original, and praised it for its fidelity • the good food guide ranked hibiscus as the eighth@-@ best restaurant in the uk • viz media later began releasing the manga as simply" ral grad" in february 2008. • entertainment weekly magazine ranked" crazy in love" forty@-@ seven in its list of • the japanese publisher nihon bungeisha released the series in collected volumes from january 2000 to september 2009. Transformer factor 102 in layer 10 Explaination: African name • s 1966 to 1971 live performances in paris, prepared to press the album once mwanga provided the label with the record< unk>. • of america" with the nhk symphony orchestra, but cancelled both deals upon mwanga’ s return from japan. • 1966 to 1971 live performances in paris, prepared to press the album once mwanga provided the label with the record< unk>. • and langston hughes, and by modern african poets and folk artists such as kwesi brew and efua sutherland, which also influenced her auto • america" with the nhk symphony orchestra, but cancelled both deals upon mwanga’ s return from japan. • du bois was buried in accra near his home, which is now the du bois memorial centre. • du bois returned to africa in late 1960 to attend the inauguration of nnamdi azikiwe as the first african governor of nigeria. • david mcgurk, lanre oyebanjo, danny parslow, tom platt and chris smith signed new • and moderate nationalist parties, the most prominent of which was bishop abel muzorewa’ s united african national council( uanc). • a few weeks after of human feelings was recorded, mwanga went to japan to negotiate a deal with trio records to have the • and was part of two large campaigns, one to witu and another to mwele. • returned to africa in late 1960 to attend the inauguration of nnamdi azikiwe as the first african governor of nigeria. • in april, mwanga arranged another session at cbs studios in new york city, and coleman • the government and moderate nationalist parties, the most prominent of which was bishop abel muzorewa’ s united african national council( uanc). • ralambo’ s father, andriamanelo, had established rules of succession by which ralambo’ Transformer factor 125 in layer 10 Explaination: Describing someone in a paraphrasing style. Name, Career • journalist tim judah suggests that the move may have been motivated by a desire to control a • the historian nora berend says that the latter measure" may have adversely affected • from the pyx that were not assayed, and numismatic historian roger burdette speculates that ashbrook, generally well@- • the pyx that were not assayed, and numismatic historian roger burdette speculates that ashbrook, generally well@-@ • the cricket historian derek birley notes that many of these bowlers imitated the methods of • critic roberta reeder notes that the early poems always attracted large numbers of admirers •; the figures for the last two years are not available, but sf historian mike ashley estimates that fantastic paid circulation may have been as low as 13@ • aesthetically, ign’ s tal blevins noted that the game had" a very distinct 40s • sf historian everett bleiler notes that hersey did not mention the venture in his • similarly, duke university professor, mark anthony neal, writes, “ nas was at the forefront of a renaissance • club reviewer erik adams wrote that the episode was a perfect mix, between the more subtle • commenting on the album and its use of samples, pitchfork’ s jeff weiss claims that both nas and his producers found inspiration for the album’ • the historian stanley karnow said of ky and thi:" both fl • that were not assayed, and numismatic historian roger burdette speculates that ashbrook, generally well@-@ treated by the • irataba was described as an eloquent speaker, and linguist leanne hinton suggests that he was among the first mohave people to Transformer factor 134 in layer 10 Explaination: Transition sentence • fanny workman have tended to slight or belittle her achievements, but contemporaries, unaware of the far greater accomplishments to come, held the workmans • scheduled to air in its regular half@-@ hour time slot, but nbc later announced it would be expanded to fill an hour time slot beginning a • wine and savoy cabbage with a red wine and smoked chocolate sauce, but he otherwise felt that the food was" over@-@ worked" and the • lap melee when he was hit by romain grosjean; webber was forced to pit straight away, while grosjean was given a ten@ • ra. one was initially scheduled to release on 3 june 2011, but delays due to a lengthy post@-@ production process and escalating • yamina nomads who were centered at tuttul, and the rebels were supported by yamhad’ s king sumu@-@ • the item was intended simply as a piece of news, but telegraph lines quickly spread the news throughout the state, fueling procession sentiment. • both twc and comcast began trials of services based on the system; turner broadcasting was an early supporter of the system, providing access to tbs and •k> have claimed that he proposed a dictatorship for robespierre, but nonetheless some of them considered him to be redeemable, or at least •@ 2 style with superfiring pairs of turrets fore and aft; the middle turrets were not superfiring, and had a funnel between them. • romani being ordered to move out with supplies for the advancing troops, but 150 men, most of whom were past the end of their contracts and entitled to •’ s boats to enter the creek into which the schooner had fled, the small craft entering the waterway in the hope of storming and capturing the vessel •-@ person shooter elements and a unique on rails control scheme, but the core adventure@-@ style gameplay has been compared to myst and snatch • stanza 6; movement 4 incorporates ideas from stanzas 7 – 14, and movement 5 relies on stanzas 15 and< unk> movement 2, • as corps troops that were usually allocated at a rate of one per division; several of the militia units were also later designated australian imperial force units, after Transformer factor 152 in layer 10 Explaination: in some locations • while most breeding stallions and racehorses of the era had stable companions, waxy reportedly was fond of rabbits in his later years and • planet and the helter skelter music bookshop have also been based on the street. • the central bank of somalia, the national monetary authority, also has its headquarters in mogadishu. • some allotropes of the other actinides also exhibit similar behaviour, though to a lesser degree. •; fortune 1000 technology company< unk>, for instance, is headquartered in the area. • musical@-@ comedy television series maid marian and her merry men were filmed in cleeve abbey. • ireland,< unk> and donegal bay in particular, have popular surfing beaches, being fully exposed to the atlantic ocean. •lstoy’ s war and peace and chekhov’ s peasants both feature scenes in which wolves are hunted with hounds and< unk>. • while most breeding stallions and racehorses of the era had stable companions, waxy reportedly was fond of rabbits in his later years and" • the lancashire and england test cricketer paul allott was born in altrincham. •asura, the demon devotee of shiva, are both credited with building temples or cut caves to live. • forbidden planet and the helter skelter music bookshop have also been based on the street. •thopedic shriners hospitals in the u. s. is also located in spokane. • dykes to watch out for and fun home, was born in lock haven in 1960. •< unk>, and alessandra ambrosio have each worn two fantasy bras.

Transformer factor 297 in layer 10 with saliency map Explaination: repetitive structure detector frontier works, and an original soundtrack by avex group were created based on the game. drama cd: tales of graces 1 to 4 are side stories that take place during the game’ s plot. they were released between may 26, 2010 and august 25, 2010. anthology drama cd: tales of graces f 2010 winter, anthology drama cd: tales of graces f 2011 summer, anthology drama cd: tales of graces f 2012 winter, anthology drama cd: tales of graces f 2012 summer, anthology drama cd: tales of graces f 2013 winte r, and anthology drama cd: tales of graces f 2013 cobrand platinum cardholders, and citibank eva air cobrand world card) the infinity( infinity mileagelands diamond, royal laurel/ premium laurel class passengers, star alliance first/ business class passengers, american express centurion/ eva air cobrand platinum cardholders, and citibank eva air cobrand world cardholders) the star( infinity mileagelands diamond/ gold, royal laurel/ premium laurel class passengers, star alliance first/ business class passengers, star alliance gold members, american express centurion/ eva air cobrand platinum cardholders, citibank eva air cobrand world cardholders, business customers, quickly set online< unk> alight"." can’ t get you out of my head" was chosen as the lead single from minogue’ s eighth studio album fever, and it was released on 8 september 2001 by parlophone in australia, while in the united kingdom and other european countries it was released on 17 september." can’ t get you out of my head" was w ritten and produced by cathy dennis and rob davis, who had been put together by british artist manager simon fuller, who wanted the duo to come up with a song for british pop group s club 7. the song was recorded using cuba typhoon status with two@-@ minute sustained winds estimated at 125 km/ h( 78 mph). around 1700 utc on may 31, the storm tracked approximately 65 km( 40 mi) west of iwo jima. roughly five hours later, it moved within 15 km( 10 mi) of chichi@-@ jima where a pressure of 992 mb( hpa; 29@.@ 30 inhg) was measured. sustained winds on chichi@-@ jima reached 95 km/ h( 60 mph); however, these were determined to be unrepresentative of lucille’ s actual intensity due first book in vocal music. the modern music series. book 1. new york, new york: silver burdette and company. smith, eleanor( 1901). a second book in vocal music. the modern music series. book 2. new york, new york: silver burdette and company. smith, eleanor( 1901). a third book in vocal music. the modern music series. book 3. new york, new york: silver burdette and company. smith, eleanor( 1905). a fourth book in vocal music. the modern music series. book 4. new york, new york: silver burde @ breaking eight weeks at number one on the airplay chart of the country and became the first to garner 3000 radio plays in a single week. subsequently, it became the most@-@ played song of 2001 in the region." can’ t get you out of my head" was certified platinum by the british phonographic industry for shipments of 600@,@ 000 units in 2001. the certification was upgraded to double@-@ platinum in 2015, denoting shipments of 1@,@ 200@,@ 000 units. in the united states," can’ t get you out of my head" peaked at number seven on the chart. in mid@-@ august 2015," la mordidita" earned martin his twenty@-@ sixth top ten hit on hot latin songs. he became the fourth artist with the most top tens in the 29@-@ year history of the chart. in late august 2015, martin earned with" la mordidita" his fifteenth number@-@ one on the latin airplay chart( up 58 percent, to 11@.@ 8 million audience impressions). eventually," la mordidita" peaked at number six on the us hot latin songs chart, number one on latin airpla y and , was delivered to sukhoi’ s experimental workshop to be outfitted with exclusive systems. built by knaapo, its structure has increased carbon@-@ fibre and al@-@ li content. installed was the 2d thrust@-@ vectoring lyulka al@-@ 31fp, an interim measure pending the availability of the al@-@ 37fu(< unk>< unk>," afterburner@-@ controlled"). the 3d thrust@-@ vectoring lyulka al@-@ 37fu was still in development. the al@-@ 31fp, in ke’ s former band, though escape the fate only charted at number 25, seven spots lower than the drug in me is you, despite equal sales. in its second week on sales, the drug in me is you dropped about 70% in the united states, selling 5@,@ 870 copies. this dropped the album 60 spots to number 79 on the billboard 200, and brought total us sales for the album to around 24@,@ 000 copies. on the billboard charts, the drug in me is you charted at number two on the top hard rock albums chart, number three on the top alternative albums and top rock albums charts, no, no, no", reached number one on the billboard hot r& b/ hip@-@ hop singles& tracks and number three on the billboard hot 100. its follow@-@ up single," with me part 1" failed to reproduce the success of" no, no, no". meanwhile, the group featured on a song from the soundtrack album of the romantic drama why do fools fall in love and" get on the bus" had a limited release in europe and other markets. in 1998, destiny’ s child garnered three soul train lady of soul awards including best new artist for" no, no, no oistic warmongers. alexander krivenko( jonathan adams) finally, introduced in trivial games and paranoid pursuits, is russian alexander krivenko, the commander of the moonbase where the ispf have their headquarters. a winner of the nobel prize for medicine, it is krivenko’ s research into bone damage that has contributed to enabling humanity to access space easily. although the star cops are independent, spring’ s relationship with krivenko is often deferential and he frequently seems to capitulate to krivenko’ s wishes.== production history===== origins= that build faith: from the life and ministry of thomas s. monson, salt lake city, utah: deseret book, isbn 978@-@ 0@-@ 87579@-@ 901@-@ 8 — —( 1996), faith rewarded: a personal account of prophetic promises to the east german saints, salt lake city, utah: deseret book, isbn 978@-@ 1@-@ 57345@-@ 186@-@ 4 — —( 1997), invitation to exaltation, salt lake city, utah: deseret book, isbn 978@- dell, tom( 2015). gunnerkrigg court volume 5:< unk>. gunnerkrigg court. archaia studios press. isbn 978@-@< unk>.=== side comics=== siddell, tom( 2013). annie in the forest part one. beyond the walls. robot voice comics. siddell, tom( 2013). annie in the forest part two. beyond the walls. robot voice comics. siddell, tom( 2015). traveller. beyond the walls. robot voice comics.=== explanatory footnotes====== 95@.@ 4 kn) f402@-@ rr@-@< unk> engine, while later examples were fitted with the 23@,@ 000 lbf( 105@.@ 8 kn) f402@-@ rr@-@ 408a. in the early 2000s, 17 tav@-@ 8bs were upgraded to include a night@-@ attack capability, the f402@-@ rr@-@ 408 engine, and software and structural changes.< unk> in 1991, the night attack harrier was the first upgrade of the av@-@ 8 , aitrus’ meeting with ti’ ana, and the birth of their son gehn. the book also explains the destruction of the d’ ni civilization. two d’ ni, veovis and a’ gaeris, plot to destroy their civilization, which they believe has been corrupted. veovis and a’ gaeris create a plague which wipes out many of the d’ ni and follows them through the ages. veovis is murdered by a’ gaeris for refusing to write an age where the two of them would have been worshipped as gods, and aitrus sacrifices himself in order to inants". a gbrmpa briefing stated the company had" threatened a compensation claim of$< unk> should the gbrmpa intend to exert authority over the company’ s operations". in response to the< unk> of the dumping incidents, the gbrmpa stated: we have strongly encouraged the company to investigate options that don’ t entail releasing the material to the environment and to develop a management plan to eliminate this potential hazard; however, gbrmpa does not have legislative control over how the< unk> tailings dam is managed.===< unk>=== following a of warped tour. following this, a lesson in romantics was released on july 10 through fearless records. in august, the band went on tour with olympia and sound the alarm. the music video for" when i get home, you’ re so dead", directed by marco de la torre, was filmed in september. in late september 2007, the band supported paramore in japan an d australia. the band went on a co@-@ headlining tour with madina lake in october and november. the" when i get home, you’ re so dead" music video was released on november 14, and the single was released on of the english football league including promo tion and relegation. the player’ s team begins with a low rating in an 8@-@ team league. by winning games, the player earns credits, which can be used to purchase the contracts of free agents. by finishing high in the 8@-@ team league, the player’ s team advances to a 16@-@ team league and eventually a 32@-@ team league. the player improves their team by periodically signing free agents, as the competition is tougher in each league. the player wins the mode after winning a playoff tournament in the 32@-@ team league oda ministra kultury i sztuki ii< unk>) 1972 – member of commission" poland 2000" of the polish academy of sciences 1973 prize of the minister of foreign affairs for popularization of polish culture abroad(< unk> ministra< unk>< unk> za< unk>< unk> kultury za< unk>) literary prize of the minister of culture and art(< unk>< unk> ministra kultury i sztuki) and honorary member of science fiction writers of america 1976 – state prize 1st level in the area of to power the antarctic outpost. above earth, ba’ al’ s armada arrives. to the displeasure of his subordinates, the other system lords, ba’ al announces that he will treat the tau< unk> leniently. suspicious about ba’ al’ s thorough knowledge of earth, qetesh betrays him and forces him to tell her everything. she orders the destruction of mcmurdo and the ancient outpost in ba’ al’ s name, but she kills ba’ al when teal’ c discovers what she is doing. as teal’ c escapes to an al< unk>, qetesh ( 156+ kn) each fuel capacity: 18@,@ 000 lb( 8@,@ 200 kg) internally, or 26@,@ 000 lb( 12@,@ 000 kg) with two external fuel tanks performance maximum speed: at altitude: mach 2@.@ 25( 1@,@ 500 mph, 2@,@ 410 km/ h)[ estimated] supercruise: mach 1@.@ 82( 1@,@ 220 mph, 1@,@ 960 km/ h) range:> 1@,@ 600 nmi( 1@,@ 840 mi, 2@,@ 960 Transformer factor 322 in layer 10 with saliency map Explaination: biography, someone born in some year… . only three pitchers threw more complete games in major league careers shorter than getzein’ s nine@-@ year career. getzein had his most extensive playing time with the detroit wolverines, compiling records of 30@-@ 11 and 29@-@ 13 in 1886 and 1887. in the 1887 world series( which detroit won, 10 games to 5), getzein pitched six complete games and compiled a 4@-@ 2 record with a 2@.@ 48 era. he also won 23 games for the boston beaneaters in 1890.== early years== getzein was born in 1864 and telegraph lines and networks. the west construction company, based in chattanooga, tennessee, was a general contracting and construction firm also involved in the operation and maintenance of railway, telephone, and telegraph lines.== personal life===== marriage and children=== on april 10, 1875, in hampshire county, flournoy married frances" fannie" ann armstrong white( april 10, 1844 – february 25, 1922), the daughter of hampshire county clerk of court john baker white and his wife frances ann streit white. frances white’ s brother, robert white, served as west virginia attorney general, and her buffalo, new york businessman who made his fortune in five@-@ and@-@ dime stores. he merged his more than 100 stores with those of his first cousins, frank winfield woolworth and charles woolworth, to form the f. w. woolworth company. he went on to hold prominent positions in the merged company as well as marine trust co. he was the father of seymour h. knox ii and grandfather of seymour h. knox iii and northrup knox, the co@-@ founders of the buffalo sabres in the national hockey league.== biography== he was born in april 1861 in russell, saint lawrence stars for eighteen years. the american film institute( afi) ranked cooper eleventh on its list of the twenty five greatest male stars of classic hollywood cinema.== early life== frank james cooper was born on may 7, 1901, at 730 eleventh avenue in helena, montana to english immigrants alice( nee brazier, 1873 – 1967) and charles henry cooper( 1865 – 1946). his father emigrated from houghton regis, bedfordshire and became a prominent lawyer, rancher, and eventually a montana supreme court justice. his mother emigrated from gillingham, kent and married charles in montana. in 1906, charles purchased the 600@-@ acre orange( 1971), which kubrick pulled from circulation in the uk following a mass media frenzy — most of his films were nominated for oscars, golden globes, or bafta awards. his last film, eyes wide shut, was completed shortly before his death in 1999.== early life== stanley kubrick was born on july 26, 1928, in the lying@-@ in hospital at 307 second avenue in manhattan, new york city. he was the first of two children of jacob leonard kubrick( may 21, 1902 – october 19, 1985), known as jack or jacques, and his wife sadie gertrude kubrick managed with a catch and release regulation. trophy trout and wild brook trout enhancement regulations apply to the remainder. a total of 31 class a wild trout waters have been designated as wilderness trout streams. fishing in class a wild trout waters is permitted year@-@ round, although the killing of fish is forbidden from labor day to the beginning of the following year’ s trout season.== gallery=== henry bell gilkeson= henry bell gilkeson( june 6, 1850 – september 29, 1921) was an american lawyer, politician, school administrator, and banker in west virginia. gilkeson was born in moorefield, movement, there have been few more remarkable figures than marjory stoneman douglas."== early life== marjory stoneman was born on april 7, 189 0, in minneapolis, minnesota, the only child of frank bryant stoneman( 1857 – 1941) and lillian trefethen( 1859 – 1912), a concert violinist. one of her earliest memories was her father reading to her the song of hiawatha, at which she burst into sobs upon hearing that the tree had to give its life in order to provide hiawatha the wood for a canoe. she was an early and voracious reader amazon. com.=== dvd release==== johann mickl= johann mickl( 18 april 1893 – 10 april 1945) was an austrian@-@ born generalleutnant and division commander in the german army during world war ii, and was one of only 882 recipients of the knight’ s cross of the iron cross with oak leaves. he was commissioned shortly before the outbreak of world war i, and served with austro@-@ hungarian forces on the eastern and italian fronts as company commander in the imperial@-@ royal mountain troops. during world war i he was decorated several times for bravery and leadership, and very unusual properties, such as a quantum critical point behavior, exotic superconductivity, and high@-@ temperature ferromagnetism.= babe ruth= george herman ruth jr.( february 6, 1895 – august 16, 1948), better known as babe ruth, was an american professional baseball player whose career in major league baseball( mlb) spanned 22 seasons, from 1914 through 1935. nicknamed" the bambino" and" the sultan of swat", he began his mlb career as a stellar left@-@ handed pitcher for the boston red sox, but achieved his greatest fame as a slugging outfielder for the air in regular scheduled services. it includes the city, country, airport and the period in which the airline served the airport. hubs are denoted with a dagger().= william s. taylor= william sylvester taylor( october 10, 1853 – august 2, 1928) was the 33rd governor of kentucky. he was initially declared the winner of the disputed gubernatorial election of 1899, but the kentucky general assembly, dominated by the democrats, reversed the election results, giving the victory to his democratic party( united states) opponent, william goebel. taylor served only 50 days as governor. a poorly educated but politically astute lawyer, taylor woods hole, massachusetts, where he studied marine bioluminescence. he also worked at the woods hole oceanographic institution.== early life== george thomas reynolds was born in trenton, new jersey on may 27, 1917, the son of george w. reynolds, a< unk> for the pennsylvania railroad, and his wife laura, a secretary with the new jersey department of geology. he attended franklin junior high school in highland park, new jersey, until year 10, and then new brunswick high school. he received a bachelor’ s degree in physics from rutgers university in 1939. he then entered princeton university, where was awarded == shaughnessy was born on march 6, 1892 in st. cloud, minnesota, the second son of lucy ann( foster) and edward shaughnessy. he attended north st. paul high school, and prior to college, had no athletic experience. when he attended the university of minnesota, however, he p layed college football under head coach henry l. williams and alongside halfback bernie bierman. shaughnessy considered williams to be football’ s greatest teacher, and williams considered him to be the best passer from the midwest. shaughnessy handled both the passing and kicking duties for the team. he played on s gregoras likewise avoids negative comments, as do most modern historians.= george nicol( baseball)= george edward nicol( october 17, 1870 – august 4, 1924) was an american baseball pitcher and outfielder who played three seasons in major league baseball( mlb). he played for the st. louis browns, chicago colts, pittsburgh pirates and louisville colonels from 1890 to 1894. possessing the rare combination of batting right@-@ handed and throwing left@-@ handed, he served primarily as a right fielder when he did not pitch. signed by the browns without having previously played any minor league baseball, nicol made his dispatched powell and major benjamin mcculloch to utah to ease tensions with brigham young and the mormons. powell assumed his senate seat on his return from utah, just prior to the election of abraham lincoln as president. powell became an outspoken critic of lincoln’ s administration, so much so that the kentucky general assembly asked for his resignation and some of his fellow senators tried to have him expelled from the body. both groups later renounced their actions. powell died at his home near henderson, kentucky shortly following a failed bid to return to the senate in 1867.== early life== powell was born on october 6, 1812 near henderson, the army in 1948. he was promoted to lieutenant general just before his retirement on 29 february 1948 in recognition of his leadership of the bomb program. by a special act of congress, his date of rank was backdated to 16 july 1945, the date of the trinity nuclear test. groves went on to become a vice@-@ president at sperry rand.== early life== leslie richard groves jr. was born in albany, new york, on 17 august 1896, the third son of four children of a pastor, leslie richard groves sr., and his wife gwen nee griffith. a descendant of french huguenots who , burns died on november 11, 1928 in brooklyn, new york.== biography== thomas p. burns was born on september 6, 1864, in philadelphia. his parents, patrick and mary burns, were both irish immigrants. in 1883, burns began his professional baseball career as a pitcher with harrisburg of the minor@-@ league interstate association. on the year, burns posted an earned run average( era) of 2@.@ 30 over 20 games pitched, 15 of which were starts. when he wasn’ t pitching, burns played second and third base. burns began the 1884 season playing for the wilmington quicksteps, @ beats".== credits and personnel== lady gaga – vocals, songwriter and producer redone – songwriter, producer, vocal editing, vocal arrangement, audio engineering, instrumentation, programming, and recording at tour bus in europe trevor muzzy – recording, vocal editing, audio engineering, and audio mixing at larrabee, north holly wood, los angeles, california gene grimaldi – audio mastering at oasis mastering, burbank, california credits adapted from born this way album liner notes.== charts=== travis jackson= travis calvin jackson( november 2, 1903 – july 27, 1987) was an american baseball shortstop. = monson was born on august 21, 1927, in salt lake city, utah to g. spencer monson( 1901 – 1979) and gladys< unk> monson( 1902 – 1973). the second of six children, he grew up in a" tight@-@ knit" family — many of his mother’ s relatives living on the same street and the extended family frequently going on trips together. the family’ s neighborhood included several residents of mexican descent, an environment in which he says he developed a love for the mexican people and culture. monso n often spent weekends with relatives on their farms in granger( it. anderson was a professional accordion player and wrote poetry for various american pagan magazines. in 1970, he published his first book of poetry, thorns of the blood rose, which contained devotional religious poetry dedicated to the goddess; it won the clover international poetry competition award in 1975. anderson continued to promote the feri tradition until his death, at which point april niino was appointed as the new grandmaster of the tradition.== early life===== childhood: 1917 – 1931=== anderson was born on may 21, 1917 at the buffalo horn ranch in clayton, new mexico. his parents were hilbart alexander anderson was elsewhere. he had recently become engaged and bought his first house in hillsborough. franklin and benjamin pierce were among the prominent citizens who welcomed president jackson to the state on his visit in mid@-@ 1833.=== marriage and children=== on november 19, 1834, pierce married jane means appleton( march 12, 1806 – december 2, 1863), the daughter of jesse appleton, a congregational minister and former president of bowdoin college, and elizabeth means. the appletons were prominent whigs, in contrast with the pierces’ democratic affiliation. jane was shy, devoutly religious, and pro@-@ temperance which took delivery of its eight and last globemaster in november 2015; no. 38 squadron, operating king airs; and the australian army’ s 68 ground liaison section. all units are based at amberley, with the exception of no. 38 squadron, located at townsville.= clark shaughnessy= clark daniel shaughnessy( originally o’ shaughnessy)( march 6, 1892 – may 15, 1970) was an american football coach and innovator. he is sometimes called the" father of the t formation" and the original founder of the forward pass, although that system had previously been used as early as the 1880s Transformer factor 386 in layer 10 with saliency map Explaination: topic: war he was awarded a companion of the order of st michael and st george for his command of the 4th machine gun battalion, the recommendation of which particularly citing his success during attacks on the hindenburg line. murray’ s final honour came on 11 july 1919, when he was mentioned in despatches for the fourth time, having received his third mention on 31 december 1918. from june to september 1919, murray — along with fellow australian victoria cross recipient william donovan joynt — led parties of aif members on a tour of the farming districts of britain and denmark to study agricultural methods under the education schemes. after touring through france and belgium, from large@-@ calibre shells; one of them, allegedly a 14@-@ inch( 356 mm) round, blew a large hole in her quarterdeck and wrecked the wardroom and the gunroom. she also took several hits by light shells that day, and, although she suffered damage to her superstructure, her fighting and steaming capabilities were not seriously impaired. the ship also participated in the main attack on the dardanelles forts on 18 march. this time a 6@-@ inch( 152 mm) howitzer battery opened fire on agamemnon and hit her 12 times in 25 minutes; five of the . lt. riefkohl, who was also the first puerto rican to graduate from the united states naval academy, served as a rear admiral in world war ii. frederick l. riefkohl’ s brother, rudolph william riefkohl also served. riefkohl was commissioned a second lieutenant and assigned to the 63rd heavy artillery regiment in france where he actively participated in the meuse@-@ argonne offensive. according to the united states war department, after the war he served as captain of coastal artillery at the letterman army medical center in presidio of san francisco, in california( 1918). washington times@-@ herald, which ran the headline" hardy wild@-@ eyed aussies called world’ s finest troops". an article in the chicago daily news told its readers that australians" in their realistic attitude towards power politics, prefer to send their boys to fight far overseas rather than fighting a battle in the suburbs of sydney". during the battle, wavell had received a cable from general sir john dill stressing the political importance of such victories in the united states, where president franklin d. roosevelt was attempting to get the lend@-@ lease act passed. it was finally enacted in march 1941. mackay wrote . he also showed respect for occupied populations and never tolerated pillaging nor violence from his men. as a sign a gratitude, he was offered gifts several times but he was often seen refusing and sending them back. while on campaign in tyrol, he was recorded to have accepted a large sum of money but he immediately distributed it to the local hospitals. further evidence of his humanity was the ca re that he displayed for the lives and well@-@ being of his men, whom he was always reluctant to sacrifice for the sake of glory. overall as a heavy cavalry commander, nansouty was one of the best men available during the napoleonic @ 000 troops on 11 february. in march 1919, princess matoika and rijndam raced each other from saint@-@ nazaire to newport news in a friendly competition that received national press coverage in the united states. rijndam, the slower ship, was just able to edge out the princess — and cut two days from her previous fastest crossing time — by appealing to the honor of the soldiers of the 133rd field artillery( returning home aboard the former holland america liner) and employing them as extra stokers for her boilers. on her next trip, the veteran transport loaded troops at saint@-@ nazaire @ july, met his wife in new york, and together they traveled to columbus, georgia by way of washington, d. c. and atlanta.== military schools== for the ten years following world war i, troy middleton would be either an instructor or a student in the succession of military schools that army officers attend during their careers. middleton arrived in columbus, georgia with strong praise from his superiors, and would soon get his efficiency report, in which brigadier general benjamin poore of the 4th division wrote of him," the best all@-@ around officer i have yet seen.< unk> by his rapid promotion from coal and 700 long tons( 710 t) of fuel oil and that provided her a range of 3@,@ 500 nautical miles( 6@,@ 500 km) at a speed of 10 knots( 19 km/ h). her main armament consisted of a dozen obukhovskii 12@-@ inch( 305 mm) pattern 1907 52@-@ calibre guns mounted in four triple turrets distributed the length of the ship. the russians did not believe that super firing turrets offered any advantage as they discounted the value of axial fire and believed that super firing turrets could not fire while over the lower turret because of ’ ll still be playing from 2007" and awarded it" playstation 3@-@ exclusive game of the year".= 11th battalion( australia)= the 11th battalion was an australian army battalion that was among the first infantry units raised during world war i for the first australian imperial force. it was the first battalion recruited in western australia, and following a brief training period in perth, the battalion sailed to egypt where it undertook four months of intensive training. in april 1915 it took part in the invasion of the gallipoli peninsula, landing at anzac cove. in august 1915 the battalion was in action in the battle of lone pine. following was transferred to western australia, being attached to the 6th brigade, which was based around geraldton. in september 1942, as part of an army@-@ wide reduction that came about because of over@-@ mobilisation, the battalion was amalgamated with the 14th battalion to become the 14th/ 32nd battalion( prahran/ footscray regiment). in early 1943, the 14th/ 32nd battalion carried out amphibious warfare training in queensland before being deployed to the buna – gona area in new guinea in july. the battalion would remain in mainland new guinea and new britain for the next two years, under the command of lieutenant in an allied air raid on 10 december 1941, mickl was appointed to temporarily command the division. during december, mickl was wounded in the head and hand, but remained at his post. rommel recommended mickl for the knight’ s cross of the iron cross, for his leadership at sidi rezegh, and it was duly awarded on 13 december 1941. the harsh conditions of desert warfare had begun to affect mickl’ s health, so at the end of december he was sent home on convalescent leave.=== eastern front======= 12th rifle brigade==== on 25 on to bijeljina which was taken against light partisan resistance late on 16 march. the 27th regiment then consolidated its position in bijeljina while the 28th regiment and the divisional reconnaissance battalion( german:< unk>) bore the brunt of the fighting as they advanced through< unk>, celic and koraj at the foot of the majevica mountains. sauberzweig later recorded that the 2nd battalion of the 28th regiment( ii/ 28)" at celic stormed the partisan defenses with( new) battalion commander hans hanke at the poin t" and that enemy forces withdrew after of matthews, the company 2ic, who had taken command almost immediately after the company commander was wounded. under his command, each of the platoons assaulted a different cluster of buildings to which they had been assigned during training on the replica village at hastings. the west side boys’ ammunition store was found and secured and, once the rest of the buildings had been cleared, the paras took up defensive positions to block any potential counter@-@ attack and patrols went into the immediate jungle in search of any west side boys hiding in the bushes. the village was completely secure by 08: 00 and the paras secured the approaches with claymore ), increased her metacentric height to 6@.@ 3 feet( 1@.@ 9 m) at deep load, and all of the changes to her equipment increased her crew to a total of 1@,@ 188. despite the bulges she was able to reach a speed of 21@.@ 75 knots( 40@.@ 28 km/ h; 25@.@ 03 mph). a brief refit in early 1927 saw the addition of two more four@-@ inch aa guns and the removal of the six@-@ inch guns from the shelter deck. about 1931, a high@-@ angle control became enraged at him, slapping him across the face. he began yelling:" your nerves, hell, you are just a goddamned coward. shut up that goddamned crying. i won’ t have these brave men who have been shot at seeing this yellow bastard sitting here crying." patton then reportedly slapped bennett again, knocking his helmet liner off, and ordered the receiving officer, major charles b. etter, not to admit him. patton then threatened bennett," you’ re going back to the front lines and you may get shot and killed, but you’ re going to fight. if you don’ t, i’ secondary guns, two of which were disabled. the ammunition stores for these two guns were set on fire and the magazines had to be flooded to prevent an explosion. the ship nevertheless remained combat effective, as her primary battery remained in operation, as did most of her secondary guns; konig could also steam at close to her maximum speed. other areas of the ship had to be counter@-@ flooded to maintain stability; 1@,@ 600 tons of water entered the ship, either as a result of battle damage or counter@-@ flooding efforts. the flooding rendered the battleship sufficiently low in the water to prevent the ship from being able in 1924 and rice institute, houston, texas in 1928. he dropped out of graduate school after one year and decided to hitchhike to san francisco. the lack of work meant hunger, so he chose to join the united states army’ s 11th cavalry regiment as a private on july 30, 1930, serving in monterey, california. after a year in the horse cavalry, parrish became an aviation cadet in june 1931 and subsequently qua lified as an enlisted pilot. he completed flight training in 1932 and was assigned to the 13th attack squadron at fort crockett, near galveston, texas. one year later in september 1933 parrish during the battle, murray was awarded the victoria cross. soon after his victoria cross action, he was promoted to major and earned a bar to his distinguished service order during an attack on the hindenburg line near bullecourt. promoted to lieutenant colonel in early 1918, he assumed command of the 4th machine gun battalion, where he would remain until the end of the war. returning to australia in 1920, murray eventually settled in queensland, where he purchased the grazing farm that would be his home for the remainder of his life. re@-@ enlisting for service in the second world war, he was appo inted as commanding officer 10 officers and 315 enlisted men, plus an additional four officers and 19 enlisted men if serving as a flotilla flagship.== construction and career== the ship was ordered on 7 july 1934 and laid down at deutsche werke, kiel, on 2 january 1935 as yard number< unk>. she was launched on 30 november 1935 and completed on 8 april 1937. she was named after max schultz who commanded the torpedo boat< unk> and was killed in action in january 1917. korvettenkapitan martin< unk> was appointed as her first captain. max schultz was assigned to the 1st destroyer division on 26 the command of otto von diederichs. the squadron participated in the fall maneuvers in 1894, which simulated a two@-@ front war against france and russia; deutschland’ s squadron acted as the russian fleet during the exercises. between 1894 and 1897, deutschland was rebuilt in the imperial dockyard in wilhelmshaven. the ship was converted into an armored cruiser; her heavy guns were removed and replaced with lighter weapons, including eight 15 cm( 5@.@ 9 in) and eight 8@.@ 8 cm( 3@.@ 5 in) guns. her entire rigging equipment was removed and two heavy military masts were installed called on many times to maintain order in times of disaster and to keep peace during periods of political unrest. oklahoma governor john c. walton used division troops to prevent the state legislature from meeting when they were preparing to impeach him in 1923. governor william h. murray called out the guard several times during the depression to close banks, distribute food and once to force the state of texas to keep open a free bridge over the red river which texas intended to collect tolls for, even after federal courts ordered the bridge not be opened. the division would go on to see combat in world war ii as one of four national guard divisions active during Transformer factor 170 in layer 10 with saliency map Explaination: topic: music production 2nd street tunnel and part of downtown los angeles spread out over a 48@-@ hour period. kesha explained the idea behind the video as well as the experience during an interview with mtv news; she said that the video was different from her other videos, noting that it was going to show a sexier side of herself. the music video for" we r who we r" is presented as an underground party. the video starts off with futuristic flashing lights. kesha, seen in a ponytail wearing gray and black makeup, chains, ripped stockings, and a sparkly one@-@ piece leotard made of shards of broken and several european territories), her" endless love" duet with luther vandross( number@-@ one in new zealand) and" against all odds" featuring westlife( number@-@ one in the united kingdom)." thank god i found you" was also omitted from the japanese track listing, and replaced with" all i want for christmas is you". for the album artwork, carey launched a social media campaign on april 12, 2015, whereby fans had to share a link to her website in order to reveal the cover which was concealed by a curtain. using the hashtag"< unk>", single," we belong together". he contained to add" but still, if mimi ’ s going to mine from her own extensive back catalog of ballads, those are the primo melodies to go for." a reviewer for dj booth thought that minaj" ruined" the song.=== music video=== the accompanying music video for the remix of" up out my f ace" was directed by carey’ s husband, nick cannon. minaj spoke about filming a video with carey and how she did not believe that the video would ever be released:" i didn ’ t even tell anyone i shot a video with the producer, few days after he had finished the composition, madonna completed writing the lyrics of" i don’ t give a". solveig understood that the lyrics were probable references towards madonna’ s life and thus received coverage in the press. however, he was not aware of the inner meaning behind the lyrics. with billboard magazine, the producer further explained: at first i thought we were going to work on one song; that was the original plan. let’ s try to work on one song and take it from there– not spend too much time thinking about the l egend, and do something that just makes sense. provided an additional and assistant engineering. all the instruments were provided by eriksen and hermansen while dean sang the background vocals. in may 2011, in the mix review, an analyzing commercial productions, mike senior of sound on sound revisited the original mixing of the song. according to him, before he started the mix, senior played the song a couple of times before releasing what thing about it" bugged" him. working it out, he noted that the harmony of the mix is undermined by the kick drum." what’ s my name?" contains basic harmonies that are a bar of f minor, a bar of a major practiced in their backyards and at< unk> salon, owned by knowles’ s mother, tina. the group would test routines in the salon, when it was on montrose boulevard in houston, and sometimes would collect tips from the customers. their try out would be critiqued by the people inside. during their school days, girl’ s tyme performed at local gigs. when summer came, mathew knowles established a" boot camp" to train them in dance and vocal lessons. after rigorous training, they began performing as opening acts for established r& b groups of that time such as swv, dru hill and immature. tina day reception at the greek embassy. upon return to greece, she was greeted at the airport by fans along with the music video of" my number one" playing on the video monitors. while in greece, she attended the opening ceremony of the european final four for the volleyball champions league in< unk>, where her song was played as she appeared on stage with cheerleaders. on march 29, paparizou arrived in valletta, malta where she signed autographs, appeared on television stations, and gave interviews to the local media. following malta, she traveled to serbia and montenegro where she gave additional interviews before moving on to and and her low hip@-@ grind during’ rude boy’ were the smash hits of her body language." deborah linton of city life wrote that rihanna" even manages to make a psychiatric couch look sexy". linton called the show’ s stage sets impressive and imaginative. rick massimo of the providence journal wrote that rihanna" looked like a neon@-@ sign rendition of herself during’ rehab’, rarely addressed the audience, and didn’ t rise above flat cliche in that until the very end of the show"." rehab" and rihanna’ s 2009 single" russian roulette" were excluded from the set only a few hours. he said:" there were a lot of tracks, but i just enjoyed it, to be honest. i knew how i wanted it to sound, and it was pretty much the last song we cut; a lot of the mixing was nailed in the production as well, which helped. dream did a great job producing this track." the bar one guitar track of" schoolin’ life" was entirely programmed. similarly, the live drum section in the hook was actually done with programmed drums. once the mixing was over, swivel’ s impression were as follows:[’ schoolin’ life] absolutely tour began on march 1, 2000 at the house of blues in los angeles, while other venues included paris olympia, trump taj mahal, brixton academy, the montreux jazz festival, and the essence jazz festival in new orleans. by july, the tour’ s first half had sold out in each city. the tour lasted nearly eight months, whil e performances went for up to three hours a night. the voodoo tour was taken internationally, with one of the most notable performances being the free jazz festival in brazil. the music video for" untitled( how does it feel)" portrayed d’ angelo as a sex symbol hobson noted that rihanna" rejects the victim stance" in the video for" man down", and elucidated that she played the role of a rape survivor who shot her attacker. she attributed the location of shooting the video in jamaica as significant, due to how the image of a gun proliferated during 1990s jamaican dance hall’ s to" express female rage". the prologue depicts rihanna as a" dark@-@ hooded" femme fatale whereby the narrative explains her motives for murder and provokes the spectator to sympathize with her because she danced in a provocative manner with a man in a club, which had a deep impact on delonge in that he spent a night up crying for him when he wrote the track." a little’ s enough" was inspired by a religious concept in which a god came to bring positive change on earth when it faces terrorism, war or famine." the war", an anthem about the iraq war and its death toll, is succeeded by" it hurts", a track about a friend of delonge with a cheating girlfriend." it’ s a terrible situation where my friend is being crushed from the inside out by all the manipulative stuff she’ s doing and this song’ s just took that dress out of the storage – it has a 27@-@ foot train and it was just all hand@-@ beaded and stuff and so i figured we might as well get a use out of it.’=== synopsis=== the video features carey readying for her wedding, and follows her to the altar, as well as her escape from the reception. many of the actors featured in carey’ s" it’ s like that" video were in that of" we belong together", which was shot as a continuation from the" it’ s like that" video. it begins with 3 in dutch@-@ speaking flanders and number 2 in french@-@ speaking wallonia. it was certified gold by the belgian entertainment association( bea) for selling more than 15@,@ 000 copies. although the song spent only 1 week on the italian singles chart( at number 8), it was certified platinum by the federazione industria musicale italiana( fimi) in 2014 for selling more than 30@,@ 000 copies.== music video===== background and synopsis=== anthony mandler directed the music video for" man down" in april 2011 on a beach in at numbers 18 and 43 in the united states, and experienced moderate success worldwide. unlike her previous records, spears did not heavily promote blackout; her only televised appearance for blackout was a universally@-@ panned performance of" gimme more" at the 2007 mtv video music awards.== background and development== in november 2003, while promoting her fourth studio album in the zone, spears told entertainment weekly that she was already writing songs for her next album and was also hoping to start her own record label in 2004. henrik jonback confirmed that he had written songs with her during the european leg of the onyx hotel tour," of albums also had increased sales due to discounting and publicity generated by the single and her performance. billboard estimated that her top@-@ 10 digital sales collectively increased over 1@,@ 700 percent. madonna’ s bestselling album was the 2009 greatest@-@ hits collection, celebration, which sold 16@,@ 000 copies( up 1@,@ 341 percent) and reentered the billboard 200 album chart. the following week celebration fell 105 spots on the chart to number 157, with sales falling to 4@,@ 000 copies." give me all your luvin’" fell to number 39 on the hot opened the performance with" yeah 3x" and was dressed in a white formal suit, accompanied by" full@-@ skirted dancers". brown was eventually joined onstage by tuxedo@-@ clad dancers and began dancing to the 1993 wu@-@ tang clan single" protect ya neck". his dance routine then moved into 1991, where he danced to nirvana’ s" smells like teen spirit". brown’ s performance then came back to the future, where he began to sing" beautiful people". while performing the song, he was suspended in the air, and then lowered to another stage where he continued to register that she didn’ t know she had." from the moment she was signed in the film, madonna had expressed interest in recording a dance version of" don’ t cry for me argentina". according to her publicist liz rosenberg," since she didn’ t write the music and lyrics, she wanted her signature on that song… i think on her mind, the best way to do it was go in the studio and work up a remix". for this, in august 1996, while still mixing the film’ s soundtrack, madonna hired remixers pablo flores and javier garza. according to flores, the singer d accumulated until then but that was instead an ideal marriage of production and performance." instead, the red lights on the stage played up the" ominous" tone of the song as it gradually increased its tempo to the point whereby the end of the song was on the verge of sounding like an incantation. for the diamonds world tour, rihanna performed" man down" in a caribbean@-@ theme section of the show, which also included" you da one"," no love allowed"," what’ s my name?" and" rude boy". james lachno of the telegraph highlight the caribbean@-@ themed edge of several realities: the film, the dream it inspires, the waking world it illuminates". the music in" i just can’ t stop loving you", a duet with siedah garrett, consisted mainly of finger snaps and timpani." just good friends", a duet with stevie wonder, was viewed by critics as sounding good at the beginning of the song, ending with a" chin@-@ bobbing cheerfulness"." the way you make me feel"’ s music consisted of blues harmonies. the lyrics of" another part of me" deal with being united, as" we not manufactured. no one paid these kids."=== live performances=== one direction performed" what makes you beautiful" on red or black? on 10 september 2011. the performance started with hosts ant& dec announcing that the band was supposedly running late for their appearance, and cut to a video of one direction boarding a london tube carriage full of fans, as the studio version of the song began playing. each fan on the tube was given a numbered ticket. the band and fans disembarked the tube and made their way to the television studio, where the remainder of the song was sung live. after the song, This is the end of visualization of high-level transformer factor. Click [D] to go back.

Table: S2.T1: Several examples of low-level transformer factors. Their top-activated words in layer 4 are marked blue, and the corresponding contexts are shown as examples for each transformer factor. As shown in the table, nearly all of the top-activated words are disambiguated into a single sense. Please note the last example of Φ:,33subscriptΦ:33\Phi_{:,33} is a rare exception, the reader may check the appendix to see a more complete list. More examples, top-activated words and contexts are provided in Appendix.

Top 3 activated words and their contextsExplanation
Φ:,2subscriptΦ:2\Phi_{:,2}• that snare shot sounded like somebody’ d kicked open the door to your mind". • i became very frustrated with that and finally made up my mind to start getting back into things." • when evita asked for more time so she could make up her mind, the crowd demanded," ¡ ahora, evita,<• Word “mind” • Noun • Definition: the element of a person that enables them to be aware of the world and their experiences.
Φ:,16subscriptΦ:16\Phi_{:,16}•nington joined the five members xero and the band was renamed to linkin park. • times about his feelings about gordon, and the price family even sat away from park’ s supporters during the trial itself. • on 25 january 2010, the morning of park’ s 66th birthday, he was found hanged and unconscious in his• Word “park” • Noun • Definition: a common first and last name
Φ:,30subscriptΦ:30\Phi_{:,30}• saying that he has left the outsiders, kovu asks simba to let him join his pride • eventually, all boycott’ s employees left, forcing him to run the estate without help. • the story concerned the attempts of a scientist to photograph the soul as it left the body.• Word “left" • Verb • Definition: leaving, exiting
Φ:,33subscriptΦ:33\Phi_{:,33}• forced to visit the sarajevo television station at night and to film with as little light as possible to avoid the attention of snipers and bombers. • by the modest, cream@-@ colored attire in the airy, light@-@ filled clip. • the man asked her to help him carry the case to his car, a light@-@ brown volkswagen beetle.• Word “light” • Noun • Definition: the natural agent that stimulates sight and makes things visible

Table: S2.T2: Evaluation of binary POS tagging task: predict whether or not “left” in a given context is a verb.

Precision (%)Recall (%)F1 score (%)
Average perceptron POS tagger92.795.594.1
Finetuned BERT base model for POS task97.595.296.3
Logistic regression classifier with activation of Φ:,30subscriptΦ:30\Phi_{:,30} at layer 497.295.896.5

Table: S3.T3: A list of typical mid-level transformer factors. The top-activation words and their context sequences for each transformer factor at layer-888 are shown in the second column. We summarize the patterns of each transformer factor in the third column. The last 4 columns are the percentage of the top 200 activated words and sequences that contain the summarized patterns in layer-444,666,888, and 101010 respectively.

2 example words and their contexts with high activationPatternsL4 (%)L6 (%)L8 (%)L10 (%)
Φ:,13subscriptΦ:13\Phi_{:,13}• the steel pipeline was about 20 ° f(- 7 ° c) degrees. • hand( 56 to 64 inches( 140 to 160 cm)) war horse is that it was aUnit exchange with parentheses0064.595.5
Φ:,42subscriptΦ:42\Phi_{:,42}• he died at the hospice of lancaster county from heart • holly’ s drummer carl bunch suffered frostbite to his toes( while aboard the ailments on 23 june 2007.Something unfortunate happened94.0100100100
Φ:,50subscriptΦ:50\Phi_{:,50}• hurricane pack 1 was a revamped version of story mode; • in 1998, the categories were retitled best short form music video, and bestDoing something again, or making something new again74.5100100100
Φ:,86subscriptΦ:86\Phi_{:,86}• he finished the 2005 – 06 season with 21 appearances and seven goals. • of an offensive game, finishing off the 2001 – 02 season with 58 points in the 47 gamesConsecutive years, used in foodball season naming010085.095.5
Φ:,102subscriptΦ:102\Phi_{:,102}• the most prominent of which was bishop abel muzorewa’ s united african national council • ralambo’ s father, andriamanelo, had established rules of succession byAfrican names99.0100100100
Φ:,125subscriptΦ:125\Phi_{:,125}• music writer jeff weiss of pitchfork describes the" enduring image" • club reviewer erik adams wrote that the episode was a perfect mixDescribing someone in a paraphrasing style. Name, Career15.599.010098.5
Φ:,184subscriptΦ:184\Phi_{:,184}• the world wide fund for nature( wwf) announced in 2010 that a biodiversity study from • fm) was halted by the federal communications commission( fcc) due to a complaint that the company buyingInstitution with abbreviation015.539.063.0
Φ:,193subscriptΦ:193\Phi_{:,193}• 74, 22@,@ 500 vietnamese during 1979 – 92, over 2@,@ 500 bosnian •, the russo@-@ turkish war of 1877 – 88 and the first balkan war in 1913.Time span in years97.095.596.595.5
Φ:,195subscriptΦ:195\Phi_{:,195}•s, hares, badgers, foxes, weasels, ground squirrels, mice, hamsters •-@ watching, boxing, chess, cycling, drama, languages, geography, jazz and other musicConsecutive of noun (Enumerating)8.098.5100100
Φ:,225subscriptΦ:225\Phi_{:,225}• technologist at the united states marine hospital in key west, florida who developed a morbid obsession for • 00°,11”, w, near smith valley, nevada.Places in US, followings the convention “city, state"51.591.591.077.5

Table: S3.T4: We construct adversarial texts similar but different to the pattern “Consecutive adjective”. The last column shows the activation of Φ:,35subscriptΦ:35\Phi_{:,35}, or α35(8)subscriptsuperscript𝛼835\alpha^{(8)}_{35}, w.r.t. the blue-marked word in layer 8.

Adversarial TextExplainationα35subscript𝛼35\alpha_{35}
(o)album as "full of exhilarating, ecstatic, thrilling, fun and sometimes downright silly songs"The original top-activated word and its context sentence for transformer factor Φ:,35subscriptΦ:35\Phi_{:,35} (not an adversarial text)9.5
(a)album as "full of delightful, lively, exciting, interesting and sometimes downright silly songs"Replace the adjectives in sentence (o) with different adjectives.9.2
(b)album as "full of unfortunate, heartbroken, annoying, boring and sometimes downright silly songs"Replace the adjectives in sentence (o) with negative adjectives.8.2
(c)album as "full of [UNK], [UNK], thrilling, [UNK] and sometimes downright silly songs"Mask the adjectives in sentence (o) with unknown tokens.5.3
(d)album as "full of thrilling and sometimes downright silly songs"Remove the first three adjectives in sentence (o).7.8
(e)album as "full of natural, smooth, rock, electronic and sometimes downright silly songs"Replace the adjectives in sentence (o) with neutral adjectives.6.2
(f)each participant starts the battle with one balloon. these can be re@-@ inflated up to fourUse a random sentence.0.0
(g)The book is described as "innovative, beautiful and brilliant". It receive the highest opinion from James WoodWe create this sentence that contain the pattern of consecutive adjective.7.9

Refer to caption Building block (layer) of transformer

Refer to caption (a)

Refer to caption (a) layer 0

Refer to caption Two examples of the high activated words and their contexts for transformer factor Φ:,297subscriptΦ:297\Phi_{:,297}. We also provide the saliency map of the tokens generated using LIME. This transformer factor corresponds to the concept: “repetitive pattern detector”. In other words, repetitive text sequences will trigger high activation of Φ:,297subscriptΦ:297\Phi_{:,297}.

Refer to caption Visualization of Φ:,322subscriptΦ:322\Phi_{:,322}. This transformer factor corresponds to the concept: “some born in some year” in biography. All of the high-activation contexts contain the beginning of a biography. As shown in the figure, the attributes of someone, name, age, career, and familial relation all have high saliency weights.

$$ \label{sparse} x = \Phi \alpha + \epsilon, \ s.t. \ \alpha \succeq 0, $$ \tag{sparse}

$$ \min_{w\in\mathbb{R}^{T}}\mathcal{L}(f,w,\mathcal{S}(s))+\sigma|w|_{2}^{2}. $$ \tag{S3.Ex3}

$$ \alpha(x)=\arg\min_{\alpha\in\mathbb{R}}||x-\Phi\alpha||{2}^{2}+\lambda||\alpha||{1} $$ \tag{Ax1.Ex4}

$$ f((s,i))=\alpha(x^{(l)}(s,i)) $$ \tag{Ax1.Ex5}

$$ h(s)_{t}=\left{\begin{array}[]{ll}0&\text{if }s[t]=\text{[`UNK']}\ 1&Otherwis\ \end{array}\right. $$ \tag{Ax1.Ex6}

$$ \Omega=diag(d(h(\mathcal{S}{1}),\vec{1}),d(h(\mathcal{S}{2}),\vec{1}),\cdots,d(h(\mathcal{S}_{n}),\vec{1})) $$ \tag{Ax1.Ex8}

$$ X=X^{(1)}\cup X^{(2)}\cup\cdots\cup X^{(L)} $$ \tag{Ax1.Ex9}

$$ \min\limits_{\Phi} \tfrac{1}{2} | X - \Phi \Omega A|{F}^{2},\ |\Phi{:,j}|_2 \leq 1. \label{appequ:dictionary_update} $$ \tag{appequ:dictionary_update}

$$ \displaystyle\text{apple}= $$

$$ \min\limits_{A} \tfrac{1}{2} | X - \Phi A|{F}^{2} + \lambda\sum{i}{ |\alpha_i|_{1}},\ \text{s.t.}\ \alpha_i \succeq 0, \label{appequ:sparse_coding} $$ \tag{appequ:sparse_coding}

$$ \text{apple} =& 0.09\text{dessert}" + 0.11\text{organism}" + 0.16\ \text{fruit}" & + 0.22\text{mobile&IT}" + 0.42``\text{other}". $$

$$ 5pt] \small\ $^{1}$ Facebook AI Research\ \small\ $^{2}$ Berkeley AI Research (BAIR), UC Berkeley\ \small\ $^{3}$ New York University\ \small\ $^{4}$ Redwood Center for Theoretical Neuroscience, UC Berkeley\ }

\DeclareMathOperator{\EMD}{EMD} \usepackage{array} \usepackage{dcolumn} \usepackage{tabularx} \newcolumntype{P}[1]{>{\centering\arraybackslash}m{#1}}

\usepackage[a4paper]{geometry} \usepackage{tabularx} \usepackage{lipsum} \begin{document}

\maketitle \begin{abstract} Transformer networks have revolutionized NLP representation learning since they were introduced. Though a great effort has been made to explain the representation in transformers, it is widely recognized that our understanding is not sufficient. One important reason is that there lack enough visualization tools for detailed analysis. In this paper, we propose to use dictionary learning to open up these `black boxes' as linear superpositions of transformer factors. Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors, e.g., word-level polysemy disambiguation, sentence-level pattern formation, and long-range dependency. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights. We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work. The code is available at \url{https://github.com/zeyuyun1/TransformerVis}. \end{abstract}

\section{Introduction} Though the transformer networks \cite{vaswani2017attention, devlin2018BERT} have achieved great success, our understanding of how they work is still fairly limited. This has triggered increasing efforts to visualize and analyze these black boxes''. Besides a direct visualization of the attention weights, most of the current efforts to interpret transformer models involve probing tasks''. They are achieved by attaching a light-weighted auxiliary classifier at the output of the target transformer layer. Then only the auxiliary classifier is trained for well-known NLP tasks like part-of-speech (POS) Tagging, Named-entity recognition (NER) Tagging, Syntactic Dependency, etc. \citet{tenney2019you} and \citet{liu-etal-2019-linguistic} show transformer models have excellent performance in those probing tasks. These results indicate that transformer models have learned the language representation related to the probing tasks. Though the probing tasks are great tools for interpreting language models, their limitation is explained in \citet{Anna2020Bertology}. We summarize the limitation into three major points: \begin{itemize} \item Most probing tasks, like POS and NER tagging, are too simple. A model that performs well in those probing tasks does not reflect the model’s true capacity. \item Probing tasks can only verify whether a certain prior structure is learned in a language model. They can not reveal the structures beyond our prior knowledge. \item It's hard to locate where exactly the related linguistic representation is learned in the transformer. \end{itemize} Efforts are made to remove those limitations and make probing tasks more diverse. For instance, \citet{hewitt-manning-2019-structural} proposes ``structural probe'', which is a much more intricate probing task. \citet{Zhengbao2020Know} proposes to generate specific probing tasks automatically. Non-probing methods are also explored to relieve the last two limitations. For example, \citet{emily2019VisBert} visualizes embedding from BERT using UMAP and shows that the embeddings of the same word under different contexts are separated into different clusters. \citet{Kawin2019Contextual} analyzes the similarity between embeddings of the same word in different contexts. Both of these works show transformers provide a context-specific representation.

\citet{faruqui-etal-2015-sparse, arora2018linear, zhang2019word} demonstrate how to use dictionary learning to explain, improve, and visualize the uncontextualized word embedding representations. In this work, we propose to use dictionary learning to alleviate the limitations of the other transformer interpretation techniques. Our results show that dictionary learning provides a powerful visualization tool, leading to some surprising new knowledge.

\section{Method} {\noindent \bf Hypothesis: contextualized word embedding as a sparse linear superposition of transformer factors.} It is shown that word embedding vectors can be factorized into a sparse linear combination of word factors \cite{arora2018linear, zhang2019word}, which correspond to elementary semantic meanings. An example is: \begin{align*} \text{apple} =& 0.09\text{dessert}" + 0.11\text{organism}" + 0.16\ \text{fruit}" & + 0.22\text{mobile&IT}" + 0.42``\text{other}". \end{align*} We view the latent representation of words in a transformer as contextualized word embedding. Similarly, we hypothesize that a contextualized word embedding vector can also be factorized as a sparse linear superposition of a set of elementary elements, which we call \textit{transformer factors}. The exact definition will be presented later in this section.

\begin{figure}[h] \centering \includegraphics[width=0.3\textwidth]{trans_7.png} \caption{Building block (layer) of transformer} \end{figure} Due to the skip connections in each of the transformer blocks, we hypothesize that the representation in any layer would be a superposition of the hierarchical representations in all of the lower layers. As a result, the output of a particular transformer block would be the sum of all of the modifications along the way. Indeed, we verify this intuition with the experiments. Based on the above observation, we propose to learn a single dictionary for the contextualized word vectors from different layers' output.

\vspace{0.1in} {\noindent \bf To learn a dictionary of transformer factors with non-negative sparse coding.}

Given a set of tokenized text sequences, we collect the contextualized embedding of every word using a transformer model. We define the set of all word embedding vectors from $l$th layer of transformer model as $X^{(l)}$. Furthermore, we collect the embeddings across all layers into a single set $X = X^{(1)} \cup X^{(2)} \cup \cdots \cup X^{(L)}$.

By our hypothesis, we assume each embedding vector $x \in X$ is a sparse linear superposition of \textit{transformer factors}:

\begin{equation}\label{sparse} x = \Phi \alpha + \epsilon, \ s.t. \ \alpha \succeq 0, \end{equation} where $\Phi\in{\rm I!R}^{d\times m}$ is a dictionary matrix with columns $\Phi_{:,c}\ $, $\bm{\alpha}\in{\rm I!R}^m$ is a sparse vector of coefficients to be inferred and $\bm{\epsilon}$ is a vector containing independent Gaussian noise samples, which are assumed to be small relative to $\bm{x}$. Typically $m>d$ so that the representation is {\em overcomplete}. This inverse problem can be efficiently solved by FISTA algorithm \cite{beck2009fast}. The dictionary matrix $\Phi$ can be learned in an iterative fashion by using non-negative sparse coding, which we leave to the appendix section \ref{sec:optimization}. Each column $\Phi_{:,c}\ $ of $\Phi$ is a {\it transformer factor} and its corresponding sparse coefficient $\bm{\alpha}_c$ is its activation level.

\vspace{0.1in} {\noindent \bf Visualization by top activation and LIME interpretation.} An important empirical method to visualize a feature in deep learning is to use the input samples, which trigger the top activation of the feature \cite{zeiler2014visualizing}. We adopt this convention. As a starting point, we try to visualize each of the dimensions of a particular layer, $X^{(l)}$. Unfortunately, the hidden dimensions of transformers are not semantically meaningful, which is similar to the uncontextualized word embeddings \cite{zhang2019word}.

Instead, we can try to visualize the transformer factors. For a transformer factor $\Phi_{:,c}$ and for a layer-$l$, we denote the 1000 contextualized word vectors with the largest sparse coefficients $\alpha^{(l)}_c$ as $X^{(l)}c \subset X^{(l)}$, which correspond to 1000 different sequences. For example, Figure~ \ref{CWF 17} shows the top 5 words that activated transformer factor-17 $\Phi{:,17}$ at layer-$0$, layer-$2$, and layer-$6$ respectively. Since a contextualized word vector is generally affected by many tokens in the sequence, we can use LIME \cite{DBLP:journals/corr/RibeiroSG16} to assign a weight to each token in the sequence to identify their relative importance to $\alpha_c$. The detailed method is left to Section~\ref{sec:experiments}.

\vspace{0.1in} {\noindent \bf To determine low-, mid-, and high-level transformer factors with importance score.} As we build a single dictionary for all of the transformer layers, the semantic meaning of the transformer factors has different levels. While some of the factors appear in lower layers and continue to be used in the later stages, the rest of the factors may only be activated in the higher layers of the transformer network. A central question in representation learning is: where does the network learn certain information?'' To answer this question, we can compute an importance score'' for each transformer factor $\Phi_{:,c}$ at layer-$l$ as $I^{(l)}_c$. $I^{(l)}_c$ is the average of the largest 1000 sparse coefficients $\alpha^{(l)}_c$'s, which correspond to $X^{(l)}_c$. We plot the importance scores for each transformer factor as a curve is shown in Figure~\ref{importance score}. We then use these importance score (IS) curves to identify which layer a transformer factor emerges. Figure~\ref{low} shows an IS curve peak in the earlier layer. The corresponding transformer factor emerges in the earlier stage, which may capture lower-level semantic meanings. In contrast, Figure~\ref{mid} shows a peak in the higher layers, which indicates the transformer factor emerges much later and may correspond to mid- or high-level semantic structures. More subtleties are involved when distinguishing between mid-level and high-level factors, which will be discussed later.

An important characteristic is that the IS curve for each transformer factor is relatively smooth. This indicates if a vital feature is learned in the beginning layers, it won't disappear in later stages. Instead, it will be carried all the way to the end with gradually decayed weight since many more features would join along the way. Similarly, abstract information learned in higher layers is slowly developed from the early layers. Figure~\ref{CWF 17} and \ref{CWF 35} confirm this idea, which will be explained in the next section.

\begin{figure}% \centering \subfloat[\centering]{{\includegraphics[width=0.5\linewidth]{images/low.png}\label{low} }}% \subfloat[\centering]{{\includegraphics[width=0.5\linewidth]{mid} \label{mid}}}% \caption{Importance score (IS) across all layers for two different transformer factors. (a) This figure shows a typical IS curve of a transformer factor corresponding to low-level information. (b) This figure shows a typical IS curve of a transformer factor corresponds to mid-level information.}% \label{importance score}% \end{figure}

\begin{figure*}[ht]% \centering \subfloat[\centering layer 0 \label{CWF 17 layer 0} ]{{\includegraphics[width=0.30\linewidth]{images/30_0.PNG} }}% \quad \subfloat[\centering layer 2]{{\includegraphics[width=0.30\linewidth]{images/30_2.PNG} }}% \quad \subfloat[\centering layer 6]{{\includegraphics[width=0.33\linewidth]{images/30_4.PNG} }}%

\caption{Visualization of a low-level transformer factor, $\Phi_{:,30}$ at different layers.
(a), (b) and (c) are the top-activated words and contexts for $\Phi_{:,30}$ in layer-$0$, $2$ and $4$ respectively. We can see that at layer-$0$, this transformer factor corresponds to word vectors that encode the word ``left'' with different senses. In layer-2, a majority of the top activated words ``left'' correspond to a single sense, "leaving, exiting." In layer 4, all of the top-activated words ``left'' have corresponded to the same sense, "leaving, exiting." Due to space limitations, we invite the readers to use our \href{https://transformervis.github.io/transformervis/}{website} to see more of those disambiguation effects.
}%
\label{CWF 17}%

\end{figure*}

\begin{table*}[!h] \small \centering \begin{tabular}{|m{0.05\linewidth} | m{0.6\linewidth} | m{0.25\linewidth} |} \hline & Top 3 activated words and their contexts & Explanation \ \hline $\Phi_{:,2}$ & • that snare shot sounded like somebody' d kicked open the door to your \textcolor{blue}{mind}".\newline• i became very frustrated with that and finally made up my \textcolor{blue}{mind} to start getting back into things."\newline• when evita asked for more time so she could make up her \textcolor{blue}{mind}, the crowd demanded," ¡ ahora, evita,<&• Word mind''\newline • Noun \newline • Definition: the element of a person that enables them to be aware of the world and their experiences.\\ \hline $\Phi_{:,16}$ &•nington joined the five members xero and the band was renamed to linkin \textcolor{blue}{park}.\newline• times about his feelings about gordon, and the price family even sat away from \textcolor{blue}{park}' s supporters during the trial itself.\newline• on 25 january 2010, the morning of \textcolor{blue}{park}' s 66th birthday, he was found hanged and unconscious in his & • Word park'' \newline • Noun \newline • Definition: a common first and last name \ \hline $\Phi_{:,30}$ & • saying that he has \textcolor{blue}{left} the outsiders, kovu asks simba to let him join his pride\newline• eventually, all boycott' s employees \textcolor{blue}{left}, forcing him to run the estate without help.\newline• the story concerned the attempts of a scientist to photograph the soul as it \textcolor{blue}{left} the body. & • Word left" \newline • Verb \newline • Definition: leaving, exiting\\ \hline $\Phi_{:,33}$ &• forced to visit the sarajevo television station at night and to film with as little \textcolor{blue}{light} as possible to avoid the attention of snipers and bombers.\newline• by the modest, cream@-@ colored attire in the airy, \textcolor{blue}{light}@-@ filled clip.\newline• the man asked her to help him carry the case to his car, a \textcolor{blue}{light}@-@ brown volkswagen beetle. & • Word light'' \newline • Noun \newline • Definition: the natural agent that stimulates sight and makes things visible\

  \hline


\end{tabular}
\caption{Several examples of low-level transformer factors. Their top-activated words in layer 4 are marked \textcolor{blue}{blue}, and the corresponding contexts are shown as examples for each transformer factor. As shown in the table, nearly all of the top-activated words are disambiguated into a single sense. Please note the last example of $\Phi_{:,33}$ is a rare exception, the reader may check the appendix to see a more complete list. More examples, top-activated words and contexts are provided in Appendix. }
\label{low level table}

\end{table*}

\begin{figure*}[!h]% \centering \subfloat[\centering \label{compare} ]{{\includegraphics[width=0.45\linewidth]{images/left_old.png} }}% \subfloat[\centering \label{linear}]{{\includegraphics[width=0.5\linewidth]{images/linear_mul.png} }}%

\caption{(a) Average activation of $\Phi_{:,30}$ for word vector ``left'' across different layers. (b) Instead of averaging, we plot the activation of all ``left'' with different contexts in layer-$0$, $2$, and $4$. Random noise is added to the y-axis to prevent overplotting. The activation of $\Phi_{:,30}$ for two different word senses of ``left'' is blended together in layer-$0$. They disentangle to a great extent in layer-$2$ and nearly separable in layer-$4$ by this single dimension.}%
\label{linear mul}%

\end{figure*}

\begin{table}[ht] \small \centering \begin{tabular}{|p{0.4\linewidth} | P{0.13\linewidth} |P{0.10\linewidth} |P{0.15\linewidth} |} \hline & Precision (%) & Recall (%) & F1 score (%) \ \hline Average perceptron POS tagger & \vspace{0.1in} 92.7 & \vspace{0.1in} 95.5 & \vspace{0.1in} 94.1 \ \hline Finetuned BERT base model for POS task & \vspace{0.1in} 97.5 & \vspace{0.1in} 95.2 & \vspace{0.1in} 96.3 \ \hline Logistic regression classifier with activation of $\Phi_{:,30}$ at layer 4 & \vspace{0.18in} 97.2 & \vspace{0.18in} {\bf 95.8} & \vspace{0.18in} {\bf 96.5} \ \hline \end{tabular} \caption{Evaluation of binary POS tagging task: predict whether or not ``left'' in a given context is a verb.} \label{evaluation} \vspace{-0.1in} \end{table}

\begin{figure*}% \centering

\subfloat[\centering layer 4]{{\includegraphics[width=0.30\linewidth]{images/35_4.JPG} }}%
\subfloat[\centering layer 6]{{\includegraphics[width=0.33\linewidth]{images/35_6.JPG} }}%
\subfloat[\centering layer 8]{{\includegraphics[width=0.33\linewidth]{images/35_8.JPG} }}%

\caption{Visualization of a mid-level transformer factor. (a), (b), (c) are the top 5 activated words and contexts for this transformer factor in layer-$4$, $6$, and $8$ respectively. Again, the position of the word vector is marked \textcolor{blue}{blue}. Please notice that sometimes only a part of a word is marked blue. This is due to that BERT uses word-piece tokenizer instead of whole word tokenizer. This transformer factor corresponds to the pattern of ``consecutive adjective''. As shown in the figure, this feature starts to develop at layer-$4$ and fully develops at layer-$8$. }%
\label{CWF 35}%

\end{figure*}

\begin{figure*}% \centering

\subfloat[\centering layer 4]{{\includegraphics[width=0.33\linewidth]{images/13_4.JPG} }}%
\subfloat[\centering layer 6]{{\includegraphics[width=0.31\linewidth]{images/13_6.JPG} }}%
\subfloat[\centering layer 8]{{\includegraphics[width=0.33\linewidth]{images/13_8.JPG} }}%

\caption{Another example of a mid-level transformer factor visualized at layer-$4$, $6$, and $8$. The pattern that corresponds to this transformer factor is ``unit exchange''. Such a pattern is somewhat unexpected based on linguistic prior knowledge. }%
\label{CWF 13}%

\end{figure*}

\section{Experiments and Discoveries} \label{sec:experiments}

We use a 12-layer pre-trained BERT model \cite{PretrainedBERT,devlin2018BERT} and freeze the weights. Since we learn a single dictionary of transformer factors for all of the layers in the transformer, we show that these transformer factors correspond to different levels of semantic or syntactic patterns. The patterns can be roughly divided into three categories: word-level disambiguation, sentence-level pattern formation, and long-range dependency. In the following, we provide detailed visualization for each pattern category. Due to the space limit, only a small amount of the factors are demonstrated in the paper. To alleviate the ``cherry-picking'' bias, we also build a \href{https://transformervis.github.io/transformervis/}{website} for the interested readers to play with these results.

\vspace{0.1in} {\noindent \bf Low-level: word-level polysemy disambiguation.} While the input embedding of a token contains polysemy, we find transformer factors with early IS curve peaks usually correspond to a specific word-level meaning. By visualizing the top activation sequences, we can see how word-level disambiguation is gradually developed in a transformer.

We show how the disambiguation effect develops progressively through each layer in Figure~\ref{CWF 17}. In Figure~\ref{CWF 17}, the top 5 activated words and their contexts for transformer factor $\Phi_{:,30}$ in different layers are listed. The top activated words in layer 0 contain the word left'' varying senses, which is being mostly disambiguated in layer 2 albeit not completely. In layer 4, the word left'' is fully disambiguated since the top-activated word contains only left'' with the word sense leaving, exiting.'' We also show more examples of those types of transformer factors in Table~\ref{low level table}: for each transformer factor, we list out the top 3 activated words and their contexts in layer 4. As shown in the table, nearly all top-activated words are disambiguated into a single sense.

Further, we can quantify the quality of the disambiguation ability of the transformer model. In the example above, since the top 1000 activated words and contexts are left'' with only the word sense leave, exiting'', we can assume left'' when used as a verb, triggers higher activation in $\Phi_{:,30}$ than left'' used as other sense of speech. We can verify this hypothesis using a human-annotated corpus: Brown corpus \cite{francis79browncorpus}. In this corpus, each word is annotated with its corresponding part-of-speech. We collect all the sentences contains the word left'' annotated as a verb in one set and sentences contains left'' annotated as other part-of-speech. As shown in Figure~\ref{compare}, in layer 0, the average activation of $\Phi_{:,30}$ for the word left'' marked as a verb is no different from left'' as other senses. However, at layer 2, left'' marked as a verb triggers a higher activation of $\Phi_{:,30}$. In layer 4, this difference further increases, indicating disambiguation develops progressively across layers. In fact, we plot the activation of left'' marked as verb and the activation of other left'' in Figure~\ref{linear}. In layer 4, they are nearly linearly separable by this single feature. Since each word left'' corresponds to an activation value, we can perform a logistic regression classification to differentiate those two types of ``left''. From the result shown in Figure~\ref{compare}, it is pretty fascinating to see that the disambiguation ability of just $\Phi_{:,30}$ is better than the other two classifiers trained with supervised data. This result confirms that disambiguation is indeed done in the early part of pre-trained transformer model and we are able to detect it via dictionary learning.

\begin{table*}[!h] \small \centering \begin{tabular}{|P{0.05\linewidth} | m{0.45\linewidth} | m{0.2\linewidth} | P{0.03\linewidth} | P{0.03\linewidth} | P{0.03\linewidth} | P{0.03\linewidth} | } \hline & 2 example words and their contexts with high activation & Patterns & L4 (%) & L6 (%)& L8 (%)& L10 (%)\ \hline $\Phi_{:,13}$ & • the steel pipeline was about 20 ° f(- 7 \textcolor{blue}{°} c) degrees.

   • hand( 56 to 64 inches( 140 to 160 \textcolor{blue}{cm})) war horse is that it was a & Unit exchange with parentheses &0 & 0 &64.5&95.5\\
\hline
$\Phi_{:,42}$ & • he died at the hospice of lancaster county from heart

• holly' s drummer carl bunch suffered \textcolor{blue}{frost}bite to his toes( while aboard the
\textcolor{blue}{ai}lments on 23 june 2007. & Something unfortunate happened &94.0&100&100&100\\
\hline
$\Phi_{:,50}$ & • hurricane pack 1 was a re\textcolor{blue}{vam}ped version of story mode;

• in 1998, the categories were \textcolor{blue}{re}titled best short form music video, and best & Doing something again, or making something new again &74.5&100&100&100 \\
\hline
$\Phi_{:,86}$ & • he finished the 2005 – \textcolor{blue}{06} season with 21 appearances and seven goals.

• of an offensive game, finishing off the 2001 – \textcolor{blue}{02} season with 58 points in the 47 games
& Consecutive years, used in foodball season naming &0&100&85.0&95.5\\
\hline
$\Phi_{:,102}$ & • the most prominent of which was bishop abel \textcolor{blue}{mu}zorewa' s united african national council

• ralambo' s father, and\textcolor{blue}{riam}anelo, had established rules of succession by
& African names &99.0&100&100&100\\
\hline
$\Phi_{:,125}$ & • music writer \textcolor{blue}{jeff} weiss of pitchfork describes the" enduring image"

• club reviewer \textcolor{blue}{erik} adams wrote that the episode was a perfect mix & Describing someone in a paraphrasing style. Name, Career &15.5&99.0&100&98.5
\\
\hline
$\Phi_{:,184}$ & • the world wide fund for nature( \textcolor{blue}{wwf}) announced in 2010 that a biodiversity study from\newline• fm) was halted by the federal communications commission( \textcolor{blue}{fcc}) due to a complaint that the company buying & Institution with abbreviation &0&15.5&39.0&63.0 \\
\hline
$\Phi_{:,193}$ &• 74, 22@,@ 500 vietnamese during 1979 \textcolor{blue}{–} 92, over 2@,@ 500 bosnian\newline •, the russo@-@ turkish war of 1877 \textcolor{blue}{–} 88 and the first balkan war in 1913.& Time span in years &97.0&95.5&96.5&95.5 \\
\hline
$\Phi_{:,195}$ & •s, hares, badgers, foxes, \textcolor{blue}{weasel}s, ground squirrels, mice, hamsters\newline•-@ watching, boxing, chess, cycling, \textcolor{blue}{drama}, languages, geography, jazz and other music& Consecutive of noun (Enumerating) &8.0&98.5&100&100 \\
\hline
$\Phi_{:,225}$ & • technologist at the united states marine hospital in key \textcolor{blue}{west}, florida who developed a morbid obsession for\newline• 00°,11'', w, near smith \textcolor{blue}{valley}, nevada. & Places in US, followings the convention ``city, state"&51.5&91.5&91.0&77.5\\
\hline
\end{tabular}
\caption{A list of typical mid-level transformer factors. The top-activation words and their context sequences for each transformer factor at layer-$8$ are shown in the second column. We summarize the patterns of each transformer factor in the third column. The last 4 columns are the percentage of the top 200 activated words and sequences that contain the summarized patterns in layer-$4$,$6$,$8$, and $10$ respectively.}
\label{Mid Unexpected}
\vspace{-0.1in}

\end{table*}

\vspace{0.1in} \begin{table*}[!h]

\small
\centering
\begin{tabular}{|P{0.05\linewidth} | m{0.45\linewidth} | m{0.30\linewidth} | P{0.05\linewidth} |}
\hline
& Adversarial Text & Explaination & $\alpha_{35}$\\
\hline
(o) & album as "full of exhilarating, ecstatic, \textcolor{blue}{thrilling}, fun and sometimes downright silly songs" & The original top-activated word and its context sentence for transformer factor $\Phi_{:,35}$ (not an adversarial text) & 9.5\\
\hline
(a) & album as "full of delightful, lively, \textcolor{blue}{exciting}, interesting and sometimes downright silly songs" & Replace the adjectives in sentence (o) with different adjectives. & 9.2 \\
\hline
(b) & album as "full of unfortunate, heartbroken, \textcolor{blue}{annoying}, boring and sometimes downright silly songs" & Replace the adjectives in sentence (o) with negative adjectives. & 8.2\\
\hline
(c) & album as "full of [UNK], [UNK], \textcolor{blue}{thrilling}, [UNK] and sometimes downright silly songs" & Mask the adjectives in sentence (o) with unknown tokens. & 5.3\\
\hline
(d) & album as "full of \textcolor{blue}{thrilling} and sometimes downright silly songs" & Remove the first three adjectives in sentence (o). & 7.8 \\
\hline
(e) & album as "full of \textcolor{blue}{natural}, smooth, rock, electronic and sometimes downright silly songs" & Replace the adjectives in sentence (o) with neutral adjectives. & 6.2 \\
\hline
(f) & each participant starts the battle \textcolor{blue}{with} one balloon. these can be re@-@ inflated up to four & Use a random sentence. & 0.0\\
\hline
(g) & The book is described as "innovative, \textcolor{blue}{beautiful} and brilliant". It receive the highest opinion from James Wood & We create this sentence that contain the pattern of consecutive adjective.& 7.9 \\
\hline
\end{tabular}
\caption{We construct adversarial texts similar but different to the pattern ``Consecutive adjective''. The last column shows the activation of $\Phi_{:,35}$, or $\alpha^{(8)}_{35}$, w.r.t. the blue-marked word in layer 8.}
\label{Adversarial Text}

\end{table*} {\noindent \bf Mid level: sentence-level pattern formation.} We find most of the transformer factors, with an IS curve peak after layer 6, capture mid-level or high-level semantic meanings. In particular, the mid-level ones correspond to semantic patterns like phrases and sentences pattern.

We first show two detailed examples of mid-level transformer factors. Figure~\ref{CWF 35} shows a transformer factor that detects the pattern of consecutive usage of adjectives. This pattern starts to emerge at layer 4, develops at layer 6, and becomes quite reliable at layer 8. Figure~\ref{CWF 13} shows a transformer factor, which corresponds to a pretty unexpected pattern: ``unit exchange'', e.g., 56 inches (140 cm). Although this exact pattern only starts to appear at layer 8, the sub-structures that make this pattern, e.g., parenthesis and numbers, appear to trigger this factor in layers 4 and 6. Thus this transformer factor is also gradually developed through several layers.

While some mid-level transformer factors verify common semantic or syntactic patterns, there are also many surprising mid-level transformer factors. We list a few in Table~\ref{Mid Unexpected} with quantitative analysis. For each listed transformer factor, we analyze the top 200 activating words and their contexts in each layer. We record the percentage of those words and contexts that correspond to the factors' semantic pattern in Table~\ref{Mid Unexpected}. From the table, we see that large percentages of top-activated words and contexts do corresponds to the pattern we describe. It also shows most of these mid-level patterns start to develop at layer 4 or 6. More detailed examples are provided in the appendix section \ref{sec:mid}. Though it's still mysterious why the transformer network develops representations for these surprising patterns, we believe such a direct visualization can provide additional insights, which complements the ``probing tasks''.

To further confirm a transformer factor does correspond to a specific pattern, we can use constructed example words and context to probe their activation. In Table~\ref{Adversarial Text}, we construct several text sequences that are similar to the patterns corresponding to a particular transformer factor but with subtle differences. The result confirms that the context that strictly follows the pattern represented by that transformer factor triggers a high activation. On the other hand, the closer the adversarial example to this pattern, the higher activation it receives at this transformer factor.

\vspace{0.1in} {\noindent \bf High-level: long-range dependency.} High-level transformer factors correspond to those linguistic patterns that span an extended range in the text. Since the IS curves of mid-level and high-level transformer factors are similar, it is difficult to distinguish those transformer factors based on their IS cures. Thus, we have to manually examine the top-activation words and contexts for each transformer factor to differentiate between mid-level and high-level transformer factors. To ease the process, we choose to use the black-box interpretation algorithm \emph{LIME} \cite{DBLP:journals/corr/RibeiroSG16} to identify the contribution of each token in a sequence. There also exist interpretation tools that specifically leverage the transformer architecture \citep{hila2021explainability,hila2021interpretability}. In the future, one could adapt those interpretation tools, which may potentially provide better visualization.

Given a sequence $s \in S$, we can treat $\alpha^{(l)}{c,i}$, the activation of $\Phi{:,c}$ in layer-$l$ at location $i$, as a scalar function of $s$, $f^{(l)}{c,i}(s)$. Assume a sequence $s$ triggers a high activation $\alpha^{(l)}{c,i}$, i.e. $f^{(l)}{c,i}(s)$ is large. We want to know how much each token (or equivalently each position) in $s$ contributes to $f^{(l)}{c,i}(s)$. To do so, we generated a sequence set $\mathcal{S}(s)$, where each $s'\in \mathcal{S}(s)$ is the same as $s$ except for that several random positions in $s'$ are masked by [`UNK'] (the unknown token). Then we learns a linear model $g_{w}(s')$ with weights $w \in \mathbb{R}^{T}$ to approximate $f(s')$, where $T$ is the length of sentence $s$. This can be solved as a ridge regression: $$\min_{w \in \mathbb{R}^T} \mathcal{L} (f,w,\mathcal{S}(s)) + \sigma |w|_2^2.$$

The learned weights $w$ can serve as a saliency map that reflects the ``contribution'' of each token in the sequence $s$. Like in Figure~\ref{297}, the color reflects the weights $w$ at each position. Red means the given position has positive weight and green means negative weight. The magnitude of weight is represented by the intensity. The redder a token is, the more it contributions to the activation of the transformer factor. We leave more implementation and mathematical formulation details of LIME algorithm in the appendix.

We provide detailed visualization for two different transformer factors that show long-range dependency in Figure~\ref{297}, \ref{322}. Since visualization of high-level information requires more extended context, we only offer the top two activated words and their contexts for each such transformer factor. Many more will be provided in the appendix section \ref{sec:high}.

We name the pattern for transformer factor $\Phi_{:,297}$ in Figure~\ref{297} as repetitive pattern detector''. All top activated contexts for $\Phi_{:,297}$ contain an obvious repetitive structure. Specifically, the text snippet can't get you out of my head" appears twice in the first example, and the text snippet xxx class passenger, star alliance'' appears three times in the second example. Compared to the patterns we found in the mid-level [\ref{CWF 13}], the high-level patterns like repetitive pattern detector'' are much more abstract. In some sense, the transformer detects if there are two (or multiple) almost identical embedding vectors at layer-$10$ without caring what they are. Such behavior might be highly related to the concept proposed in the capsule networks \cite{sabour2017dynamic,hinton2021represent}. To further understand this behavior and study how the self-attention mechanism helps model the relationships between the features outlines an interesting future research direction.

Figure~\ref{322} shown another high-level factor, which detects text snippets related to ``the beginning of a biography''. The necessary components, day of birth as month and four-digit years, first name and last name, familial relation, and career, are all mid-level information. In Figure~\ref{322}, we see that all the information relates to biography has a high weight in the saliency map. Thus, they are all together combined to detect the high-level pattern.

\begin{figure}[!h] \centering \includegraphics[width=0.49\textwidth]{images/278_f.png} \caption{Two examples of the high activated words and their contexts for transformer factor $\Phi_{:,297}$. We also provide the saliency map of the tokens generated using LIME. This transformer factor corresponds to the concept: ``repetitive pattern detector''. In other words, repetitive text sequences will trigger high activation of $\Phi_{:,297}$.} \label{297} \end{figure}

\begin{figure}[!h] \centering \includegraphics[width=0.48\textwidth]{images/322.png} \caption{Visualization of $\Phi_{:,322}$. This transformer factor corresponds to the concept: ``some born in some year'' in biography. All of the high-activation contexts contain the beginning of a biography. As shown in the figure, the attributes of someone, name, age, career, and familial relation all have high saliency weights.} \vspace{-0.1in} \label{322} \end{figure}

\section{Discussion} Dictionary learning has been successfully used to visualize the classical word embeddings \cite{arora2018linear, zhang2019word}. In this paper, we propose to use this simple method to visualize the representation learned in transformer networks to supplement the implicit ``probing-tasks'' methods. Our results show that the learned transformer factors are relatively reliable and can even provide many surprising insights into the linguistic structures. This simple tool can open up the transformer networks and show the hierarchical semantic or syntactic representation learned at different stages. In short, we find word-level disambiguation, sentence-level pattern formation, and long-range dependency. The idea of a neural network learns low-level features in early layers, and abstract concepts in the later stages are very similar to the visualization in CNN \cite{zeiler2014visualizing}. Dictionary learning can be a convenient tool to help visualize a broad category of neural networks with skip connections, like ResNet \cite{he2016deep}, ViT models \cite{dosovitskiy2020image}, etc. For more interested readers, we provide an interactive \href{https://transformervis.github.io/transformervis/}{website}\footnote{https://transformervis.github.io/transformervis/} for the readers to gain some further insights.

\section*{Acknowledgements} We thank our reviewers for their detailed and insightful comments. We also thank Yuhao Zhang for his suggestions during the preparation of this paper.

\bibliography{naacl2021} \bibliographystyle{acl_natbib}

\clearpage \appendix \section*{Supplementary Materials} \renewcommand{\thesubsection}{\Alph{subsection}} \subsection{Importance Score (IS) Curves} \label{sec 1} \begin{figure}[h]% \centering \subfloat[\centering]{{\includegraphics[width=0.45\linewidth]{images/low_level.png}\label{low more} }}% \quad \subfloat[\centering]{{\includegraphics[width=0.44\linewidth]{images/high_level.png} \label{high more}}}% \caption{(a) Importance score of 16 transformer factors corresponding to low level information. (b) Importance score of 16 transformer factors corresponds to mid level information respectively.}% \label{importance score more}% \end{figure}

The importance score curve's characteristic has a strong correspondence to a transformer factor's categorization. Based on the location of the peak of an IS curve, we can classify a transformer factor as low-level, mid-level or high-level. The importance score for low-level transformer factors peak in early layers and slowly decrease across the rest of the layers. On the other hand, the importance score for mid-level and high-level transformers slowly increases and peaks at higher layers. In Figure~\ref{importance score more}, we show two sets of the examples to demonstrate the clear distinction between those two types of IS curves.

Taking a step back, we can also plot IS curve for each dimension of word vector (without sparse coding) at different layers. They do not show any specific patterns, as shown in Figure~\ref{sparse or no sparse}. This makes intuitive sense since we mentioned that each of the entries of a contextualized word embedding does not correspond to any clear semantic meaning.

\begin{figure}[h]% \centering \subfloat[\centering]{{\includegraphics[width=0.45\linewidth]{images/no_sparse.JPG}\label{no sparse} }}% \quad \subfloat[\centering]{{\includegraphics[width=0.45\linewidth]{images/sparse.JPG} \label{sparse}}}% \caption{(a) Importance score calculated using certain dimension of word vectors without sparse coding. (b) Importance score calculated using sparse coding of word vectors. }% \label{sparse or no sparse}% \end{figure}

\subsection{LIME: Local Interpretable Model-Agnostic Explanations} \label{sec 2}

After we trained the dictionary $\Phi$ through non-negative sparse coding, the inference of the sparse code of a given input is $$\alpha(x) = \arg\min_{\alpha \in \mathbb{R}} || x - \Phi \alpha ||_2^2 + \lambda ||\alpha||_1 $$

For a given sentence and index pair $(s,i)$, the embedding of word $w = s[i]$ by layer $l$ of transformer is $x^{(l)} (x,i)$. Then we can abstract the inference of a specific entry of sparse code of the word vector as a black-box scalar-value function $f$:

$$f((s,i)) = \alpha(x^{(l)}(s,i))$$

Let $RandomMask$ denotes the operation that generates perturbed version of our sentence $s$ by masking word at random location with ``[UNK]'' (unkown) tokens. For example, a masked sentence could be

[Today is a [`UNK'],day]

Let $h$ denote a encoder for perturbed sentences compared to the unperturbed sentence $s$, such that

[ h(s)_t= \left{ \begin{array}{ll} 0 & \text{if } s[t] = \text{[`UNK']} \ 1 & Otherwis \ \end{array} \right. $$ \tag{sparse}

Algorithm: algorithm
\caption{Explaining Sparse Coding Activation using LIME Algorithm}
\label{CHalgorithm}
\begin{algorithmic}[1]

\State $\mathcal{S} = \{h(s)\}$
\State $Y = \{f(s)\}$
\For{each $i$ in $\{1,2, ..., N\}$ }
\State $s_i' \leftarrow RandomMask(s)$
\State $\mathcal{S} \leftarrow \mathcal{S} \cup h(s_i') $
\State $Y \leftarrow Y \cup f(s_i') $
\EndFor
\State $w \leftarrow Ridge_w(\mathcal{S},Y)$
\end{algorithmic}

Low-Level Transformer Factors

with

Explaination: repetitive structure detector frontier works, and an original soundtrack by avex group were created based on the game. drama cd : tales of grace s 1 to 4 are side stories that take place during the game' s plot. they were released between may 26, 2010 and august 25, 2010 . anthology drama cd: tales of graces f 2010 winter , anthology drama cd : tales of graces f 2011 summer, anthology drama cd : tales of graces f 2012 winter, anthology drama cd : tales of graces f 2012 summer , anthology drama cd : tales of grace s f 2013 win te r, and anthology drama cd : tales of graces f 2013

cobra nd platinum cardholders, and citibank eva air cobra nd world card) the infinity( infinity mileagelands diamond, royal laurel / premium laurel class passengers , star alliance first / business class passengers, american express centurion/ eva air cobrand platinum cardholders , and ci ti bank eva air cobrand world card holders ) the star( infinity mileagelands diamond/ gold, royal laurel/ premium laurel class passengers , star alliance first / business class passengers , star alliance gold members , american express centurion/ eva air cobrand platinum cardholders , citibank eva air cobrand world cardholders, business customers , quickly set online< unk> alight" . " can ' t get you out of my head " was chosen as the lead single from minogue' s eighth studio album fever, and it was released on 8 september 2001 by par lophone in australia, while in the united kingdom and other european countries it was released on 17 september . " can ' t get you out of my head " was w ritten and produced by cathy dennis and rob davis, who had been put together by british artist manager simon fuller , who wanted the duo to come up with a song for british pop group s club 7. the song was recorded using cuba

typhoon status

with two @ - @

Figure

minute sustained winds estimated at 125 km/ h( 78 mph). around 1700 utc on may 31, the storm tracked approximately 65 km( 40 mi) west of iwo jim a . roughly five hours later , it moved within 15 km( 10 mi) of chi chi @ - @ jim a where a pressure of 992 mb( hpa; 29@.@ 30 inhg) was measured . sustained winds on chi chi @ - @ jim a reached 95 km/ h( 60 mph); however , these were determined to be unrepresentative of lucille' s actual intensity due

first book in vocal music. the modern music series. book 1. new york, new york: silver burdette and company. smith, eleanor( 1901). a second book in vocal music. the modern music series. book 2. new york, new york: silver burdette and company. smith, eleanor( 1901 ) . a third book in vocal music. the modern music series . book 3. new york, new york: silver burdette and company. smith, eleanor ( 1905) . a fourth book in vocal music . the modern music series . book 4 . new york , new york: silver burde

@ breaking eight weeks at number one on the airplay chart of the country and became the first to garner 3000 radio plays in a single week. subsequently , it became the most@-@ played song of 2001 in the region . " can ' t get you out of my head " was certified platinum by the british phonographic industry for shipments of 600@,@ 000 units in 2001. the certification was upgraded to double@-@ platinum in 2015, denoting shipments of 1@,@ 200@,@ 000 units. in the united states, " can ' t get you out of my head " peaked at number seven on the chart. in mid@-@ august 2015, " la mor di dit a" earned martin his twenty@@sixth top ten hit on hot latin songs. he became the fourth artist with the most top tens in the 29@-@ year history of the chart . in late august 2015 , martin earned with " la mor di dit a " his fifteenth number@-@ one on the latin airplay chart ( up 58 percent, to 11@.@ 8 million audience impressions) . eventually ," la mor di dit a" peaked at number six on the us hot latin songs chart, number

one on latin airpla y and

, was delivered to sukhoi' s experimental workshop to be outfitted with exclusive systems. built by knaapo, its structure has increased carbon@-@ fibre and al @-@ li content. installed was the 2d thrust@-@ vector ing l yu lka al @-@ 31fp , an interim measure pending the availability of the al @-@ 37fu(< unk>< unk>, " after burn er@-@ controlled" ) . the 3d thrust@ - @ vector ing l yu lka al @ - @ 37 fu was still in development . the al @-@ 31fp , in ke' s former band, though escape the fate only charted at number 25, seven spots lower than the drug in me is you , despite equal sales . in its second week on sales , the drug in me is you dropped about 70% in the united states , selling 5 @, @ 870 copies. this dropped the album 60 spots to number 79 on the billboard 200, and brought total us sales for the album to around 24 @ ,@ 000 copies. on the billboard charts, the drug in me is you charted at number two on the top hard rock albums chart, number three on the top alternative albums and top rock albums charts,

no , no , no ", reached number one on the billboard hot r & b/ hip@-@ hop singles& tracks and number three on the billboard hot 100. its follow@-@ up single ," with me part 1" failed to reproduce the success of " no , no , no " . meanwhile , the group featured on a song from the soundtrack album of the romantic drama why do fools fall in love and" get on the bus " had a limited release in europe and other markets . in 1998, destiny' s child garnered three soul train lady of soul awards including best new artist for " no , no , no oistic warm ong ers. alexander k rivenko( jonathan adams) finally , introduced in trivial games and paranoid pursuits, is russian alexander k ri venko, the commander of the moonbase where the ispf have their headquarters. a winner of the nobel prize for medicine, it is k ri ven ko ' s research into bone damage that has contributed

to enabling humanity to access space easily. although the star cops are independent, spring ' s relationship with k rivenko is often def ere ntial and he frequently seems to cap it ulate to k ri ven ko ' s wishes . == production history=== = = origins=

that build faith: from the life and ministry of thomas s. monson, salt lake city , utah : des ere t book , isbn 978 @ -@ 0 @-@ 87579@-@ 901@-@ 8 -( 1996), faith rewarded: a personal account of prophetic promises to the east german saints, salt lake city, utah : des ere t book , isbn 978 @ -@ 1@-@ 57345@-@ 186@-@ 4 - -( 1997 ) , invitation to exal tation , salt lake city , utah : des ere t book , isbn 978@- dell , tom( 2015). gunnerkrigg court volume 5:< unk> . gunnerkrigg court. arch aia studios press. isbn 978@-@< unk > .=== side comics=== siddell, tom( 2013 ) . annie in the forest part one . beyond the walls . robot voice comics . siddell, tom ( 2013). annie in the forest part two. beyond the walls. robot voice comics . siddell, tom( 2015). traveller . beyond the walls . robot voice comics . = = = ex pl anatory footnotes======

95 @.@ 4 kn ) f 40 2@-@ rr @ - @ < unk> engine, while later examples were fitted with the 23@,@ 000 lbf( 105@.@ 8 kn ) f 40 2 @ - @ rr @- @ 40 8 a . in the early 2000s , 17 tav@-@ 8 bs were upgraded to include a night@-@ attack capability , the f 402@-@ rr @ - @ 408 engine , and software and structural changes.< unk> in 1991 , the night attack harrier was the first upgrade of the av @-@ 8

, aitrus' meeting with ti' ana, and the birth of their son gehn . the book also explains the destruction of the d ' ni civilization. two d ' ni , veovis and a' gaeris, plot to destroy their civilization, which they believe has been corrupted . veov is and a' gaeris create a plague which wipes out many of the d ' ni and follows them through the ages . veovis is murdered by a ' gaeris for refusing to write an age where the two of them would have been worshipped as gods, and aitrus sacrifices himself in order to

inants" . a gb rm pa briefing stated the company had" threatened a compensation claim of$< unk> should the gb rm pa intend to ex ert authority over the company' s operations". in response to the< unk> of the dumping incidents, the gb rm pa stated: we have strongly encouraged the company to investigate options that don ' t en tail releasing the material to the environment and to develop a management plan to eliminate this potential hazard; however , gb rm pa does not have legislative control over how the< un k > tailings dam is managed. = ==< unk>=== following a of warped tour . following this, a lesson in romantics was released on july 10 through fearless records. in august, the band went on tour with olympia and sound the alarm. the music video for " when i get home , you ' re so dead " , directed by marco de la torre, was filmed in september. in late september 2007, the band supported paramore in japan an d australia . the band went on a co@ - @ headlining tour with madina lake in october and november. the" when i get home , you' re so dead " music video was released on november 14, and the single was released on

of the english football league including promo tion and relegation. the player' s team begins with a low rating in an 8 @ - @ team league . by winning games, the player earns credits, which can be used to purchase the contracts of free agents . by finishing high in the 8 @ - @ team league , the player' s team advances to a 16@-@ team league and eventually a 32@-@ team league . the player improves their team by periodically signing free agents , as the competition is tougher in each league. the player wins the mode after winning a playoff tournament in the 32@-@ team league o da ministra kultury i s z tu ki ii< unk > ) 1972 - member of commission" poland

2000" of the polish academy of sciences 1973 prize of the minister of foreign affairs for popular ization of polish culture abroad (< unk> ministra< unk>< unk> za< unk>< unk> kultury za< un k> ) literary prize of the minister of culture and art(< un k >< unk> mini stra ku lt ury i s z tu ki ) and honorary member of science fiction writers of america 1976 -state prize 1st level in the area of to power the antarctic outpost. above earth, ba' al ' s armada arrives. to the displeasure of his subordinates, the other system lords , ba ' al announces that he will treat the tau < un k> leniently. suspicious about ba ' al ' s thorough knowledge of earth, qetesh betrays him and forces him to tell her everything. she orders the destruction of mcmurdo and the ancient outpost in ba ' al ' s name , but she kills ba ' al when tea l' c discovers what she is doing. as teal' c escapes to an al < unk>, q ete sh

( 156+ kn) each fuel capacity: 18@,@ 000 lb( 8@,@ 200 kg) internally, or 26@ , @ 000 lb( 12@,@ 000 kg) with two external fuel tanks performance maximum speed: at altitude: mach 2@ . @ 25( 1 @ , @ 500 mph , 2@ , @ 410 km / h ) [ estimated ] supercruise: mach 1@ . @ 82 ( 1 @ , @ 220 mph , 1 @ , @ 960 km / h ) range : > 1@,@ 600 nmi( 1@, @ 840 mi, 2@,@ 960

Transformer factor 322 in layer 10 with saliency map Explaination: biography, someone born in some year...

. only three pitchers threw more complete games in major league careers shorter than getzein' s nine@-@ year career. getzein had his most extensive playing time with the detroit wolverines, compiling records of 30@-@ 11 and 29@-@ 13 in 1886 and 1887 . in the 1887 world series( which detroit won, 10 games to 5), getzein pitched six complete games and compiled a 4 @ -@ 2 record with a 2@.@ 48 era . he also won 23 games for the boston bean ea ters in

1890 . = = early years = = get ze in was born in 1864

and telegraph lines and networks. the west construction company, based in chattanooga, tennessee, was a general contracting and construction firm also involved in the operation and maintenance of railway, telephone, and telegraph lines.== personal life===== marriage and children === on april 10 , 1875 , in hampshire county, flournoy married frances " fannie" ann armstrong white ( april 10 , 1844 -february 25 , 1922 ) , the daughter of hampshire county clerk of court john baker white and his wife frances ann streit white . frances white' s brother, robert white, served as west virginia attorney general, and her buffalo , new york businessman who made his fortune in five @-@ and @-@ dime stores. he merged his more than 100 stores with those of his first cousins, frank winfield woolworth and charles woolworth, to form the f. w . wool worth company. he went on to hold prominent positions in the merged company as well as marine trust co. he was the father of seymour h. knox ii and grandfather of seymour h . knox iii and northrup knox, the co@-@ founders of the buffalo sabres in the national hockey league . = = biography = = he was born in april 1861 in russell , saint lawrence

stars for eighteen years. the american film institute( afi) ranked cooper eleventh on its list of the twenty five greatest male stars of classic hollywood cinema .== early life== frank james cooper was born on may 7, 1901 , at 730 eleventh avenue in helena, montana to english immigrants alice( nee brazier, 1873 -1967 ) and charles henry cooper ( 1865 -1946 ) . his father emigrated from houghton regis , bedfordshire and became a prominent lawyer , rancher, and eventually a montana supreme court justice . his mother emigrated from gillingham, kent and married charles in montana . in 1906, charles purchased the 600@-@ acre orange ( 1971 ), which ku brick pulled from

circulation in the uk following a mass media frenzy - most of his films were nominated for oscars, golden globes, or bafta awards. his last film, eyes wide shut, was completed shortly before his death in 1999. = = early life = = stanley kubrick was born on july 26, 1928, in the lying @-@ in hospital at 307 second avenue in manhattan, new york city. he was the first of two children of jacob leonard kubrick ( may 21 , 1902 -october 19 , 1985 ) , known as jack or jacques , and his wife sadie gertrude kubrick managed with a catch and release regulation. trophy trout and wild brook trout enhancement regulations apply to the remainder. a total of 31 class a wild trout waters have been designated as wilderness trout streams. fishing in class a wild trout waters is permitted year@-@ round, although the killing of fish is forbidden from labor day to the beginning of the following year' s trout season . == gallery=== henry bell gil kes on = henry bell gilkes on ( june 6 , 1850 -september 29 , 1921 ) was an american lawyer , politician , school administrator, and banker in west virginia . gilkeson was born in moorefield,

movement, there have been few more remarkable figures than mar jory stoneman douglas."== early life== marjory stone man was born on april 7, 189 0, in minneapolis, minnesota , the only child of frank bryant stoneman ( 1857 -1941 ) and lillian trefethen ( 1859 -1912 ) , a concert violinist . one of her earliest memories was her father reading to her the song of hiawatha, at which she burst into sobs upon hearing that the tree had to give its life in order to provide hiawatha the wood for a canoe. she was an early and voracious reader amazon . com.=== dvd release==== johann mickl= johann mick l ( 18 april 1893 -10 april 1945 ) was an austrian @-@ born general le utnant and division commander in the german army during world war ii , and was one of only 88 2 recipients of the knight' s cross of the iron cross with oak leaves . he was commissioned shortly before the outbreak of world

war i, and served with austro@-@ hungarian forces on the eastern and italian fronts as company commander in the imperial@-@ royal mountain troops . during world war i he was decorated several times for bravery and leadership, and very unusual properties, such as a quantum critical point behavior, exotic supercondu ct ivity, and high@-@ temperature ferromagnetism . = babe ruth= george herman ruth jr . ( february 6 , 1895 -august 16 , 1948 ) , better known as babe ruth , was an american professional baseball player whose career in major league baseball( mlb) spanned 22 seasons, from 1914 through 1935 . nicknamed" the bambino" and" the sultan of swat", he began his mlb career as a stellar left@-@ handed pitcher for the boston red sox, but achieved his greatest fame as a slugging outfielder for the

air in regular scheduled services. it includes the city, country, airport and the period in which the airline served the airport. hubs are denoted with a dagger() . = william s. taylor= william sylvester taylor ( october 10 , 1853 -august 2 , 1928 ) was the 33rd governor of kentucky . he was initially declared the winner of the disputed gubernatorial election of 1899 , but the kentucky general assembly, dominated by the democrats , reversed the election results, giving the victory to his democratic party( united states) opponent , william goebel. taylor served only 50 days as governor. a poorly educated but politically as tute lawyer, taylor woods hole, massachusetts, where he studied marine bioluminescence. he also worked at the woods hole oceanographic institution . = = early life = = george thomas reynolds was born in trenton , new jersey on may 27 , 1917 , the son of george w. reynolds , a< unk> for the pennsylvania railroad , and his wife laura, a secretary with the new jersey department of geology . he attended franklin junior high school in highland park, new jersey, until year 10, and then new brunswick high school. he received a bachelor' s degree in physics from rutgers university in 1939 . he then

entered princeton university, where was awarded

= = shaughnessy was born on march 6, 1892 in st . cloud , minnesota , the second son of lucy ann ( foster) and edward shaugh nessy . he attended north st. paul high school, and prior to college, had no athletic experience . when he attended the university of minnesota , however, he p layed college football under head coach henry l. williams and alongside halfback bernie bierman . shaughnessy considered williams to be football ' s greatest teacher, and williams considered him to be the best passer from the midwest . shaughnessy handled both the passing and kicking duties for the team . he played on s gregoras likewise avoids negative comments, as do most modern historians . = george nico l ( baseball ) = george edward nico l ( october 17 , 1870 -august 4 , 1924 ) was an american baseball pitcher and outfielder who played three seasons in major league baseball( mlb) . he played for the st. louis browns, chicago colts, pittsburgh pirates and louisville colonels from 1890 to 1894 . possessing the rare combination of batting right@-@ handed and throwing left@-@ handed, he served primarily as a right fielder when he did not pitch . signed by the browns without having previously played any minor league baseball, nico l made his

dispatched powell and major benjamin mcculloch to utah to ease tensions with brigham young and the mormons. powell assumed his senate seat on his return from utah, just prior to the election of abraham lincoln as president. powell became an outspoken critic of lincoln ' s administration, so much so that the kentucky general assembly asked for his resignation and some of his fellow senators tried to have him expelled from the body . both groups later renounced their actions . powell died at his home near henderson , kentucky shortly following a failed bid to return to the senate in 1867 . = = early life == powell was born on october 6 , 1812 near henderson , the army in 1948. he was promoted to

lieutenant general just before his retirement on 29 february 1948 in recognition of his leadership of the bomb program. by a special act of congress, his date of rank was backdated to 16 july 1945, the date of the trinity nuclear test . groves went on to become a vice@-@ president at sperry rand.== early life== leslie richard groves jr . was born in albany, new york , on 17 august 1896 , the third son of four children of a pastor , leslie richard groves sr . , and his wife gwen nee griffith . a descendant of french huguenots who

, burns died on november 11 , 1928 in brooklyn, new york . == biography== thomas p . burns was born on september 6 , 1864 , in philadelphia . his parents , patrick and mary burns, were both irish immigrants. in 1883 , burns began his professional baseball career as a pitcher with harrisburg of the minor@-@ league interstate association. on the year, burns posted an earned run average( era) of 2@ . @ 30 over 20 games pitched, 15 of which were starts . when he wasn' t pitching, burns played second and third base. burns began the 1884 season playing for the wilmington quicksteps,

@ beats".== credits and personnel== lady gaga - vocals, songwriter and producer redone -songwriter, producer , vocal editing, vocal arrangement, audio engineering, instrumentation, programming, and recording at tour bus in europe trevor mu zzy -recording, vocal editing, audio engineering, and audio mixing at larrabee, north holly wood, los angeles, california gene grimaldi - audio mastering at oasis mastering, burbank, california credits adapted from born this way album liner notes . == charts == = travis jackson = travis calvin jackson ( november 2 , 1903 -july 27 , 1987 ) was an american baseball shortstop .

= mons on was born on august 21, 1927 , in salt lake city , utah to g . spencer mons on ( 1901 -1979 ) and gladys < unk> monson( 1902 - 1973) . the second of six children , he grew up in a" tight@-@ knit" family - many of his mother' s relatives living on the same street and the extended family frequently going on trips together. the family' s neighborhood included several residents of mexican descent, an environment in which he says he developed a love for the mexican people and culture. monso n often spent weekends with relatives on their farms in granger(

it . anderson was a professional accordion player and wrote poetry for various american pagan magazines . in 1970, he published his first book of poetry, thorns of the blood rose, which contained devotional religious poetry dedicated to the goddess; it won the clover international poetry competition award in 1975. anderson continued to promote the feri tradition until his death, at which point april ni ino was appointed as the new grandmaster of the tradition . == early life===== childhood: 1917 -1931=== anderson was born on may 21 , 1917 at the buffalo horn ranch in clayton , new mexico . his parents were hi lb art alexander anderson was elsewhere. he had recently become engaged and bought his first house in hillsborough. franklin and benjamin pierce were among the prominent citizens who welcomed president jackson to the state on his visit in mid@-@ 1833.=== marriage and children=== on november 19 , 1834 , pierce married jane means appleton ( march 12 , 1806 -december 2 , 1863 ) , the daughter of jesse appleton, a congregational minister and former president of bowdo in college , and elizabeth means . the appletons were prominent whigs, in contrast with the pierces' democratic affiliation. jane was shy, devoutly religious, and pro@-@ temperance

which took delivery of its eight and last globemaster in november 2015; no. 38 squadron, operating king airs; and the australian army' s 68 ground liaison section . all units are based at amberley , with the exception of no. 38 squadron, located at townsville . = clark shaughnessy= clark daniel sha ugh nessy ( originally o' shaughnessy ) ( march 6 , 1892 -may 15 , 1970 ) was an american football coach and inn ova tor . he is sometimes called the" father of the t formation" and the original founder of the forward pass, although that system had previously been used as early as the 1880s

Transformer factor 386 in layer 10 with saliency map Explaination: topic: war

he was awarded a companion of the order of st michael and st george for his command of the 4th machine gun battalion , the recommendation of which particularly citing his success during attacks on the hindenburg line . murray' s final honour came on 11 july 1919 , when he was mentioned in despa tch es for the fourth time, having received his third mention on 31 december 1918. from june to september 1919 , murray -along with fellow australian victoria cross recipient william donovan joynt - led parties of aif members on a tour of the farming districts of britain and denmark to study agricultural methods under the education schemes. after touring through france and belgium , from large@-@ calibre shells ; one of them, allegedly a 14@- @ inch( 356 mm) round, blew a large hole in her quarterdeck and wrecked the wardroom and the gunroom . she also took several hits by light shells that day, and, although she suffered damage to her superstructure, her fighting and steaming capabilities were not seriously impaired. the ship also participated in the main attack on the dardan elles forts on 18 march . this time a 6@ - @ inch ( 152 mm ) how itzer battery opened fire on agamemnon and hit her 12 times in 25 minutes ; five of the

. lt. riefkohl, who was also the first puerto rican to graduate from the united states naval academy , served as a rear admiral in world war ii. frederick l. riefkohl' s brother , rudolph william riefkohl also served . riefkohl was commissioned a second lieutenant and assigned to the 63 rd heavy artillery regiment in france where he actively partici- pated in the meuse @ -@ argonne offensive . according to the united states war department , after the war he served as captain of coastal artillery at the letterman army medical center in presidio of san francisco, in california( 1918).

washington times@-@ herald, which ran the headline" hardy wild@ - @ eyed aus sies called world' s finest troops " . an article in the chicago daily news told its readers that australians " in their realistic attitude towards power politics, prefer to send their boys to fight far overseas rather than fighting a battle in the suburbs of sydney" . during the battle , wave ll had received a cable from general sir john dil l stress ing the political importance of such victories in the united states, where president franklin d. roosevelt was attempting to get the lend@-@ lease act passed. it was finally enacted in march 1941 . mackay wrote

. he also showed respect for occupied populations and never tolerated pillaging nor violence from his men. as a sign a gratitude, he was offered gifts several times but he was often seen refusing and sending them back. while on campaign in tyrol , he was recorded to have accepted a large sum of money but he immediately distributed it to the local hospitals. further evidence of his humanity was the ca re that he displayed for the lives and well @-@ being of his men, whom he was always reluctant to sacrifice for the sake of glory . overall as a heavy cavalry commander , nanso ut y was one of the best men available during the napoleonic

@ 000 troops on 11 february. in march 1919 , princess matoika and rijndam raced each other from saint@-@ na zaire to newport news in a friendly competition that received national press coverage in the united states. rijndam, the slower ship , was just able to edge out the princess and cut two days from her previous fastest crossing time - by appealing to the honor of the soldiers of the 133 rd field artillery ( returning home aboard the former holland america liner ) and employing them as extra stokers for her boilers . on her next trip, the veteran transport loaded troops at saint @ -@ nazaire

@july, met his wife in new york , and together they traveled to columbus , georgia by way of washington, d. c . and atlanta . = = military schools== for the ten years following world war i , troy middleton would be either an instructor or a student in the succession of military schools that army officers attend during their careers. middleton arrived in columbus, georgia with strong praise from his superiors, and would soon get his efficiency report, in which brigadier general benjamin poor e of the 4th division wrote of him , " the best all@-@ around officer i have yet seen . < unk> by his rapid promotion from coal and 700 long tons ( 710 t) of fuel oil and that provided her a range of 3@ , @ 500 nautical miles( 6@,@ 500 km) at a speed of 10 knots( 19 km/ h) . her main armament consisted of a dozen obukhovskii 12@-@ inch ( 305 mm ) pattern 1907 52 @- @ cal ib re guns mounted in four triple turrets distributed the length of the ship . the russians did not believe that super firing turrets offered any advantage as they discounted the value of axial fire and believed that super firing turrets could not fire while over the lower turret because of

' ll still be playing from 2007" and awarded it" playstation 3@-@ exclusive game of the year". = 11th battalion( australia)= the 11th battalion was an australian army battalion that was among the first infantry units raised during world war i for the first australian imperial force. it was the first battalion recruited in western australia, and following a brief training period in perth , the battalion sailed to egypt where it undertook four months of intensive training . in april 1915 it took part in the invasion of the gallip oli peninsula , landing at anzac cove . in august 1915 the battalion was in action in the battle of lone pine . following was transferred to western australia , being attached to the 6th brigade, which was based around geraldton . in september 1942, as part of an army@- @ wide reduction that came about

because of over@-@ mobilisation , the battalion was amalgamated with the 14th battalion to become the 14th/ 32nd battalion( pr ahran/ footscray regiment) . in early 1943 , the 14th / 32nd battalion carried out amphibious warfare training in queensland before being deployed to the bun a - gona area in new guinea in july . the battalion would remain in mainland new guinea and new britain for the next two years , under the command of lieutenant in an allied air raid on 10 december 1941 , mickl was appointed to temporarily command the division . during december, mickl was wounded in the head and hand , but remained at his post . rom mel recommended mick l for the knight' s cross of the iron cross, for his leadership at sidi rezegh, and it was duly awarded on 13 december 1941 . the harsh conditions of desert warfare had begun to affect mickl' s health, so at the end of december he was sent home on convalescent leave.= = = eastern front = = = = === 12th rifle brigade === = on 25

on to bijeljina which was taken against light partisan resistance late on 16 march . the 27th regiment then consolidated its position in bi je ljina while the 28th regiment and the divisional reconnaissance battalion( german:< un k >) bore the brunt of the fighting as they advanced through< unk>, celic and koraj at the foot of the majevica mountains . sauberzweig later recorded that the 2nd battalion of the 28th regiment ( ii / 28 ) " at ce lic stormed the partisan defenses with( new) battalion commander hans hanke at the poin t" and that enemy forces withdrew after of matthews , the company 2ic , who had taken command almost immediately after the company commander was wounded . under his command , each of the platoons assaulted a different cluster of buildings to which they had been assigned during training on the replica village at hastings. the west side boys' ammunition store was found and secured and, once the rest of the buildings had been cleared , the paras took

up defensive positions to block any potential counter@-@ attack and patrols went into the immediate jungle in search of any west side boys hiding in the bushes . the village was completely secure by 08 : 00 and the paras secured the approaches with clay more

) , increased her metacentric height to 6@.@ 3 feet( 1 @ .@ 9 m) at deep load, and all of the changes to her equipment increased her crew to a total of 1@,@ 188 . despite the bulge s she was able to reach a speed of 21@.@ 75 knots ( 40@.@ 28 km / h; 25@.@ 03 mph). a brief refit in early 1927 saw the addition of two more four @-@ inch aa guns and the removal of the six@-@ inch guns from the shelter deck . about 1931 , a high@-@ angle control became enraged at him, slapping him across the face. he began yelling:" your nerves, hell, you are just a goddamned coward . shut up that goddamned crying . i won' t have these brave men who have been shot at seeing this yellow bastard sitting here crying." patton then reportedly slapped bennett again, knocking his helmet liner off, and ordered the receiving officer , major charles b . et ter , not to admit him . patton then threatened bennett, " you' re going back to the front lines and you may get shot and killed, but you' re going to fight. if you don ' t, i'

secondary guns, two of which were disabled. the ammunition stores for these two guns were set on fire and the magazines had to be flooded to prevent an explosion. the ship nevertheless remained combat effective , as her primary battery remained in operation , as did most of her secondary guns; ko nig could also steam at close to her maximum speed . other areas of the ship had to be counter@-@ flooded to maintain stability; 1@,@ 600 tons of water entered the ship , either as a result of battle damage or counter @ -@ flooding efforts . the flooding rendered the battleship sufficiently low in the water to prevent the ship from being able in 1924 and rice institute, houston, texas in 1928 . he dropped out of graduate school after one year and decided to hitchhike to san francisco. the lack of work meant hunger, so he chose to join the united states army' s 11th cavalry regiment as a private on july 30, 1930, serving in monterey, california. after a year in the horse cavalry, par rish became an aviation cadet in june 1931 and subsequently qua lified as an enlisted pilot . he completed flight training in 1932 and was assigned to the 13th attack squadron at fort cr ock ett , near galveston , texas . one year later in september 1933 parrish

during the battle , murray was awarded the victoria cross. soon after his victoria cross action, he was promoted to major and earned a bar to his distinguished service order during an attack on the hindenburg line near bullec our t. promoted to lieutenant colonel in early 1918 , he assumed command of the 4th machine gun battalion , where he would remain until the end of the war . returning to australia in 1920, murray eventually settled in queensland , where he purchased the grazing farm that would be his home for the remainder of his life . re@-@ enlist ing for service in the second world war, he was appo inted as commanding officer

10 officers and 315 enlisted men, plus an additional four officers and 19 enlisted men if serving as a flotilla flagship . == construction and career= = the ship was ordered on 7 july 1934 and laid down at deutsche werke, kiel, on 2 january 1935 as yard number< unk>. she was launched on 30 november 1935 and completed on 8 april 1937 . she was named after max schultz who commanded the torpedo boat< unk > and was killed in action in january 1917 . ko r vet ten ka pitan martin < unk> was appointed as her first captain. max schultz was assigned to the 1st destroyer division on 26

the command of otto von diederichs. the squadron participated in the fall maneuvers in 1894 , which simulated a two@ - @ front war against france and russia ; deutschland' s squadron acted as the russian fleet during the exercises. between 1894 and 1897, deutschland was rebuilt in the imperial dockyard in wilhelmshaven. the ship was converted into an armored cruiser ; her heavy guns were removed and replaced with lighter weapons , including eight 15 cm( 5 @.@ 9 in ) and eight 8@. @ 8 cm ( 3@ . @ 5 in ) guns . her entire rigging equipment was removed and two heavy military mast s were installed

called on many times to maintain order in times of disaster and to keep peace during periods of political unrest . oklahoma governor john c. walton used division troops to prevent the state legislature from meeting when they were preparing to impeach him in 1923 . governor william h. murray called out the guard several times during the depression to close banks, distribute food and once to force the state of texas to keep open a free bridge over the red river which texas intended to collect toll s for, even after federal courts ordered the bridge not be opened . the division would go on to see combat in world war ii as one of four national guard divisions active during

Transformer factor 170 in layer 10 with saliency map Explaination: topic: music production

2nd street tunnel and part of downtown los angeles spread out over a 48 @ - @ hour period. kesha explained the idea behind the video as well as the experience during an interview with mtv news ; she said that the video was different from her other videos, noting that it was going to show a sexier side of herself. the music video for " we r who we r " is presented as an underground party . the video starts off with futuristic flashing lights. kesha, seen in a ponytail wearing gray and black makeup, chains , ripped stockings, and a sparkly one@-@ piece leotard made of shards of broken and several european territories), her" endless love" duet with luther van dro ss (

number@-@ one in new zealand) and " against all odds " featuring west life ( number @@ one in the united kingdom ) . " thank god i found you" was also omitted from the japanese track listing, and replaced with" all i want for christmas is you " . for the album artwork, carey launched a social media campaign on april 12, 2015 , whereby fans had to share a link to her website in order to reveal the cover which was concealed by a curtain . using the hashtag"< unk>", single ," we belong together". he contained to add" but still , if mimi ' s going to mine from her own extensive back catalog of ballads , those are the primo melodies to go for." a reviewer for dj booth thought that minaj" ruined" the song . = == music video=== the accompanying music video for the remix of " up out my f ace " was directed by carey ' s husband, nick cannon. minaj spoke about filming a video with carey and how she did not believe that the video would ever be released:" i didn ' t even tell anyone i shot a video with

the producer, few days after he had finished the composition , madonna completed writing the lyrics of " i don ' t give a " . solve ig understood that the lyrics were probable references towards madonna' s life and thus received coverage in the press. however , he was not aware of the inner meaning behind the lyrics. with billboard magazine , the producer further explained: at first i thought we were going to work on one song; that was the original plan . let' s try to work on one song and take it from there- not spend too much time thinking about the l egend, and do something that just makes sense .

provided an additional and assistant engineering . all the instruments were provided by eriksen and hermansen while dean sang the background vocals. in may 2011 , in the mix review, an analyzing commercial productions, mike senior of sound on sound revisited the original mixing of the song . according to him, before he started the mix, senior played the song a couple of times before releasing what thing about it" bugged" him. working it out, he noted that the harmony of the mix is undermine d by the kick drum . " what ' s my name ? " contains basic harmonies that are a bar of f minor , a bar of a major

practiced in their backyards and at < un k > salon, owned by knowles ' s mother, tina . the group would test routines in the salon, when it was on montrose boulevard in houston, and sometimes would collect tips from the customers. their try out would be critiqued by the people inside . during their school days, girl' s ty me performed at local gigs. when summer came, mathew knowles established a" boot camp" to train them in dance and vocal lessons . after rigorous training, they began performing as opening acts for established r & b groups of that time such as swv, dr u hill and immature. tina day reception at the greek embassy. upon return to greece, she was greeted at the airport by fans along with the music video of " my number one " playing on the video monitors. while in greece , she attended the opening ceremony of the european final four for the volleyball champions league in< unk>, where her song was played as she appeared on stage with cheerleaders . on march 29, paparizou arrived in valletta, malta where she signed autographs, appeared on television stations, and gave interviews to the local media. following malta, she traveled to serbia and montenegro where she gave additional interviews before moving on to and

and her low hip@ - @ grind during' rude boy' were the smash hits of her body language . " deborah linton of city life wrote that rihanna " even manages to make a psychiatric couch look sexy". linton called the show' s stage sets impressive and imaginative. rick massimo of the providence journal wrote that rihanna " looked like a neon@-@ sign rendition of herself during' rehab', rarely addressed the audience, and didn ' t rise above flat cliche in that until the very end of the show " . " rehab" and rihanna ' s 2009 single " russian ro ule tte " were excluded from the set

only a few hours. he said : " there were a lot of tracks , but i just enjoyed it, to be honest. i knew how i wanted it to sound, and it was pretty much the last song we cut; a lot of the mixing was nailed in the production as well, which helped. dream did a great job producing this track . " the bar one guitar track of " school in ' life " was entirely programmed. similarly, the live drum section in the hook was actually done with programmed drums . once the mixing was over , swivel' s impression were as follows : [ ' schoolin' life] absolutely tour began on march 1, 2000 at the house of blues in los angeles , while other venues included paris olympia, trump taj mahal, brixton academy, the montreux jazz festival, and the essence jazz festival in new orleans. by july, the tour' s first half had sold out in each city. the tour lasted nearly eight months, w hil e performances went for up to three hours a night . the voodoo tour was taken internationally, with one of the most notable performances being the free jazz festival in brazil . the music video for " untitled ( how does it feel ) " portrayed d' angelo as a sex symbol

ho bson noted that rihanna " rejects the victim stance " in the video for " man down " , and elucidated that she played the role of a rape survivor who shot her attacker. she attributed the location of shooting the video in jamaica as significant, due to how the image of a gun proliferated during 1990s jamaican dance hall ' s to" express female rage". the prologue depicts rihanna as a" dark @-@ hooded" femme fatale whereby the narrative explains her motives for murder and provokes the spectator to sympathi ze with her because she danced in a provocative manner with a man in a club , which had a deep impact on delonge in that he spent a night up crying for him when he wrote the track . " a little ' s enough " was

inspired by a religious concept in which a god came to bring positive change on earth when it faces terrorism, war or famine. " the war " , an anthem about the iraq war and its death toll , is succeeded by " it hurts", a track about a friend of delonge with a cheating girlfriend." it' s a terrible situation where my friend is being crushed from the inside out by all the manipulative stuff she' s doing and this song' s just took that dress out of the storage -it has a 27@-@ foot train and it was just all hand@-@ beaded and stuff and so i figured we might as well get a use out of it.'=== synopsis= = = the video features carey ready ing for her wedding, and follows her to the altar, as well as her escape from the reception. many of the actors featured in carey ' s" it' s like that " video were in that of " we belong together " , which was shot as a continuation from the" it' s like that" video . it begins with

3 in dutch@-@ speaking flanders and number 2 in french@-@ speaking wallonia . it was certified gold by the belgian entertainment association( bea) for selling more than 15@,@ 000 copies . although the song spent only 1 week on the italian singles chart ( at number 8 ) , it was certified platinum by the federazione industria musicale italiana( fimi) in 2014 for selling more than 30@,@ 000 copies .== music video ===== background and synopsis=== anthony man dler directed the music video for " man down " in april 2011 on a beach in at numbers 18 and 43 in the united states, and experienced moderate success worldwide . unlike her previous records, spears did not heavily promote blackout ; her only televised appearance for blackout was a universally @-@ pan ned performance of " gi mme more " at the 2007 mtv video music awards.== background and development== in november 2003, while promoting her fourth studio album in the zone, spears told entertainment weekly that she was already writing songs for her next album and was also hoping to start her own record label in 2004. henrik jonback confirmed

that he had written songs with her during the european leg of the onyx hotel tour,"

of albums also had increased sales due to discount ing and publicity generated by the single and her performance. billboard estimated that her top@-@ 10 digital sales collectively increased over 1@,@ 700 percent . madonna ' s bestsell ing album was the 2009 greatest@-@ hits collection, celebration, which sold 16@,@ 000 copies ( up 1@,@ 341 percent) and reentered the billboard 200 album chart. the following week celebration fell 105 spots on the chart to number 157, with sales falling to 4 @ , @ 000 copies . " give me all your lu vin ' " fell to number 39 on the hot opened the performance with" yeah 3x" and was dressed in a white formal suit, accompanied by" full@-@ skirted dancers". brown was eventually joined onstage by tuxedo@-@ clad dancers and began dancing to the 1993 wu @ -@ tang clan single " protect ya neck". his dance routine then moved into 1991 , where he danced to nirvana ' s " smells like teen spirit " . brown ' s performance then came back to the future , where he began to sing " beautiful people " . while performing the song, he was suspended in the air, and then lowered to another stage where he continued to

register that she didn' t know she had." from the moment she was signed in the film , madonna had expressed interest in recording a dance version of " don ' t cry for me argentina " . according to her public ist liz rosenberg," since she didn' t write the music and lyrics , she wanted her signature on that song... i think on her mind, the best way to do it was go in the studio and work up a remix". for this, in august 1996 , while still mixing the film' s soundtrack, madonna hired remixers pablo flores and javier garza. according to flores , the singer d accumulated until then but that was instead an ideal marriage of production and performance . " instead, the red lights on the

stage played up the" ominous" tone of the song as it gradually increased its tempo to the point whereby the end of the song was on the verge of sounding like an inca ntation . for the diamonds world tour , rihanna performed " man down " in a caribbean @-@ theme section of the show , which also included" you da one"," no love allowed"," what' s my name? " and " rude boy" . james lachno of the telegraph highlight the caribbean@-@ themed edge of several realities: the film , the dream it inspires, the waking world it illuminates". the music in" i just can' t stop loving you", a duet with si eda h garrett, consisted mainly of finger snaps and timpani." just good friends ", a duet with stevie wonder, was viewed by critics as sounding good at the beginning of the song, ending with a" chin @ -@ bobbing cheerfulness " . " the way you make me feel"' s music consisted of blues harmonies. the lyrics of " another part of me " deal with being united , as" we

not manufactured . no one paid these kids. " === live performances === one direction performed " what makes you beautiful " on red or black? on 10 september 2011 . the performance started with hosts ant & dec announcing that the band was supposedly running late for their appearance, and cut to a video of one direction boarding a london tube carriage full of fans, as the studio version of the song began playing . each fan on the tube was given a numbered ticket. the band and fans disembarked the tube and made their way to the television studio, where the remainder of the song was sung live . after the song,

High-Level Transformer Factors

Top 3 activated words and their contextsExplanation
Φ : , 2• that snare shot sounded like somebody' d kicked open the door to your mind". • i became very frustrated with that and finally made up my mind to start getting back into things." • when evita asked for more time so she could make up her mind, the crowd demanded," ¡ ahora, evita,<• Word 'mind' • Noun • Definition: the element of a person that enables them to be aware of the world and their ex- periences.
Φ : , 16•nington joined the five members xero and the band was renamed to linkin park. • times about his feelings about gordon, and the price family even sat away from park' s supporters during the trial itself. • on 25 january 2010, the morning of park' s 66th birthday, he was found hanged and unconscious in his• Word 'park' • Noun • Definition: a common first and last name
Φ : , 30• saying that he has left the outsiders, kovu asks simba to let him join his pride • eventually, all boycott' s employees left, forcing him to run the estate without help. • the story concerned the attempts of a scientist to photograph the soul as it left the body.• Word 'left" • Verb • Definition: leaving, exiting
Φ : , 33• forced to visit the sarajevo television station at night and to film with as little light as possible to avoid the attention of snipers and bombers. • by the modest, cream@-@ colored attire in the airy, light@-@ filled clip. • the man asked her to help him carry the case to his car, a light@-@ brown volkswagen beetle.• Word 'light' • Noun • Definition: the natural agent that stimulates sight and makes things visible
Precision (%)Recall (%)F1 score (%)
Average perceptron POS tagger92.795.594.1
Finetuned BERT base model for POS task97.595.296.3
Logistic regression clas- sifier with activation of Φ : , 30 at layer 497.295.896.5
2 example words and their contexts with high activationPatternsL4 (%)L6 (%)L8 (%)L10 (%)
Φ : , 13• the steel pipeline was about 20 ° f(- 7 ° c) degrees. • hand( 56 to 64 inches( 140 to 160 cm)) war horse is that it was aUnit exchange with paren- theses0064.595.5
Φ : , 42• he died at the hospice of lancaster county from heart • holly' s drummer carl bunch suffered frostbite to his toes( while aboard the ailments on 23 june 2007.Something unfortunate happened94100100100
Φ : , 50• hurricane pack 1 was a revamped version of story mode; • in 1998, the categories were retitled best short form music video, and bestDoing something again, or making something new again74.5100100100
Φ : , 86• he finished the 2005 - 06 season with 21 appearances and seven goals. • of an offensive game, finishing off the 2001 - 02 season with 58 points in the 47 gamesConsecutive years, used in foodball season nam- ing01008595.5
Φ : , 102• the most prominent of which was bishop abel muzorewa' s united african national council • ralambo' s father, andriamanelo, had established rules of succession byAfrican names99100100100
Φ : , 125• music writer jeff weiss of pitchfork describes the" endur- ing image" • club reviewer erik adams wrote that the episode was a perfect mixDescribing someone in a paraphrasing style. Name, Career15.59910098.5
Φ : , 184• the world wide fund for nature( wwf) announced in 2010 that a biodiversity study from • fm) was halted by the federal communications commis- sion( fcc) due to a complaint that the company buyingInstitution with abbrevia- tion015.53963
Φ : , 193• 74, 22@,@ 500 vietnamese during 1979 - 92, over 2@,@ 500 bosnian •, the russo@-@ turkish war of 1877 - 88 and the first balkan war in 1913.Time span in years9795.596.595.5
Φ : , 195•s, hares, badgers, foxes, weasels, ground squirrels, mice, hamsters •-@ watching, boxing, chess, cycling, drama, languages, geography, jazz and other musicConsecutive of noun (Enumerating)898.5100100
Φ : , 225• technologist at the united states marine hospital in key west, florida who developed a morbid obsession for • 00°,11', w, near smith valley, nevada.Places in US, follow- ings the convention 'city, state"51.591.59177.5
Adversarial TextExplainationα 35
(o)album as "full of exhilarating, ecstatic, thrilling, fun and sometimes downright silly songs"The original top-activated word and its context sentence for transformer factor Φ : , 35 (not an adversarial text)9.5
(a)album as "full of delightful, lively, exciting, interesting and sometimes downright silly songs"Replace the adjectives in sentence (o) with different adjectives.9.2
(b)album as "full of unfortunate, heartbroken, annoying, bor- ing and sometimes downright silly songs"Replace the adjectives in sentence (o) with negative adjectives.8.2
(c)album as "full of [UNK], [UNK], thrilling, [UNK] and sometimes downright silly songs"Mask the adjectives in sentence (o) with unknown tokens.5.3
(d)album as "full of thrilling and sometimes downright silly songs"Remove the first three adjectives in sen- tence (o).7.8
(e)album as "full of natural, smooth, rock, electronic and sometimes downright silly songs"Replace the adjectives in sentence (o) with neutral adjectives.6.2
(f)each participant starts the battle with one balloon. these can be re@-@ inflated up to fourUse a random sentence.0
(g)The book is described as "innovative, beautiful and bril- liant". It receive the highest opinion from James WoodWecreate this sentence that contain the pattern of consecutive adjective.7.9

Transformer networks have revolutionized NLP representation learning since they were introduced. Though a great effort has been made to explain the representation in transformers, it is widely recognized that our understanding is not sufficient. One important reason is that there lack enough visualization tools for detailed analysis. In this paper, we propose to use dictionary learning to open up these ‘black boxes’ as linear superpositions of transformer factors. Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors, e.g., word-level polysemy disambiguation, sentence-level pattern formation, and long-range dependency. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights. We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work. The code is available at https://github.com/zeyuyun1/TransformerVis.

Though the transformer networks Vaswani et al. (2017); Devlin et al. (2018) have achieved great success, our understanding of how they work is still fairly limited. This has triggered increasing efforts to visualize and analyze these “black boxes”. Besides a direct visualization of the attention weights, most of the current efforts to interpret transformer models involve “probing tasks”. They are achieved by attaching a light-weighted auxiliary classifier at the output of the target transformer layer. Then only the auxiliary classifier is trained for well-known NLP tasks like part-of-speech (POS) Tagging, Named-entity recognition (NER) Tagging, Syntactic Dependency, etc. Tenney et al. (2019) and Liu et al. (2019) show transformer models have excellent performance in those probing tasks. These results indicate that transformer models have learned the language representation related to the probing tasks. Though the probing tasks are great tools for interpreting language models, their limitation is explained in Rogers et al. (2020). We summarize the limitation into three major points:

Most probing tasks, like POS and NER tagging, are too simple. A model that performs well in those probing tasks does not reflect the model’s true capacity.

Probing tasks can only verify whether a certain prior structure is learned in a language model. They can not reveal the structures beyond our prior knowledge.

It’s hard to locate where exactly the related linguistic representation is learned in the transformer.

Efforts are made to remove those limitations and make probing tasks more diverse. For instance, Hewitt and Manning (2019) proposes “structural probe”, which is a much more intricate probing task. Jiang et al. (2020) proposes to generate specific probing tasks automatically. Non-probing methods are also explored to relieve the last two limitations. For example, Reif et al. (2019) visualizes embedding from BERT using UMAP and shows that the embeddings of the same word under different contexts are separated into different clusters. Ethayarajh (2019) analyzes the similarity between embeddings of the same word in different contexts. Both of these works show transformers provide a context-specific representation.

Faruqui et al. (2015); Arora et al. (2018); Zhang et al. (2019) demonstrate how to use dictionary learning to explain, improve, and visualize the uncontextualized word embedding representations. In this work, we propose to use dictionary learning to alleviate the limitations of the other transformer interpretation techniques. Our results show that dictionary learning provides a powerful visualization tool, leading to some surprising new knowledge.

Hypothesis: contextualized word embedding as a sparse linear superposition of transformer factors. It is shown that word embedding vectors can be factorized into a sparse linear combination of word factors Arora et al. (2018); Zhang et al. (2019), which correspond to elementary semantic meanings. An example is:

We view the latent representation of words in a transformer as contextualized word embedding. Similarly, we hypothesize that a contextualized word embedding vector can also be factorized as a sparse linear superposition of a set of elementary elements, which we call transformer factors. The exact definition will be presented later in this section.

Due to the skip connections in each of the transformer blocks, we hypothesize that the representation in any layer would be a superposition of the hierarchical representations in all of the lower layers. As a result, the output of a particular transformer block would be the sum of all of the modifications along the way. Indeed, we verify this intuition with the experiments. Based on the above observation, we propose to learn a single dictionary for the contextualized word vectors from different layers’ output.

To learn a dictionary of transformer factors with non-negative sparse coding.

Given a set of tokenized text sequences, we collect the contextualized embedding of every word using a transformer model. We define the set of all word embedding vectors from l𝑙lth layer of transformer model as X(l)superscript𝑋𝑙X^{(l)}. Furthermore, we collect the embeddings across all layers into a single set X=X(1)∪X(2)∪⋯∪X(L)𝑋superscript𝑋1superscript𝑋2⋯superscript𝑋𝐿X=X^{(1)}\cup X^{(2)}\cup\cdots\cup X^{(L)}.

By our hypothesis, we assume each embedding vector x∈X𝑥𝑋x\in X is a sparse linear superposition of transformer factors:

where Φ∈I​Rd×mΦIsuperscriptR𝑑𝑚\Phi\in{\rm I!R}^{d\times m} is a dictionary matrix with columns Φ:,csubscriptΦ:𝑐\Phi_{:,c}\ , 𝜶∈I​Rm𝜶IsuperscriptR𝑚\bm{\alpha}\in{\rm I!R}^{m} is a sparse vector of coefficients to be inferred and ϵbold-italic-ϵ\bm{\epsilon} is a vector containing independent Gaussian noise samples, which are assumed to be small relative to 𝒙𝒙\bm{x}. Typically m>d𝑚𝑑m>d so that the representation is overcomplete. This inverse problem can be efficiently solved by FISTA algorithm Beck and Teboulle (2009). The dictionary matrix ΦΦ\Phi can be learned in an iterative fashion by using non-negative sparse coding, which we leave to the appendix section C. Each column Φ:,csubscriptΦ:𝑐\Phi_{:,c}\ of ΦΦ\Phi is a transformer factor and its corresponding sparse coefficient 𝜶csubscript𝜶𝑐\bm{\alpha}_{c} is its activation level.

Visualization by top activation and LIME interpretation. An important empirical method to visualize a feature in deep learning is to use the input samples, which trigger the top activation of the feature Zeiler and Fergus (2014). We adopt this convention. As a starting point, we try to visualize each of the dimensions of a particular layer, X(l)superscript𝑋𝑙X^{(l)}. Unfortunately, the hidden dimensions of transformers are not semantically meaningful, which is similar to the uncontextualized word embeddings Zhang et al. (2019).

Instead, we can try to visualize the transformer factors. For a transformer factor Φ:,csubscriptΦ:𝑐\Phi_{:,c} and for a layer-l𝑙l, we denote the 1000 contextualized word vectors with the largest sparse coefficients αc(l)subscriptsuperscript𝛼𝑙𝑐\alpha^{(l)}{c} as Xc(l)⊂X(l)subscriptsuperscript𝑋𝑙𝑐superscript𝑋𝑙X^{(l)}{c}\subset X^{(l)}, which correspond to 1000 different sequences. For example, Figure 3 shows the top 5 words that activated transformer factor-17 Φ:,17subscriptΦ:17\Phi_{:,17} at layer-00, layer-222, and layer-666 respectively. Since a contextualized word vector is generally affected by many tokens in the sequence, we can use LIME Ribeiro et al. (2016) to assign a weight to each token in the sequence to identify their relative importance to αcsubscript𝛼𝑐\alpha_{c}. The detailed method is left to Section 3.

To determine low-, mid-, and high-level transformer factors with importance score. As we build a single dictionary for all of the transformer layers, the semantic meaning of the transformer factors has different levels. While some of the factors appear in lower layers and continue to be used in the later stages, the rest of the factors may only be activated in the higher layers of the transformer network. A central question in representation learning is: “where does the network learn certain information?” To answer this question, we can compute an “importance score” for each transformer factor Φ:,csubscriptΦ:𝑐\Phi_{:,c} at layer-l𝑙l as Ic(l)subscriptsuperscript𝐼𝑙𝑐I^{(l)}{c}. Ic(l)subscriptsuperscript𝐼𝑙𝑐I^{(l)}{c} is the average of the largest 1000 sparse coefficients αc(l)subscriptsuperscript𝛼𝑙𝑐\alpha^{(l)}{c}’s, which correspond to Xc(l)subscriptsuperscript𝑋𝑙𝑐X^{(l)}{c}. We plot the importance scores for each transformer factor as a curve is shown in Figure 2. We then use these importance score (IS) curves to identify which layer a transformer factor emerges. Figure 2a shows an IS curve peak in the earlier layer. The corresponding transformer factor emerges in the earlier stage, which may capture lower-level semantic meanings. In contrast, Figure 2b shows a peak in the higher layers, which indicates the transformer factor emerges much later and may correspond to mid- or high-level semantic structures. More subtleties are involved when distinguishing between mid-level and high-level factors, which will be discussed later.

An important characteristic is that the IS curve for each transformer factor is relatively smooth. This indicates if a vital feature is learned in the beginning layers, it won’t disappear in later stages. Instead, it will be carried all the way to the end with gradually decayed weight since many more features would join along the way. Similarly, abstract information learned in higher layers is slowly developed from the early layers. Figure 3 and 5 confirm this idea, which will be explained in the next section.

We use a 12-layer pre-trained BERT model Pre ; Devlin et al. (2018) and freeze the weights. Since we learn a single dictionary of transformer factors for all of the layers in the transformer, we show that these transformer factors correspond to different levels of semantic or syntactic patterns. The patterns can be roughly divided into three categories: word-level disambiguation, sentence-level pattern formation, and long-range dependency. In the following, we provide detailed visualization for each pattern category. Due to the space limit, only a small amount of the factors are demonstrated in the paper. To alleviate the “cherry-picking” bias, we also build a website for the interested readers to play with these results.

Low-level: word-level polysemy disambiguation. While the input embedding of a token contains polysemy, we find transformer factors with early IS curve peaks usually correspond to a specific word-level meaning. By visualizing the top activation sequences, we can see how word-level disambiguation is gradually developed in a transformer.

We show how the disambiguation effect develops progressively through each layer in Figure 3. In Figure 3, the top 5 activated words and their contexts for transformer factor Φ:,30subscriptΦ:30\Phi_{:,30} in different layers are listed. The top activated words in layer 0 contain the word “left” varying senses, which is being mostly disambiguated in layer 2 albeit not completely. In layer 4, the word “left” is fully disambiguated since the top-activated word contains only “left” with the word sense “leaving, exiting.” We also show more examples of those types of transformer factors in Table 1: for each transformer factor, we list out the top 3 activated words and their contexts in layer 4. As shown in the table, nearly all top-activated words are disambiguated into a single sense.

Further, we can quantify the quality of the disambiguation ability of the transformer model. In the example above, since the top 1000 activated words and contexts are “left” with only the word sense “leave, exiting”, we can assume “left” when used as a verb, triggers higher activation in Φ:,30subscriptΦ:30\Phi_{:,30} than “left” used as other sense of speech. We can verify this hypothesis using a human-annotated corpus: Brown corpus Francis and Kucera (1979). In this corpus, each word is annotated with its corresponding part-of-speech. We collect all the sentences contains the word “left” annotated as a verb in one set and sentences contains “left” annotated as other part-of-speech. As shown in Figure 4a, in layer 0, the average activation of Φ:,30subscriptΦ:30\Phi_{:,30} for the word “left” marked as a verb is no different from “left” as other senses. However, at layer 2, “left” marked as a verb triggers a higher activation of Φ:,30subscriptΦ:30\Phi_{:,30}. In layer 4, this difference further increases, indicating disambiguation develops progressively across layers. In fact, we plot the activation of “left” marked as verb and the activation of other “left” in Figure 4b. In layer 4, they are nearly linearly separable by this single feature. Since each word “left” corresponds to an activation value, we can perform a logistic regression classification to differentiate those two types of “left”. From the result shown in Figure 4a, it is pretty fascinating to see that the disambiguation ability of just Φ:,30subscriptΦ:30\Phi_{:,30} is better than the other two classifiers trained with supervised data. This result confirms that disambiguation is indeed done in the early part of pre-trained transformer model and we are able to detect it via dictionary learning.

Mid level: sentence-level pattern formation. We find most of the transformer factors, with an IS curve peak after layer 6, capture mid-level or high-level semantic meanings. In particular, the mid-level ones correspond to semantic patterns like phrases and sentences pattern.

We first show two detailed examples of mid-level transformer factors. Figure 5 shows a transformer factor that detects the pattern of consecutive usage of adjectives. This pattern starts to emerge at layer 4, develops at layer 6, and becomes quite reliable at layer 8. Figure 6 shows a transformer factor, which corresponds to a pretty unexpected pattern: “unit exchange”, e.g., 56 inches (140 cm). Although this exact pattern only starts to appear at layer 8, the sub-structures that make this pattern, e.g., parenthesis and numbers, appear to trigger this factor in layers 4 and 6. Thus this transformer factor is also gradually developed through several layers.

While some mid-level transformer factors verify common semantic or syntactic patterns, there are also many surprising mid-level transformer factors. We list a few in Table 3 with quantitative analysis. For each listed transformer factor, we analyze the top 200 activating words and their contexts in each layer. We record the percentage of those words and contexts that correspond to the factors’ semantic pattern in Table 3. From the table, we see that large percentages of top-activated words and contexts do corresponds to the pattern we describe. It also shows most of these mid-level patterns start to develop at layer 4 or 6. More detailed examples are provided in the appendix section F. Though it’s still mysterious why the transformer network develops representations for these surprising patterns, we believe such a direct visualization can provide additional insights, which complements the “probing tasks”.

To further confirm a transformer factor does correspond to a specific pattern, we can use constructed example words and context to probe their activation. In Table 4, we construct several text sequences that are similar to the patterns corresponding to a particular transformer factor but with subtle differences. The result confirms that the context that strictly follows the pattern represented by that transformer factor triggers a high activation. On the other hand, the closer the adversarial example to this pattern, the higher activation it receives at this transformer factor.

High-level: long-range dependency. High-level transformer factors correspond to those linguistic patterns that span an extended range in the text. Since the IS curves of mid-level and high-level transformer factors are similar, it is difficult to distinguish those transformer factors based on their IS cures. Thus, we have to manually examine the top-activation words and contexts for each transformer factor to differentiate between mid-level and high-level transformer factors. To ease the process, we choose to use the black-box interpretation algorithm LIME Ribeiro et al. (2016) to identify the contribution of each token in a sequence. There also exist interpretation tools that specifically leverage the transformer architecture (Chefer et al., 2021, 2020). In the future, one could adapt those interpretation tools, which may potentially provide better visualization.

Given a sequence s∈S𝑠𝑆s\in S, we can treat αc,i(l)subscriptsuperscript𝛼𝑙𝑐𝑖\alpha^{(l)}{c,i}, the activation of Φ:,csubscriptΦ:𝑐\Phi{:,c} in layer-l𝑙l at location i𝑖i, as a scalar function of s𝑠s, fc,i(l)​(s)subscriptsuperscript𝑓𝑙𝑐𝑖𝑠f^{(l)}{c,i}(s). Assume a sequence s𝑠s triggers a high activation αc,i(l)subscriptsuperscript𝛼𝑙𝑐𝑖\alpha^{(l)}{c,i}, i.e. fc,i(l)​(s)subscriptsuperscript𝑓𝑙𝑐𝑖𝑠f^{(l)}{c,i}(s) is large. We want to know how much each token (or equivalently each position) in s𝑠s contributes to fc,i(l)​(s)subscriptsuperscript𝑓𝑙𝑐𝑖𝑠f^{(l)}{c,i}(s). To do so, we generated a sequence set 𝒮​(s)𝒮𝑠\mathcal{S}(s), where each s′∈𝒮​(s)superscript𝑠′𝒮𝑠s^{\prime}\in\mathcal{S}(s) is the same as s𝑠s except for that several random positions in s′superscript𝑠′s^{\prime} are masked by [‘UNK’] (the unknown token). Then we learns a linear model gw​(s′)subscript𝑔𝑤superscript𝑠′g_{w}(s^{\prime}) with weights w∈ℝT𝑤superscriptℝ𝑇w\in\mathbb{R}^{T} to approximate f​(s′)𝑓superscript𝑠′f(s^{\prime}), where T𝑇T is the length of sentence s𝑠s. This can be solved as a ridge regression:

The learned weights w𝑤w can serve as a saliency map that reflects the “contribution” of each token in the sequence s𝑠s. Like in Figure 7, the color reflects the weights w𝑤w at each position. Red means the given position has positive weight and green means negative weight. The magnitude of weight is represented by the intensity. The redder a token is, the more it contributions to the activation of the transformer factor. We leave more implementation and mathematical formulation details of LIME algorithm in the appendix.

We provide detailed visualization for two different transformer factors that show long-range dependency in Figure 7, 8. Since visualization of high-level information requires more extended context, we only offer the top two activated words and their contexts for each such transformer factor. Many more will be provided in the appendix section G.

We name the pattern for transformer factor Φ:,297subscriptΦ:297\Phi_{:,297} in Figure 7 as “repetitive pattern detector”. All top activated contexts for Φ:,297subscriptΦ:297\Phi_{:,297} contain an obvious repetitive structure. Specifically, the text snippet “can’t get you out of my head" appears twice in the first example, and the text snippet “xxx class passenger, star alliance” appears three times in the second example. Compared to the patterns we found in the mid-level [6], the high-level patterns like “repetitive pattern detector” are much more abstract. In some sense, the transformer detects if there are two (or multiple) almost identical embedding vectors at layer-101010 without caring what they are. Such behavior might be highly related to the concept proposed in the capsule networks Sabour et al. (2017); Hinton (2021). To further understand this behavior and study how the self-attention mechanism helps model the relationships between the features outlines an interesting future research direction.

Figure 8 shown another high-level factor, which detects text snippets related to “the beginning of a biography”. The necessary components, day of birth as month and four-digit years, first name and last name, familial relation, and career, are all mid-level information. In Figure 8, we see that all the information relates to biography has a high weight in the saliency map. Thus, they are all together combined to detect the high-level pattern.

Dictionary learning has been successfully used to visualize the classical word embeddings Arora et al. (2018); Zhang et al. (2019). In this paper, we propose to use this simple method to visualize the representation learned in transformer networks to supplement the implicit “probing-tasks” methods. Our results show that the learned transformer factors are relatively reliable and can even provide many surprising insights into the linguistic structures. This simple tool can open up the transformer networks and show the hierarchical semantic or syntactic representation learned at different stages. In short, we find word-level disambiguation, sentence-level pattern formation, and long-range dependency. The idea of a neural network learns low-level features in early layers, and abstract concepts in the later stages are very similar to the visualization in CNN Zeiler and Fergus (2014). Dictionary learning can be a convenient tool to help visualize a broad category of neural networks with skip connections, like ResNet He et al. (2016), ViT models Dosovitskiy et al. (2020), etc. For more interested readers, we provide an interactive website111https://transformervis.github.io/transformervis/ for the readers to gain some further insights.

We thank our reviewers for their detailed and insightful comments. We also thank Yuhao Zhang for his suggestions during the preparation of this paper.

The importance score curve’s characteristic has a strong correspondence to a transformer factor’s categorization. Based on the location of the peak of an IS curve, we can classify a transformer factor as low-level, mid-level or high-level. The importance score for low-level transformer factors peak in early layers and slowly decrease across the rest of the layers. On the other hand, the importance score for mid-level and high-level transformers slowly increases and peaks at higher layers. In Figure 9, we show two sets of the examples to demonstrate the clear distinction between those two types of IS curves.

Taking a step back, we can also plot IS curve for each dimension of word vector (without sparse coding) at different layers. They do not show any specific patterns, as shown in Figure 10. This makes intuitive sense since we mentioned that each of the entries of a contextualized word embedding does not correspond to any clear semantic meaning.

For a given sentence and index pair (s,i)𝑠𝑖(s,i), the embedding of word w=s​[i]𝑤𝑠delimited-[]𝑖w=s[i] by layer l𝑙l of transformer is x(l)​(x,i)superscript𝑥𝑙𝑥𝑖x^{(l)}(x,i). Then we can abstract the inference of a specific entry of sparse code of the word vector as a black-box scalar-value function f𝑓f:

Let R​a​n​d​o​m​M​a​s​k𝑅𝑎𝑛𝑑𝑜𝑚𝑀𝑎𝑠𝑘RandomMask denotes the operation that generates perturbed version of our sentence s𝑠s by masking word at random location with “[UNK]” (unkown) tokens. For example, a masked sentence could be

[Today is a [‘UNK’],day]

Let hℎh denote a encoder for perturbed sentences compared to the unperturbed sentence s𝑠s, such that

The LIME algorithm we used to generated saliency map for each sentences is the following:

Where R​i​d​g​ew𝑅𝑖𝑑𝑔subscript𝑒𝑤Ridge_{w} is a weighted ridge regression defined as:

d​(⋅,⋅)𝑑⋅⋅d(\cdot,\cdot) can be any metric that measures how much a perturbed sentence is different from the original sentence. If a sentence is perturbed such that every token is being masked, then the distance h​(h​(s′),1→)ℎℎsuperscript𝑠′→1h(h(s^{\prime}),\vec{1}) should be 0, if a sentence is not perturbed at all, then h​(h​(s′),1→)ℎℎsuperscript𝑠′→1h(h(s^{\prime}),\vec{1}) should be 1. We choose d​(⋅,⋅)𝑑⋅⋅d(\cdot,\cdot) to be cosine similarity in our implementation.

In practice, we also uses feature selection. This is done by running LIME twice. After we obtain the regression weight w1subscript𝑤1w_{1} for the time, we use it to find the first k𝑘k indices corresponds to the entry in w1subscript𝑤1w_{1} with highest absolute value. We use those k𝑘k index as location in the sentence and apply LIME for the second time with only those selected indices from step 1.

Overall, the regression weight w𝑤w can be regarded as a saliency map. The higher the weight wksubscript𝑤𝑘w_{k} is, the more important the word s​[k]𝑠delimited-[]𝑘s[k] in the sentence since it contributes more to the activation of a specific transformer factor.

We could also have negative weight in w𝑤w. In general, negative weights are hard to interpret in the context of transformer factor. The activation will increase if they are removed those word correspond to negative weights. Since a transformer factor corresponds to a specific pattern, then word with negative weights are those word in a context that behaves “opposite" of this pattern.

Let S𝑆S be the set of all sequences, recall how we defined word embedding using hidden state of transformer in the main section: X(l)={x(l)​(s,i)|s∈S,i∈[0,l​e​n​(s)]}superscript𝑋𝑙conditional-setsuperscript𝑥𝑙𝑠𝑖formulae-sequence𝑠𝑆𝑖0𝑙𝑒𝑛𝑠X^{(l)}={x^{(l)}(s,i)|s\in S,i\in\left[0,len(s)\right]} as the set of all word embedding at layer l𝑙l, then the set of word embedding across all layers is defined as

In practice, we use BERT base model as our transformer model, each word embedding vector (hidden state of BERT) is dimension 768. To learn the transformer factors, we concatenate all word vector x∈X𝑥𝑋x\in X into a data matrix A𝐴A. We also defined f​(x)𝑓𝑥f(x) to be the frequency of the token that is embedded in word vector x𝑥x. For example, if x𝑥x is the embedding of the word “the”, it will have a much larger frequency i.e. f​(x)𝑓𝑥f(x) is high.

Using f​(x)𝑓𝑥f(x), we define the Inverse Frequency Matrix ΩΩ\Omega: ΩΩ\Omega is a diagonal matrix where each entry on the diagonal is the square inverse frequency of each word, i.e.

Then we use a typical iterative optimization procedure to learn the dictionary ΦΦ\Phi described in the main section:

These two optimizations are both convex, we solve them iteratively to learn the transformer factors: In practice, we use minibatches contains 200 word vectors as X𝑋X. The motivation of apply Inverse Frequency Matrix ΩΩ\Omega is that we want to make sure all words in our vocabulary has the same contribution. When we sample our minibatch from A𝐴A, frequent words like “the” and “a” are much likely to appear, which should receive lower weight during update.

Optimization 2 can converge in 1000 steps using the FISTA algorithm222The FISTA algorithm can usually converge within 300 steps, we use 1000 steps nevertheless to avoid any potential numerical issue.. We experimented with different λ𝜆\lambda values from 0.03 to 3, and choose λ=0.27𝜆0.27\lambda=0.27 to give results presented in this paper. Once the sparse coefficients have been inferred, we update our dictionary ΦΦ\Phi based on Optimization 3 by one step using an approximate second-order method, where the Hessian is approximated by its diagonal to achieve an efficient inverse Duchi et al. (2011). The second-order parameter update method usually leads to much faster convergence. Empirically, we train 200k steps and it takes roughly 2 days on a Nvidia 1080 Ti GPU.

In the following three sections, we provide visualization of more example transformer factor in low-level, mid-level, and high-level. Here’s table of Contents that contain hyperlinks which direct to each level:

Low-Level: E

Transformer factor 2 in layer 4 Explaination: Mind: noun, the element of a person that enables them to be aware of the world and their experiences. • that snare shot sounded like somebody’ d kicked open the door to your mind". • i became very frustrated with that and finally made up my mind to start getting back into things." • when evita asked for more time so she could make up her mind, the crowd demanded," ¡ ahora, evita,< • song and watch it evolve in front of us… almost as a memory in your head. • was to be objective and to let the viewer make up his or her own mind." • managed to give me goosebumps, and those moments have remained on my mind for weeks afterward." • rests the tir’ d mind, and waking loves to dream •, tracks like’ halftime’ and the laid back’ one time 4 your mind’ demonstrated a[ high] level of technical precision and rhetorical dexter • so i went to bed with that on my mind". •ment to a seed of doubt that had been playing on mulder’ s mind for the entire season". • my poor friend smart shewed the disturbance of his mind, by falling upon his knees, and saying his prayers in the street • donoghue complained that lessing has not made up her mind on whether her characters are" the salt of the earth or its sc • release of the new lanois@-@ produced album, time out of mind. • sympathetic man to illegally" ghost@-@ hack" his wife’ s mind to find his daughter. • this album veered into" the corridors" of flying lotus’" own mind", interpreting his guest vocalists as" disembodied phantom Transformer factor 16 in layer 4 Explaination: Park: noun, ’park’ as the name • allmusic writer william ruhlmann said that" linkin park sounds like a johnny@-@ come@-@ lately to an •nington joined the five members xero and the band was renamed to linkin park. • times about his feelings about gordon, and the price family even sat away from park’ s supporters during the trial itself. • on 25 january 2010, the morning of park’ s 66th birthday, he was found hanged and unconscious in his • was her, and knew who had done it", expressing his conviction of park’ s guilt. • jeremy park wrote to the north@-@ west evening mail to confirm that he • vanessa fisher, park’ s adoptive daughter, appeared as a witness for the prosecution at the • they played at< unk> for years before joining oldham athletic at boundary park until 2010 when they moved to oldham borough’ s previous ground,< • theme park guests may use the hogwarts express to travel between hogsmead • s strength in both singing and rapping while comparing the sound to linkin park. • in a statement shortly after park’ s guilty verdict, he said he had" no doubt" that • june 2013, which saw the band travel to rock am ring and rock im park as headline act, the song was moved to the middle of the set • after spending the first decade of her life at the central park zoo, pattycake moved permanently to the bronx zoo in 1982. • south park spoofed the show and its hosts in the episode" south park is gay!" • harrison" sounds like he’ s recorded his vocal track in one of the park’ s legendary caves". Transformer factor 30 in layer 4 Explaination: left: verb, leaving, exiting • did succeed in getting the naval officers into his house, and the mob eventually left. • all of the federal troops had left at this point, except totten who had stayed behind to listen to • saying that he has left the outsiders, kovu asks simba to let him join his pride • eventually, all boycott’ s employees left, forcing him to run the estate without help. • the story concerned the attempts of a scientist to photograph the soul as it left the body. • in time and will slowly improve until he returns to the point at which he left. • peggy’ s exit was a" non event", as" peggy just left, nonsensically and at complete odds with everything we’ ve • over the course of the group’ s existence, several hundred people joined and left. • no profit was made in six years, and the church left, losing their investment. • on 7 november he left, missing the bolshevik revolution, which began on that day. • he had not re@-@ written his will and when produced still left everything to his son lunalilo. • they continued filming as normal, and when lynch yelled cut, the townspeople had left. • with land of black gold( 1950), a story that he had previously left unfinished, instead. • he was infuriated that the government had left thousands unemployed by closing down casinos and brothels. • an impending marriage between her and albert interfered with their studies, the two brothers left on 28 august 1837 at the close of the term to travel around europe Transformer factor 33 in layer 4 Explaination: light: noun, the natural agent that stimulates sight and makes things visible: • forced to visit the sarajevo television station at night and to film with as little light as possible to avoid the attention of snipers and bombers. • by the modest, cream@-@ colored attire in the airy, light@-@ filled clip. • the man asked her to help him carry the case to his car, a light@-@ brown volkswagen beetle. • they are portrayed in a particularly sympathetic light when they are killed during the ending. • caught up" was directed by mr. x, who was behind the laser light treatment of usher’ s 2004 video" yeah!" • piracy in the indian ocean, and the script depicted the pirates in a sympathetic light. • without the benefit of moon light, the light horsemen had fired at the flashes of the enemy’ s • second innings, voce repeated the tactic late in the day, in fading light against woodfull and bill brown. •, and the workers were transferred on 7 july to another facility belonging to early light, 30 km away in< unk> town. • unk> brooklyn avenue ne near the university of washington campus in a small light@-@ industrial building leased from the university. • factory where the incident took place is the< unk>(" early light") toy factory(< unk>), owned by hong •, a 1934 comedy in which samuel was portrayed in an unflattering light, and mrs beeton, a 1937 documentary,< unk> • stage effects and blue@-@ red light transitions give the video a surreal feel, while a stoic crowd make • set against the backdrop of mumbai’ s red@-@ light districts, it follows the travails of its personnel and principal, • themselves on the chinese flank in the foothills, before scaling the position at first light. Transformer factor 47 in layer 4 Explaination: plants: noun, vegetation • the distinct feature of the main campus is the mall, which is a large tree – laden grassy area where many students go to relax. • each school in the london borough of hillingdon was invited to plant a tree, and the station commander of raf northolt, group captain tim o • its diet in summer contains a high proportion of insects, while more plant items are eaten in autumn. • large fruitings of the fungus are often associated with damage to the host tree, such as that which occurs with burning. • she nests on the ground under the cover of plants or in cavities such as hollow tree trunks. • orchards, heaths and hedgerows, especially where there are some old trees. • the scent of plants such as yarrow acts as an olfactory attractant to females. • of its grasshopper host, causing it to climb to the top of a plant and cling to the stem as it dies. • well@-@ drained or sandy soil, often in the partial shade of trees. • food is taken from the ground, low@-@ growing plants and from inside grass tussocks; the crake may search leaf • into his thought that the power of gravity( which brought an apple from a tree to the ground) was not limited to a certain distance from earth, • they eat both seeds and green plant parts and consume a variety of animals, including insects, crustaceans • fyne, argyll in the 1870s was named as the uk ’ s tallest tree in 2011. •", or colourless enamel, as in the ground areas, rocks and trees. • produced from 16 to 139 weeks after a forest fire in areas with coniferous trees.

This is the end of visualization of low-level transformer factor. Click [D] to go back.

Transformer factor 13 in layer 10 Explaination: Unit exchange with parentheses: e.g. 10 m (1000cm) • 14@-@ 16 hand( 56 to 64 inches( 140 to 160 cm)) war horse is that it was a matter of pride to a •, behind many successful developments, defaulted on the$ 214 million($ 47 billion) in bonds held by 60@,@ 000 investors; the van • straus, behind many successful developments, defaulted on the$ 214 million($ 47 billion) in bonds held by 60@,@ 000 investors; • the niche is 4 m( 13 ft) wide and 3@. • with a top speed of nearly 21 knots( 39 km/ h; 24 mph). •@ 4 billion( us$ 21 million) — india’ s highest@-@ earning film of the year •) at deep load as built, with a length of 310 ft( 94 m), a beam of 73 feet 7 inches( 22@. •@ 3@-@ inch( 160 mm) calibre steel barrel. • and gave a maximum speed of 23 knots( 43 km/ h; 26 mph). • 2 km) in length, with a depth around 790 yards( 720 m), and in places only a few yards separated the two sides. • hull provided a combined thickness of between 24 and 28 inches( 60 – 70 cm), increasing to around 48 inches( 1@. • switzerland, austria and germany; and his mother, lynette federer( born durand), from kempton park, gauteng, is •@ 2 in( 361 mm) thick sides. •) and a top speed of 30 knots( 56 km/ h; 35 mph). •, an outdoor seating area( 4@,@ 300 square feet( 400 m2)) and a 2@,@ 500@-@ square@ Transformer factor 24 in layer 10 Explaination: Male name • divorcing doqui in 1978, michelle married robert h. tucker, jr. the following year, changed her name to gee tucker, moved back • divorced doqui in 1978 and married new orleans politician robert h. tucker, jr. the following year; she changed her name to gee tucker and became • including isabel sanford, when chuck and new orleans politician robert h. tucker, jr. visited michelle at her hotel. • of 32 basidiomycete mushrooms showed that mutinus elegans was the only species to show antibiotic( both antibacterial • amphicoelias, it is probably synonymous with camarasaurus grandis rather than c. supremus because it was found lower in the •[ her] for warmth and virtue" and mehul s. thakkar of the deccan chronicle wrote that she was successful in" deliver[ • em( queen latifah) and uncle henry( david alan grier) own a diner, to which dorothy works for room and board. • in melbourne on 10 august 1895, presented by dion boucicault, jr. and robert brough, and the play was an immediate success. • in the early 1980s, james r. tindall, sr. purchased the building, the construction of which his father had originally financed • in 1937, when chakravarthi rajagopalachari became the chief minister of madras presidency, he introduced hindi as a compulsory • in 1905 william lewis moody, jr. and isaac h. kempner, members of two of galveston’ • also, walter b. jones, jr. of north carolina sent a letter to the republican conference chairwoman cathy • empire’ s leading generals, nikephoros bryennios the elder, the doux of dyrrhachium in the western balkans • in bengali as< unk>: the warrior by raj chakraborty with dev and mimi chakraborty portraying the lead roles. • on 1 june 1989, erik g. braathen, son of bjørn g., took over as ceo Transformer factor 25 in layer 10 Explaination: Attributive Clauses • which allows japan to mount an assault on the us; or kill him, which lets the us discover japan’ s role in rigging american elections — • certain stages of development, and constitutive heterochromatin that consists of chromosome structural components such as telomeres and centromeres • to the mouth of the nueces river, and oso bay, which extends south to the mouth of oso creek. •@,@ 082 metric tons, and argentina, which ranks 17th, with 326@,@ 900 metric tons. • of$ 42@,@ 693 and females had a median income of$ 34@,@ 795. • ultimately scored 14 points with 70 per cent shooting, and crispin, who scored twelve points with 67 per cent shooting. • and is operated by danish air transport, and one jetstream 32, which seats 19 and is operated by helitrans. • acute stage, which occurs shortly after an initial infection, and a chronic stage that develops over many years. •, earl of warwick and then william of lancaster, and ada de warenne who married henry, earl of huntingdon. • who ultimately scored 14 points with 70 per cent shooting, and crispin, who scored twelve points with 67 per cent shooting. • in america, while" halo/ walking on sunshine" charted at number 4 in ireland, 9 in the uk, 10 in australia, 28 in canada • five events, heptathlon consisting of seven events, and decathlon consisting of ten< unk> every multi event, athletes participate in a •@-@ life of 154@,@ 000 years, and 235np with a half@-@ life of 396@. • comfort, and intended to function as the prison, and the second floor was better finished, with a hall and a chamber, and probably operated as the •b, which serves the quonset freeway, and exit 7a, which serves route 402( frenchtown road), another spur route connecting the Transformer factor 42 in layer 10 Explaination: Some kind of disaster, something unfortunate happened • after the first five games, all losses, jeff carter suffered a broken foot that kept him out of the line@-@ up for • allingham died of natural causes in his sleep at 3: 10 am on 18 july 2009 at his • upon reaching corfu, thousands of serb troops began showing symptoms of typhus and had to be quarantined on the island of< un • than a year after the senate general election, the september 11, 2001 terrorist attacks took place, with giuliani still mayor. • the starting job because fourth@-@ year junior grady was under suspension related to driving while intoxicated charges. • his majesty, but as soon as they were on board ship, they died of melancholy, having refused to eat or drink. • on 16 september 1918, before she had even gone into action, she suffered a large fire in one of her 6@-@ inch magazines, and • orange goalkeeper for long@-@ time starter john galloway who was sick with the flu. • in 1666 his andover home was destroyed by fire, supposedly because of" the carelessness of the maid". • the government, on 8 february, admitted that the outbreak may have been caused by semi@-@ processed turkey meat imported directly •ikromo came under investigation by the justice office of the dutch east indies for publishing several further anti@-@ dutch editorials. • that he could attend to the duties of his office, but fell ill with a fever in august 1823 and died in office on september 1. •@ 2 billion initiative to combat cholera and the construction of a$ 17 million teaching hospital in< unk • he would not hear from his daughter until she was convicted of stealing from playwright george axelrod in 1968, by which time rosaleen • relatively hidden location and proximity to piccadilly circus, the street suffers from crime, which has led to westminster city council gating off the man in Transformer factor 50 in layer 10 Explaination: Doing something again, or making something new again • 2007 saw the show undergo a revamp, which included a switch to recording in hdtv, the introduction • during the ship’ s 1930 reconstruction; the maximum elevation of the main guns was increased to+ 43 degrees, increasing their maximum range from 25@, • hurricane pack 1 was a revamped version of story mode; team ninja tweaked the • she was fitted with new engines and more powerful water@-@ tube boilers rated at 6@ • from 1988 to 2000, the two western towers were substantially overhauled with a viewing platform provided at the top of the north tower. • latest missoula downtown master plan in 2009, increased emphasis was directed toward redeveloping the north side’ s former rail yard and the area • 1896: the ribbon of the army version medal of honor was redesigned with all stripes being vertical. • the new badge includes a star to represent the european cup win in 1982, and • missoula downtown master plan in 2009, increased emphasis was directed toward redeveloping the north side’ s former rail yard and the area just • also assisted in comprehensive infrastructure renovations, restored a dependable supply of electricity, revamped the baggage handling facilities as well as the arrival and departure lounge • hurricane pack 1 was a revamped version of story mode; team ninja tweaked the encounters • 1896: the ribbon of the army version medal of honor was redesigned with all stripes being vertical. • from 1988 to 2000, the two western towers were substantially overhauled with a viewing platform provided at the top of the north tower • assisted in comprehensive infrastructure renovations, restored a dependable supply of electricity, revamped the baggage handling facilities as well as the arrival and departure lounges • bond series and the fourth to star roger moore as bond; the plot was significantly changed from the novel to include excursions into space. Transformer factor 51 in layer 10 Explaination: apostrophe s, possesive • the irish times was critical of the book’ s text but wrote positively of the included photographs. • if it survived long enough to become old@-@ fashioned it was likely to be • you by phil spector as his inspirations, which resulted to the album’ s wall of sound resonance. • the irish times was critical of the book’ s text but wrote positively of the included photographs. • album to the wu tang clan and nine inch nails, particularly comparing the album’ s production( which was done by various producers with executive producer don gilmore • to the wu tang clan and nine inch nails, particularly comparing the album’ s production( which was done by various producers with executive producer don gilmore) • toward the commoners and interested in easing their burden but suspicious about the letter’ s true purpose, reluctantly signed the document under intense pressure from the french • the novel’ s reception was even warmer than that of its predecessor; waugh was • first song selected for inclusion after her mother’ s recommendation and the song’ s melancholic lyrics. • it divided critics at the time; although they praised the game’ s writing and scale of choice, they criticized its technical flaws. • mgm executive al lewin said that several years after the film’ s release stroheim asked him for the cut footage. • the game’ s production was turbulent, as the design’ s scope exceeded the available resources • nicki escudero from the phoenix new times noted the song’ s superficial themes which included lyrics about" sex, money and cheating" • mgm executive al lewin said that several years after the film’ s release stroheim asked him for the cut footage. • labrie said that there was" a lot of discussion" about the song’ s wording and how direct it should be. Transformer factor 86 in layer 10 Explaination: Pattern: Consecutive years, this is convention to name foodball/rugby game season • with york the previous season, signed a contract until the end of 2013 – 14 and sheffield united midfielder elliott whitehouse signed on a one@-@ • as of the end of the 2014 – 15 season, aston villa have spent 104 seasons in the top tier of english • won 13 and drew two of their opening 15 league matches of the 1985 – 86 campaign, and seemed destined to win the first division title. • mcallister, still without a goal in 2009 – 10, couldn’ t get on the scoresheet in the three games • john bentley led united to a fourth@-@ place finish in 1912 – 13. • he made 46 appearances, scoring three goals, in the 2001 – 02 season before spending the close season with the kalamazoo kingdom in the • he moved to basingstoke town towards the end of 2001 – 02, making his debut in march 2002. • 7[ note 1] was the worst record in the nhl for 2011 – 12 and the first time in franchise history they finished in last place. • side, who withdrew from the football league at the end of the 1893 – 94 season after finishing bottom of the second division. • spent a year as a physics instructor at the university of minnesota in 1916 – 17, then two years as a research engineer with the westinghouse lamp • defeat was a 7 – 2 loss to witton albion in the 2001 – 02 season. • york achieved three successive wins for the first time in 2013 – 14 after beating northampton 2 – 0 away, with bowman and fletcher scoring in • he started to develop more of an offensive game, finishing off the 2001 – 02 season with 58 points in the 47 games he played in seattle. • suart limited matthews to 19 league appearances in 1958 – 59. • jaw warriors of the western hockey league( whl) during the 2000 – 01 season. Transformer factor 99 in layer 10 Explaination: past tense • r. in their review of rihanna’ s top 20 songs, time out ranked" man down" as their tenth best track, writing that it is • rolling stone ranked" imagine" number three on its list of" the 500 greatest songs • japan’ s computer entertainment rating organization( cero) rated ninja gaiden and black, on their release, as 18+ • ultimate classic rock ranked" lola" as the kinks’ third best song, saying" • adrien begrand of popmatters described" south of heaven" as" an unorthodox set opener • columbia records released it as the album’ s fourth and final single on june 14, • rolling stone ranked it the best song of 2009 and the 36th@-@ best song • indielondon’ s jack foley noted" wind it up" as a highlight of the sweet escape and called • premiere magazine listed frank booth, played by dennis hopper, as the fifty@-@ • columbia records released" crazy in love" on may 18, 2003, as the lead • the times considered the production the best since the original, and praised it for its fidelity • the good food guide ranked hibiscus as the eighth@-@ best restaurant in the uk • viz media later began releasing the manga as simply" ral grad" in february 2008. • entertainment weekly magazine ranked" crazy in love" forty@-@ seven in its list of • the japanese publisher nihon bungeisha released the series in collected volumes from january 2000 to september 2009. Transformer factor 102 in layer 10 Explaination: African name • s 1966 to 1971 live performances in paris, prepared to press the album once mwanga provided the label with the record< unk>. • of america" with the nhk symphony orchestra, but cancelled both deals upon mwanga’ s return from japan. • 1966 to 1971 live performances in paris, prepared to press the album once mwanga provided the label with the record< unk>. • and langston hughes, and by modern african poets and folk artists such as kwesi brew and efua sutherland, which also influenced her auto • america" with the nhk symphony orchestra, but cancelled both deals upon mwanga’ s return from japan. • du bois was buried in accra near his home, which is now the du bois memorial centre. • du bois returned to africa in late 1960 to attend the inauguration of nnamdi azikiwe as the first african governor of nigeria. • david mcgurk, lanre oyebanjo, danny parslow, tom platt and chris smith signed new • and moderate nationalist parties, the most prominent of which was bishop abel muzorewa’ s united african national council( uanc). • a few weeks after of human feelings was recorded, mwanga went to japan to negotiate a deal with trio records to have the • and was part of two large campaigns, one to witu and another to mwele. • returned to africa in late 1960 to attend the inauguration of nnamdi azikiwe as the first african governor of nigeria. • in april, mwanga arranged another session at cbs studios in new york city, and coleman • the government and moderate nationalist parties, the most prominent of which was bishop abel muzorewa’ s united african national council( uanc). • ralambo’ s father, andriamanelo, had established rules of succession by which ralambo’ Transformer factor 125 in layer 10 Explaination: Describing someone in a paraphrasing style. Name, Career • journalist tim judah suggests that the move may have been motivated by a desire to control a • the historian nora berend says that the latter measure" may have adversely affected • from the pyx that were not assayed, and numismatic historian roger burdette speculates that ashbrook, generally well@- • the pyx that were not assayed, and numismatic historian roger burdette speculates that ashbrook, generally well@-@ • the cricket historian derek birley notes that many of these bowlers imitated the methods of • critic roberta reeder notes that the early poems always attracted large numbers of admirers •; the figures for the last two years are not available, but sf historian mike ashley estimates that fantastic paid circulation may have been as low as 13@ • aesthetically, ign’ s tal blevins noted that the game had" a very distinct 40s • sf historian everett bleiler notes that hersey did not mention the venture in his • similarly, duke university professor, mark anthony neal, writes, “ nas was at the forefront of a renaissance • club reviewer erik adams wrote that the episode was a perfect mix, between the more subtle • commenting on the album and its use of samples, pitchfork’ s jeff weiss claims that both nas and his producers found inspiration for the album’ • the historian stanley karnow said of ky and thi:" both fl • that were not assayed, and numismatic historian roger burdette speculates that ashbrook, generally well@-@ treated by the • irataba was described as an eloquent speaker, and linguist leanne hinton suggests that he was among the first mohave people to Transformer factor 134 in layer 10 Explaination: Transition sentence • fanny workman have tended to slight or belittle her achievements, but contemporaries, unaware of the far greater accomplishments to come, held the workmans • scheduled to air in its regular half@-@ hour time slot, but nbc later announced it would be expanded to fill an hour time slot beginning a • wine and savoy cabbage with a red wine and smoked chocolate sauce, but he otherwise felt that the food was" over@-@ worked" and the • lap melee when he was hit by romain grosjean; webber was forced to pit straight away, while grosjean was given a ten@ • ra. one was initially scheduled to release on 3 june 2011, but delays due to a lengthy post@-@ production process and escalating • yamina nomads who were centered at tuttul, and the rebels were supported by yamhad’ s king sumu@-@ • the item was intended simply as a piece of news, but telegraph lines quickly spread the news throughout the state, fueling procession sentiment. • both twc and comcast began trials of services based on the system; turner broadcasting was an early supporter of the system, providing access to tbs and •k> have claimed that he proposed a dictatorship for robespierre, but nonetheless some of them considered him to be redeemable, or at least •@ 2 style with superfiring pairs of turrets fore and aft; the middle turrets were not superfiring, and had a funnel between them. • romani being ordered to move out with supplies for the advancing troops, but 150 men, most of whom were past the end of their contracts and entitled to •’ s boats to enter the creek into which the schooner had fled, the small craft entering the waterway in the hope of storming and capturing the vessel •-@ person shooter elements and a unique on rails control scheme, but the core adventure@-@ style gameplay has been compared to myst and snatch • stanza 6; movement 4 incorporates ideas from stanzas 7 – 14, and movement 5 relies on stanzas 15 and< unk> movement 2, • as corps troops that were usually allocated at a rate of one per division; several of the militia units were also later designated australian imperial force units, after Transformer factor 152 in layer 10 Explaination: in some locations • while most breeding stallions and racehorses of the era had stable companions, waxy reportedly was fond of rabbits in his later years and • planet and the helter skelter music bookshop have also been based on the street. • the central bank of somalia, the national monetary authority, also has its headquarters in mogadishu. • some allotropes of the other actinides also exhibit similar behaviour, though to a lesser degree. •; fortune 1000 technology company< unk>, for instance, is headquartered in the area. • musical@-@ comedy television series maid marian and her merry men were filmed in cleeve abbey. • ireland,< unk> and donegal bay in particular, have popular surfing beaches, being fully exposed to the atlantic ocean. •lstoy’ s war and peace and chekhov’ s peasants both feature scenes in which wolves are hunted with hounds and< unk>. • while most breeding stallions and racehorses of the era had stable companions, waxy reportedly was fond of rabbits in his later years and" • the lancashire and england test cricketer paul allott was born in altrincham. •asura, the demon devotee of shiva, are both credited with building temples or cut caves to live. • forbidden planet and the helter skelter music bookshop have also been based on the street. •thopedic shriners hospitals in the u. s. is also located in spokane. • dykes to watch out for and fun home, was born in lock haven in 1960. •< unk>, and alessandra ambrosio have each worn two fantasy bras.

Transformer factor 297 in layer 10 with saliency map Explaination: repetitive structure detector frontier works, and an original soundtrack by avex group were created based on the game. drama cd: tales of graces 1 to 4 are side stories that take place during the game’ s plot. they were released between may 26, 2010 and august 25, 2010. anthology drama cd: tales of graces f 2010 winter, anthology drama cd: tales of graces f 2011 summer, anthology drama cd: tales of graces f 2012 winter, anthology drama cd: tales of graces f 2012 summer, anthology drama cd: tales of graces f 2013 winte r, and anthology drama cd: tales of graces f 2013 cobrand platinum cardholders, and citibank eva air cobrand world card) the infinity( infinity mileagelands diamond, royal laurel/ premium laurel class passengers, star alliance first/ business class passengers, american express centurion/ eva air cobrand platinum cardholders, and citibank eva air cobrand world cardholders) the star( infinity mileagelands diamond/ gold, royal laurel/ premium laurel class passengers, star alliance first/ business class passengers, star alliance gold members, american express centurion/ eva air cobrand platinum cardholders, citibank eva air cobrand world cardholders, business customers, quickly set online< unk> alight"." can’ t get you out of my head" was chosen as the lead single from minogue’ s eighth studio album fever, and it was released on 8 september 2001 by parlophone in australia, while in the united kingdom and other european countries it was released on 17 september." can’ t get you out of my head" was w ritten and produced by cathy dennis and rob davis, who had been put together by british artist manager simon fuller, who wanted the duo to come up with a song for british pop group s club 7. the song was recorded using cuba typhoon status with two@-@ minute sustained winds estimated at 125 km/ h( 78 mph). around 1700 utc on may 31, the storm tracked approximately 65 km( 40 mi) west of iwo jima. roughly five hours later, it moved within 15 km( 10 mi) of chichi@-@ jima where a pressure of 992 mb( hpa; 29@.@ 30 inhg) was measured. sustained winds on chichi@-@ jima reached 95 km/ h( 60 mph); however, these were determined to be unrepresentative of lucille’ s actual intensity due first book in vocal music. the modern music series. book 1. new york, new york: silver burdette and company. smith, eleanor( 1901). a second book in vocal music. the modern music series. book 2. new york, new york: silver burdette and company. smith, eleanor( 1901). a third book in vocal music. the modern music series. book 3. new york, new york: silver burdette and company. smith, eleanor( 1905). a fourth book in vocal music. the modern music series. book 4. new york, new york: silver burde @ breaking eight weeks at number one on the airplay chart of the country and became the first to garner 3000 radio plays in a single week. subsequently, it became the most@-@ played song of 2001 in the region." can’ t get you out of my head" was certified platinum by the british phonographic industry for shipments of 600@,@ 000 units in 2001. the certification was upgraded to double@-@ platinum in 2015, denoting shipments of 1@,@ 200@,@ 000 units. in the united states," can’ t get you out of my head" peaked at number seven on the chart. in mid@-@ august 2015," la mordidita" earned martin his twenty@-@ sixth top ten hit on hot latin songs. he became the fourth artist with the most top tens in the 29@-@ year history of the chart. in late august 2015, martin earned with" la mordidita" his fifteenth number@-@ one on the latin airplay chart( up 58 percent, to 11@.@ 8 million audience impressions). eventually," la mordidita" peaked at number six on the us hot latin songs chart, number one on latin airpla y and , was delivered to sukhoi’ s experimental workshop to be outfitted with exclusive systems. built by knaapo, its structure has increased carbon@-@ fibre and al@-@ li content. installed was the 2d thrust@-@ vectoring lyulka al@-@ 31fp, an interim measure pending the availability of the al@-@ 37fu(< unk>< unk>," afterburner@-@ controlled"). the 3d thrust@-@ vectoring lyulka al@-@ 37fu was still in development. the al@-@ 31fp, in ke’ s former band, though escape the fate only charted at number 25, seven spots lower than the drug in me is you, despite equal sales. in its second week on sales, the drug in me is you dropped about 70% in the united states, selling 5@,@ 870 copies. this dropped the album 60 spots to number 79 on the billboard 200, and brought total us sales for the album to around 24@,@ 000 copies. on the billboard charts, the drug in me is you charted at number two on the top hard rock albums chart, number three on the top alternative albums and top rock albums charts, no, no, no", reached number one on the billboard hot r& b/ hip@-@ hop singles& tracks and number three on the billboard hot 100. its follow@-@ up single," with me part 1" failed to reproduce the success of" no, no, no". meanwhile, the group featured on a song from the soundtrack album of the romantic drama why do fools fall in love and" get on the bus" had a limited release in europe and other markets. in 1998, destiny’ s child garnered three soul train lady of soul awards including best new artist for" no, no, no oistic warmongers. alexander krivenko( jonathan adams) finally, introduced in trivial games and paranoid pursuits, is russian alexander krivenko, the commander of the moonbase where the ispf have their headquarters. a winner of the nobel prize for medicine, it is krivenko’ s research into bone damage that has contributed to enabling humanity to access space easily. although the star cops are independent, spring’ s relationship with krivenko is often deferential and he frequently seems to capitulate to krivenko’ s wishes.== production history===== origins= that build faith: from the life and ministry of thomas s. monson, salt lake city, utah: deseret book, isbn 978@-@ 0@-@ 87579@-@ 901@-@ 8 — —( 1996), faith rewarded: a personal account of prophetic promises to the east german saints, salt lake city, utah: deseret book, isbn 978@-@ 1@-@ 57345@-@ 186@-@ 4 — —( 1997), invitation to exaltation, salt lake city, utah: deseret book, isbn 978@- dell, tom( 2015). gunnerkrigg court volume 5:< unk>. gunnerkrigg court. archaia studios press. isbn 978@-@< unk>.=== side comics=== siddell, tom( 2013). annie in the forest part one. beyond the walls. robot voice comics. siddell, tom( 2013). annie in the forest part two. beyond the walls. robot voice comics. siddell, tom( 2015). traveller. beyond the walls. robot voice comics.=== explanatory footnotes====== 95@.@ 4 kn) f402@-@ rr@-@< unk> engine, while later examples were fitted with the 23@,@ 000 lbf( 105@.@ 8 kn) f402@-@ rr@-@ 408a. in the early 2000s, 17 tav@-@ 8bs were upgraded to include a night@-@ attack capability, the f402@-@ rr@-@ 408 engine, and software and structural changes.< unk> in 1991, the night attack harrier was the first upgrade of the av@-@ 8 , aitrus’ meeting with ti’ ana, and the birth of their son gehn. the book also explains the destruction of the d’ ni civilization. two d’ ni, veovis and a’ gaeris, plot to destroy their civilization, which they believe has been corrupted. veovis and a’ gaeris create a plague which wipes out many of the d’ ni and follows them through the ages. veovis is murdered by a’ gaeris for refusing to write an age where the two of them would have been worshipped as gods, and aitrus sacrifices himself in order to inants". a gbrmpa briefing stated the company had" threatened a compensation claim of$< unk> should the gbrmpa intend to exert authority over the company’ s operations". in response to the< unk> of the dumping incidents, the gbrmpa stated: we have strongly encouraged the company to investigate options that don’ t entail releasing the material to the environment and to develop a management plan to eliminate this potential hazard; however, gbrmpa does not have legislative control over how the< unk> tailings dam is managed.===< unk>=== following a of warped tour. following this, a lesson in romantics was released on july 10 through fearless records. in august, the band went on tour with olympia and sound the alarm. the music video for" when i get home, you’ re so dead", directed by marco de la torre, was filmed in september. in late september 2007, the band supported paramore in japan an d australia. the band went on a co@-@ headlining tour with madina lake in october and november. the" when i get home, you’ re so dead" music video was released on november 14, and the single was released on of the english football league including promo tion and relegation. the player’ s team begins with a low rating in an 8@-@ team league. by winning games, the player earns credits, which can be used to purchase the contracts of free agents. by finishing high in the 8@-@ team league, the player’ s team advances to a 16@-@ team league and eventually a 32@-@ team league. the player improves their team by periodically signing free agents, as the competition is tougher in each league. the player wins the mode after winning a playoff tournament in the 32@-@ team league oda ministra kultury i sztuki ii< unk>) 1972 – member of commission" poland 2000" of the polish academy of sciences 1973 prize of the minister of foreign affairs for popularization of polish culture abroad(< unk> ministra< unk>< unk> za< unk>< unk> kultury za< unk>) literary prize of the minister of culture and art(< unk>< unk> ministra kultury i sztuki) and honorary member of science fiction writers of america 1976 – state prize 1st level in the area of to power the antarctic outpost. above earth, ba’ al’ s armada arrives. to the displeasure of his subordinates, the other system lords, ba’ al announces that he will treat the tau< unk> leniently. suspicious about ba’ al’ s thorough knowledge of earth, qetesh betrays him and forces him to tell her everything. she orders the destruction of mcmurdo and the ancient outpost in ba’ al’ s name, but she kills ba’ al when teal’ c discovers what she is doing. as teal’ c escapes to an al< unk>, qetesh ( 156+ kn) each fuel capacity: 18@,@ 000 lb( 8@,@ 200 kg) internally, or 26@,@ 000 lb( 12@,@ 000 kg) with two external fuel tanks performance maximum speed: at altitude: mach 2@.@ 25( 1@,@ 500 mph, 2@,@ 410 km/ h)[ estimated] supercruise: mach 1@.@ 82( 1@,@ 220 mph, 1@,@ 960 km/ h) range:> 1@,@ 600 nmi( 1@,@ 840 mi, 2@,@ 960 Transformer factor 322 in layer 10 with saliency map Explaination: biography, someone born in some year… . only three pitchers threw more complete games in major league careers shorter than getzein’ s nine@-@ year career. getzein had his most extensive playing time with the detroit wolverines, compiling records of 30@-@ 11 and 29@-@ 13 in 1886 and 1887. in the 1887 world series( which detroit won, 10 games to 5), getzein pitched six complete games and compiled a 4@-@ 2 record with a 2@.@ 48 era. he also won 23 games for the boston beaneaters in 1890.== early years== getzein was born in 1864 and telegraph lines and networks. the west construction company, based in chattanooga, tennessee, was a general contracting and construction firm also involved in the operation and maintenance of railway, telephone, and telegraph lines.== personal life===== marriage and children=== on april 10, 1875, in hampshire county, flournoy married frances" fannie" ann armstrong white( april 10, 1844 – february 25, 1922), the daughter of hampshire county clerk of court john baker white and his wife frances ann streit white. frances white’ s brother, robert white, served as west virginia attorney general, and her buffalo, new york businessman who made his fortune in five@-@ and@-@ dime stores. he merged his more than 100 stores with those of his first cousins, frank winfield woolworth and charles woolworth, to form the f. w. woolworth company. he went on to hold prominent positions in the merged company as well as marine trust co. he was the father of seymour h. knox ii and grandfather of seymour h. knox iii and northrup knox, the co@-@ founders of the buffalo sabres in the national hockey league.== biography== he was born in april 1861 in russell, saint lawrence stars for eighteen years. the american film institute( afi) ranked cooper eleventh on its list of the twenty five greatest male stars of classic hollywood cinema.== early life== frank james cooper was born on may 7, 1901, at 730 eleventh avenue in helena, montana to english immigrants alice( nee brazier, 1873 – 1967) and charles henry cooper( 1865 – 1946). his father emigrated from houghton regis, bedfordshire and became a prominent lawyer, rancher, and eventually a montana supreme court justice. his mother emigrated from gillingham, kent and married charles in montana. in 1906, charles purchased the 600@-@ acre orange( 1971), which kubrick pulled from circulation in the uk following a mass media frenzy — most of his films were nominated for oscars, golden globes, or bafta awards. his last film, eyes wide shut, was completed shortly before his death in 1999.== early life== stanley kubrick was born on july 26, 1928, in the lying@-@ in hospital at 307 second avenue in manhattan, new york city. he was the first of two children of jacob leonard kubrick( may 21, 1902 – october 19, 1985), known as jack or jacques, and his wife sadie gertrude kubrick managed with a catch and release regulation. trophy trout and wild brook trout enhancement regulations apply to the remainder. a total of 31 class a wild trout waters have been designated as wilderness trout streams. fishing in class a wild trout waters is permitted year@-@ round, although the killing of fish is forbidden from labor day to the beginning of the following year’ s trout season.== gallery=== henry bell gilkeson= henry bell gilkeson( june 6, 1850 – september 29, 1921) was an american lawyer, politician, school administrator, and banker in west virginia. gilkeson was born in moorefield, movement, there have been few more remarkable figures than marjory stoneman douglas."== early life== marjory stoneman was born on april 7, 189 0, in minneapolis, minnesota, the only child of frank bryant stoneman( 1857 – 1941) and lillian trefethen( 1859 – 1912), a concert violinist. one of her earliest memories was her father reading to her the song of hiawatha, at which she burst into sobs upon hearing that the tree had to give its life in order to provide hiawatha the wood for a canoe. she was an early and voracious reader amazon. com.=== dvd release==== johann mickl= johann mickl( 18 april 1893 – 10 april 1945) was an austrian@-@ born generalleutnant and division commander in the german army during world war ii, and was one of only 882 recipients of the knight’ s cross of the iron cross with oak leaves. he was commissioned shortly before the outbreak of world war i, and served with austro@-@ hungarian forces on the eastern and italian fronts as company commander in the imperial@-@ royal mountain troops. during world war i he was decorated several times for bravery and leadership, and very unusual properties, such as a quantum critical point behavior, exotic superconductivity, and high@-@ temperature ferromagnetism.= babe ruth= george herman ruth jr.( february 6, 1895 – august 16, 1948), better known as babe ruth, was an american professional baseball player whose career in major league baseball( mlb) spanned 22 seasons, from 1914 through 1935. nicknamed" the bambino" and" the sultan of swat", he began his mlb career as a stellar left@-@ handed pitcher for the boston red sox, but achieved his greatest fame as a slugging outfielder for the air in regular scheduled services. it includes the city, country, airport and the period in which the airline served the airport. hubs are denoted with a dagger().= william s. taylor= william sylvester taylor( october 10, 1853 – august 2, 1928) was the 33rd governor of kentucky. he was initially declared the winner of the disputed gubernatorial election of 1899, but the kentucky general assembly, dominated by the democrats, reversed the election results, giving the victory to his democratic party( united states) opponent, william goebel. taylor served only 50 days as governor. a poorly educated but politically astute lawyer, taylor woods hole, massachusetts, where he studied marine bioluminescence. he also worked at the woods hole oceanographic institution.== early life== george thomas reynolds was born in trenton, new jersey on may 27, 1917, the son of george w. reynolds, a< unk> for the pennsylvania railroad, and his wife laura, a secretary with the new jersey department of geology. he attended franklin junior high school in highland park, new jersey, until year 10, and then new brunswick high school. he received a bachelor’ s degree in physics from rutgers university in 1939. he then entered princeton university, where was awarded == shaughnessy was born on march 6, 1892 in st. cloud, minnesota, the second son of lucy ann( foster) and edward shaughnessy. he attended north st. paul high school, and prior to college, had no athletic experience. when he attended the university of minnesota, however, he p layed college football under head coach henry l. williams and alongside halfback bernie bierman. shaughnessy considered williams to be football’ s greatest teacher, and williams considered him to be the best passer from the midwest. shaughnessy handled both the passing and kicking duties for the team. he played on s gregoras likewise avoids negative comments, as do most modern historians.= george nicol( baseball)= george edward nicol( october 17, 1870 – august 4, 1924) was an american baseball pitcher and outfielder who played three seasons in major league baseball( mlb). he played for the st. louis browns, chicago colts, pittsburgh pirates and louisville colonels from 1890 to 1894. possessing the rare combination of batting right@-@ handed and throwing left@-@ handed, he served primarily as a right fielder when he did not pitch. signed by the browns without having previously played any minor league baseball, nicol made his dispatched powell and major benjamin mcculloch to utah to ease tensions with brigham young and the mormons. powell assumed his senate seat on his return from utah, just prior to the election of abraham lincoln as president. powell became an outspoken critic of lincoln’ s administration, so much so that the kentucky general assembly asked for his resignation and some of his fellow senators tried to have him expelled from the body. both groups later renounced their actions. powell died at his home near henderson, kentucky shortly following a failed bid to return to the senate in 1867.== early life== powell was born on october 6, 1812 near henderson, the army in 1948. he was promoted to lieutenant general just before his retirement on 29 february 1948 in recognition of his leadership of the bomb program. by a special act of congress, his date of rank was backdated to 16 july 1945, the date of the trinity nuclear test. groves went on to become a vice@-@ president at sperry rand.== early life== leslie richard groves jr. was born in albany, new york, on 17 august 1896, the third son of four children of a pastor, leslie richard groves sr., and his wife gwen nee griffith. a descendant of french huguenots who , burns died on november 11, 1928 in brooklyn, new york.== biography== thomas p. burns was born on september 6, 1864, in philadelphia. his parents, patrick and mary burns, were both irish immigrants. in 1883, burns began his professional baseball career as a pitcher with harrisburg of the minor@-@ league interstate association. on the year, burns posted an earned run average( era) of 2@.@ 30 over 20 games pitched, 15 of which were starts. when he wasn’ t pitching, burns played second and third base. burns began the 1884 season playing for the wilmington quicksteps, @ beats".== credits and personnel== lady gaga – vocals, songwriter and producer redone – songwriter, producer, vocal editing, vocal arrangement, audio engineering, instrumentation, programming, and recording at tour bus in europe trevor muzzy – recording, vocal editing, audio engineering, and audio mixing at larrabee, north holly wood, los angeles, california gene grimaldi – audio mastering at oasis mastering, burbank, california credits adapted from born this way album liner notes.== charts=== travis jackson= travis calvin jackson( november 2, 1903 – july 27, 1987) was an american baseball shortstop. = monson was born on august 21, 1927, in salt lake city, utah to g. spencer monson( 1901 – 1979) and gladys< unk> monson( 1902 – 1973). the second of six children, he grew up in a" tight@-@ knit" family — many of his mother’ s relatives living on the same street and the extended family frequently going on trips together. the family’ s neighborhood included several residents of mexican descent, an environment in which he says he developed a love for the mexican people and culture. monso n often spent weekends with relatives on their farms in granger( it. anderson was a professional accordion player and wrote poetry for various american pagan magazines. in 1970, he published his first book of poetry, thorns of the blood rose, which contained devotional religious poetry dedicated to the goddess; it won the clover international poetry competition award in 1975. anderson continued to promote the feri tradition until his death, at which point april niino was appointed as the new grandmaster of the tradition.== early life===== childhood: 1917 – 1931=== anderson was born on may 21, 1917 at the buffalo horn ranch in clayton, new mexico. his parents were hilbart alexander anderson was elsewhere. he had recently become engaged and bought his first house in hillsborough. franklin and benjamin pierce were among the prominent citizens who welcomed president jackson to the state on his visit in mid@-@ 1833.=== marriage and children=== on november 19, 1834, pierce married jane means appleton( march 12, 1806 – december 2, 1863), the daughter of jesse appleton, a congregational minister and former president of bowdoin college, and elizabeth means. the appletons were prominent whigs, in contrast with the pierces’ democratic affiliation. jane was shy, devoutly religious, and pro@-@ temperance which took delivery of its eight and last globemaster in november 2015; no. 38 squadron, operating king airs; and the australian army’ s 68 ground liaison section. all units are based at amberley, with the exception of no. 38 squadron, located at townsville.= clark shaughnessy= clark daniel shaughnessy( originally o’ shaughnessy)( march 6, 1892 – may 15, 1970) was an american football coach and innovator. he is sometimes called the" father of the t formation" and the original founder of the forward pass, although that system had previously been used as early as the 1880s Transformer factor 386 in layer 10 with saliency map Explaination: topic: war he was awarded a companion of the order of st michael and st george for his command of the 4th machine gun battalion, the recommendation of which particularly citing his success during attacks on the hindenburg line. murray’ s final honour came on 11 july 1919, when he was mentioned in despatches for the fourth time, having received his third mention on 31 december 1918. from june to september 1919, murray — along with fellow australian victoria cross recipient william donovan joynt — led parties of aif members on a tour of the farming districts of britain and denmark to study agricultural methods under the education schemes. after touring through france and belgium, from large@-@ calibre shells; one of them, allegedly a 14@-@ inch( 356 mm) round, blew a large hole in her quarterdeck and wrecked the wardroom and the gunroom. she also took several hits by light shells that day, and, although she suffered damage to her superstructure, her fighting and steaming capabilities were not seriously impaired. the ship also participated in the main attack on the dardanelles forts on 18 march. this time a 6@-@ inch( 152 mm) howitzer battery opened fire on agamemnon and hit her 12 times in 25 minutes; five of the . lt. riefkohl, who was also the first puerto rican to graduate from the united states naval academy, served as a rear admiral in world war ii. frederick l. riefkohl’ s brother, rudolph william riefkohl also served. riefkohl was commissioned a second lieutenant and assigned to the 63rd heavy artillery regiment in france where he actively participated in the meuse@-@ argonne offensive. according to the united states war department, after the war he served as captain of coastal artillery at the letterman army medical center in presidio of san francisco, in california( 1918). washington times@-@ herald, which ran the headline" hardy wild@-@ eyed aussies called world’ s finest troops". an article in the chicago daily news told its readers that australians" in their realistic attitude towards power politics, prefer to send their boys to fight far overseas rather than fighting a battle in the suburbs of sydney". during the battle, wavell had received a cable from general sir john dill stressing the political importance of such victories in the united states, where president franklin d. roosevelt was attempting to get the lend@-@ lease act passed. it was finally enacted in march 1941. mackay wrote . he also showed respect for occupied populations and never tolerated pillaging nor violence from his men. as a sign a gratitude, he was offered gifts several times but he was often seen refusing and sending them back. while on campaign in tyrol, he was recorded to have accepted a large sum of money but he immediately distributed it to the local hospitals. further evidence of his humanity was the ca re that he displayed for the lives and well@-@ being of his men, whom he was always reluctant to sacrifice for the sake of glory. overall as a heavy cavalry commander, nansouty was one of the best men available during the napoleonic @ 000 troops on 11 february. in march 1919, princess matoika and rijndam raced each other from saint@-@ nazaire to newport news in a friendly competition that received national press coverage in the united states. rijndam, the slower ship, was just able to edge out the princess — and cut two days from her previous fastest crossing time — by appealing to the honor of the soldiers of the 133rd field artillery( returning home aboard the former holland america liner) and employing them as extra stokers for her boilers. on her next trip, the veteran transport loaded troops at saint@-@ nazaire @ july, met his wife in new york, and together they traveled to columbus, georgia by way of washington, d. c. and atlanta.== military schools== for the ten years following world war i, troy middleton would be either an instructor or a student in the succession of military schools that army officers attend during their careers. middleton arrived in columbus, georgia with strong praise from his superiors, and would soon get his efficiency report, in which brigadier general benjamin poore of the 4th division wrote of him," the best all@-@ around officer i have yet seen.< unk> by his rapid promotion from coal and 700 long tons( 710 t) of fuel oil and that provided her a range of 3@,@ 500 nautical miles( 6@,@ 500 km) at a speed of 10 knots( 19 km/ h). her main armament consisted of a dozen obukhovskii 12@-@ inch( 305 mm) pattern 1907 52@-@ calibre guns mounted in four triple turrets distributed the length of the ship. the russians did not believe that super firing turrets offered any advantage as they discounted the value of axial fire and believed that super firing turrets could not fire while over the lower turret because of ’ ll still be playing from 2007" and awarded it" playstation 3@-@ exclusive game of the year".= 11th battalion( australia)= the 11th battalion was an australian army battalion that was among the first infantry units raised during world war i for the first australian imperial force. it was the first battalion recruited in western australia, and following a brief training period in perth, the battalion sailed to egypt where it undertook four months of intensive training. in april 1915 it took part in the invasion of the gallipoli peninsula, landing at anzac cove. in august 1915 the battalion was in action in the battle of lone pine. following was transferred to western australia, being attached to the 6th brigade, which was based around geraldton. in september 1942, as part of an army@-@ wide reduction that came about because of over@-@ mobilisation, the battalion was amalgamated with the 14th battalion to become the 14th/ 32nd battalion( prahran/ footscray regiment). in early 1943, the 14th/ 32nd battalion carried out amphibious warfare training in queensland before being deployed to the buna – gona area in new guinea in july. the battalion would remain in mainland new guinea and new britain for the next two years, under the command of lieutenant in an allied air raid on 10 december 1941, mickl was appointed to temporarily command the division. during december, mickl was wounded in the head and hand, but remained at his post. rommel recommended mickl for the knight’ s cross of the iron cross, for his leadership at sidi rezegh, and it was duly awarded on 13 december 1941. the harsh conditions of desert warfare had begun to affect mickl’ s health, so at the end of december he was sent home on convalescent leave.=== eastern front======= 12th rifle brigade==== on 25 on to bijeljina which was taken against light partisan resistance late on 16 march. the 27th regiment then consolidated its position in bijeljina while the 28th regiment and the divisional reconnaissance battalion( german:< unk>) bore the brunt of the fighting as they advanced through< unk>, celic and koraj at the foot of the majevica mountains. sauberzweig later recorded that the 2nd battalion of the 28th regiment( ii/ 28)" at celic stormed the partisan defenses with( new) battalion commander hans hanke at the poin t" and that enemy forces withdrew after of matthews, the company 2ic, who had taken command almost immediately after the company commander was wounded. under his command, each of the platoons assaulted a different cluster of buildings to which they had been assigned during training on the replica village at hastings. the west side boys’ ammunition store was found and secured and, once the rest of the buildings had been cleared, the paras took up defensive positions to block any potential counter@-@ attack and patrols went into the immediate jungle in search of any west side boys hiding in the bushes. the village was completely secure by 08: 00 and the paras secured the approaches with claymore ), increased her metacentric height to 6@.@ 3 feet( 1@.@ 9 m) at deep load, and all of the changes to her equipment increased her crew to a total of 1@,@ 188. despite the bulges she was able to reach a speed of 21@.@ 75 knots( 40@.@ 28 km/ h; 25@.@ 03 mph). a brief refit in early 1927 saw the addition of two more four@-@ inch aa guns and the removal of the six@-@ inch guns from the shelter deck. about 1931, a high@-@ angle control became enraged at him, slapping him across the face. he began yelling:" your nerves, hell, you are just a goddamned coward. shut up that goddamned crying. i won’ t have these brave men who have been shot at seeing this yellow bastard sitting here crying." patton then reportedly slapped bennett again, knocking his helmet liner off, and ordered the receiving officer, major charles b. etter, not to admit him. patton then threatened bennett," you’ re going back to the front lines and you may get shot and killed, but you’ re going to fight. if you don’ t, i’ secondary guns, two of which were disabled. the ammunition stores for these two guns were set on fire and the magazines had to be flooded to prevent an explosion. the ship nevertheless remained combat effective, as her primary battery remained in operation, as did most of her secondary guns; konig could also steam at close to her maximum speed. other areas of the ship had to be counter@-@ flooded to maintain stability; 1@,@ 600 tons of water entered the ship, either as a result of battle damage or counter@-@ flooding efforts. the flooding rendered the battleship sufficiently low in the water to prevent the ship from being able in 1924 and rice institute, houston, texas in 1928. he dropped out of graduate school after one year and decided to hitchhike to san francisco. the lack of work meant hunger, so he chose to join the united states army’ s 11th cavalry regiment as a private on july 30, 1930, serving in monterey, california. after a year in the horse cavalry, parrish became an aviation cadet in june 1931 and subsequently qua lified as an enlisted pilot. he completed flight training in 1932 and was assigned to the 13th attack squadron at fort crockett, near galveston, texas. one year later in september 1933 parrish during the battle, murray was awarded the victoria cross. soon after his victoria cross action, he was promoted to major and earned a bar to his distinguished service order during an attack on the hindenburg line near bullecourt. promoted to lieutenant colonel in early 1918, he assumed command of the 4th machine gun battalion, where he would remain until the end of the war. returning to australia in 1920, murray eventually settled in queensland, where he purchased the grazing farm that would be his home for the remainder of his life. re@-@ enlisting for service in the second world war, he was appo inted as commanding officer 10 officers and 315 enlisted men, plus an additional four officers and 19 enlisted men if serving as a flotilla flagship.== construction and career== the ship was ordered on 7 july 1934 and laid down at deutsche werke, kiel, on 2 january 1935 as yard number< unk>. she was launched on 30 november 1935 and completed on 8 april 1937. she was named after max schultz who commanded the torpedo boat< unk> and was killed in action in january 1917. korvettenkapitan martin< unk> was appointed as her first captain. max schultz was assigned to the 1st destroyer division on 26 the command of otto von diederichs. the squadron participated in the fall maneuvers in 1894, which simulated a two@-@ front war against france and russia; deutschland’ s squadron acted as the russian fleet during the exercises. between 1894 and 1897, deutschland was rebuilt in the imperial dockyard in wilhelmshaven. the ship was converted into an armored cruiser; her heavy guns were removed and replaced with lighter weapons, including eight 15 cm( 5@.@ 9 in) and eight 8@.@ 8 cm( 3@.@ 5 in) guns. her entire rigging equipment was removed and two heavy military masts were installed called on many times to maintain order in times of disaster and to keep peace during periods of political unrest. oklahoma governor john c. walton used division troops to prevent the state legislature from meeting when they were preparing to impeach him in 1923. governor william h. murray called out the guard several times during the depression to close banks, distribute food and once to force the state of texas to keep open a free bridge over the red river which texas intended to collect tolls for, even after federal courts ordered the bridge not be opened. the division would go on to see combat in world war ii as one of four national guard divisions active during Transformer factor 170 in layer 10 with saliency map Explaination: topic: music production 2nd street tunnel and part of downtown los angeles spread out over a 48@-@ hour period. kesha explained the idea behind the video as well as the experience during an interview with mtv news; she said that the video was different from her other videos, noting that it was going to show a sexier side of herself. the music video for" we r who we r" is presented as an underground party. the video starts off with futuristic flashing lights. kesha, seen in a ponytail wearing gray and black makeup, chains, ripped stockings, and a sparkly one@-@ piece leotard made of shards of broken and several european territories), her" endless love" duet with luther vandross( number@-@ one in new zealand) and" against all odds" featuring westlife( number@-@ one in the united kingdom)." thank god i found you" was also omitted from the japanese track listing, and replaced with" all i want for christmas is you". for the album artwork, carey launched a social media campaign on april 12, 2015, whereby fans had to share a link to her website in order to reveal the cover which was concealed by a curtain. using the hashtag"< unk>", single," we belong together". he contained to add" but still, if mimi ’ s going to mine from her own extensive back catalog of ballads, those are the primo melodies to go for." a reviewer for dj booth thought that minaj" ruined" the song.=== music video=== the accompanying music video for the remix of" up out my f ace" was directed by carey’ s husband, nick cannon. minaj spoke about filming a video with carey and how she did not believe that the video would ever be released:" i didn ’ t even tell anyone i shot a video with the producer, few days after he had finished the composition, madonna completed writing the lyrics of" i don’ t give a". solveig understood that the lyrics were probable references towards madonna’ s life and thus received coverage in the press. however, he was not aware of the inner meaning behind the lyrics. with billboard magazine, the producer further explained: at first i thought we were going to work on one song; that was the original plan. let’ s try to work on one song and take it from there– not spend too much time thinking about the l egend, and do something that just makes sense. provided an additional and assistant engineering. all the instruments were provided by eriksen and hermansen while dean sang the background vocals. in may 2011, in the mix review, an analyzing commercial productions, mike senior of sound on sound revisited the original mixing of the song. according to him, before he started the mix, senior played the song a couple of times before releasing what thing about it" bugged" him. working it out, he noted that the harmony of the mix is undermined by the kick drum." what’ s my name?" contains basic harmonies that are a bar of f minor, a bar of a major practiced in their backyards and at< unk> salon, owned by knowles’ s mother, tina. the group would test routines in the salon, when it was on montrose boulevard in houston, and sometimes would collect tips from the customers. their try out would be critiqued by the people inside. during their school days, girl’ s tyme performed at local gigs. when summer came, mathew knowles established a" boot camp" to train them in dance and vocal lessons. after rigorous training, they began performing as opening acts for established r& b groups of that time such as swv, dru hill and immature. tina day reception at the greek embassy. upon return to greece, she was greeted at the airport by fans along with the music video of" my number one" playing on the video monitors. while in greece, she attended the opening ceremony of the european final four for the volleyball champions league in< unk>, where her song was played as she appeared on stage with cheerleaders. on march 29, paparizou arrived in valletta, malta where she signed autographs, appeared on television stations, and gave interviews to the local media. following malta, she traveled to serbia and montenegro where she gave additional interviews before moving on to and and her low hip@-@ grind during’ rude boy’ were the smash hits of her body language." deborah linton of city life wrote that rihanna" even manages to make a psychiatric couch look sexy". linton called the show’ s stage sets impressive and imaginative. rick massimo of the providence journal wrote that rihanna" looked like a neon@-@ sign rendition of herself during’ rehab’, rarely addressed the audience, and didn’ t rise above flat cliche in that until the very end of the show"." rehab" and rihanna’ s 2009 single" russian roulette" were excluded from the set only a few hours. he said:" there were a lot of tracks, but i just enjoyed it, to be honest. i knew how i wanted it to sound, and it was pretty much the last song we cut; a lot of the mixing was nailed in the production as well, which helped. dream did a great job producing this track." the bar one guitar track of" schoolin’ life" was entirely programmed. similarly, the live drum section in the hook was actually done with programmed drums. once the mixing was over, swivel’ s impression were as follows:[’ schoolin’ life] absolutely tour began on march 1, 2000 at the house of blues in los angeles, while other venues included paris olympia, trump taj mahal, brixton academy, the montreux jazz festival, and the essence jazz festival in new orleans. by july, the tour’ s first half had sold out in each city. the tour lasted nearly eight months, whil e performances went for up to three hours a night. the voodoo tour was taken internationally, with one of the most notable performances being the free jazz festival in brazil. the music video for" untitled( how does it feel)" portrayed d’ angelo as a sex symbol hobson noted that rihanna" rejects the victim stance" in the video for" man down", and elucidated that she played the role of a rape survivor who shot her attacker. she attributed the location of shooting the video in jamaica as significant, due to how the image of a gun proliferated during 1990s jamaican dance hall’ s to" express female rage". the prologue depicts rihanna as a" dark@-@ hooded" femme fatale whereby the narrative explains her motives for murder and provokes the spectator to sympathize with her because she danced in a provocative manner with a man in a club, which had a deep impact on delonge in that he spent a night up crying for him when he wrote the track." a little’ s enough" was inspired by a religious concept in which a god came to bring positive change on earth when it faces terrorism, war or famine." the war", an anthem about the iraq war and its death toll, is succeeded by" it hurts", a track about a friend of delonge with a cheating girlfriend." it’ s a terrible situation where my friend is being crushed from the inside out by all the manipulative stuff she’ s doing and this song’ s just took that dress out of the storage – it has a 27@-@ foot train and it was just all hand@-@ beaded and stuff and so i figured we might as well get a use out of it.’=== synopsis=== the video features carey readying for her wedding, and follows her to the altar, as well as her escape from the reception. many of the actors featured in carey’ s" it’ s like that" video were in that of" we belong together", which was shot as a continuation from the" it’ s like that" video. it begins with 3 in dutch@-@ speaking flanders and number 2 in french@-@ speaking wallonia. it was certified gold by the belgian entertainment association( bea) for selling more than 15@,@ 000 copies. although the song spent only 1 week on the italian singles chart( at number 8), it was certified platinum by the federazione industria musicale italiana( fimi) in 2014 for selling more than 30@,@ 000 copies.== music video===== background and synopsis=== anthony mandler directed the music video for" man down" in april 2011 on a beach in at numbers 18 and 43 in the united states, and experienced moderate success worldwide. unlike her previous records, spears did not heavily promote blackout; her only televised appearance for blackout was a universally@-@ panned performance of" gimme more" at the 2007 mtv video music awards.== background and development== in november 2003, while promoting her fourth studio album in the zone, spears told entertainment weekly that she was already writing songs for her next album and was also hoping to start her own record label in 2004. henrik jonback confirmed that he had written songs with her during the european leg of the onyx hotel tour," of albums also had increased sales due to discounting and publicity generated by the single and her performance. billboard estimated that her top@-@ 10 digital sales collectively increased over 1@,@ 700 percent. madonna’ s bestselling album was the 2009 greatest@-@ hits collection, celebration, which sold 16@,@ 000 copies( up 1@,@ 341 percent) and reentered the billboard 200 album chart. the following week celebration fell 105 spots on the chart to number 157, with sales falling to 4@,@ 000 copies." give me all your luvin’" fell to number 39 on the hot opened the performance with" yeah 3x" and was dressed in a white formal suit, accompanied by" full@-@ skirted dancers". brown was eventually joined onstage by tuxedo@-@ clad dancers and began dancing to the 1993 wu@-@ tang clan single" protect ya neck". his dance routine then moved into 1991, where he danced to nirvana’ s" smells like teen spirit". brown’ s performance then came back to the future, where he began to sing" beautiful people". while performing the song, he was suspended in the air, and then lowered to another stage where he continued to register that she didn’ t know she had." from the moment she was signed in the film, madonna had expressed interest in recording a dance version of" don’ t cry for me argentina". according to her publicist liz rosenberg," since she didn’ t write the music and lyrics, she wanted her signature on that song… i think on her mind, the best way to do it was go in the studio and work up a remix". for this, in august 1996, while still mixing the film’ s soundtrack, madonna hired remixers pablo flores and javier garza. according to flores, the singer d accumulated until then but that was instead an ideal marriage of production and performance." instead, the red lights on the stage played up the" ominous" tone of the song as it gradually increased its tempo to the point whereby the end of the song was on the verge of sounding like an incantation. for the diamonds world tour, rihanna performed" man down" in a caribbean@-@ theme section of the show, which also included" you da one"," no love allowed"," what’ s my name?" and" rude boy". james lachno of the telegraph highlight the caribbean@-@ themed edge of several realities: the film, the dream it inspires, the waking world it illuminates". the music in" i just can’ t stop loving you", a duet with siedah garrett, consisted mainly of finger snaps and timpani." just good friends", a duet with stevie wonder, was viewed by critics as sounding good at the beginning of the song, ending with a" chin@-@ bobbing cheerfulness"." the way you make me feel"’ s music consisted of blues harmonies. the lyrics of" another part of me" deal with being united, as" we not manufactured. no one paid these kids."=== live performances=== one direction performed" what makes you beautiful" on red or black? on 10 september 2011. the performance started with hosts ant& dec announcing that the band was supposedly running late for their appearance, and cut to a video of one direction boarding a london tube carriage full of fans, as the studio version of the song began playing. each fan on the tube was given a numbered ticket. the band and fans disembarked the tube and made their way to the television studio, where the remainder of the song was sung live. after the song, This is the end of visualization of high-level transformer factor. Click [D] to go back.

Table: S2.T1: Several examples of low-level transformer factors. Their top-activated words in layer 4 are marked blue, and the corresponding contexts are shown as examples for each transformer factor. As shown in the table, nearly all of the top-activated words are disambiguated into a single sense. Please note the last example of Φ:,33subscriptΦ:33\Phi_{:,33} is a rare exception, the reader may check the appendix to see a more complete list. More examples, top-activated words and contexts are provided in Appendix.

Top 3 activated words and their contextsExplanation
Φ:,2subscriptΦ:2\Phi_{:,2}• that snare shot sounded like somebody’ d kicked open the door to your mind". • i became very frustrated with that and finally made up my mind to start getting back into things." • when evita asked for more time so she could make up her mind, the crowd demanded," ¡ ahora, evita,<• Word “mind” • Noun • Definition: the element of a person that enables them to be aware of the world and their experiences.
Φ:,16subscriptΦ:16\Phi_{:,16}•nington joined the five members xero and the band was renamed to linkin park. • times about his feelings about gordon, and the price family even sat away from park’ s supporters during the trial itself. • on 25 january 2010, the morning of park’ s 66th birthday, he was found hanged and unconscious in his• Word “park” • Noun • Definition: a common first and last name
Φ:,30subscriptΦ:30\Phi_{:,30}• saying that he has left the outsiders, kovu asks simba to let him join his pride • eventually, all boycott’ s employees left, forcing him to run the estate without help. • the story concerned the attempts of a scientist to photograph the soul as it left the body.• Word “left" • Verb • Definition: leaving, exiting
Φ:,33subscriptΦ:33\Phi_{:,33}• forced to visit the sarajevo television station at night and to film with as little light as possible to avoid the attention of snipers and bombers. • by the modest, cream@-@ colored attire in the airy, light@-@ filled clip. • the man asked her to help him carry the case to his car, a light@-@ brown volkswagen beetle.• Word “light” • Noun • Definition: the natural agent that stimulates sight and makes things visible

Table: S2.T2: Evaluation of binary POS tagging task: predict whether or not “left” in a given context is a verb.

Precision (%)Recall (%)F1 score (%)
Average perceptron POS tagger92.795.594.1
Finetuned BERT base model for POS task97.595.296.3
Logistic regression classifier with activation of Φ:,30subscriptΦ:30\Phi_{:,30} at layer 497.295.896.5

Table: S3.T3: A list of typical mid-level transformer factors. The top-activation words and their context sequences for each transformer factor at layer-888 are shown in the second column. We summarize the patterns of each transformer factor in the third column. The last 4 columns are the percentage of the top 200 activated words and sequences that contain the summarized patterns in layer-444,666,888, and 101010 respectively.

2 example words and their contexts with high activationPatternsL4 (%)L6 (%)L8 (%)L10 (%)
Φ:,13subscriptΦ:13\Phi_{:,13}• the steel pipeline was about 20 ° f(- 7 ° c) degrees. • hand( 56 to 64 inches( 140 to 160 cm)) war horse is that it was aUnit exchange with parentheses0064.595.5
Φ:,42subscriptΦ:42\Phi_{:,42}• he died at the hospice of lancaster county from heart • holly’ s drummer carl bunch suffered frostbite to his toes( while aboard the ailments on 23 june 2007.Something unfortunate happened94.0100100100
Φ:,50subscriptΦ:50\Phi_{:,50}• hurricane pack 1 was a revamped version of story mode; • in 1998, the categories were retitled best short form music video, and bestDoing something again, or making something new again74.5100100100
Φ:,86subscriptΦ:86\Phi_{:,86}• he finished the 2005 – 06 season with 21 appearances and seven goals. • of an offensive game, finishing off the 2001 – 02 season with 58 points in the 47 gamesConsecutive years, used in foodball season naming010085.095.5
Φ:,102subscriptΦ:102\Phi_{:,102}• the most prominent of which was bishop abel muzorewa’ s united african national council • ralambo’ s father, andriamanelo, had established rules of succession byAfrican names99.0100100100
Φ:,125subscriptΦ:125\Phi_{:,125}• music writer jeff weiss of pitchfork describes the" enduring image" • club reviewer erik adams wrote that the episode was a perfect mixDescribing someone in a paraphrasing style. Name, Career15.599.010098.5
Φ:,184subscriptΦ:184\Phi_{:,184}• the world wide fund for nature( wwf) announced in 2010 that a biodiversity study from • fm) was halted by the federal communications commission( fcc) due to a complaint that the company buyingInstitution with abbreviation015.539.063.0
Φ:,193subscriptΦ:193\Phi_{:,193}• 74, 22@,@ 500 vietnamese during 1979 – 92, over 2@,@ 500 bosnian •, the russo@-@ turkish war of 1877 – 88 and the first balkan war in 1913.Time span in years97.095.596.595.5
Φ:,195subscriptΦ:195\Phi_{:,195}•s, hares, badgers, foxes, weasels, ground squirrels, mice, hamsters •-@ watching, boxing, chess, cycling, drama, languages, geography, jazz and other musicConsecutive of noun (Enumerating)8.098.5100100
Φ:,225subscriptΦ:225\Phi_{:,225}• technologist at the united states marine hospital in key west, florida who developed a morbid obsession for • 00°,11”, w, near smith valley, nevada.Places in US, followings the convention “city, state"51.591.591.077.5

Table: S3.T4: We construct adversarial texts similar but different to the pattern “Consecutive adjective”. The last column shows the activation of Φ:,35subscriptΦ:35\Phi_{:,35}, or α35(8)subscriptsuperscript𝛼835\alpha^{(8)}_{35}, w.r.t. the blue-marked word in layer 8.

Adversarial TextExplainationα35subscript𝛼35\alpha_{35}
(o)album as "full of exhilarating, ecstatic, thrilling, fun and sometimes downright silly songs"The original top-activated word and its context sentence for transformer factor Φ:,35subscriptΦ:35\Phi_{:,35} (not an adversarial text)9.5
(a)album as "full of delightful, lively, exciting, interesting and sometimes downright silly songs"Replace the adjectives in sentence (o) with different adjectives.9.2
(b)album as "full of unfortunate, heartbroken, annoying, boring and sometimes downright silly songs"Replace the adjectives in sentence (o) with negative adjectives.8.2
(c)album as "full of [UNK], [UNK], thrilling, [UNK] and sometimes downright silly songs"Mask the adjectives in sentence (o) with unknown tokens.5.3
(d)album as "full of thrilling and sometimes downright silly songs"Remove the first three adjectives in sentence (o).7.8
(e)album as "full of natural, smooth, rock, electronic and sometimes downright silly songs"Replace the adjectives in sentence (o) with neutral adjectives.6.2
(f)each participant starts the battle with one balloon. these can be re@-@ inflated up to fourUse a random sentence.0.0
(g)The book is described as "innovative, beautiful and brilliant". It receive the highest opinion from James WoodWe create this sentence that contain the pattern of consecutive adjective.7.9

Refer to caption Building block (layer) of transformer

Refer to caption (a)

Refer to caption (a) layer 0

Refer to caption Two examples of the high activated words and their contexts for transformer factor Φ:,297subscriptΦ:297\Phi_{:,297}. We also provide the saliency map of the tokens generated using LIME. This transformer factor corresponds to the concept: “repetitive pattern detector”. In other words, repetitive text sequences will trigger high activation of Φ:,297subscriptΦ:297\Phi_{:,297}.

Refer to caption Visualization of Φ:,322subscriptΦ:322\Phi_{:,322}. This transformer factor corresponds to the concept: “some born in some year” in biography. All of the high-activation contexts contain the beginning of a biography. As shown in the figure, the attributes of someone, name, age, career, and familial relation all have high saliency weights.

$$ \label{sparse} x = \Phi \alpha + \epsilon, \ s.t. \ \alpha \succeq 0, $$ \tag{sparse}

$$ \min_{w\in\mathbb{R}^{T}}\mathcal{L}(f,w,\mathcal{S}(s))+\sigma|w|_{2}^{2}. $$ \tag{S3.Ex3}

$$ \alpha(x)=\arg\min_{\alpha\in\mathbb{R}}||x-\Phi\alpha||{2}^{2}+\lambda||\alpha||{1} $$ \tag{Ax1.Ex4}

$$ f((s,i))=\alpha(x^{(l)}(s,i)) $$ \tag{Ax1.Ex5}

$$ h(s)_{t}=\left{\begin{array}[]{ll}0&\text{if }s[t]=\text{[`UNK']}\ 1&Otherwis\ \end{array}\right. $$ \tag{Ax1.Ex6}

$$ \Omega=diag(d(h(\mathcal{S}{1}),\vec{1}),d(h(\mathcal{S}{2}),\vec{1}),\cdots,d(h(\mathcal{S}_{n}),\vec{1})) $$ \tag{Ax1.Ex8}

$$ X=X^{(1)}\cup X^{(2)}\cup\cdots\cup X^{(L)} $$ \tag{Ax1.Ex9}

$$ \min\limits_{\Phi} \tfrac{1}{2} | X - \Phi \Omega A|{F}^{2},\ |\Phi{:,j}|_2 \leq 1. \label{appequ:dictionary_update} $$ \tag{appequ:dictionary_update}

$$ \displaystyle\text{apple}= $$

$$ \min\limits_{A} \tfrac{1}{2} | X - \Phi A|{F}^{2} + \lambda\sum{i}{ |\alpha_i|_{1}},\ \text{s.t.}\ \alpha_i \succeq 0, \label{appequ:sparse_coding} $$ \tag{appequ:sparse_coding}

$$ \text{apple} =& 0.09\text{dessert}" + 0.11\text{organism}" + 0.16\ \text{fruit}" & + 0.22\text{mobile&IT}" + 0.42``\text{other}". $$

$$ 5pt] \small\ $^{1}$ Facebook AI Research\ \small\ $^{2}$ Berkeley AI Research (BAIR), UC Berkeley\ \small\ $^{3}$ New York University\ \small\ $^{4}$ Redwood Center for Theoretical Neuroscience, UC Berkeley\ }

\DeclareMathOperator{\EMD}{EMD} \usepackage{array} \usepackage{dcolumn} \usepackage{tabularx} \newcolumntype{P}[1]{>{\centering\arraybackslash}m{#1}}

\usepackage[a4paper]{geometry} \usepackage{tabularx} \usepackage{lipsum} \begin{document}

\maketitle \begin{abstract} Transformer networks have revolutionized NLP representation learning since they were introduced. Though a great effort has been made to explain the representation in transformers, it is widely recognized that our understanding is not sufficient. One important reason is that there lack enough visualization tools for detailed analysis. In this paper, we propose to use dictionary learning to open up these `black boxes' as linear superpositions of transformer factors. Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors, e.g., word-level polysemy disambiguation, sentence-level pattern formation, and long-range dependency. While some of these patterns confirm the conventional prior linguistic knowledge, the rest are relatively unexpected, which may provide new insights. We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work. The code is available at \url{https://github.com/zeyuyun1/TransformerVis}. \end{abstract}

\section{Introduction} Though the transformer networks \cite{vaswani2017attention, devlin2018BERT} have achieved great success, our understanding of how they work is still fairly limited. This has triggered increasing efforts to visualize and analyze these black boxes''. Besides a direct visualization of the attention weights, most of the current efforts to interpret transformer models involve probing tasks''. They are achieved by attaching a light-weighted auxiliary classifier at the output of the target transformer layer. Then only the auxiliary classifier is trained for well-known NLP tasks like part-of-speech (POS) Tagging, Named-entity recognition (NER) Tagging, Syntactic Dependency, etc. \citet{tenney2019you} and \citet{liu-etal-2019-linguistic} show transformer models have excellent performance in those probing tasks. These results indicate that transformer models have learned the language representation related to the probing tasks. Though the probing tasks are great tools for interpreting language models, their limitation is explained in \citet{Anna2020Bertology}. We summarize the limitation into three major points: \begin{itemize} \item Most probing tasks, like POS and NER tagging, are too simple. A model that performs well in those probing tasks does not reflect the model’s true capacity. \item Probing tasks can only verify whether a certain prior structure is learned in a language model. They can not reveal the structures beyond our prior knowledge. \item It's hard to locate where exactly the related linguistic representation is learned in the transformer. \end{itemize} Efforts are made to remove those limitations and make probing tasks more diverse. For instance, \citet{hewitt-manning-2019-structural} proposes ``structural probe'', which is a much more intricate probing task. \citet{Zhengbao2020Know} proposes to generate specific probing tasks automatically. Non-probing methods are also explored to relieve the last two limitations. For example, \citet{emily2019VisBert} visualizes embedding from BERT using UMAP and shows that the embeddings of the same word under different contexts are separated into different clusters. \citet{Kawin2019Contextual} analyzes the similarity between embeddings of the same word in different contexts. Both of these works show transformers provide a context-specific representation.

\citet{faruqui-etal-2015-sparse, arora2018linear, zhang2019word} demonstrate how to use dictionary learning to explain, improve, and visualize the uncontextualized word embedding representations. In this work, we propose to use dictionary learning to alleviate the limitations of the other transformer interpretation techniques. Our results show that dictionary learning provides a powerful visualization tool, leading to some surprising new knowledge.

\section{Method} {\noindent \bf Hypothesis: contextualized word embedding as a sparse linear superposition of transformer factors.} It is shown that word embedding vectors can be factorized into a sparse linear combination of word factors \cite{arora2018linear, zhang2019word}, which correspond to elementary semantic meanings. An example is: \begin{align*} \text{apple} =& 0.09\text{dessert}" + 0.11\text{organism}" + 0.16\ \text{fruit}" & + 0.22\text{mobile&IT}" + 0.42``\text{other}". \end{align*} We view the latent representation of words in a transformer as contextualized word embedding. Similarly, we hypothesize that a contextualized word embedding vector can also be factorized as a sparse linear superposition of a set of elementary elements, which we call \textit{transformer factors}. The exact definition will be presented later in this section.

\begin{figure}[h] \centering \includegraphics[width=0.3\textwidth]{trans_7.png} \caption{Building block (layer) of transformer} \end{figure} Due to the skip connections in each of the transformer blocks, we hypothesize that the representation in any layer would be a superposition of the hierarchical representations in all of the lower layers. As a result, the output of a particular transformer block would be the sum of all of the modifications along the way. Indeed, we verify this intuition with the experiments. Based on the above observation, we propose to learn a single dictionary for the contextualized word vectors from different layers' output.

\vspace{0.1in} {\noindent \bf To learn a dictionary of transformer factors with non-negative sparse coding.}

Given a set of tokenized text sequences, we collect the contextualized embedding of every word using a transformer model. We define the set of all word embedding vectors from $l$th layer of transformer model as $X^{(l)}$. Furthermore, we collect the embeddings across all layers into a single set $X = X^{(1)} \cup X^{(2)} \cup \cdots \cup X^{(L)}$.

By our hypothesis, we assume each embedding vector $x \in X$ is a sparse linear superposition of \textit{transformer factors}:

\begin{equation}\label{sparse} x = \Phi \alpha + \epsilon, \ s.t. \ \alpha \succeq 0, \end{equation} where $\Phi\in{\rm I!R}^{d\times m}$ is a dictionary matrix with columns $\Phi_{:,c}\ $, $\bm{\alpha}\in{\rm I!R}^m$ is a sparse vector of coefficients to be inferred and $\bm{\epsilon}$ is a vector containing independent Gaussian noise samples, which are assumed to be small relative to $\bm{x}$. Typically $m>d$ so that the representation is {\em overcomplete}. This inverse problem can be efficiently solved by FISTA algorithm \cite{beck2009fast}. The dictionary matrix $\Phi$ can be learned in an iterative fashion by using non-negative sparse coding, which we leave to the appendix section \ref{sec:optimization}. Each column $\Phi_{:,c}\ $ of $\Phi$ is a {\it transformer factor} and its corresponding sparse coefficient $\bm{\alpha}_c$ is its activation level.

\vspace{0.1in} {\noindent \bf Visualization by top activation and LIME interpretation.} An important empirical method to visualize a feature in deep learning is to use the input samples, which trigger the top activation of the feature \cite{zeiler2014visualizing}. We adopt this convention. As a starting point, we try to visualize each of the dimensions of a particular layer, $X^{(l)}$. Unfortunately, the hidden dimensions of transformers are not semantically meaningful, which is similar to the uncontextualized word embeddings \cite{zhang2019word}.

Instead, we can try to visualize the transformer factors. For a transformer factor $\Phi_{:,c}$ and for a layer-$l$, we denote the 1000 contextualized word vectors with the largest sparse coefficients $\alpha^{(l)}_c$ as $X^{(l)}c \subset X^{(l)}$, which correspond to 1000 different sequences. For example, Figure~ \ref{CWF 17} shows the top 5 words that activated transformer factor-17 $\Phi{:,17}$ at layer-$0$, layer-$2$, and layer-$6$ respectively. Since a contextualized word vector is generally affected by many tokens in the sequence, we can use LIME \cite{DBLP:journals/corr/RibeiroSG16} to assign a weight to each token in the sequence to identify their relative importance to $\alpha_c$. The detailed method is left to Section~\ref{sec:experiments}.

\vspace{0.1in} {\noindent \bf To determine low-, mid-, and high-level transformer factors with importance score.} As we build a single dictionary for all of the transformer layers, the semantic meaning of the transformer factors has different levels. While some of the factors appear in lower layers and continue to be used in the later stages, the rest of the factors may only be activated in the higher layers of the transformer network. A central question in representation learning is: where does the network learn certain information?'' To answer this question, we can compute an importance score'' for each transformer factor $\Phi_{:,c}$ at layer-$l$ as $I^{(l)}_c$. $I^{(l)}_c$ is the average of the largest 1000 sparse coefficients $\alpha^{(l)}_c$'s, which correspond to $X^{(l)}_c$. We plot the importance scores for each transformer factor as a curve is shown in Figure~\ref{importance score}. We then use these importance score (IS) curves to identify which layer a transformer factor emerges. Figure~\ref{low} shows an IS curve peak in the earlier layer. The corresponding transformer factor emerges in the earlier stage, which may capture lower-level semantic meanings. In contrast, Figure~\ref{mid} shows a peak in the higher layers, which indicates the transformer factor emerges much later and may correspond to mid- or high-level semantic structures. More subtleties are involved when distinguishing between mid-level and high-level factors, which will be discussed later.

An important characteristic is that the IS curve for each transformer factor is relatively smooth. This indicates if a vital feature is learned in the beginning layers, it won't disappear in later stages. Instead, it will be carried all the way to the end with gradually decayed weight since many more features would join along the way. Similarly, abstract information learned in higher layers is slowly developed from the early layers. Figure~\ref{CWF 17} and \ref{CWF 35} confirm this idea, which will be explained in the next section.

\begin{figure}% \centering \subfloat[\centering]{{\includegraphics[width=0.5\linewidth]{images/low.png}\label{low} }}% \subfloat[\centering]{{\includegraphics[width=0.5\linewidth]{mid} \label{mid}}}% \caption{Importance score (IS) across all layers for two different transformer factors. (a) This figure shows a typical IS curve of a transformer factor corresponding to low-level information. (b) This figure shows a typical IS curve of a transformer factor corresponds to mid-level information.}% \label{importance score}% \end{figure}

\begin{figure*}[ht]% \centering \subfloat[\centering layer 0 \label{CWF 17 layer 0} ]{{\includegraphics[width=0.30\linewidth]{images/30_0.PNG} }}% \quad \subfloat[\centering layer 2]{{\includegraphics[width=0.30\linewidth]{images/30_2.PNG} }}% \quad \subfloat[\centering layer 6]{{\includegraphics[width=0.33\linewidth]{images/30_4.PNG} }}%

\caption{Visualization of a low-level transformer factor, $\Phi_{:,30}$ at different layers.
(a), (b) and (c) are the top-activated words and contexts for $\Phi_{:,30}$ in layer-$0$, $2$ and $4$ respectively. We can see that at layer-$0$, this transformer factor corresponds to word vectors that encode the word ``left'' with different senses. In layer-2, a majority of the top activated words ``left'' correspond to a single sense, "leaving, exiting." In layer 4, all of the top-activated words ``left'' have corresponded to the same sense, "leaving, exiting." Due to space limitations, we invite the readers to use our \href{https://transformervis.github.io/transformervis/}{website} to see more of those disambiguation effects.
}%
\label{CWF 17}%

\end{figure*}

\begin{table*}[!h] \small \centering \begin{tabular}{|m{0.05\linewidth} | m{0.6\linewidth} | m{0.25\linewidth} |} \hline & Top 3 activated words and their contexts & Explanation \ \hline $\Phi_{:,2}$ & • that snare shot sounded like somebody' d kicked open the door to your \textcolor{blue}{mind}".\newline• i became very frustrated with that and finally made up my \textcolor{blue}{mind} to start getting back into things."\newline• when evita asked for more time so she could make up her \textcolor{blue}{mind}, the crowd demanded," ¡ ahora, evita,<&• Word mind''\newline • Noun \newline • Definition: the element of a person that enables them to be aware of the world and their experiences.\\ \hline $\Phi_{:,16}$ &•nington joined the five members xero and the band was renamed to linkin \textcolor{blue}{park}.\newline• times about his feelings about gordon, and the price family even sat away from \textcolor{blue}{park}' s supporters during the trial itself.\newline• on 25 january 2010, the morning of \textcolor{blue}{park}' s 66th birthday, he was found hanged and unconscious in his & • Word park'' \newline • Noun \newline • Definition: a common first and last name \ \hline $\Phi_{:,30}$ & • saying that he has \textcolor{blue}{left} the outsiders, kovu asks simba to let him join his pride\newline• eventually, all boycott' s employees \textcolor{blue}{left}, forcing him to run the estate without help.\newline• the story concerned the attempts of a scientist to photograph the soul as it \textcolor{blue}{left} the body. & • Word left" \newline • Verb \newline • Definition: leaving, exiting\\ \hline $\Phi_{:,33}$ &• forced to visit the sarajevo television station at night and to film with as little \textcolor{blue}{light} as possible to avoid the attention of snipers and bombers.\newline• by the modest, cream@-@ colored attire in the airy, \textcolor{blue}{light}@-@ filled clip.\newline• the man asked her to help him carry the case to his car, a \textcolor{blue}{light}@-@ brown volkswagen beetle. & • Word light'' \newline • Noun \newline • Definition: the natural agent that stimulates sight and makes things visible\

  \hline


\end{tabular}
\caption{Several examples of low-level transformer factors. Their top-activated words in layer 4 are marked \textcolor{blue}{blue}, and the corresponding contexts are shown as examples for each transformer factor. As shown in the table, nearly all of the top-activated words are disambiguated into a single sense. Please note the last example of $\Phi_{:,33}$ is a rare exception, the reader may check the appendix to see a more complete list. More examples, top-activated words and contexts are provided in Appendix. }
\label{low level table}

\end{table*}

\begin{figure*}[!h]% \centering \subfloat[\centering \label{compare} ]{{\includegraphics[width=0.45\linewidth]{images/left_old.png} }}% \subfloat[\centering \label{linear}]{{\includegraphics[width=0.5\linewidth]{images/linear_mul.png} }}%

\caption{(a) Average activation of $\Phi_{:,30}$ for word vector ``left'' across different layers. (b) Instead of averaging, we plot the activation of all ``left'' with different contexts in layer-$0$, $2$, and $4$. Random noise is added to the y-axis to prevent overplotting. The activation of $\Phi_{:,30}$ for two different word senses of ``left'' is blended together in layer-$0$. They disentangle to a great extent in layer-$2$ and nearly separable in layer-$4$ by this single dimension.}%
\label{linear mul}%

\end{figure*}

\begin{table}[ht] \small \centering \begin{tabular}{|p{0.4\linewidth} | P{0.13\linewidth} |P{0.10\linewidth} |P{0.15\linewidth} |} \hline & Precision (%) & Recall (%) & F1 score (%) \ \hline Average perceptron POS tagger & \vspace{0.1in} 92.7 & \vspace{0.1in} 95.5 & \vspace{0.1in} 94.1 \ \hline Finetuned BERT base model for POS task & \vspace{0.1in} 97.5 & \vspace{0.1in} 95.2 & \vspace{0.1in} 96.3 \ \hline Logistic regression classifier with activation of $\Phi_{:,30}$ at layer 4 & \vspace{0.18in} 97.2 & \vspace{0.18in} {\bf 95.8} & \vspace{0.18in} {\bf 96.5} \ \hline \end{tabular} \caption{Evaluation of binary POS tagging task: predict whether or not ``left'' in a given context is a verb.} \label{evaluation} \vspace{-0.1in} \end{table}

\begin{figure*}% \centering

\subfloat[\centering layer 4]{{\includegraphics[width=0.30\linewidth]{images/35_4.JPG} }}%
\subfloat[\centering layer 6]{{\includegraphics[width=0.33\linewidth]{images/35_6.JPG} }}%
\subfloat[\centering layer 8]{{\includegraphics[width=0.33\linewidth]{images/35_8.JPG} }}%

\caption{Visualization of a mid-level transformer factor. (a), (b), (c) are the top 5 activated words and contexts for this transformer factor in layer-$4$, $6$, and $8$ respectively. Again, the position of the word vector is marked \textcolor{blue}{blue}. Please notice that sometimes only a part of a word is marked blue. This is due to that BERT uses word-piece tokenizer instead of whole word tokenizer. This transformer factor corresponds to the pattern of ``consecutive adjective''. As shown in the figure, this feature starts to develop at layer-$4$ and fully develops at layer-$8$. }%
\label{CWF 35}%

\end{figure*}

\begin{figure*}% \centering

\subfloat[\centering layer 4]{{\includegraphics[width=0.33\linewidth]{images/13_4.JPG} }}%
\subfloat[\centering layer 6]{{\includegraphics[width=0.31\linewidth]{images/13_6.JPG} }}%
\subfloat[\centering layer 8]{{\includegraphics[width=0.33\linewidth]{images/13_8.JPG} }}%

\caption{Another example of a mid-level transformer factor visualized at layer-$4$, $6$, and $8$. The pattern that corresponds to this transformer factor is ``unit exchange''. Such a pattern is somewhat unexpected based on linguistic prior knowledge. }%
\label{CWF 13}%

\end{figure*}

\section{Experiments and Discoveries} \label{sec:experiments}

We use a 12-layer pre-trained BERT model \cite{PretrainedBERT,devlin2018BERT} and freeze the weights. Since we learn a single dictionary of transformer factors for all of the layers in the transformer, we show that these transformer factors correspond to different levels of semantic or syntactic patterns. The patterns can be roughly divided into three categories: word-level disambiguation, sentence-level pattern formation, and long-range dependency. In the following, we provide detailed visualization for each pattern category. Due to the space limit, only a small amount of the factors are demonstrated in the paper. To alleviate the ``cherry-picking'' bias, we also build a \href{https://transformervis.github.io/transformervis/}{website} for the interested readers to play with these results.

\vspace{0.1in} {\noindent \bf Low-level: word-level polysemy disambiguation.} While the input embedding of a token contains polysemy, we find transformer factors with early IS curve peaks usually correspond to a specific word-level meaning. By visualizing the top activation sequences, we can see how word-level disambiguation is gradually developed in a transformer.

We show how the disambiguation effect develops progressively through each layer in Figure~\ref{CWF 17}. In Figure~\ref{CWF 17}, the top 5 activated words and their contexts for transformer factor $\Phi_{:,30}$ in different layers are listed. The top activated words in layer 0 contain the word left'' varying senses, which is being mostly disambiguated in layer 2 albeit not completely. In layer 4, the word left'' is fully disambiguated since the top-activated word contains only left'' with the word sense leaving, exiting.'' We also show more examples of those types of transformer factors in Table~\ref{low level table}: for each transformer factor, we list out the top 3 activated words and their contexts in layer 4. As shown in the table, nearly all top-activated words are disambiguated into a single sense.

Further, we can quantify the quality of the disambiguation ability of the transformer model. In the example above, since the top 1000 activated words and contexts are left'' with only the word sense leave, exiting'', we can assume left'' when used as a verb, triggers higher activation in $\Phi_{:,30}$ than left'' used as other sense of speech. We can verify this hypothesis using a human-annotated corpus: Brown corpus \cite{francis79browncorpus}. In this corpus, each word is annotated with its corresponding part-of-speech. We collect all the sentences contains the word left'' annotated as a verb in one set and sentences contains left'' annotated as other part-of-speech. As shown in Figure~\ref{compare}, in layer 0, the average activation of $\Phi_{:,30}$ for the word left'' marked as a verb is no different from left'' as other senses. However, at layer 2, left'' marked as a verb triggers a higher activation of $\Phi_{:,30}$. In layer 4, this difference further increases, indicating disambiguation develops progressively across layers. In fact, we plot the activation of left'' marked as verb and the activation of other left'' in Figure~\ref{linear}. In layer 4, they are nearly linearly separable by this single feature. Since each word left'' corresponds to an activation value, we can perform a logistic regression classification to differentiate those two types of ``left''. From the result shown in Figure~\ref{compare}, it is pretty fascinating to see that the disambiguation ability of just $\Phi_{:,30}$ is better than the other two classifiers trained with supervised data. This result confirms that disambiguation is indeed done in the early part of pre-trained transformer model and we are able to detect it via dictionary learning.

\begin{table*}[!h] \small \centering \begin{tabular}{|P{0.05\linewidth} | m{0.45\linewidth} | m{0.2\linewidth} | P{0.03\linewidth} | P{0.03\linewidth} | P{0.03\linewidth} | P{0.03\linewidth} | } \hline & 2 example words and their contexts with high activation & Patterns & L4 (%) & L6 (%)& L8 (%)& L10 (%)\ \hline $\Phi_{:,13}$ & • the steel pipeline was about 20 ° f(- 7 \textcolor{blue}{°} c) degrees.

   • hand( 56 to 64 inches( 140 to 160 \textcolor{blue}{cm})) war horse is that it was a & Unit exchange with parentheses &0 & 0 &64.5&95.5\\
\hline
$\Phi_{:,42}$ & • he died at the hospice of lancaster county from heart

• holly' s drummer carl bunch suffered \textcolor{blue}{frost}bite to his toes( while aboard the
\textcolor{blue}{ai}lments on 23 june 2007. & Something unfortunate happened &94.0&100&100&100\\
\hline
$\Phi_{:,50}$ & • hurricane pack 1 was a re\textcolor{blue}{vam}ped version of story mode;

• in 1998, the categories were \textcolor{blue}{re}titled best short form music video, and best & Doing something again, or making something new again &74.5&100&100&100 \\
\hline
$\Phi_{:,86}$ & • he finished the 2005 – \textcolor{blue}{06} season with 21 appearances and seven goals.

• of an offensive game, finishing off the 2001 – \textcolor{blue}{02} season with 58 points in the 47 games
& Consecutive years, used in foodball season naming &0&100&85.0&95.5\\
\hline
$\Phi_{:,102}$ & • the most prominent of which was bishop abel \textcolor{blue}{mu}zorewa' s united african national council

• ralambo' s father, and\textcolor{blue}{riam}anelo, had established rules of succession by
& African names &99.0&100&100&100\\
\hline
$\Phi_{:,125}$ & • music writer \textcolor{blue}{jeff} weiss of pitchfork describes the" enduring image"

• club reviewer \textcolor{blue}{erik} adams wrote that the episode was a perfect mix & Describing someone in a paraphrasing style. Name, Career &15.5&99.0&100&98.5
\\
\hline
$\Phi_{:,184}$ & • the world wide fund for nature( \textcolor{blue}{wwf}) announced in 2010 that a biodiversity study from\newline• fm) was halted by the federal communications commission( \textcolor{blue}{fcc}) due to a complaint that the company buying & Institution with abbreviation &0&15.5&39.0&63.0 \\
\hline
$\Phi_{:,193}$ &• 74, 22@,@ 500 vietnamese during 1979 \textcolor{blue}{–} 92, over 2@,@ 500 bosnian\newline •, the russo@-@ turkish war of 1877 \textcolor{blue}{–} 88 and the first balkan war in 1913.& Time span in years &97.0&95.5&96.5&95.5 \\
\hline
$\Phi_{:,195}$ & •s, hares, badgers, foxes, \textcolor{blue}{weasel}s, ground squirrels, mice, hamsters\newline•-@ watching, boxing, chess, cycling, \textcolor{blue}{drama}, languages, geography, jazz and other music& Consecutive of noun (Enumerating) &8.0&98.5&100&100 \\
\hline
$\Phi_{:,225}$ & • technologist at the united states marine hospital in key \textcolor{blue}{west}, florida who developed a morbid obsession for\newline• 00°,11'', w, near smith \textcolor{blue}{valley}, nevada. & Places in US, followings the convention ``city, state"&51.5&91.5&91.0&77.5\\
\hline
\end{tabular}
\caption{A list of typical mid-level transformer factors. The top-activation words and their context sequences for each transformer factor at layer-$8$ are shown in the second column. We summarize the patterns of each transformer factor in the third column. The last 4 columns are the percentage of the top 200 activated words and sequences that contain the summarized patterns in layer-$4$,$6$,$8$, and $10$ respectively.}
\label{Mid Unexpected}
\vspace{-0.1in}

\end{table*}

\vspace{0.1in} \begin{table*}[!h]

\small
\centering
\begin{tabular}{|P{0.05\linewidth} | m{0.45\linewidth} | m{0.30\linewidth} | P{0.05\linewidth} |}
\hline
& Adversarial Text & Explaination & $\alpha_{35}$\\
\hline
(o) & album as "full of exhilarating, ecstatic, \textcolor{blue}{thrilling}, fun and sometimes downright silly songs" & The original top-activated word and its context sentence for transformer factor $\Phi_{:,35}$ (not an adversarial text) & 9.5\\
\hline
(a) & album as "full of delightful, lively, \textcolor{blue}{exciting}, interesting and sometimes downright silly songs" & Replace the adjectives in sentence (o) with different adjectives. & 9.2 \\
\hline
(b) & album as "full of unfortunate, heartbroken, \textcolor{blue}{annoying}, boring and sometimes downright silly songs" & Replace the adjectives in sentence (o) with negative adjectives. & 8.2\\
\hline
(c) & album as "full of [UNK], [UNK], \textcolor{blue}{thrilling}, [UNK] and sometimes downright silly songs" & Mask the adjectives in sentence (o) with unknown tokens. & 5.3\\
\hline
(d) & album as "full of \textcolor{blue}{thrilling} and sometimes downright silly songs" & Remove the first three adjectives in sentence (o). & 7.8 \\
\hline
(e) & album as "full of \textcolor{blue}{natural}, smooth, rock, electronic and sometimes downright silly songs" & Replace the adjectives in sentence (o) with neutral adjectives. & 6.2 \\
\hline
(f) & each participant starts the battle \textcolor{blue}{with} one balloon. these can be re@-@ inflated up to four & Use a random sentence. & 0.0\\
\hline
(g) & The book is described as "innovative, \textcolor{blue}{beautiful} and brilliant". It receive the highest opinion from James Wood & We create this sentence that contain the pattern of consecutive adjective.& 7.9 \\
\hline
\end{tabular}
\caption{We construct adversarial texts similar but different to the pattern ``Consecutive adjective''. The last column shows the activation of $\Phi_{:,35}$, or $\alpha^{(8)}_{35}$, w.r.t. the blue-marked word in layer 8.}
\label{Adversarial Text}

\end{table*} {\noindent \bf Mid level: sentence-level pattern formation.} We find most of the transformer factors, with an IS curve peak after layer 6, capture mid-level or high-level semantic meanings. In particular, the mid-level ones correspond to semantic patterns like phrases and sentences pattern.

We first show two detailed examples of mid-level transformer factors. Figure~\ref{CWF 35} shows a transformer factor that detects the pattern of consecutive usage of adjectives. This pattern starts to emerge at layer 4, develops at layer 6, and becomes quite reliable at layer 8. Figure~\ref{CWF 13} shows a transformer factor, which corresponds to a pretty unexpected pattern: ``unit exchange'', e.g., 56 inches (140 cm). Although this exact pattern only starts to appear at layer 8, the sub-structures that make this pattern, e.g., parenthesis and numbers, appear to trigger this factor in layers 4 and 6. Thus this transformer factor is also gradually developed through several layers.

While some mid-level transformer factors verify common semantic or syntactic patterns, there are also many surprising mid-level transformer factors. We list a few in Table~\ref{Mid Unexpected} with quantitative analysis. For each listed transformer factor, we analyze the top 200 activating words and their contexts in each layer. We record the percentage of those words and contexts that correspond to the factors' semantic pattern in Table~\ref{Mid Unexpected}. From the table, we see that large percentages of top-activated words and contexts do corresponds to the pattern we describe. It also shows most of these mid-level patterns start to develop at layer 4 or 6. More detailed examples are provided in the appendix section \ref{sec:mid}. Though it's still mysterious why the transformer network develops representations for these surprising patterns, we believe such a direct visualization can provide additional insights, which complements the ``probing tasks''.

To further confirm a transformer factor does correspond to a specific pattern, we can use constructed example words and context to probe their activation. In Table~\ref{Adversarial Text}, we construct several text sequences that are similar to the patterns corresponding to a particular transformer factor but with subtle differences. The result confirms that the context that strictly follows the pattern represented by that transformer factor triggers a high activation. On the other hand, the closer the adversarial example to this pattern, the higher activation it receives at this transformer factor.

\vspace{0.1in} {\noindent \bf High-level: long-range dependency.} High-level transformer factors correspond to those linguistic patterns that span an extended range in the text. Since the IS curves of mid-level and high-level transformer factors are similar, it is difficult to distinguish those transformer factors based on their IS cures. Thus, we have to manually examine the top-activation words and contexts for each transformer factor to differentiate between mid-level and high-level transformer factors. To ease the process, we choose to use the black-box interpretation algorithm \emph{LIME} \cite{DBLP:journals/corr/RibeiroSG16} to identify the contribution of each token in a sequence. There also exist interpretation tools that specifically leverage the transformer architecture \citep{hila2021explainability,hila2021interpretability}. In the future, one could adapt those interpretation tools, which may potentially provide better visualization.

Given a sequence $s \in S$, we can treat $\alpha^{(l)}{c,i}$, the activation of $\Phi{:,c}$ in layer-$l$ at location $i$, as a scalar function of $s$, $f^{(l)}{c,i}(s)$. Assume a sequence $s$ triggers a high activation $\alpha^{(l)}{c,i}$, i.e. $f^{(l)}{c,i}(s)$ is large. We want to know how much each token (or equivalently each position) in $s$ contributes to $f^{(l)}{c,i}(s)$. To do so, we generated a sequence set $\mathcal{S}(s)$, where each $s'\in \mathcal{S}(s)$ is the same as $s$ except for that several random positions in $s'$ are masked by [`UNK'] (the unknown token). Then we learns a linear model $g_{w}(s')$ with weights $w \in \mathbb{R}^{T}$ to approximate $f(s')$, where $T$ is the length of sentence $s$. This can be solved as a ridge regression: $$\min_{w \in \mathbb{R}^T} \mathcal{L} (f,w,\mathcal{S}(s)) + \sigma |w|_2^2.$$

The learned weights $w$ can serve as a saliency map that reflects the ``contribution'' of each token in the sequence $s$. Like in Figure~\ref{297}, the color reflects the weights $w$ at each position. Red means the given position has positive weight and green means negative weight. The magnitude of weight is represented by the intensity. The redder a token is, the more it contributions to the activation of the transformer factor. We leave more implementation and mathematical formulation details of LIME algorithm in the appendix.

We provide detailed visualization for two different transformer factors that show long-range dependency in Figure~\ref{297}, \ref{322}. Since visualization of high-level information requires more extended context, we only offer the top two activated words and their contexts for each such transformer factor. Many more will be provided in the appendix section \ref{sec:high}.

We name the pattern for transformer factor $\Phi_{:,297}$ in Figure~\ref{297} as repetitive pattern detector''. All top activated contexts for $\Phi_{:,297}$ contain an obvious repetitive structure. Specifically, the text snippet can't get you out of my head" appears twice in the first example, and the text snippet xxx class passenger, star alliance'' appears three times in the second example. Compared to the patterns we found in the mid-level [\ref{CWF 13}], the high-level patterns like repetitive pattern detector'' are much more abstract. In some sense, the transformer detects if there are two (or multiple) almost identical embedding vectors at layer-$10$ without caring what they are. Such behavior might be highly related to the concept proposed in the capsule networks \cite{sabour2017dynamic,hinton2021represent}. To further understand this behavior and study how the self-attention mechanism helps model the relationships between the features outlines an interesting future research direction.

Figure~\ref{322} shown another high-level factor, which detects text snippets related to ``the beginning of a biography''. The necessary components, day of birth as month and four-digit years, first name and last name, familial relation, and career, are all mid-level information. In Figure~\ref{322}, we see that all the information relates to biography has a high weight in the saliency map. Thus, they are all together combined to detect the high-level pattern.

\begin{figure}[!h] \centering \includegraphics[width=0.49\textwidth]{images/278_f.png} \caption{Two examples of the high activated words and their contexts for transformer factor $\Phi_{:,297}$. We also provide the saliency map of the tokens generated using LIME. This transformer factor corresponds to the concept: ``repetitive pattern detector''. In other words, repetitive text sequences will trigger high activation of $\Phi_{:,297}$.} \label{297} \end{figure}

\begin{figure}[!h] \centering \includegraphics[width=0.48\textwidth]{images/322.png} \caption{Visualization of $\Phi_{:,322}$. This transformer factor corresponds to the concept: ``some born in some year'' in biography. All of the high-activation contexts contain the beginning of a biography. As shown in the figure, the attributes of someone, name, age, career, and familial relation all have high saliency weights.} \vspace{-0.1in} \label{322} \end{figure}

\section{Discussion} Dictionary learning has been successfully used to visualize the classical word embeddings \cite{arora2018linear, zhang2019word}. In this paper, we propose to use this simple method to visualize the representation learned in transformer networks to supplement the implicit ``probing-tasks'' methods. Our results show that the learned transformer factors are relatively reliable and can even provide many surprising insights into the linguistic structures. This simple tool can open up the transformer networks and show the hierarchical semantic or syntactic representation learned at different stages. In short, we find word-level disambiguation, sentence-level pattern formation, and long-range dependency. The idea of a neural network learns low-level features in early layers, and abstract concepts in the later stages are very similar to the visualization in CNN \cite{zeiler2014visualizing}. Dictionary learning can be a convenient tool to help visualize a broad category of neural networks with skip connections, like ResNet \cite{he2016deep}, ViT models \cite{dosovitskiy2020image}, etc. For more interested readers, we provide an interactive \href{https://transformervis.github.io/transformervis/}{website}\footnote{https://transformervis.github.io/transformervis/} for the readers to gain some further insights.

\section*{Acknowledgements} We thank our reviewers for their detailed and insightful comments. We also thank Yuhao Zhang for his suggestions during the preparation of this paper.

\bibliography{naacl2021} \bibliographystyle{acl_natbib}

\clearpage \appendix \section*{Supplementary Materials} \renewcommand{\thesubsection}{\Alph{subsection}} \subsection{Importance Score (IS) Curves} \label{sec 1} \begin{figure}[h]% \centering \subfloat[\centering]{{\includegraphics[width=0.45\linewidth]{images/low_level.png}\label{low more} }}% \quad \subfloat[\centering]{{\includegraphics[width=0.44\linewidth]{images/high_level.png} \label{high more}}}% \caption{(a) Importance score of 16 transformer factors corresponding to low level information. (b) Importance score of 16 transformer factors corresponds to mid level information respectively.}% \label{importance score more}% \end{figure}

The importance score curve's characteristic has a strong correspondence to a transformer factor's categorization. Based on the location of the peak of an IS curve, we can classify a transformer factor as low-level, mid-level or high-level. The importance score for low-level transformer factors peak in early layers and slowly decrease across the rest of the layers. On the other hand, the importance score for mid-level and high-level transformers slowly increases and peaks at higher layers. In Figure~\ref{importance score more}, we show two sets of the examples to demonstrate the clear distinction between those two types of IS curves.

Taking a step back, we can also plot IS curve for each dimension of word vector (without sparse coding) at different layers. They do not show any specific patterns, as shown in Figure~\ref{sparse or no sparse}. This makes intuitive sense since we mentioned that each of the entries of a contextualized word embedding does not correspond to any clear semantic meaning.

\begin{figure}[h]% \centering \subfloat[\centering]{{\includegraphics[width=0.45\linewidth]{images/no_sparse.JPG}\label{no sparse} }}% \quad \subfloat[\centering]{{\includegraphics[width=0.45\linewidth]{images/sparse.JPG} \label{sparse}}}% \caption{(a) Importance score calculated using certain dimension of word vectors without sparse coding. (b) Importance score calculated using sparse coding of word vectors. }% \label{sparse or no sparse}% \end{figure}

\subsection{LIME: Local Interpretable Model-Agnostic Explanations} \label{sec 2}

After we trained the dictionary $\Phi$ through non-negative sparse coding, the inference of the sparse code of a given input is $$\alpha(x) = \arg\min_{\alpha \in \mathbb{R}} || x - \Phi \alpha ||_2^2 + \lambda ||\alpha||_1 $$

For a given sentence and index pair $(s,i)$, the embedding of word $w = s[i]$ by layer $l$ of transformer is $x^{(l)} (x,i)$. Then we can abstract the inference of a specific entry of sparse code of the word vector as a black-box scalar-value function $f$:

$$f((s,i)) = \alpha(x^{(l)}(s,i))$$

Let $RandomMask$ denotes the operation that generates perturbed version of our sentence $s$ by masking word at random location with ``[UNK]'' (unkown) tokens. For example, a masked sentence could be

[Today is a [`UNK'],day]

Let $h$ denote a encoder for perturbed sentences compared to the unperturbed sentence $s$, such that

[ h(s)_t= \left{ \begin{array}{ll} 0 & \text{if } s[t] = \text{[`UNK']} \ 1 & Otherwis \ \end{array} \right. $$ \tag{sparse}

Algorithm: algorithm
\caption{Explaining Sparse Coding Activation using LIME Algorithm}
\label{CHalgorithm}
\begin{algorithmic}[1]

\State $\mathcal{S} = \{h(s)\}$
\State $Y = \{f(s)\}$
\For{each $i$ in $\{1,2, ..., N\}$ }
\State $s_i' \leftarrow RandomMask(s)$
\State $\mathcal{S} \leftarrow \mathcal{S} \cup h(s_i') $
\State $Y \leftarrow Y \cup f(s_i') $
\EndFor
\State $w \leftarrow Ridge_w(\mathcal{S},Y)$
\end{algorithmic}
Top 3 activated words and their contextsExplanation
Φ : , 2• that snare shot sounded like somebody' d kicked open the door to your mind". • i became very frustrated with that and finally made up my mind to start getting back into things." • when evita asked for more time so she could make up her mind, the crowd demanded," ¡ ahora, evita,<• Word 'mind' • Noun • Definition: the element of a person that enables them to be aware of the world and their ex- periences.
Φ : , 16•nington joined the five members xero and the band was renamed to linkin park. • times about his feelings about gordon, and the price family even sat away from park' s supporters during the trial itself. • on 25 january 2010, the morning of park' s 66th birthday, he was found hanged and unconscious in his• Word 'park' • Noun • Definition: a common first and last name
Φ : , 30• saying that he has left the outsiders, kovu asks simba to let him join his pride • eventually, all boycott' s employees left, forcing him to run the estate without help. • the story concerned the attempts of a scientist to photograph the soul as it left the body.• Word 'left" • Verb • Definition: leaving, exiting
Φ : , 33• forced to visit the sarajevo television station at night and to film with as little light as possible to avoid the attention of snipers and bombers. • by the modest, cream@-@ colored attire in the airy, light@-@ filled clip. • the man asked her to help him carry the case to his car, a light@-@ brown volkswagen beetle.• Word 'light' • Noun • Definition: the natural agent that stimulates sight and makes things visible
Precision (%)Recall (%)F1 score (%)
Average perceptron POS tagger92.795.594.1
Finetuned BERT base model for POS task97.595.296.3
Logistic regression clas- sifier with activation of Φ : , 30 at layer 497.295.896.5
2 example words and their contexts with high activationPatternsL4 (%)L6 (%)L8 (%)L10 (%)
Φ : , 13• the steel pipeline was about 20 ° f(- 7 ° c) degrees. • hand( 56 to 64 inches( 140 to 160 cm)) war horse is that it was aUnit exchange with paren- theses0064.595.5
Φ : , 42• he died at the hospice of lancaster county from heart • holly' s drummer carl bunch suffered frostbite to his toes( while aboard the ailments on 23 june 2007.Something unfortunate happened94100100100
Φ : , 50• hurricane pack 1 was a revamped version of story mode; • in 1998, the categories were retitled best short form music video, and bestDoing something again, or making something new again74.5100100100
Φ : , 86• he finished the 2005 - 06 season with 21 appearances and seven goals. • of an offensive game, finishing off the 2001 - 02 season with 58 points in the 47 gamesConsecutive years, used in foodball season nam- ing01008595.5
Φ : , 102• the most prominent of which was bishop abel muzorewa' s united african national council • ralambo' s father, andriamanelo, had established rules of succession byAfrican names99100100100
Φ : , 125• music writer jeff weiss of pitchfork describes the" endur- ing image" • club reviewer erik adams wrote that the episode was a perfect mixDescribing someone in a paraphrasing style. Name, Career15.59910098.5
Φ : , 184• the world wide fund for nature( wwf) announced in 2010 that a biodiversity study from • fm) was halted by the federal communications commis- sion( fcc) due to a complaint that the company buyingInstitution with abbrevia- tion015.53963
Φ : , 193• 74, 22@,@ 500 vietnamese during 1979 - 92, over 2@,@ 500 bosnian •, the russo@-@ turkish war of 1877 - 88 and the first balkan war in 1913.Time span in years9795.596.595.5
Φ : , 195•s, hares, badgers, foxes, weasels, ground squirrels, mice, hamsters •-@ watching, boxing, chess, cycling, drama, languages, geography, jazz and other musicConsecutive of noun (Enumerating)898.5100100
Φ : , 225• technologist at the united states marine hospital in key west, florida who developed a morbid obsession for • 00°,11', w, near smith valley, nevada.Places in US, follow- ings the convention 'city, state"51.591.59177.5
Adversarial TextExplainationα 35
(o)album as "full of exhilarating, ecstatic, thrilling, fun and sometimes downright silly songs"The original top-activated word and its context sentence for transformer factor Φ : , 35 (not an adversarial text)9.5
(a)album as "full of delightful, lively, exciting, interesting and sometimes downright silly songs"Replace the adjectives in sentence (o) with different adjectives.9.2
(b)album as "full of unfortunate, heartbroken, annoying, bor- ing and sometimes downright silly songs"Replace the adjectives in sentence (o) with negative adjectives.8.2
(c)album as "full of [UNK], [UNK], thrilling, [UNK] and sometimes downright silly songs"Mask the adjectives in sentence (o) with unknown tokens.5.3
(d)album as "full of thrilling and sometimes downright silly songs"Remove the first three adjectives in sen- tence (o).7.8
(e)album as "full of natural, smooth, rock, electronic and sometimes downright silly songs"Replace the adjectives in sentence (o) with neutral adjectives.6.2
(f)each participant starts the battle with one balloon. these can be re@-@ inflated up to fourUse a random sentence.0
(g)The book is described as "innovative, beautiful and bril- liant". It receive the highest opinion from James WoodWecreate this sentence that contain the pattern of consecutive adjective.7.9

Figure

Figure

References

[Aho:72] Alfred V. Aho, Jeffrey D. Ullman. (1972). The Theory of Parsing, Translation and Compiling.

[APA:83] {American Psychological Association. (1983). Publications Manual.

[Chandra:81] Ashok K. Chandra, Dexter C. Kozen, Larry J. Stockmeyer. (1981). Alternation. Journal of the Association for Computing Machinery. doi:10.1145/322234.322243.

[andrew2007scalable] Andrew, Galen, Gao, Jianfeng. (2007). Scalable training of {L1. Proceedings of the 24th International Conference on Machine Learning.

[Gusfield:97] Dan Gusfield. (1997). Algorithms on Strings, Trees and Sequences.

[rasooli-tetrault-2015] Mohammad Sadegh Rasooli, Joel R. Tetreault. (2015). Yara Parser: {A. Computing Research Repository.

[Ando2005] Ando, Rie Kubota, Zhang, Tong. (2005). A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data. Journal of Machine Learning Research.

[tenney2019you] Tenney, Ian, Xia, Patrick, Chen, Berlin, Wang, Alex, Poliak, Adam, McCoy, R Thomas, Kim, Najoung, Van Durme, Benjamin, Bowman, Samuel R, Das, Dipanjan, others. (2019). What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316.

[emily2019VisBert] Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B. Vi{'{e. (2019). Visualizing and Measuring the Geometry of {BERT. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, (NeurIPS).

[Kawin2019Contextual] Kawin Ethayarajh. (2019). How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and {GPT-2. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, {EMNLP-IJCNLP.

[Zhengbao2020Know] Zhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig. (2020). How Can We Know What Language Models Know. Trans. Assoc. Comput. Linguistics.

[hewitt-manning-2019-structural] Hewitt, John, Manning, Christopher D.. (2019). {A. Proceedings of the 2019 Conference of the North {A.

[Anna2020Bertology] Anna Rogers, Olga Kovaleva, Anna Rumshisky. (2020). A Primer in BERTology: What We Know About How {BERT. Trans. Assoc. Comput. Linguistics.

[liu-etal-2019-linguistic] Liu, Nelson F., Gardner, Matt, Belinkov, Yonatan, Peters, Matthew E., Smith, Noah A.. (2019). Linguistic Knowledge and Transferability of Contextual Representations. Proceedings of the 2019 Conference of the North {A.

[DBLP:journals/corr/RibeiroSG16] Marco T'u. . CoRR (2016).

[arora2018linear] Arora, Sanjeev, Li, Yuanzhi, Liang, Yingyu, Ma, Tengyu, Risteski, Andrej. (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics.

[zhang2019word] Zhang, Juexiao, Chen, Yubei, Cheung, Brian, Olshausen, Bruno A. (2019). Word Embedding Visualization Via Dictionary Learning. arXiv preprint arXiv:1910.03833.

[beck2009fast] Beck, Amir, Teboulle, Marc. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences.

[zeiler2014visualizing] Zeiler, Matthew D, Fergus, Rob. (2014). Visualizing and understanding convolutional networks. European conference on computer vision.

[DBLP:journals/corr/abs-1910-03771] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R{'{e. (2019). HuggingFace's Transformers: State-of-the-art Natural Language Processing. CoRR.

[devlin2018BERT] Jacob Devlin, Ming{-. (2018). {BERT:. CoRR.

[PretrainedBert] . Pretrained Bert base model (12 layers). ().

[10.3115/1118108.1118117] Loper, Edward, Bird, Steven. (2002). NLTK: The Natural Language Toolkit. Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1. doi:10.3115/1118108.1118117.

[francis79browncorpus] Francis, W. N., Kucera, H.. Brown Corpus Manual.

[he2016deep] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, Sun, Jian. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition.

[dosovitskiy2020image] Dosovitskiy, Alexey, Beyer, Lucas, Kolesnikov, Alexander, Weissenborn, Dirk, Zhai, Xiaohua, Unterthiner, Thomas, Dehghani, Mostafa, Minderer, Matthias, Heigold, Georg, Gelly, Sylvain, others. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[vaswani2017attention] Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, Lukasz, Polosukhin, Illia. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.

[sabour2017dynamic] Sabour, Sara, Frosst, Nicholas, Hinton, Geoffrey E. (2017). Dynamic routing between capsules. arXiv preprint arXiv:1710.09829.

[hinton2021represent] Hinton, Geoffrey. (2021). How to represent part-whole hierarchies in a neural network. arXiv preprint arXiv:2102.12627.

[duchi2011adaptive] Duchi, John, Hazan, Elad, Singer, Yoram. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research.

[faruqui-etal-2015-sparse] Faruqui, Manaal, Tsvetkov, Yulia, Yogatama, Dani, Dyer, Chris, Smith, Noah A.. (2015). Sparse Overcomplete Word Vector Representations. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers).

[hila2021explainability] Hila Chefer, Shir Gur, Lior Wolf. (2021). Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers. CoRR.

[hila2021interpretability] Hila Chefer, Shir Gur, Lior Wolf. (2020). Transformer Interpretability Beyond Attention Visualization. CoRR.

[bib1] Pretrained bert base model (12 layers). https://huggingface.co/bert-base-uncased, last accessed on 03/11/2021.

[bib2] Arora et al. (2018) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2018. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495.

[bib6] Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.

[bib7] Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

[bib9] Kawin Ethayarajh. 2019. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, pages 55–65. Association for Computational Linguistics.

[bib10] Faruqui et al. (2015) Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015. Sparse overcomplete word vector representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics.

[bib11] W. N. Francis and H. Kucera. 1979. Brown corpus manual. Technical report, Department of Linguistics, Brown University, Providence, Rhode Island, US.

[bib13] John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers).

[bib17] Reif et al. (2019) Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B. Viégas, Andy Coenen, Adam Pearce, and Been Kim. 2019. Visualizing and measuring the geometry of BERT. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, (NeurIPS), pages 8592–8600.