ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒ
Contents
13. ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒÂ¶
ã¢ãã³ã·ã§ã³ã¯æ©æ¢°åŠç¿ãAIã«ãããŠãç¹ã«ã³ã³ãã¥ãŒã¿ããžã§ã³ã§äœå¹ŽãåããååšããæŠå¿µã§ã[BP97]ãããã¥ãŒã©ã«ãããã¯ãŒã¯ããšããèšèãšåæ§ã«ãã¢ãã³ã·ã§ã³ã¯äººéã®è³ã倧éã®èŠèŠã»èŽèŠå ¥åãåŠçããéã®æ³šææ©æ§ã«çæ³ãåŸãŠããŸã[BP97]ãã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯ããã®æ³šææ©æ§ãåçŸãã深局åŠç¿ã¬ã€ã€ãŒã§ãã深局åŠç¿ã«ãããã¢ãã³ã·ã§ã³ã«ã€ããŠã¯Luongã[LPM15]ã詳现ã解説ããŠããŸãããŸãããã¡ãã§å®è·µçãªæŠèŠã玹ä»ãããŠããŸããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯èšèªã®ãããªç³»åã®ã¢ããªã³ã°ã«éåžžã«æçšã§ããããšãçµéšçã«ç€ºãããŠãããçŸåšã§ã¯å¿ èŠäžå¯æ¬ ãªååšãšãªã£ãŠããŸã[VSP+17]ãã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒãæãè¯ã䜿ããŠããã®ã¯ç³»åã¢ããªã³ã°ã«çšãããã©ã³ã¹ãã©ãŒããŒãã¥ãŒã©ã«ãããã¯ãŒã¯ã§ãããŸããã°ã©ããã¥ãŒã©ã«ãããã¯ãŒã¯ã§ãã¢ãã³ã·ã§ã³ã䜿ãããããšããããŸãã
Audience & Objectives
ãã®ç« ã¯ãStandard Layers ãš Tensors and Shapesãç解ããŠããããšãåæã«æžãããŠããã®ã§ããããŒããã£ã¹ããè¡åããã³ãœã«ã®åœ¢ç¶ã«ã¯æ £ããŠããæ¹ãè¯ãã§ãããããã®ç« ãçµããããã«ã¯ã以äžã®äºãã§ããããã«ãªã£ãŠããã¯ãã§ãã
ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã®åœ¢ç¶ãå ¥åºåã®æ£ããæå®
ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã®å®è£
ä»ã®ã¬ã€ã€ãŒã«ã¢ãã³ã·ã§ã³ãé©çšããæ¹æ³
ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯ãåºæ¬çã«ã¯å éå¹³åã«ããéçŽã§ããããã¯åã«åèŠçŽ ã«äœããã®æ¹æ³ã§éã¿ä»ãããããã®å¹³åãèšç®ããŠããã ãã§ããããã«ãããã¢ãã³ã·ã§ã³ã¯å ¥åãã³ãœã«ã®ã©ã³ã¯ãå°ããããŸããã»ãšãã©ã®ã¬ã€ã€ãŒã1ã€ãããã¯2ã€ã®å ¥åãåãã®ã«å¯Ÿããã¢ãã³ã·ã§ã³ã¯3ã€ã®å ¥åãåããšããç¹ã§çããã¬ã€ã€ãŒã§ããããã3ã€ã®å ¥åã¯ãããããã¯ãšãªãŒãããªã¥ãŒãããŒãšåŒã°ããŸããéçŽã¯ããªã¥ãŒã«å¯ŸããŠè¡ãããããªã¥ãŒã®ã©ã³ã¯ã3ã§ããã°åºåã®ã©ã³ã¯ã¯2ã«ãªããŸããã¯ãšãªãŒã¯ããŒãã1å°ããã©ã³ã¯ãããŒãšããªã¥ãŒã¯åãã©ã³ã¯ã§ããããŒãšã¯ãšãªãŒã¯ã¢ãã³ã·ã§ã³æ©æ§ âæ¹çšåŒãæå³ããâã«åŸã£ãŠããªã¥ãŒã®éã¿ã決å®ããŸãã
äžã®è¡šã¯ããã3ã€ã®å ¥åããŸãšãããã®ã§ããå€ãã®å Žåãã¯ãšãªãŒã¯ãããåŠçããããããã©ã³ã¯ã2ã«ãªãããšã«æ³šæããŠãã ãããå ¥åã¯ãšãªããããåŠçãããŠããå Žåãåºåã©ã³ã¯ã1ã§ã¯ãªãåæ§ã«2ã«ãªããŸãã
ã©ã³ã¯ |
åœ¢ç¶ |
ç®ç |
äŸ |
|
---|---|---|---|---|
ã¯ãšãªãŒ |
1 |
(ã¢ãã³ã·ã§ã³ç¹åŸŽéã®æ°) |
ããŒã«å¯ŸããŠãã§ãã¯ãè¡ãå ¥å |
ç¹åŸŽéãã¯ãã«ãšããŠè¡šçŸããã1åèª |
ã㌠|
2 |
(ç³»åã®é·ã, ã¢ãã³ã·ã§ã³ç¹åŸŽéã®æ°) |
ã¯ãšãªãŒã«å¯ŸããŠã¢ãã³ã·ã§ã³ãèšç®ããããã«äœ¿çšãã |
ç¹åŸŽéãã¯ãã«ãšããŠè¡šçŸãããæäžã®å šåèª |
ããªã¥ãŒ |
2 |
(ç³»åã®é·ã, ã¢ãã³ã·ã§ã³ç¹åŸŽéã®æ°) |
åºåå€ãèšç®ããããã«äœ¿çšããã |
æäžã®ååèªã«å¯Ÿå¿ããæ°å€ãã¯ãã« |
åºå |
1 |
(ããªã¥ãŒç¹åŸŽéã®æ°) |
ããªã¥ãŒã®ã¢ãã³ã·ã§ã³ãŠã§ã€ãã«ããå éå¹³å |
ïŒã€ã®ãã¯ãã« |
13.1. äŸÂ¶
ã¢ãã³ã·ã§ã³ã¯ç³»åããŒã¿ã§èãããšåãããããæŠå¿µã§ãããThe sleepy child reads a bookããšããæç« ã§èããŠã¿ãŸããããæäžã®ååèªã¯ããŒã«çžåœããåèªãåã蟌ã¿ã§è¡šçŸãããšããŒã¯ã©ã³ã¯2ãšãªããŸããäŸãã°ããsleepyããšããåèªã¯ãé·ã3ã®åã蟌ã¿ãã¯ãã«ïŒ \([2, 0, 1]\) ã§è¡šçŸããããããããŸããããããã®åã蟌ã¿ã¯æšæºçãªèšèªã®åã蟌ã¿ããåŠç¿ãŸãã¯ååŸããããã®ã§ããæ £ç¿çã«ãããŒã®è»ž0ã¯ç³»åã«ãããäœçœ®ãè¡šãã軞1ããã¯ãã«ãè¡šããŸããã¯ãšãªãŒã¯å€ãã®å Žåããbookããšããåèªã®ããã«ããŒã«å«ãŸããŠããäžèŠçŽ ã«ãªããŸããã¯ãšãªãŒãæäžã®ã©ã®éšåãã圱é¿ãåããŠããããèŠãŠããã®ããã¢ãã³ã·ã§ã³ã®éèŠãªãã€ã³ãã§ãããBookãã¯ãchildããšãreadsãã«åŒ·ãã¢ãã³ã·ã§ã³ãæã€ã¯ãã§ããããsleepyãã§ã¯ããã¯ãªããªãã¯ãã§ãããããå®éã«ãã¯ãã«ãšããŠèšç®ãããã®ãã¢ãã³ã·ã§ã³ãã¯ãã« \(\vec{b}\) ãšåŒã¶ããšã¯ããã«ãããã§ããããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã®åºåã¯ãã¯ãšãªãŒãšããŒã®ã¢ãã³ã·ã§ã³ããç®åºãããéã¿ã§ããªã¥ãŒãéçŽãããã®ã§ãããããã£ãŠãæäžã®åèŠçŽ ã«å¯ŸããŠã²ãšã€ã®ããŒãããªã¥ãŒã察å¿ããŠããã¯ãã§ããäžè¬çã«ãããªã¥ãŒã¯ããŒãšåäžã«ãªãããšããããŸãã
æ°åŠçã«ã©ãããããšãèŠãŠã¿ãŸããããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯ïŒïŒïŒã¢ãã³ã·ã§ã³æ©æ§ã«ããã¢ãã³ã·ã§ã³ãã¯ãã« \(\vec{b}\) ã®èšç®ãšãïŒïŒïŒã¢ãã³ã·ã§ã³ãã¯ãã« \(\vec{b}\) ãçšããããªã¥ãŒã®éçŽã®2ã€ã®ã¹ããããããªããŸããã¢ãã³ã·ã§ã³æ©æ§ã¯ã¢ãã³ã·ã§ã³æ¹çšåŒã®å¥åã§ããäžã®äŸã«ã€ããŠèããŠã¿ãŠãã ãããããã§ã¯3次å ã®åã蟌ã¿ã䜿ã£ãŠåèªãè¡šçŸããŠã¿ãŸãã
ã€ã³ããã¯ã¹ |
åã蟌㿠|
åèª |
---|---|---|
0 |
0,0,0 |
The |
1 |
2,0,1 |
Sleepy |
2 |
1,-1,-2 |
Child |
3 |
2,3,1 |
Reads |
4 |
-2,0,0 |
A |
5 |
0,2,1 |
Book |
ããŒã¯ãããå šãŠããŸãšããã©ã³ã¯2ã®ãã³ãœã«ïŒè¡åïŒã«ãªããŸããããã§ã¯ãåãããããã®ããã«æŽæ°ã®ã¿ã§èª¬æããŠããããšã«æ³šæããŠãã ãããéåžžã¯åèªã®åã蟌ã¿è¡šçŸã«ã¯æµ®åå°æ°ç¹ãçšããããŸãã
ãã®æã«ã¯6ã€ã®åèªãããããããã3次å ãã¯ãã«ã§è¡šçŸãããã®ã§ãããŒã¯ \((6, 3)\) ã®åœ¢ç¶ãããŠããŸããããªã¥ãŒã¯åçŽã«ãååèªã«1ã€ã®å€ãæã€ãšããããããã®ããªã¥ãŒã«ãã£ãŠåºåã決å®ããŸãããããããããåèªã®ææ ãè¡šçŸããŠãããããããŸããïŒãhappyãã®ãããªããžãã£ããªåèªãªã®ãããangryãã®ãããªãã¬ãã£ããªåèªãªã®ããªã©ã
ããªã¥ãŒ \(\mathbf{V}\) ã¯ããŒãšåãã©ã³ã¯ã§ããã¹ãã§ããã®åœ¢ç¶ã¯ \((6, 1)\)ã«ãªããŸãããŸããã¯ãšãªãŒã¯ããŒããã©ã³ã¯ã1å°ãããªããŸãããã®äŸã§ã®ã¯ãšãªãŒã¯ãbookããšããåèªã§ãã
13.2. ã¢ãã³ã·ã§ã³æ©æ§æ¹çšåŒÂ¶
ã¢ãã³ã·ã§ã³æ©æ§æ¹çšåŒã¯ã¯ãšãªãŒãšããŒã®åŒæ°ã®ã¿ã䜿çšããŸãã ãã®åŒã¯ããŒãã1ã©ã³ã¯äœããã³ãœã«ãåºåããåããŒã«å€§ããŠã¯ãšãªãŒãæã€ã¹ãã¢ãã³ã·ã§ã³ã«å¯Ÿå¿ããã¹ã«ã©ãŒãäžããŸãã ãã®ã¢ãã³ã·ã§ã³ãã¯ãã«ã¯æ£èŠåãããŠããå¿ èŠããããŸããæãäžè¬çãªã¢ãã³ã·ã§ã³æ©æ§ã¯å ç©ãšãœããããã¯ã¹ã§ãã
ããã§ãã€ã³ããã¯ã¹ \(i\) ã¯ç³»åã«ãããäœçœ®ã\(j\) ã¯ç¹åŸŽéã®ã€ã³ããã¯ã¹ã§ãããœããããã¯ã¹ã¯ä»¥äžã§å®çŸ©ããã
\(\vec{b}\) ãæ£èŠåãããããšãä¿èšŒããŠããŸããäžã®äŸããåŸãå€ãä»£å ¥ãããšã次ã®ããã«ãªããŸãã
ããã§ã¯æ°åãäžžããŸããããã¢ãã³ã·ã§ã³ãã¯ãã«ã¯åèªèªèº«ïŒbookïŒãšåè©ïŒreadïŒã«ã®ã¿éã¿ãæã£ãŠããŸããããã¯ç§ãäœã£ãäŸã§ãããã¢ãã³ã·ã§ã³ãåèªå士ãã©ã®ããã«é¢é£ä»ããã瀺åãäžããŠãããŠããŸããã°ã©ããã¥ãŒã©ã«ãããã¯ãŒã¯ã«ãããè¿åã®æŠå¿µãæãèµ·ãããããããŸããã
13.3. ã¢ãã³ã·ã§ã³éçŽÂ¶
ã¢ãã³ã·ã§ã³ãã¯ãã« \(\vec{b}\) ã¯ãããªã¥ãŒã®å éå¹³åã®èšç®ã«äœ¿çšãããŸãã
æŠå¿µçã«ã¯ãä»åã®äŸã§ã¯æäžã¯ãšãªãŒãbookãã®ã¢ãã³ã·ã§ã³ã§éã¿ã¥ããããææ ãèšç®ããããšã«ãªããŸããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯2ã€ã®ããšãè¡ã£ãŠããããšãåãããŸãïŒã¢ãã³ã·ã§ã³æ©æ§ã§ã¢ãã³ã·ã§ã³ãã¯ãã«ãèšç®ããããã䜿ã£ãŠããªã¥ãŒã®å éå¹³åãæ±ããŠããŸãã
13.4. ãã³ãœã«ããã¶
ãã®å ç©ããœããããã¯ã¹ãéçŽã¯ãã³ãœã«ããããšåŒã°ããæãäžè¬çãªã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã§ã[LPM15]ãäžè¬çãªæŽŸçãšããŠã¯ãããŒã®æ¬¡å ïŒæåŸã®è»žã®æ¬¡å ïŒã§å²ã£ããã®ããããŸããããã§ãããŒãæ£èŠåãããŠããªãããšãæãåºããŠãã ãããä¹±æ°ã§ããã°ãäžå¿æ¥µéå®çããå ç©ããã®åºåã®å€§ããã¯ã¯ããŒã®æ¬¡å ã®å¹³æ¹æ ¹ã§ã¹ã±ãŒã«ããŸããã€ãŸãã\(e^{\vec{q} \cdot \mathbf{K}}\) ãåãããšã§ããœããããã¯ã¹å€ã«æªåœ±é¿ãäžããå¯èœæ§ããããŸãã以äžããŸãšãããšã以äžã®åŒã®ããã«ãªããŸãã
ããã§ã\(d\) ã¯ã¯ãšãªãŒãã¯ãã«ã®æ¬¡å ã§ãã
13.5. ãœãããããŒãã枩床ã¢ãã³ã·ã§ã³Â¶
ã¢ãã³ã·ã§ã³ã®æŽŸç圢ãšããŠèããããã®ã¯ã\(\mathrm{softmax}\) ã®åºåã«ãããŠæãã¢ãã³ã·ã§ã³ã®é«ããã®ã1ã«ããã以å€ã0ã«çœ®ãæããããšã§ãããããããŒãã¢ãã³ã·ã§ã³ãšåŒã³ãŸããããŒãã¢ãã³ã·ã§ã³ã®åŒã¯ã以äžã®ããã«ãœããããã¯ã¹ãããŒãããã¯ã¹ã§çœ®ãæããããšã§å®çŸ©ãããŸãã
ããã¯ã \(\vec{x}\) ã®æ倧èŠçŽ ã®äœçœ®ã1ãšãããã以å€ã®äœçœ®ã«0ã眮ãããšãæ°åŠçã«å®åŒåãããã®ã§ãããã®åŒãçµ±èšååŠã®ãã«ããã³ååžã«äŒŒãŠããããšããã枩床 \(T\)ãšããçšèªãçšããŠããŸãã\(T = 0\) ã®ãšãã¯ããŒãã¢ãã³ã·ã§ã³ã\(T = 1\) ã®ãšãã¯ãœããã¢ãã³ã·ã§ã³ã\(T = \infty\) ã®ãšãã¯åäžãªã¢ãã³ã·ã§ã³ãæå³ããããšãããããšæããŸãã\(T\) ãäžéçãªå€ã«ããããšãå¯èœã§ãã
13.6. ã»ã«ãã¢ãã³ã·ã§ã³Â¶
ãã£ãŒãã©ãŒãã³ã°ã§ã¯ããã¹ãŠããããåŠçãããããšãèŠããŠããŸããïŒéåžžãã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒãžã®ãããå ¥åã¯ã¯ãšãªãŒã§ãããããŸã§ã®è°è«ã§ã¯ãã¯ãšãªãŒã¯ããŒããã1ã©ã³ã¯äœããã³ãœã«ïŒã¯ãšãªãŒãã¯ãã«ïŒã§ãããããããåããããšããŒãšåãã©ã³ã¯ã«ãªããŸããã»ãšãã©ã®å Žåãã¯ãšãªãŒãšããŒã¯åäžã§ããæã ã®äŸã§ã¯ãã¯ãšãªãŒã¯ãbookãã®åã蟌ã¿ãã¯ãã«ã§ãããã¯ããŒã®ã²ãšã€ã§ãããããå šãŠã®åèªãèæ ®ããããã«ã¯ãšãªãŒããããåŠçãããšãã¯ãšãªãŒã¯ããŒãšåããã®ã«ãªããŸããããã«ç¹æ®ãªã±ãŒã¹ãšããŠãã¯ãšãªãŒãããªã¥ãŒãããŒå šãŠãåãå Žåãããããããã»ã«ãã¢ãã³ã·ã§ã³ãšåŒã³ãŸããããã¯ã¢ãã³ã·ã§ã³ã¡ã«ããºã ãããªã¥ãŒãçŽæ¥äœ¿çšããããšãæå³ããŠãããã¬ã€ã€ãŒã«å ¥åãããå¥ã®ãããŒãã¯ååšããŸããã
13.7. åŠç¿å¯èœã¢ãã³ã·ã§ã³Â¶
ãããŸã§èª¬æããŠããã¢ãã³ã·ã§ã³ã«ã¯ãåŠç¿å¯èœãªãã©ã¡ãŒã¿ã¯åšããŸããã§ãããã¢ãã³ã·ã§ã³ã«ããåŠç¿ã¯ã©ã®ããã«ããŠè¡ãã®ã§ããããïŒäžè¬çã«ãåŠç¿å¯èœãªãã©ã¡ãŒã¿ãçŽæ¥åŒã«æãããããšã¯ããŸããã代ããã«ãå šçµåå±€ãéããŠããŒãããªã¥ãŒãã¯ãšãªãŒïŒStandard Layersãåç §ïŒãã¢ãã³ã·ã§ã³ãžå ¥åããŸãããã®çºãã²ãšã€ã®ã¬ã€ã€ãŒãšããŠã¢ãã³ã·ã§ã³ãèŠããšåŠç¿å¯èœãªãã©ã¡ãŒã¿ã¯ãããŸãããå šçµåå±€ãšã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã®ãããã¯ãšããŠèŠãã°åŠç¿å¯èœã§ãã以äžã§æ瀺çã«ç¢ºèªããŸãããã
13.8. ãã«ããããã¢ãã³ã·ã§ã³ãããã¯Â¶
è€æ°ã®ãã£ã«ã¿ãŒã«ããç³ã¿èŸŒã¿ã«çæ³ãåŸããè€æ°ã®äžŠåã¢ãã³ã·ã§ã³ãããªããããã¯ïŒã¬ã€ã€ãŒã®ã°ã«ãŒãïŒããããŸãããããã¯ããã«ããããã¢ãã³ã·ã§ã³ããšåŒã°ããŸããããããªã¥ãŒã®åœ¢ç¶ã \((L, V)\) ã§ããã°ã\((H, V)\) ã®åœ¢ç¶ã®ãã³ãœã«ãè¿ã£ãŠããŸããããã§ã\(H\) ã¯äžŠåã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒïŒãããïŒã®æ°ã§ããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã«åŠç¿å¯èœãã©ã¡ãŒã¿ããªãã®ãªããã©ããªæå³ãããã®ã§ãããããããã§ãéã¿ãå°å ¥ããŸããããå šãŠã®ã¢ãã³ã·ã§ã³ãããã®åœ¢ç¶ãäžå®ã§ããå¿ èŠãããã®ã§ãéã¿ã¯æ£æ¹è¡åã«ãªã£ãŠããŸãã
ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã \(A(\vec{q}, \mathbf{K}, \mathbf{V})\) ã§å®çŸ©ãããŠãããšããŸãããã®æãã«ããããã¢ãã³ã·ã§ã³ã¯ä»¥äžã®ããã«æžããŸãã
ããã§ãåºåãã¯ãã« \(\ldots\) ã®åèŠçŽ ã¯ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒããã®åºåã§ã \(H\) åã® \((L, V)\) ã®åœ¢ç¶ããããã³ãœã«ã§ããã€ãŸããå šäœã®åºåãšããŠã¯ \((H, L, V)\) ã®åœ¢ç¶ããããã³ãœã«ã«ãªããŸãããã«ããããã¢ãã³ã·ã§ã³ãããã¯ã®æãæåãªäŸã¯ããã©ã³ã¹ãã©ãŒããŒ[]ã§äœ¿çšãããŠããã»ã«ãã¢ãã³ã·ã§ã³ãã«ããããã¢ãã³ã·ã§ã³ãããã¯ã§ããéåžžãè€æ°ã®é£ç¶ããã¢ãã³ã·ã§ã³ãããã¯ãé©çšããã®ã§ã次ã®ãããã¯ãžå ¥åãããããªã¥ãŒã¯ãã©ã³ã¯3 \((H, L, V)\) ã§ã¯ãªãã©ã³ã¯2ã®ãã³ãœã«ã§ããå¿ èŠããããŸãããããã£ãŠããã«ããããã¢ãã³ã·ã§ã³ã®åºåã¯ãã°ãã° \((H, V, V)\) ãŸã㯠\((H)\) ã®éã¿ãã³ãœã«ãšã®è¡åç©ã«ãã£ãŠã©ã³ã¯2ã«ãªããŸãããããåããã«ããããã§ããã°ã以äžã®äŸãåç §ããŠãã ããã
13.9. ããŒãããã¯ãå®è¡ãã¶
äžã®   ãã¯ãªãã¯ããŠãGoogle Colab ãç«ã¡äžããŠãã ããã
13.10. ã³ãŒãã®äŸÂ¶
ã¢ãã³ã·ã§ã³ãã©ã®ããã«å®è£
ãããŠãããèŠãŠã¿ãŸããããããã§ã¯æ§ã
ãªéã«ã©ã³ãã ãªå€æ°ã䜿çšããã®ã§ãåŠç¿ãããå€æ°ã w_
ã§ãå
¥åå€æ°ã i_
ã§è¡šãããšã«ããŸãã
13.10.1. ãã³ãœã«ãããæ©æ§Â¶
ãŸãããã³ãœã«ãããæ©æ§ã®å®è£ ããå§ããŸããäŸãšããŠãç³»åã®é·ãã11ãããŒç¹åŸŽéã®é·ãã4ãããªã¥ãŒç¹åŸŽéã®æ¬¡å ã2ãšããŸããããŒãšã¯ãšãªãŒã¯ãç¹åŸŽéã®æ¬¡å ãåãã§ããããšã«æ³šæããŠãã ããã
import numpy as np
def softmax(x, axis=None):
return np.exp(x) / np.sum(np.exp(x), axis=axis)
def tensor_dot(q, k):
b = softmax((k @ q) / np.sqrt(q.shape[0]))
return b
i_query = np.random.normal(size=(4,))
i_keys = np.random.normal(size=(11, 4))
b = tensor_dot(i_query, i_keys)
print("b = ", b)
b = [0.20700389 0.04009835 0.05307579 0.0622597 0.08612718 0.04874157
0.14210682 0.11323356 0.0255366 0.13386457 0.08795197]
æåŸ éããåèšã1ã®ãã¯ãã« \(\vec{b}\) ãåŸãããŸããã
13.10.2. äžè¬çãªAttention¶
ã§ã¯ããã®ã¢ãã³ã·ã§ã³æ©æ§ãã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã«çµã¿èŸŒã¿ãŸãããã
def attention_layer(q, k, v):
b = tensor_dot(q, k)
return b @ v
i_values = np.random.normal(size=(11, 2))
attention_layer(i_query, i_keys, i_values)
array([0.3080947 , 0.38364215])
åç¹åŸŽæ¬¡å ã«1ã€ãã€ã2ã€ã®å€ãåŸãããŸãã
13.10.3. ã»ã«ãã¢ãã³ã·ã§ã³Â¶
ã»ã«ãã¢ãã³ã·ã§ã³ã®å€æŽç¹ã¯ãã¯ãšãªãŒãããŒãããªã¥ãŒãçããããããšã§ãããã®èšå®ã§ã¯ã¯ãšãªãŒã®ãããåŠçãå¿ èŠã«ãªããã©ã³ã¯2ã®åºåãåŸãŸãã
def batched_tensor_dot(q, k):
# a 㯠batch x seq x feature 次å
ïŒããã§ã¯ N x N x 4ïŒã«ãªã
# ã¢ã€ã³ã·ã¥ã¿ã€ã³èšæ³ã«ãããããåããããããç©
a = np.einsum("ij,kj->ik", q, k) / np.sqrt(q.shape[0])
# ç³»åã«å¯ŸããŠãœããããã¯ã¹ãé©çš
b = softmax(a, axis=1)
return b
def self_attention(x):
b = batched_tensor_dot(x, x)
return b @ x
i_batched_query = np.random.normal(size=(11, 4))
self_attention(i_batched_query)
array([[ 0.11789742, -0.2934655 , -0.03479239, -0.01692023],
[ 0.31828959, -0.27241419, -0.04986509, -0.14278845],
[ 0.02310531, -0.10175113, -0.30212143, -0.17298333],
[-0.20688837, -0.99100187, -0.0773466 , 0.1965005 ],
[-0.1770745 , -0.76096894, -0.00722271, 0.10354181],
[-0.93571529, -1.73757843, -0.11719636, 1.20152768],
[-0.41593942, -0.22415518, -0.40699085, -0.2241061 ],
[-0.79776283, -1.63773601, 0.19498726, 1.06386468],
[-1.69392981, 0.18193607, -0.82259821, -0.04819894],
[-1.77157321, -0.15970198, -0.69181863, 0.17983833],
[-0.11379758, -0.92881141, -0.02131801, 0.40542272]])
\(11\times4\) ã®è¡åãåŸãããã°äžæãèšç®ãã§ããŠããŸãã
13.10.4. åŠç¿å¯èœãã©ã¡ãŒã¿ãè¿œå ãã¶
ãããã®ã¹ãããã«éã¿è¡åãè¿œå ããããšã§ãåŠç¿å¯èœãã©ã¡ãŒã¿ãè¿œå ããããšãã§ããŸããã»ã«ãã¢ãã³ã·ã§ã³ã§å®è·µããŠã¿ãŸããããã»ã«ãã¢ãã³ã·ã§ã³ã§ã¯ããŒãããªã¥ãŒãã¯ãšãªãŒã¯åããã®ã§ããããããããã«ç°ãªãéã¿ãæããããšãã§ããŸãããã¢ãšããŠãããªã¥ãŒã®ç¹åŸŽé次å ã2ã«å€æŽããŠã¿ãŸãã
# éã¿ãå
¥å次å
-> ææã®ç¹åŸŽé次å
ã«å€æŽããã
w_q = np.random.normal(size=(4, 4))
w_k = np.random.normal(size=(4, 4))
w_v = np.random.normal(size=(4, 2))
def trainable_self_attention(x, w_q, w_k, w_v):
q = x @ w_q
k = x @ w_k
v = x @ w_v
b = batched_tensor_dot(q, k)
return b @ v
trainable_self_attention(i_batched_query, w_q, w_k, w_v)
array([[ 4.23472509e-01, 9.04428270e-02],
[ 1.31111986e+00, 2.35479791e-01],
[ 1.44492004e+00, -1.58504816e-01],
[-6.92618092e+00, -5.76462397e-01],
[-1.17416733e+01, -7.88693159e-01],
[-3.25096494e+01, -3.14974036e+00],
[-1.07461959e+00, -3.13295876e-01],
[-1.53449098e+02, -1.18942119e+01],
[-1.47040433e+00, -1.46023707e-01],
[-2.95108097e+01, -2.37028194e+00],
[-3.57954944e-01, -3.92492830e-02]])
éã¿ã§ããªã¥ãŒã®ç¹åŸŽé次å ã2ã«ããã®ã§ã \(11\times 2\)ã®åºåãåŸãããŸãã
13.10.5. ãã«ãããã¶
ãã«ããããã¢ãã³ã·ã§ã³ã®å¯äžã®å€æŽç¹ã¯åãããã«å¯ŸããŠ1ã€ã®éã¿ãæã¡ããããé©çšåŸã®åºåãçµåããããšã§ããåŠç¿å¯èœãªé·ã \(H\) ã®éã¿ãã¯ãã«ã䜿ã£ãŠåºåãé£çµããããå¹³åãæ倧å€ãªã©ã®éçŽãè¡ããŸãã
w_q_h1 = np.random.normal(size=(4, 4))
w_k_h1 = np.random.normal(size=(4, 4))
w_v_h1 = np.random.normal(size=(4, 2))
w_q_h2 = np.random.normal(size=(4, 4))
w_k_h2 = np.random.normal(size=(4, 4))
w_v_h2 = np.random.normal(size=(4, 2))
w_h = np.random.normal(size=2)
def multihead_attention(x, w_q_h1, w_k_h1, w_v_h1, w_q_h2, w_k_h2, w_v_h2):
h1_out = trainable_self_attention(x, w_q_h1, w_k_h1, w_v_h1)
h2_out = trainable_self_attention(x, w_q_h2, w_k_h2, w_v_h2)
# join along last axis so we can use dot.
all_h = np.stack((h1_out, h2_out), -1)
return all_h @ w_h
multihead_attention(i_batched_query, w_q_h1, w_k_h1, w_v_h1, w_q_h2, w_k_h2, w_v_h2)
array([[-0.33469453, 1.23200244],
[-0.49369896, -0.24600652],
[-4.00547969, -2.08206014],
[ 3.99078926, 1.68343247],
[ 3.71601947, 1.89168072],
[ 1.31416941, 3.38837506],
[-0.21610159, 0.29774985],
[14.46310167, 45.28534033],
[-4.25262271, 1.21172501],
[-2.424586 , 3.21464851],
[ 1.16727829, 1.7187619 ]])
æåŸ éããã©ã³ã¯2ã§ãã \(11\times 2\) ã®åºåãåŸãããŸããã
14. ã°ã©ããã¥ãŒã©ã«ãããã¯ãŒã¯ã«ãããã¢ãã³ã·ã§ã³Â¶
ã°ã©ããã¥ãŒã©ã«ãããã¯ãŒã¯ã®éèŠãªæ§è³ªã« permutation equivariant ãããããšãæãåºããŠãã ããã æã ã¯ã°ã©ããã¥ãŒã©ã«ãããã¯ãŒã¯ã permutation equivariant ã«ããããã«ãåèšãå¹³åãªã©ã®éçŽã䜿çšããŠããŸããã
ãŸããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯ permutation invariantïŒãããåããªãæïŒããã㯠permutation equivariantïŒãããåããæïŒã§ãããã®ãããã¢ãã³ã·ã§ã³ã¯è¿åæ å ±ãéçŽããæ¹æ³ãšããŠããå©çšãããŠããŸããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯éèŠãªè¿åãèŠã€ããã®ãåŸæãªã®ã§ãé«æ¬¡å ã°ã©ãïŒå€§éãªè¿åãæã€ïŒã«ãããŠéèŠã§ããããã¯ååã§ã¯çšãªããšã§ãããå šãŠã®ååãçµåããŠãã®è·é¢ããšããžãšããŠçœ®ãã ãã§è¯ããšããããšã§ããã°ã©ãç³ã¿èŸŒã¿ã¬ã€ã€ãŒïŒGCNã¬ã€ã€ãŒïŒãã»ãšãã©ã®GNNã¬ã€ã€ãŒããã¬ã€ã€ãŒããšã«ã²ãšã€ã®çµåããæ å ±ãäŒæãããããšãã§ããªãããšãæãåºããŠãã ããããããã£ãŠãå šãŠã®ååãçµåããŠã¢ãã³ã·ã§ã³ãé©çšããããšã¯ãå€æ°ã®ã¬ã€ã€ãŒãçµç±ããªããŠãé·è·é¢ã®æ å ±äŒéãå¯èœã«ãªããŸãããã ãããããã¯ãŒã¯ãæ£ããçµå/ååã«æ³šæãåããŠãããæ°ãä»ããå¿ èŠããããŸãã
ã¢ãã³ã·ã§ã³ã Battaglia equations[BHB+18] ã«ã©ãåœãŠã¯ãŸããèŠãŠã¿ãŸããããBattaglia æ¹çšåŒã¯GNNãå®çŸ©ããããã®äžè¬çãªæšæºæ¹çšåŒã§ããããšãæãåºããŠãã ãããã¢ãã³ã·ã§ã³ã¯è€æ°ã®å Žæã«çŸããããšããããŸãããå è¿°ããéãè¿åãèæ ®ããæã«çŸããŸããå ·äœçã«ã¯ãã¯ãšãªãŒã¯ \(i\) çªç®ã®ããŒããšãªããããŒïŒããªã¥ãŒã¯è¿åããŒããšãšããžã®ç¹åŸŽã®çµã¿åããã«ãªããŸããBattaglia æ¹çšåŒããããã«åœãŠã¯ãŸãã¹ãããã¯ãªããã以äžã®ããã«ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒãåå²ããããšãã§ããŸããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã®å€§éšåã¯ãšããžæŽæ°åŒã«åœãŠã¯ãŸãã§ãããã
ããã¯äžè¬åãããåŒã§ããã \(\phi^e()\) ã®éžæãGNNãå®çŸ©ããŠããããšãæãåºããŠãã ããã\(\vec{e}_k\) ã¯ãšããž \(k\) ã®ç¹åŸŽéãã¯ãã«ã\(\vec{v}_{rk}\) ã¯ãšããž \(k\) ã®åä¿¡ããŒãç¹åŸŽéãã¯ãã«ã\(\vec{v}_{sk}\) ã¯ãšããž \(k\) ã®éä¿¡ããŒãç¹åŸŽéãã¯ãã«ã\(\vec{u}\) ã¯å šäœã°ã©ãç¹åŸŽéã§ãããã®ã¹ãããããã¢ãã³ã·ã§ã³æ©æ§ã«å©çšããŸããããã§ãã¯ãšãªã¯ãŒåä¿¡ããŒã \(\vec{c}_{rk}\) ã§ãããŒïŒããªã¥ãŒã¯éä¿¡ãšãšããžãã¯ãã«ã§ããå ·äœçã«ã¯ãZhangãã®ã¢ãããŒãïŒ[ZSX+18]ïŒããã³ãœã«ãããæ©æ§ã§å©çšããŸãã圌ãã¯ããŒãç¹åŸŽéã®ã¿ãèæ ®ããããŒãšããªã¥ãŒã¯ããŒãç¹åŸŽéãšåäžã«èšå®ããŸãããäžæ¹ã§ã圌ãã¯ããŒãç¹åŸŽéãããŒïŒã¯ãšãªãŒã«å€æããåŠç¿å¯èœãªãã©ã¡ãŒã¿ã䜿çšããŸããã
ã²ãšã€ã®åŒã«ãŸãšãããšïŒ
ããã§ãã¢ãã³ã·ã§ã³ããéã¿ã¥ãããããšããžç¹åŸŽéãã¯ãã«ãåŸãããšãã§ããŸãã æåŸã«ããšããžéçŽã¹ãããã§ãããã®ãšããžç¹åŸŽéãåèšããŸãã
Zhangã[ZSX+18]ã§ã¯ããã«ããããã¢ãã³ã·ã§ã³ã䜿çšããŠããŸããã ãã«ããããã¢ãã³ã·ã§ã³ã¯ã©ã®ããã«æ©èœããã®ã§ããããïŒ
ãšããžç¹åŸŽéè¡å \(E_i^{'}\) ã¯è»ž0ããšããžïŒ\(k\)ïŒã軞1ãç¹åŸŽéã軞2ããããã®ãšããžç¹åŸŽéãã³ãœã«ã«ãªããŸãããããããã¯åã« \(\mathbf{W}^h_q, \mathbf{W}^h_k, \mathbf{W}^h_v\) ã®ã©ã®éåã䜿ã£ãããæå³ããŠããããšãæãåºããŠãã ããã ãã³ãœã«ãæåŸ ãããè¡åã«æ»ãããã«ã¯ãåçŽã«æåŸã®2軞ïŒç¹åŸŽéããããïŒãç¹åŸŽéã«ãããããéã¿è¡åãçšããã°è¯ãã§ãã
ããããããã®ãããã€ã³ããã¯ã¹ãæ瀺çã«æžãåºããŸãããïŒ
ããã§ã \(j\) ã¯ãšããžç¹åŸŽéã®å ¥åã€ã³ããã¯ã¹ã\(l\) ã¯åºåãšããžç¹åŸŽéè¡åã§ã\(k,h,i\) ã¯ä»¥åãšåæ§ã®å®çŸ©ãšããŸãããã©ã³ã¹ãã©ãŒããŒã¯ãã«ããããã¢ãã³ã·ã§ã³ã§æ§ç¯ããããããã¯ãŒã¯ã®å¥åãªã®ã§ããã©ã³ã¹ãã©ãŒããŒã°ã©ããã¥ãŒã©ã«ãããã¯ãŒã¯ãèŠãããããšãããã§ãããïŒ[MDM+20]ïŒã
14.1. ç« ã®ãŸãšã¶
ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯äººéã®æ³šææ©æ§ã«ãã³ããåŸãŠããããåºæ¬çã«ã¯å éå¹³åã«ããéçŽã§ããã
ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯ã¯ãšãªãŒãããªã¥ãŒãããŒã®3ã€ã®å ¥åããšãããããã®å ¥åã¯ãã°ãã°åäžã§ãã¯ãšãªã¯ããŒã®ïŒã€ã§ãããããŒãšããªã¥ãŒã¯çããã
èšèªã®ãããªç³»åã¢ããªã³ã°ã«åããŠããã
ã¢ãã³ã·ã§ã³ãã¯ãã«ã¯æ£èŠåãããŠããå¿ èŠãããããœããããã¯ã¹é¢æ°ã§æ£èŠåãå®çŸã§ããããã¢ãã³ã·ã§ã³æ©æ§åŒã¯ãã€ããŒãã©ã¡ãŒã¿ã§ããã
ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã¯ã¢ãã³ã·ã§ã³æ©æ§ã§ã¢ãã³ã·ã§ã³ãã¯ãã«ãèšç®ããããããŠã¢ãã³ã·ã§ã³å éå¹³åãèšç®ããããšã§ã¢ãã³ã·ã§ã³ãã¯ãã«ã®éçŽãè¡ãã
ããŒãã¢ãã³ã·ã§ã³ïŒããŒãããã¯ã¹é¢æ°ïŒãçšãããšãã¢ãã³ã·ã§ã³æ©æ§ã®æ倧åºåãè¿ãã
ãœããããã¯ã¹åŸã®ãã³ãœã«ãããã¯ã¢ãã³ã·ã§ã³æ©æ§ã§ãã£ãšãäžè¬çã§ããã
ã»ã«ãã¢ãã³ã·ã§ã³ã¯ã¯ãšãªãŒãããªã¥ãŒãããŒãå šãŠçãããšãã«éæãããã
ã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒãã®ãã®ã¯åŠç¿ã§ããªãã
ãã«ããããã¢ãã³ã·ã§ã³ãããã¯ã¯è€æ°äžŠåã¢ãã³ã·ã§ã³ã«åå²å¯èœãªã¬ã€ã€ãŒã®ã°ã«ãŒãã§ããã
14.2. åŒçšæç®Â¶
- BHB+18
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, and others. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
- ZSX+18(1,2)
Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit-Yan Yeung. Gaan: gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294, 2018.
- BP97(1,2)
Shumeet Baluja and Dean A. Pomerleau. Expectation-based selective attention for visual monitoring and control of a robot vehicle. Robotics and Autonomous Systems, 22(3):329â344, 1997. Robot Learning: The New Wave. URL: http://www.sciencedirect.com/science/article/pii/S0921889097000468, doi:https://doi.org/10.1016/S0921-8890(97)00046-8.
- LPM15(1,2)
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
- VSP+17
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Åukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, 5998â6008. 2017.
- MDM+20
Åukasz Maziarka, Tomasz Danel, SÅawomir Mucha, Krzysztof Rataj, Jacek Tabor, and StanisÅaw JastrzÄbski. Molecule attention transformer. arXiv preprint arXiv:2002.08264, 2020.