æ©æ¢°åŠç¿å ¥é
Contents
2. æ©æ¢°åŠç¿å ¥é¶
æ©æ¢°åŠç¿å ¥éãšåŒã°ããèšäºã¯ãããããwebãµã€ãäžã«æ°å件ãããŸããæ¬çš¿ã§ã¯ããããã®èšäºãæžãçŽãã®ã§ã¯ãªããååŠã§ã®å¿çšã«çŠç¹ãåœãŠãäž»èŠãªæŠå¿µã玹ä»ããããšã«ããŸãã 以äžã«å ¥éçãªè³æãæããŸããæ°ããããè¯ããã®ãã»ãã«ããã°æããŠãã ããã
åèè ã倧åŠé¢ã§åããŠèªãã ãæ©æ¢°åŠç¿ã«ã€ããŠã®Ethem Alpaydinã«ãã£ãŠæžãããæ¬[Alp20]
Nils Nillsonã®ãªã³ã©ã€ã³ãã㯠Introductory Machine Learning
ç©è³ªç§åŠã«ãããæ©æ¢°åŠç¿ã«ã€ããŠã®2ã€ã®review[FZJS21, Bal19]
èšç®ç§åŠã«ãããæ©æ¢°åŠç¿ã«ã€ããŠã®2ã€ã®review[GomezBAG20]
éå±ç§åŠã«ãããæ©æ¢°åŠç¿ã«ã€ããŠã®2ã€ã®review[NDJ+18]
ãããã®è³æãããæ©æ¢°åŠç¿ãããŒã¿ãã¢ããªã³ã°ããææ³ã§ãããäžè¬çã«ã¯ã«ã¯äºæž¬æ©èœãæã€ããšãåŠãã§ããã ãããšæããŸããæ©æ¢°åŠç¿ã«ã¯å€ãã®ææ³ãå«ãŸããŸãããããã§ã¯æ·±å±€åŠç¿ãåŠã¶ããã«å¿ èŠãªãã®ã ããåãäžããŸããäŸãã°ãã©ã³ãã ãã©ã¬ã¹ãããµããŒããã¯ã¿ãŒãã·ã³ãæè¿åæ¢çŽ¢ãªã©ã¯åºã䜿ãããŠããæ©æ¢°åŠç¿ææ³ã§ãä»ã§ãæå¹ãªææ³ã§ãããããã§ã¯åãäžããŸããã
èªè å±€ãšç®ç
æ¬ç« ã¯ãååŠãšpythonã«ã€ããŠããçšåºŠç¥èã®ããæ©æ¢°åŠç¿ã®åå¿è
ã察象ãšããŠãããããã§ãªãå Žåã«ã¯äžèšã®å
¥éèšäºã®ããããã«ç®ãéããŠããããšããå§ãããŸãããã®èšäºã§ã¯ãpandas
ã®ç¥èïŒã«ã©ã ã®èªã¿èŸŒã¿ãšéžæïŒãrdkit
ã®ç¥èïŒååã®æãæ¹ïŒãååãSMILES [Wei88]ãšããŠä¿åããæ¹æ³ã«ã€ããŠããçšåºŠã®ç¥èãæ³å®ããŠããŸãããã®ç« ãèªããšã以äžã®ããšãã§ããããã«ãªããšæ³å®ãããŸãã
ç¹åŸŽéãã©ãã«ã®å®çŸ©
æåž«ããåŠç¿ãšæåž«ãªãåŠç¿ãåºå¥ã§ããã
æ倱é¢æ°ãšã¯äœããåŸé éäžæ³ãçšããŠã©ã®ããã«æå°åã§ããããç解ããã
ã¢ãã«ãšã¯äœããç¹åŸŽãšã©ãã«ãšã®é¢ä¿ãç解ããã
ããŒã¿ã®ã¯ã©ã¹ã¿ãªã³ã°ãã§ãããããããŒã¿ã«ã€ããŠäœã瀺ããã説æã§ããã
2.1. çšèªã®èª¬æ¶
æ©æ¢°åŠç¿ãšã¯ãããŒã¿ã«åœãŠã¯ããŠã¢ãã«ãæ§ç¯ããããšãç®æãåéã§ãã ãŸããèšèãå®çŸ©ããŸãã
ç¹åŸŽé
    次å \(D\)ã®\(N\)åã®ãã¯ãã« \(\{\vec{x}_i\}\) ã®éåã§ããå®æ°ãæŽæ°çãçšããããŸãã
ã©ãã«
    \(N\) åã®æŽæ°ãŸãã¯å®æ°ã®éå \(\{y_i\}\)ã\(y_i\) ã¯éåžžã¹ã«ã©ãŒã§ãã
ã©ãã«ä»ãããŒã¿
    \(N\) åã®tupleãããªãéå \(\{\left(\vec{x}_i, y_i\right)\}\) ãæããŸãã
ã©ãã«ãªãããŒã¿
    ã©ãã« \(y\) ãæªç¥ã® \(N\) åã®ç¹åŸŽé \(\{\vec{x}_i\}\) ã®éåãæããŸãã
ã¢ãã«
    ç¹åŸŽãã¯ãã«ãåãåããäºæž¬çµæ \(\hat{y}\) ãåºåããé¢æ° \(f(\vec{x})\) ãæããŸãã
äºæž¬çµæ
    äžããããå ¥å \(\vec{x}\) ã«å¯Ÿããã¢ãã«ãéããŠåŸãããäºæž¬çµæ \(\hat{y}\) ã®ããšãæããŸãã
2.2. æåž«ããåŠç¿Â¶
æåã®ã¿ã¹ã¯ã¯æåž«ããåŠç¿ã§ããæåž«ããåŠç¿ãšã¯ãããŒã¿ã§åŠç¿ããã¢ãã«ã§ \(\vec{x}\) ãã \(y\) ãäºæž¬ããæ¹æ³ã§ãããã®ã¿ã¹ã¯ã¯ãããŒã¿ã»ããã«å«ãŸããã©ãã«ãã¢ã«ãŽãªãºã ã«æããããšã§åŠç¿ãé²ãããããæåž«ããåŠç¿ãšåŒã°ããŠããŸããããäžã€ã®æ¹æ³ã¯æåž«ãªãåŠç¿ã§ãã¢ã«ãŽãªãºã ã«ã©ãã«ãæããªãæ¹æ³ã§ãããã®æåž«ããïŒæåž«ãªãã®åºå¥ã¯åŸã§ãã£ãšå³å¯ã«ãªããŸãããä»ã®ãšããã¯ãã®å®çŸ©ã§ååã§ãã
äŸãšããŠãAqSolDB[SKE19]ãšãããçŽ1äžçš®é¡ã®ååç©ãšããã®æ°Žãžã®æº¶è§£åºŠã®æž¬å®çµæ(ã©ãã«)ã«ã€ããŠã®ããŒã¿ã»ããã䜿ã£ãŠã¿ãŸãããã®ããŒã¿ã»ããã«ã¯ãæ©æ¢°åŠç¿ã«å©çšã§ããååç¹æ§ïŒç¹åŸŽéïŒãå«ãŸããŠããŸãã溶解床ã®æž¬å®çµæã¯ãååç©ã®æ°Žãžã®æº¶è§£åºŠãlog molarityã®åäœã§è¡šãããã®ã«ãªããŸãã
2.3. Notebookã®å®è¡Â¶
äžã«ãã   ãã¯ãªãã¯ãããšããã®ããŒãžãã€ã³ã¿ã©ã¯ãã£ããªGoogle Colab NotebookãšããŠèµ·åãããããã«ãªããŸãã ããã±ãŒãžã®ã€ã³ã¹ããŒã«ã«ã€ããŠã¯ã以äžãåç §ããŠãã ããã
Tip
ããã±ãŒãžãã€ã³ã¹ããŒã«ããã«ã¯ãæ°ããã»ã«ã§æ¬¡ã®ã³ãŒããå®è¡ããŸãã
!pip install dmol-book
ã€ã³ã¹ããŒã«ã«åé¡ãçããå Žåã¯ããã®ãªã³ã¯ãã䜿çšãããŠããããã±ãŒãžãªã¹ãã®ææ°çãå ¥æããããšãã§ããŸãã
2.3.1. ããŒã¿ã®ããŒã¶
ããŒã¿ãããŠã³ããŒãããPandasã®ããŒã¿ãã¬ãŒã ã«ããŒãããŸãã以äžã®ã»ã«ã§ã¯ãã€ã³ããŒããå¿ èŠãªããã±ãŒãžã®ã€ã³ã¹ããŒã«ãèšå®ããŸãã
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import jax.numpy as jnp
import jax
from jax.example_libraries import optimizers
import sklearn.manifold, sklearn.cluster
import rdkit, rdkit.Chem, rdkit.Chem.Draw
import dmol
# soldata = pd.read_csv('https://dataverse.harvard.edu/api/access/datafile/3407241?format=original&gbrecs=true')
# had to rehost because dataverse isn't reliable
soldata = pd.read_csv(
"https://github.com/whitead/dmol-book/raw/master/data/curated-solubility-dataset.csv"
)
soldata.head()
ID | Name | InChI | InChIKey | SMILES | Solubility | SD | Ocurrences | Group | MolWt | ... | NumRotatableBonds | NumValenceElectrons | NumAromaticRings | NumSaturatedRings | NumAliphaticRings | RingCount | TPSA | LabuteASA | BalabanJ | BertzCT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A-3 | N,N,N-trimethyloctadecan-1-aminium bromide | InChI=1S/C21H46N.BrH/c1-5-6-7-8-9-10-11-12-13-... | SZEMGTQCPRNXEG-UHFFFAOYSA-M | [Br-].CCCCCCCCCCCCCCCCCC[N+](C)(C)C | -3.616127 | 0.0 | 1 | G1 | 392.510 | ... | 17.0 | 142.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 158.520601 | 0.000000e+00 | 210.377334 |
1 | A-4 | Benzo[cd]indol-2(1H)-one | InChI=1S/C11H7NO/c13-11-8-5-1-3-7-4-2-6-9(12-1... | GPYLCFQEKPUWLD-UHFFFAOYSA-N | O=C1Nc2cccc3cccc1c23 | -3.254767 | 0.0 | 1 | G1 | 169.183 | ... | 0.0 | 62.0 | 2.0 | 0.0 | 1.0 | 3.0 | 29.10 | 75.183563 | 2.582996e+00 | 511.229248 |
2 | A-5 | 4-chlorobenzaldehyde | InChI=1S/C7H5ClO/c8-7-3-1-6(5-9)2-4-7/h1-5H | AVPYQKSLYISFPO-UHFFFAOYSA-N | Clc1ccc(C=O)cc1 | -2.177078 | 0.0 | 1 | G1 | 140.569 | ... | 1.0 | 46.0 | 1.0 | 0.0 | 0.0 | 1.0 | 17.07 | 58.261134 | 3.009782e+00 | 202.661065 |
3 | A-8 | zinc bis[2-hydroxy-3,5-bis(1-phenylethyl)benzo... | InChI=1S/2C23H22O3.Zn/c2*1-15(17-9-5-3-6-10-17... | XTUPUYCJWKHGSW-UHFFFAOYSA-L | [Zn++].CC(c1ccccc1)c2cc(C(C)c3ccccc3)c(O)c(c2)... | -3.924409 | 0.0 | 1 | G1 | 756.226 | ... | 10.0 | 264.0 | 6.0 | 0.0 | 0.0 | 6.0 | 120.72 | 323.755434 | 2.322963e-07 | 1964.648666 |
4 | A-9 | 4-({4-[bis(oxiran-2-ylmethyl)amino]phenyl}meth... | InChI=1S/C25H30N2O4/c1-5-20(26(10-22-14-28-22)... | FAUAZXVRLVIARB-UHFFFAOYSA-N | C1OC1CN(CC2CO2)c3ccc(Cc4ccc(cc4)N(CC5CO5)CC6CO... | -4.662065 | 0.0 | 1 | G1 | 422.525 | ... | 12.0 | 164.0 | 2.0 | 4.0 | 4.0 | 6.0 | 56.60 | 183.183268 | 1.084427e+00 | 769.899934 |
5 rows à 26 columns
2.3.2. ããŒã¿æ¢çŽ¢Â¶
ååã«ã¯ãååéãå転å¯èœãªçµåã䟡é»åãªã©ãæ§ã ãªç¹åŸŽéãšãªããããã®ããããŸãããããŠãã¡ãããä»åã®ããŒã¿ã»ããã«ãããŠã©ãã«ãšãªã溶解床ãšããææšããããŸãããã®ããŒã¿ã»ããã«å¯ŸããŠç§ãã¡ãåžžã«æåã«è¡ãã¹ãããšã®äžã€ã¯ãæ¢çŽ¢çããŒã¿è§£æïŒEDAïŒãšåŒã°ããããã»ã¹ã§ããŒã¿ã«ã€ããŠã®ç解ãæ·±ããããšã§ãããŸããã©ãã«ãããŒã¿ã®å€§æ ãç¥ãããã«ãããã€ãã®å ·äœçãªäŸã調ã¹ãããšããå§ããŸãããã
# plot one molecule
mol = rdkit.Chem.MolFromInchi(soldata.InChI[0])
mol
ããã¯ããŒã¿ã»ããã®ãã¡ãæåã®ååãrdkitã䜿ã£ãŠã¬ã³ããªã³ã°ãããã®ã§ãã
ããã§ã¯ã溶解床ããŒã¿ã®ç¯å²ãšãããæ§æããååã«ã€ããŠãªããšãªãç解ããããã«ã極å€ãèŠãŠã¿ãŸãããããŸãã溶解床ã®ç¢ºçååžã®åœ¢ãšæ¥µå€ãç¥ãããã«ã(seaborn.distplot
ã䜿ã£ãŠ)溶解床ã®ãã¹ãã°ã©ã ãäœæããŸãã
sns.distplot(soldata.Solubility)
plt.show()

äžå³ã§ã¯ã溶解床ã®ãã¹ãã°ã©ã ãšã«ãŒãã«å¯åºŠæšå®å€ãéãåãããŠããŸãããã®ãã¹ãã°ã©ã ããã溶解床ã¯çŽ-13ãã2.5ãŸã§å€åããæ£èŠååžããŠããªãããšãããããŸãã
# get 3 lowest and 3 highest solubilities
soldata_sorted = soldata.sort_values("Solubility")
extremes = pd.concat([soldata_sorted[:3], soldata_sorted[-3:]])
# We need to have a list of strings for legends
legend_text = [
f"{x.ID}: solubility = {x.Solubility:.2f}" for x in extremes.itertuples()
]
# now plot them on a grid
extreme_mols = [rdkit.Chem.MolFromInchi(inchi) for inchi in extremes.InChI]
rdkit.Chem.Draw.MolsToGridImage(
extreme_mols, molsPerRow=3, subImgSize=(250, 250), legends=legend_text
)
極端ãªååã®äŸã§ã¯ãé«å¡©çŽ ååç©ãæã溶解床ãäœããã€ãªã³æ§ååç©ã溶解床ãé«ãããšãããããŸããA-2918ã¯å€ãå€ãã€ãŸãééããªã®ã§ããããïŒãŸããNH\(_3\) ã¯æ¬åœã«ãããã®ææ©ååç©ã«å¹æµããã®ã§ããããïŒãã®ãããªçåã¯ãã¢ããªã³ã°ãè¡ãåã«æ€èšãã¹ãããšã§ãã
2.3.3. ç¹åŸŽéã®çžé¢Â¶
次ã«ãç¹åŸŽéãšæº¶è§£åºŠ(ã©ãã«)ã®çžé¢ã調ã¹ãŠã¿ãŸãããã SD
(æšæºåå·®)ãOcurrences
(ãã®ååãæ§æããããŒã¿ããŒã¹ã§äœååºçŸããã)ãGroup
(ããŒã¿ã®åºæ) ãªã©ãç¹åŸŽéã溶解床ãšã¯é¢ä¿ã®ãªãã«ã©ã ãããã€ãããããšã«æ³šæããŠãã ããã
features_start_at = list(soldata.columns).index("MolWt")
feature_names = soldata.columns[features_start_at:]
fig, axs = plt.subplots(nrows=5, ncols=4, sharey=True, figsize=(12, 8), dpi=300)
axs = axs.flatten() # so we don't have to slice by row and column
for i, n in enumerate(feature_names):
ax = axs[i]
ax.scatter(
soldata[n], soldata.Solubility, s=6, alpha=0.4, color=f"C{i}"
) # add some color
if i % 4 == 0:
ax.set_ylabel("Solubility")
ax.set_xlabel(n)
# hide empty subplots
for i in range(len(feature_names), len(axs)):
fig.delaxes(axs[i])
plt.tight_layout()
plt.show()

ååéãæ°ŽçŽ çµåã®æ°ã¯ãå°ãªããšããã®ããããããã¯ãã»ãšãã©çžé¢ããªãããã«èŠããã®ã¯èå³æ·±ãããšã§ããMolLogPã¯æº¶è§£æ§ã«é¢é£ããèšç®ããå°åºãããèšè¿°åã§ãããçžé¢ãæã£ãŠããŸãããŸãããããã®ç¹åŸŽéã®ããã€ãã¯ãåæ£ãäœããç¹åŸŽã®å€ãå€ãã®ããŒã¿ã«å¯ŸããŠã»ãšãã©å€åããªãããå šãå€åããªãããšãããããŸãïŒäŸãã°ããNumHDonorsããªã©ïŒã
2.3.4. ç·åœ¢ã¢ãã«Â¶
ãŸããæãåçŽãªã¢ãããŒãã®1ã€ã§ããç·åœ¢ã¢ãã«ããå§ããŸããããããã¯æåž«ããåŠç¿ã®æåã®äŸã§ããããã説æããç¹åŸŽéã®éžæãé£ãããããã»ãšãã©äœ¿ãããããšã¯ãããŸããã
ãã®ç·åœ¢ã¢ãã«ã¯ä»¥äžã®æ¹çšåŒã§å®çŸ©ãããŸãã
ãã®åŒã¯1ã€ã®ããŒã¿ç¹ã«å¯ŸããŠå®çŸ©ãããŸãã1ã€ã®ç¹åŸŽãã¯ãã« \(\vec{x}\) ã®åœ¢ç¶ã¯ã(17åã®ç¹åŸŽããããã)ä»åã®å Žå17ã§ãã\(\vec{w}\) ã¯é·ã17ã®èª¿æŽå¯èœãªãã©ã¡ãŒã¿ã®ãã¯ãã«ã§ã \(b\) ã¯èª¿æŽå¯èœãªã¹ã«ã©ãŒã§ã(ãã€ã¢ã¹ ãšåŒã°ããŸã)ã
ãã®ã¢ãã«ã¯ãjax
ãšããã©ã€ãã©ãªãçšããŠå®è£
ããŸãããã®ã©ã€ãã©ãªã¯ãautodiffã«ãã£ãŠè§£æçåŸé
ãç°¡åã«èšç®ã§ããããšãé€ãã°ãnumpyã«éåžžã«ãã䌌ãŠããŸãã
def linear_model(x, w, b):
return jnp.dot(x, w) + b
# test it out
x = np.array([1, 0, 2.5])
w = np.array([0.2, -0.5, 0.4])
b = 4.3
linear_model(x, w, b)
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
DeviceArray(5.5, dtype=float32)
ããã§éèŠãªåé¡ãåºãŠããŸãã 調æŽå¯èœãªãã©ã¡ãŒã¿ \(\vec{w}\) ãš \(b\) ã¯ã©ã®ããã«èŠã€ããã®ã§ããããïŒ ç·åœ¢ååž°ã®å€å žçãªæ¹æ³ã§ã¯ã \(\vec{w} = (X^TX)^{-1}X^{T}\vec{y}\) ãšããæ¬äŒŒéè¡åã䜿ã£ãŠèª¿æŽãã©ã¡ãŒã¿ãçŽæ¥èšç®ããŸãã詳ãã㯠ãã¡ãã«è©³ããæžããŠãããŸãã ããããä»åã¯ã深局åŠç¿ã§è¡ãããšãèæ ®ããå埩çãªã¢ãããŒãã䜿çšããŸããããã¯ç·åœ¢ååž°ã®æ£ããèšç®æ¹æ³ã§ã¯ãããŸãããã深局åŠç¿ã§ã¯ããèŠãæ¹æ³ãªã®ã§ãå埩çãªèšç®æ¹æ³ã«æ £ããã®ã«äŸ¿å©ã§ãããã
調æŽå¯èœãªãã©ã¡ãŒã¿ãç¹°ãè¿ãèŠã€ããããã«ãæ倱é¢æ°ãéžã³ãåŸé ãçšããŠæå°åããããšã«ããŸãããããã®éãå®çŸ©ããããã€ãã®åæå€wãšbãçšããŠæ倱ãèšç®ããŠãããŸãã
# convert data into features, labels
features = soldata.loc[:, feature_names].values
labels = soldata.Solubility.values
feature_dim = features.shape[1]
# initialize our paramaters
w = np.random.normal(size=feature_dim)
b = 0.0
# define loss
def loss(y, labels):
return jnp.mean((y - labels) ** 2)
# test it out
y = linear_model(features, w, b)
loss(y, labels)
DeviceArray(3265882.2, dtype=float32)
溶解床ã-13ãã2ã§ããããšãèãããšããã®æ倱ã¯ã²ã©ããã®ã§ãããããããã®çµæã¯ãããŸã§åæãã©ã¡ãŒã¿ããäºæž¬ããã ããªã®ã§ããã®å Žåã«ãããŠã¯æ£ããã®ã§ãã
2.3.5. åŸé éäžæ³Â¶
ããã§ã調æŽå¯èœãªãã©ã¡ãŒã¿ã«å¯ŸããŠæ倱ãã©ã®ããã«å€åããããšããæ å ±ã䜿ã£ãŠãæ倱ãæžãããŠãããŸãã ä»åçšããæ倱é¢æ°ã以äžã®ããã«å®çŸ©ããŸãã:
ãã®æ倱é¢æ°ã¯å¹³å2ä¹èª€å·®ãšåŒã°ãããã°ãã°MSEãšç¥ãããŸãã調æŽå¯èœãªãã©ã¡ãŒã¿ã«å¯Ÿããæ倱ã®åŸé ãèšç®ããããšãã§ããŸãã
ããã§ã\(w_i\) ã¯éã¿ãã¯ãã« \(i\) çªç®ã®èŠçŽ ã§ãã \(\vec{w}\) ã§ããè² ã®åŸé ã®æ¹åã«åŸ®å°éã®å€åãçãããšã§ãæ倱ãæžããããšãã§ããŸãã:
ããã§ã \(\eta\) ã¯åŠç¿çã§ããã調æŽå¯èœã§ããããåŠç¿ããªããã©ã¡ãŒã¿ïŒãã€ããŒãã©ã¡ãŒã¿ãšåŒã°ããŸãïŒã§ãããã®äŸã§ã¯ \(1\times10^{-6}\) ãšèšå®ããŠããŸããäžè¬çã«ã¯10ã®çŽ¯ä¹ã§è¡šãããæ倧ã§ã0.1çšåºŠã«ãªãããã«èšå®ãããŸãããã以äžã®å€ã¯å®å®æ§ã«åé¡ãããããšãç¥ãããŠããŸããããã§ã¯ãgradient descentãšåŒã°ãããã®æé ãå®è£ ããŠã¿ãŸãã
# compute gradients
def loss_wrapper(w, b, data):
features = data[0]
labels = data[1]
y = linear_model(features, w, b)
return loss(y, labels)
loss_grad = jax.grad(loss_wrapper, (0, 1))
# test it out
loss_grad(w, b, (features, labels))
(DeviceArray([1.1313606e+06, 7.3055479e+03, 2.8128903e+05, 7.4776180e+04,
1.5807716e+04, 4.1010732e+03, 2.2745816e+04, 1.7236283e+04,
3.9615741e+05, 5.2891074e+03, 1.0585280e+03, 1.8357089e+03,
7.1248174e+03, 2.7012794e+05, 4.6752903e+05, 5.4541211e+03,
2.5596618e+06], dtype=float32),
DeviceArray(2671.3784, dtype=float32, weak_type=True))
åŸé ã®èšç®ãè¡ã£ãã®ã§ããããæ°ã¹ããããããŠæå°åããŸãã
loss_progress = []
eta = 1e-6
data = (features, labels)
for i in range(10):
grad = loss_grad(w, b, data)
w -= eta * grad[0]
b -= eta * grad[1]
loss_progress.append(loss_wrapper(w, b, data))
plt.plot(loss_progress)
plt.xlabel("Step")
plt.yscale("log")
plt.ylabel("Loss")
plt.title("Full Dataset Training Curve")
plt.show()

2.3.6. åŠç¿æ²ç·Â¶
äžã®å³ã¯ åŠç¿æ²ç· ãšåŒã°ãããã®ã§ããããã¯æ倱ãæžå°ããŠãããã©ããã瀺ããŠãããã¢ãã«ãåŠç¿ãè¡ã£ãŠããããšãè¡šããŠããŸããåŠç¿æ²ç·ã¯ Learning Curve ãšãåŒã°ããŸããX軞ã¯ãµã³ãã«æ°ãããŒã¿ã»ããã®ç·å埩åæ°ïŒãšããã¯ãšåŒã°ããïŒãã¢ãã«ã®åŠç¿ã«äœ¿çšãããããŒã¿éã®ä»ã®ææšãçšããããŸãã
2.3.7. ãããåŠç¶
åŸé éäžæ³ã«ãã£ãŠåŠç¿ãè¯ãæãã«é²ãã§ããããšãããããŸããããããã¡ãã£ãšããå€æŽã§åŠç¿ã®ã¹ããŒãã¢ãããå³ãããšãã§ããŸããæ©æ¢°åŠç¿ã§å®éã«è¡ãããŠããåŠç¿æ¹æ³ã§ãã ãããåŠç ã䜿ã£ãŠã¿ãŸããããã¡ãã£ãšããå€æŽç¹ãšã¯ããã¹ãŠã®ããŒã¿ãäžåºŠã«äœ¿ãã®ã§ã¯ãªãããã®éšåéåã®å°ããªãããããŒã¿ã ããåããšããããšã§ãããããåŠçã«ã¯2ã€ã®å©ç¹ããããŸãã1ã€ã¯ãã©ã¡ãŒã¿ã®æŽæ°ãèšç®ããæéãççž®ã§ããããšããã1ã€ã¯åŠç¿éçšãã©ã³ãã ã«ã§ããããšã§ãããã®ã©ã³ãã æ§ã«ãããåŠç¿ã®é²è¡ãæ¢ããå¯èœæ§ã®ãããå±æçãªæ¥µå°å€ããéããããšãã§ããŸãããã®ãããåŠçã®è¿œå ã«ããããã®åŸé éäžæ³ã¢ã«ãŽãªãºã ã¯ç¢ºççãšãªãã確ççåŸé éäžæ³ïŒSGDïŒãšåŒã°ããææ³ã«ãªããŸããSGDãšãã®ããªãšãŒã·ã§ã³ã¯ã深局åŠç¿ã«ãããæãäžè¬çãªåŠç¿æ¹æ³ã§ãã
# initialize our paramaters
# to be fair to previous method
w = np.random.normal(size=feature_dim)
b = 0.0
loss_progress = []
eta = 1e-6
batch_size = 32
N = len(labels) # number of data points
data = (features, labels)
# compute how much data fits nicely into a batch
# and drop extra data
new_N = len(labels) // batch_size * batch_size
# the -1 means that numpy will compute
# what that dimension should be
batched_features = features[:new_N].reshape((-1, batch_size, feature_dim))
batched_labels = labels[:new_N].reshape((-1, batch_size))
# to make it random, we'll iterate over the batches randomly
indices = np.arange(new_N // batch_size)
np.random.shuffle(indices)
for i in indices:
# choose a random set of
# indices to slice our data
grad = loss_grad(w, b, (batched_features[i], batched_labels[i]))
w -= eta * grad[0]
b -= eta * grad[1]
# we still compute loss on whole dataset, but not every step
if i % 10 == 0:
loss_progress.append(loss_wrapper(w, b, data))
plt.plot(np.arange(len(loss_progress)) * 10, loss_progress)
plt.xlabel("Step")
plt.yscale("log")
plt.ylabel("Loss")
plt.title("Batched Loss Curve")
plt.show()

ããã§æ³šç®ãã¹ãç¹ã¯ã以äžã®3ã€ã§ãã
ãããåŠçãè¡ããªãå Žåã«æ¯ã¹ãæ倱ãå°ãããªã£ãŠããŸãã
ããŒã¿ã»ããã10åã§å埩ããã®ãããã1åã ãå埩ããŠããã«ãé¢ããããã¹ãããæ°ãå¢ããŠããŸãã
æ倱ã¯åžžã«æžå°ããããã§ã¯ãããŸããã
æ倱ãå°ãããªãçç±ã¯ãåããŒã¿ãã€ã³ãã1åããèŠãŠããªãã«ãããããããããå€ãã®ã¹ããããèžãããšãã§ããããã§ãããããåŠçãè¡ããšããããããšã«åŸé éäžæ³ã®æŽæ°ãè¡ããããããŒã¿ã»ããã«å¯ŸããŠ1åã®å埩ã§ããå€ãã®æŽæ°ãè¡ãããšãã§ããŸããå ·äœçã«ã¯ \(B\) ãããããµã€ãºãšãããšãå ã®åŸé éäžæ³ã§ã¯1åããæŽæ°ã§ããªãæããããåŠçãè¡ã£ãå Žå㯠\(N / B\) åã®æŽæ°ãè¡ãããšãã§ããŸããæ倱ãåžžã«æžå°ããªãçç±ã¯ãè©äŸ¡ãããã³ã«ç°ãªãããŒã¿ã»ããã§ããããã§ããããŒã¿ã»ããããã©ã³ãã ã«éžå®ãããããã§ã¯ãããååãä»ã®ååããäºæž¬ãé£ããã£ããã1ã€ã®ãããã«åºã¥ããŠãã©ã¡ãŒã¿ãæŽæ°ããã ããªã®ã§ãåã¹ããããæ£ããæ倱ãæå°åã§ããŠãããšã¯éããªãã£ããããŸããããããããããã©ã³ãã ã«éžå®ãããŠãããšä»®å®ããã°ãåžžã«ïŒå¹³åçã«ïŒæ倱ã®æžå°éã®æåŸ å€ãåäžãããããšãã§ããŸãã(ã€ãŸããæ倱ã®æåŸ å€ãæå°åã§ããŸã)
2.3.8. ç¹åŸŽéã®æšæºå¶
æ倱é¢æ°ã®æžå°ãããäžå®ã®å€ã§æ¢ãŸã£ãŠããŸããåŸé ã調ã¹ããšãéåžžã«å€§ãããã®ãããã°ãéåžžã«å°ãããã®ãããããšãããããŸããããã§éèŠã«ãªã£ãŠããã®ã¯ãããããã®ç¹åŸŽã¯å€§ãããéãããšã§ããäŸãã°ãåŠç¿ã§åæ ãããã¹ãããããã®éèŠããšã¯ç¡é¢ä¿ã«ãååéã¯æ¯èŒç倧ããªæ°åã§ãããååã®äžã®ç°ã®æ°ã¯æ¯èŒçå°ããªæ°åã«ãªããŸããããã¯åŠç¿ã«åœ±é¿ãäžããŠãããããããã¯åãåŠç¿çã \(\eta\) ã䜿ããªããã°ãªããªãã§ããããã®åŠç¿çãé©åãªãã®ãããã°ãå°ãããããã®ããããšããåé¡ãçºçããŠããŸããããã \(\eta\) ã倧ãããããšãç¹åŸŽéã®åŸé ãããŠã±ã圱é¿ã倧ãããªããããåŠç¿ã®é床ãççºçã«å¢å ããŠããŸããŸããããã§ãçµ±èšã®æç§æžã«èŒã£ãŠããæšæºåã®åŒãçšããŠãå šãŠã®çŽ æ§ã®å€§ãããåãã«ããã®ãæšæºçãªè§£æ±ºçãšããŠåãããŠããŸãã
ããã§ã\(\bar{x_j}\) ã¯åã®å¹³åã\(\sigma_{x_j}\) ã¯åã®æšæºåå·®ã§ããèšç·ŽããŒã¿ããã¹ãããŒã¿ã§æ±æããªãããã«ãã€ãŸããèšç·Žãããšãã«ãã¹ãããŒã¿ã®å¹³åãæšæºåå·®ãªã©ã®æ å ±ã䜿ãããšããªãããã«ãå¹³åãšæšæºåå·®ã®èšç®ã«ã¯èšç·ŽããŒã¿ã ãã䜿ãããã«ããŸãããã¹ãããŒã¿ã¯ãæªç¥ã®ããŒã¿ã«å¯Ÿããã¢ãã«ã®æ§èœãè¿äŒŒçã«ç€ºãããã®ãã®ã§ãããéåžžã®ã¿ã¹ã¯ã§ã¯æªç¥ã®ããŒã¿ãã©ã®ãããªç¹åŸŽéãæã€ãã®ã§ãããã¯åãããªãã®ã§ãæšæºåã®ããã«åŠç¿æã«äœ¿çšããããšã¯ã§ããŸããã
fstd = np.std(features, axis=0)
fmean = np.mean(features, axis=0)
std_features = (features - fmean) / fstd
# initialize our paramaters
# since we're changing the features
w = np.random.normal(scale=0.1, size=feature_dim)
b = 0.0
loss_progress = []
eta = 1e-2
batch_size = 32
N = len(labels) # number of data points
data = (std_features, labels)
# compute how much data fits nicely into a batch
# and drop extra data
new_N = len(labels) // batch_size * batch_size
num_epochs = 3
# the -1 means that numpy will compute
# what that dimension should be
batched_features = std_features[:new_N].reshape((-1, batch_size, feature_dim))
batched_labels = labels[:new_N].reshape((-1, batch_size))
indices = np.arange(new_N // batch_size)
# iterate through the dataset 3 times
for epoch in range(num_epochs):
# to make it random, we'll iterate over the batches randomly
np.random.shuffle(indices)
for i in indices:
# choose a random set of
# indices to slice our data
grad = loss_grad(w, b, (batched_features[i], batched_labels[i]))
w -= eta * grad[0]
b -= eta * grad[1]
# we still compute loss on whole dataset, but not every step
if i % 50 == 0:
loss_progress.append(loss_wrapper(w, b, data))
plt.plot(np.arange(len(loss_progress)) * 50, loss_progress)
plt.xlabel("Step")
plt.yscale("log")
plt.ylabel("Loss")
plt.show()

åŠç¿ãäžå®å®ã«ãªããªããŸãŸãåŠç¿çã0.01ãŸã§äžããããšãã§ããŸãããããã¯ãã¹ãŠã®ç¹åŸŽãåããªãŒããŒã«ãªã£ãããšã«ããå¯èœãšãªã£ãŠããŸãããŸãããããã®ã¢ãã«ã®æ¹è¯ã«ãããæŽãªãåŠç¿ãç¶ããããšãå¯èœã«ãªããŸãã
2.3.9. ã¢ãã«æ§èœã®åæ¶
ããã¯éèŠãªäºæãªã®ã§ãåŸã§è©³ãã調ã¹ãŸããïŒæåž«ããåŠç¿ã§éåžžæåã«èª¿ã¹ãã®ã¯ããªãã£ããããã§ïŒäºæž¬å€ãšã©ãã«äºæž¬å€ãçšããå³ãäœããŸãïŒãã®ããããã®è¯ããšããã¯ãç¹åŸŽéã®æ¬¡å ã«é¢ä¿ãªãæ©èœããããšã§ããã¢ãã«ãå®å šã«æ©èœããŠããå Žåããã¹ãŠã®ããŒã¿ã¯ \(y = \hat{y}\) äžã«ãããããããŸãã
predicted_labels = linear_model(std_features, w, b)
plt.plot([-100, 100], [-100, 100])
plt.scatter(labels, predicted_labels, s=4, alpha=0.7)
plt.xlabel("Measured Solubility $y$")
plt.ylabel("Predicted Solubility $\hat{y}$")
plt.xlim(-13.5, 2)
plt.ylim(-13.5, 2)
plt.show()

æçµçãªã¢ãã«ã®è©äŸ¡ã¯æ倱ã®å€ã§è¡ãããšãã§ããŸãããéåžžãä»ã®ææšã䜿çšãããŸããååž°åæã§ã¯ãæ倱ã«å ããŠçžé¢ä¿æ°ãèšç®ãããããšãå€ãã§ããçžé¢ä¿æ°ã¯æ¬¡ã®ããã«èšç®ãããŸãã
# slice correlation between predict/labels
# from correlation matrix
np.corrcoef(labels, predicted_labels)[0, 1]
0.6475304402750964
0.65
ã¯çžé¢ä¿æ°ãšããŠæªãã¯ãªãã§ãããçŽ æŽããããšã¯èšããŸããã
2.4. æåž«ãªãåŠç¿Â¶
æåž«ãªãåŠç¿ã§ã¯ãã©ãã«ããªãç¶æ 㧠\(\hat{y}\)ãäºæž¬ããããšãç®æšã§ããããã¯äžå¯èœãªããšã®ããã«æããŸãããã©ã®ããã«æåãå€æããã®ã§ããããã äžè¬ã«ãæåž«ãªãåŠç¿ã¯3ã€ã®ã«ããŽãªã«åããããŸãã
ã¯ã©ã¹ã¿ãªã³ã°
    ãã®ã«ããŽãªã§ã¯ã \(\{y_i\}\) ãã¯ã©ã¹å€æ°ãšä»®å®ããç¹åŸŽãã¯ã©ã¹ã«åå²ããããšãè©Šã¿ãŸããã¯ã©ã¹ã¿ãªã³ã°ã§ã¯ãã¯ã©ã¹ã®å®çŸ©ïŒã¯ã©ã¹ã¿ãšåŒã°ããŸãïŒãšåç¹åŸŽéãã©ã®ã¯ã©ã¹ã¿ã«å²ãåœãŠãããã¹ãããåæã«åŠç¿ããããšã«ãªããŸãã
ã·ã°ãã«ã®ããã€ãžã³ã°
    ãã®ã¿ã¹ã¯ã§ã¯ã\(x\) ã¯ãã€ãºãšã·ã°ãã«ïŒ\(y\)ïŒã®2ã€ã®æåããã§ããŠãããšä»®å®ããã·ã°ãã«ã®\(y\)ã\(x\)ããæœåºãããã€ãºãé€å»ããããšãç®æšãšããŸããåŸè¿°ããè¡šçŸåŠç¿ãšé«ãé¢é£æ§ãæã¡ãŸãã
çæçã¢ãã«
    çæçã¢ãã«ã¯ã \(P(\vec{x})\) ãåŠç¿ããŠã\(\vec{x}\) ã®æ°ããå€ããµã³ããªã³ã°ããæ¹æ³ã§ããããã¯ã\(y\)ã確çãšãããããæšå®ããããšããããšã«äŒŒãŠããŸãããããã«ã€ããŠã¯ãåŸã§è©³ãã説æããŸãã
2.4.1. ã¯ã©ã¹ã¿ãªã³ã°Â¶
ã¯ã©ã¹ã¿ãªã³ã°ã¯æŽå²çã«æãããç¥ãããæ©æ¢°åŠç¿ææ³ã®1ã€ã§ããïŒä»ã§ãè¯ãçšããããŠããŸããã¯ã©ã¹ã¿ãªã³ã°ã¯äœããªããšããã«ã¯ã©ã¹ã©ãã«ãäžããã®ã§ãããŒã¿äžã®ãã¿ãŒã³ãèŠã€ããããŒã¿ããæ°ããæŽå¯ãåŸãã®ã«åœ¹ç«ã¡ãŸãããããŠãååŠïŒãããŠã»ãšãã©ã®åéïŒã§ããŸã人æ°ããªããªã£ãçç±ã§ããããŸãããã¯ã©ã¹ã¿ãªã³ã°ã«ã¯æ£è§£ãäžæ£è§£ããããŸãããã¯ã©ã¹ã¿ãªã³ã°ã¯ãäºäººã®äººéãç¬ç«ããŠè¡ããšããã°ãã°ç°ãªãçãã«å°éããŸãããšã¯ãããã¯ã©ã¹ã¿ãªã³ã°ã¯ç¥ã£ãŠããã¹ãããŒã«ã§ãããè¯ãæ¢çŽ¢æŠç¥ã«ããªãåŸãŸãã
ããã§ã¯ãå€å žçãªã¯ã©ã¹ã¿ãªã³ã°ææ³ã§ããk-meansã«ã€ããŠèŠãŠãããŸããWikipediaã«ãã®å€å žçãªã¢ã«ãŽãªãºã ã«é¢ãããã°ãããèšäºãããã®ã§ããã®å 容ã«ã€ããŠç¹°ãè¿ãã®ã¯ãããŠãããŸããã¯ã©ã¹ã¿ãªã³ã°ã®çµæãå®éã«èŠããããã«ããããã«ãç¹åŸŽéã2次å ã«æ圱ããããšããå§ããŸããããã¯è¡šçŸåŠç¿ã§è©³çŽ°ã«èª¬æãããã®ã§ããããã®ã¹ãããã«ã€ããŠã®ç解ãå¿é ããå¿ èŠã¯ãããŸããã(蚳泚: æ¬åœãïŒ)
# get down to 2 dimensions for easy visuals
embedding = sklearn.manifold.Isomap(n_components=2)
# only fit to every 25th point to make it fast
embedding.fit(std_features[::25, :])
reduced_features = embedding.transform(std_features)
極端ã«é¢ããŠããå€ãå€ãããã®ã§ïŒããã¯ããã§é¢çœãã®ã§ããïŒãããŒã¿ã®çãäž99ããŒã»ã³ã¿ã€ã«ã«ã€ããŠæ³šç®ããŠãããŸãã
xlow, xhi = np.quantile(reduced_features, [0.005, 0.995], axis=0)
plt.figure(dpi=300)
plt.scatter(
reduced_features[:, 0],
reduced_features[:, 1],
s=4,
alpha=0.7,
c=labels,
edgecolors="none",
)
plt.xlim(xlow[0], xhi[0])
plt.ylim(xlow[1], xhi[1])
cb = plt.colorbar()
cb.set_label("Solubility")
plt.show()

次å åæžã«ãããç¹åŸŽéã¯ã¯ããã2次å ãšãªããŸããã溶解床ã®ã¯ã©ã¹ã«ãã£ãŠè²ãä»ããããšã§ãããã€ãã®æ§é ãèŠãããšãã§ããŸãããã®ãããªæ¬¡å åæžãè¡ã£ãçµæã®ããããã§ã¯ã軞ã¯ä»»æã§ãããããã©ãã«ãä»ããªãããšã«æ³šæããŠãã ããã
ç¶ããŠãã¯ã©ã¹ã¿ãªã³ã°ãè¡ããŸããã¯ã©ã¹ã¿ãªã³ã°ã®äž»ãªèª²é¡ã¯ãã¯ã©ã¹ã¿ãããã€ã«ããã決ããããšã§ãããããããªæ¹æ³ããããŸãããåºæ¬çã«ã¯çŽæã«é Œãããšã«ãªããŸããã€ãŸããååŠè ãšããŠãããŒã¿ä»¥å€ã®äœããã®ãã¡ã€ã³ç¥èã䜿ã£ãŠãã¯ã©ã¹ã¿æ°ãçŽæçã«æ±ºããå¿ èŠããããŸããéç§åŠçã«èãããŸããïŒã ããã¯ã©ã¹ã¿ãªã³ã°ã¯é£ãããã§ãã
# cluster - using whole features
kmeans = sklearn.cluster.KMeans(n_clusters=4, random_state=0)
kmeans.fit(std_features)
ãšãŠãç°¡åãªæé ã§ãããã§ã¯ãããŒã¿ãå²ãåœãŠãããã¯ã©ã¹ã§è²ä»ãããŠå¯èŠåããŠã¿ãŸãã
plt.figure(dpi=300)
point_colors = [f"C{i}" for i in kmeans.labels_]
plt.scatter(
reduced_features[:, 0],
reduced_features[:, 1],
s=4,
alpha=0.7,
c=point_colors,
edgecolors="none",
)
# make legend
legend_elements = [
plt.matplotlib.patches.Patch(
facecolor=f"C{i}", edgecolor="none", label=f"Class {i}"
)
for i in range(4)
]
plt.legend(handles=legend_elements)
plt.xlim(xlow[0], xhi[0])
plt.ylim(xlow[1], xhi[1])
plt.show()

2.4.2. ã¯ã©ã¹ã¿ãŒæ°ã®éžæ¶
ã©ããã£ãŠã¯ã©ã¹ã¿ãŒã®æ°ãæ£ãã決ããããããå€æããã®ã§ããããïŒçãã¯çŽæã§ãããšã«ããŒãããããšåŒã°ãããåŠç¿æ²ç·ã®ããã«äœ¿ãããšã®ã§ããããŒã«ããããŸããk-meansã®ã¯ã©ã¹ã¿ã¯ãã¯ã©ã¹ã¿äžå¿ããã®å¹³åäºä¹è·é¢ãèšç®ããããšã§æ倱é¢æ°ãšããŠäœ¿ãããšãã§ããŸããããããã¯ã©ã¹ã¿æ°ãåŠç¿å¯èœãªãã©ã¡ãŒã¿ãšããŠæ±ããšãã¯ã©ã¹ã¿æ°ãšããŒã¿ç¹æ°ãçãã(ã€ãŸãã1ã€ã®ã¯ã©ã¹ã¿ã«1ã€ã®ããŒã¿ãå ¥ã)ãšãã«æããã£ããããããšãããããŸããããã§ã¯æå³ããããŸããããããããã®æ倱é¢æ°ã®åŸããã»ãŒäžå®ã«ãªãç¹ãååšããã¯ã©ã¹ã¿ãè¿œå ããããšã§æ°ããèŠèãè¿œå ããŠããªããšå€å®ããããšãã§ããŸããæ倱ãããããããŠäœãèµ·ãããèŠãŠã¿ãŸããããæéãç¯çŽããããã«ãããŒã¿ã»ããã®ãã¡äžéšã®ãµã³ãã«ã䜿çšããŠããããšã«æ°ãä»ããŠãã ããããããåŠçãšåããããªèãæ¹ã§ãã
# make an elbow plot
loss = []
cn = range(2, 15)
for i in cn:
kmeans = sklearn.cluster.KMeans(n_clusters=i, random_state=0)
# use every 50th point
kmeans.fit(std_features[::50])
# we get score -> opposite of loss
# so take -
loss.append(-kmeans.score(std_features[::50]))
plt.plot(cn, loss, "o-")
plt.xlabel("Cluster Number")
plt.ylabel("Loss")
plt.title("Elbow Plot")
plt.show()

å€ããç®ã¯ã©ãã§ããããïŒããããèŠãŠã¿ããšããã¶ãã6? 3? 4? 7? ä»åã¯4ãéžã³ãŸããããé¿ããããããããŒã¿ã«åºã¥ããšãã£ãšããããããã§ãã æåŸã®äœæ¥ã¯ãã¯ã©ã¹ã¿ãå®éã«äœã§ããããç¥ãããšã§ããæãäžå¿ã«ããããŒã¿ãã€ã³ãïŒã€ãŸããã¯ã©ã¹ã¿ã®äžå¿ã«æãè¿ãããŒã¿ïŒãæœåºããããããã¯ã©ã¹ã¿ã®ä»£è¡šãšã¿ãªããŸãã
# cluster - using whole features
kmeans = sklearn.cluster.KMeans(n_clusters=4, random_state=0)
kmeans.fit(std_features)
cluster_center_idx = []
for c in kmeans.cluster_centers_:
# find point closest
i = np.argmin(np.sum((std_features - c) ** 2, axis=1))
cluster_center_idx.append(i)
cluster_centers = soldata.iloc[cluster_center_idx, :]
legend_text = [f"Class {i}" for i in range(4)]
# now plot them on a grid
cluster_mols = [rdkit.Chem.MolFromInchi(inchi) for inchi in cluster_centers.InChI]
rdkit.Chem.Draw.MolsToGridImage(
cluster_mols, molsPerRow=2, subImgSize=(400, 400), legends=legend_text
)
ã§ã¯ããããã®ã¯ã©ã¹ã¯äžäœäœãªã®ã§ããããïŒäžæã§ããæå³çã«æº¶è§£åºŠãæããã«ããªãã£ãïŒæåž«ãªãåŠç¿ïŒã®ã§ãå¿ ããã溶解床ãšé¢ä¿ãããããã§ã¯ãããŸããããããã®ã¯ã©ã¹ã¯ããããããŒã¿ã»ããã«ã©ã®ãããªç¹åŸŽãéžã°ãããã®çµæã§ããã¯ã©ã¹1ã¯ãã¹ãŠè² é»è·ã§ãããšããã¯ã©ã¹0ã¯èèªæã§ãããšãã£ã仮説ãç«ãŠãŠãæ€èšããããšãã§ããŸãããããããã¹ããªã¯ã©ã¹ã¿ãªã³ã°ãéžã¶ããšã¯ã§ããŸããããæåž«ãªãåŠç¿ã¯æŽå¯ããã¿ãŒã³ãèŠã€ããããšãéèŠã§ã粟床ã®é«ãã¢ãã«ãäœãããšãéèŠãªã®ã§ã¯ãããŸããã
ãšã«ããŒããããæ³ã¯ãã¯ã©ã¹ã¿çªå·ãéžæããããã®å€ãã®ã¢ãããŒãã®1ã€ã§ã [PDN05]. ç§ã¯ãçŽæãå©çšããŠããããšãæ確ãªã®ã§ããã®æ¹æ³ã奜ãã§ããŸããããæŽç·Žãããæ¹æ³ã¯ãã¯ã©ã¹ã¿ãªã³ã°ã«æ£è§£ãäžæ£è§£ããªããšããäºå®ãããçš®é èœããŠããŸãã
Note
ãã®ããã»ã¹ã§ã¯ã溶解床ãäºæž¬ããé¢æ°ã¯åŸãããŸãããäºæž¬ãããã¯ã©ã¹ã§æº¶è§£åºŠã®äºæž¬ã«é¢ããèŠèãåŸããããããããŸããããããã¯ã¯ã©ã¹ã¿ãªã³ã°ã®ç®çã§ã¯ãããŸããã
2.5. ãŸãšã¶
æåž«ããæ©æ¢°åŠç¿ãšã¯ãå ¥åã®ç¹åŸŽé \(\vec{x}\) ããã©ãã« \(y\) ãäºæž¬ããã¢ãã«ãæ§ç¯ããããšã§ãã
ããŒã¿ã¯ã©ãã«ä»ãã§ãã©ãã«ãªãã§ãé©çšå¯èœã§ãã
確ççåŸé éäžæ³ãçšããŠãæ倱ãæå°åããããšã§ã¢ãã«ãåŠç¿ã§ããŸãã
æåž«ãªãåŠç¿ãšã¯ãããŒã¿ã®ãã¿ãŒã³ãçºèŠããã¢ãã«ãäœãããšã§ãã
ã¯ã©ã¹ã¿ãªã³ã°ã¯æåž«ãªãåŠç¿ã®1ã€ã§ãããã¢ãã«ãããŒã¿ç¹ãã¯ã©ã¹ã¿ã«åããŸãã
2.6. ç·Žç¿åé¡Â¶
2.6.1. ããŒã¿åŠç¶
numpy
ã®np.amin
,np.std
ãªã©ã䜿ã£ãŠïŒpandas
ã§ã¯ãããŸããïŒïŒããã¹ãŠã®ããŒã¿ç¹ã«ã€ããŠã®åç¹åŸŽã®å¹³åãæå°ãæ倧ãæšæºåå·®ãèšç®ããŠãã ãããrdkit ã䜿ã£ãŠãååéã®å€§ãã2ã€ã®ååãæããŠãã ããããŸãããã®æ§é ã®å€ãªç¹ãæããŠãã ããã
2.6.2. ç·åœ¢ã¢ãã«Â¶
\(y = \vec{w_1} \cdot \sin\left(\vec{x}\right) + \vec{w_2} \cdot \vec{x} + b\) ã®ãããªéç·åœ¢ã¢ãã«ãç·åœ¢ã¢ãã«ã§è¡šçŸã§ããããšã蚌æããŠãã ããã
ç·åœ¢ã¢ãã«æ¹çšåŒãã¢ã€ã³ã·ã¥ã¿ã€ã³ã®çž®çŽèšæ³ã§ãããåŒã«æžãåºããªãããããã圢åŒãšã¯ããããã瀺ãã€ã³ããã¯ã¹ãæ瀺çã«æã€ããšã§ãããäŸãã°ãã©ãã«ã¯\(y_{bi}\)ãšãªãã\(b\)ã¯ããŒã¿ã®ããããã\(i\)ã¯åã ã®ããŒã¿ç¹ã瀺ãã
2.6.3. æ倱é¢æ°ã®æå°å¶
ä»åã®notebookã§ã¯ãç¹åŸŽéã¯æšæºåããããã©ãã«ã¯æšæºåããŸããã§ãããã©ãã«ãæšæºåããããšã¯ãåŠç¿çã®éžæã«åœ±é¿ãäžãããïŒèª¬æããŠãã ããã
å¹³åäºä¹èª€å·®ã§ã¯ãªããå¹³å絶察誀差ã®æ倱ãå®è£ ããŠãã ããããã®åŸé ã
jax
ã䜿ã£ãŠèšç®ããŠãã ãããæšæºåãããç¹åŸŽéãçšããŠãããããµã€ãºãåŠç¿ã«ã©ã®ãããªåœ±é¿ãäžãããã瀺ããŠãã ãããããããµã€ãºã¯1ã8ã32ã256ã1024ã䜿çšããŠãã ãããåå®è¡ã®éã«éã¿ãåæåããªããã°ãªããªãããšã«æ³šæããŠãã ããããããŠãåããããµã€ãºã§ã®log-lossãåãããããäžã«ãããããããã®çµæã説æããŠãã ããã
2.6.4. ã¯ã©ã¹ã¿ãªã³ã°Â¶
ã¯ã©ã¹ã¿ãªã³ã°ã¯æåž«ãªãåŠç¿ã®äžçš®ã§ãããã©ãã«ãäºæž¬ãããšè¿°ã¹ãŸãããã¯ã©ã¹ã¿ãªã³ã°ã§äºæž¬ãããã©ãã«ãšã¯ãå ·äœçã«ã©ã®ãããªãã®ã§ãããããããã€ãã®ããŒã¿ãã€ã³ãã«ã€ããŠãäºæž¬ãããã©ãã«ãã©ã®ãããªãã®ãæžãåºããŠã¿ãŠãã ããã
ã¯ã©ã¹ã¿ãªã³ã°ã§ã¯ãç¹åŸŽéããã©ãã«ãäºæž¬ããŸããã©ãã«ããã£ãŠãããããç¹åŸŽéãšèŠãªããŠã¯ã©ã¹ã¿ãªã³ã°ããããšãã§ããŸãããã®ããã«ãã©ãã«ãç¹åŸŽéãšããŠæ±ããã¯ã©ã¹ãè¡šãæ°ããã©ãã«ãäºæž¬ããããšããã¯ã©ã¹ã¿ãªã³ã°ãè¯ããªãçç±ã2ã€è¿°ã¹ãŠãã ããã
IsomapããããïŒçž®å°æ¬¡å ããããïŒäžã§ãç¹ãã©ã®ã°ã«ãŒãã«å±ãããïŒG1ãG2ãªã©ïŒã§è²åãããŠãã ããããããšã¯ã©ã¹ã¿ãªã³ã°ã®éã«äœãé¢ä¿ãããã§ããããã
2.7. Cited References¶
- Alp20
Ethem Alpaydin. Introduction to machine learning. MIT press, 2020.
- FZJS21
Victor Fung, Jiaxin Zhang, Eric Juarez, and Bobby G. Sumpter. Benchmarking graph neural networks for materials chemistry. npj Computational Materials, June 2021. URL: https://doi.org/10.1038/s41524-021-00554-0, doi:10.1038/s41524-021-00554-0.
- Bal19
Prasanna V Balachandran. Machine learning guided design of functional materials with targeted properties. Computational Materials Science, 164:82â90, 2019.
- GomezBAG20
Rafael Gómez-Bombarelli and Alán Aspuru-Guzik. Machine learning and big-data in computational chemistry. Handbook of Materials Modeling: Methods: Theory and Modeling, pages 1939â1962, 2020.
- NDJ+18
Aditya Nandy, Chenru Duan, Jon Paul Janet, Stefan Gugler, and Heather J Kulik. Strategies and software for machine learning accelerated discovery in transition metal chemistry. Industrial & Engineering Chemistry Research, 57(42):13973â13986, 2018.
- Wei88
David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences, 28(1):31â36, 1988.
- SKE19
Murat Cihan Sorkun, Abhishek Khetan, and SÌleyman Er. AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci. Data, 6(1):143, 2019. doi:10.1038/s41597-019-0151-1.
- PDN05
Duc Truong Pham, Stefan S Dimov, and Chi D Nguyen. Selection of k in k-means clustering. Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 219(1):103â119, 2005.