[paper review] Motif - based Graph Self-Supervised Learning for Molecular Property prediction 논문 리뷰

AI/Paper review

[paper review] Motif - based Graph Self-Supervised Learning for Molecular Property prediction 논문 리뷰

재온 2022. 4. 17. 13:20

MGSSL(Motif - based Graph Self-Supervised Learning for Molecular Property prediction)
그래프 학습에서 Motif의 개념이 꽤 중요하게 다루어지고 있는데 단순히 node와 edge level에서 그래프를 바라보는 것보다 topology property를 잘 catch했다는 점에서 의미가 있다. 분자 특성 예측 시 해당 연구의 접근 방식을 도입해볼 수 있을 것 같다.

📚 제안 배경

1) labeling data 부족

화학 분야의 특성상 labeling에 소요되는 비용 및 시간 높음
⇒ over-fitting과 일반성 위해 self-supervised 방식 도입함.
최근 많은 연구에서 활용되고 있으며 앞서 리뷰했던 Grover 논문도 SSL 방식의 학습 방법 활용하였다.
https://velog.io/@emperor___one/Paper-review-GROVER-Graph-Represention-from-self-supervised-mEssage-passing-tRansformer-%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0

2) Motif 단위의 분자 특정 예측

motif: 자주 발견되는 subgraph의 확실하고 명확한 pattern
→ 분자의 성질 설명 용이
ex) OH가 subgraph로 많이 포함된 분자 → 수용성이 있겠구나 ! 예측 가능
기존 GNN 모델은 node나 edge 단위의 예측에 주로 사용→ motif 단위로 확장
분자의 특성 예측같은 downstream task 성능 향상

GROVER Model의 한계
motif 추출에 있어 전통적인 소프트웨어 활용 → motif의 topology 정보는 반영 어려움.

📚 선행 연구

분자 특성 예측

전통적인 분자 특성 예측 방법 DFT(density functional theory)
various machine learning model (ridge, RF, CNN 등)
분자의 구조 정보를 더 넓게 반영하기 위한 GNN 모델 (message passing)
GNN 모델에 transformer 도입한 GROVER→ motif 추출에 있어 전통적인 소프트웨어 활용하고 이를 classification label처럼 취급하여 motif의 topology 정보는 반영 X

📚 MGSSL framework

(Motif-based Graph Self-Supervised Learning)

💡핵심 아이디어

node-by-node가 아닌 motif-by-motif Graph

multi-level self-supervised pre-training tasks

[Motiflayer]

chemistry-inspired molecule fragmentation
motif generation

[Atom layer]

multi-level self-supervised
pre-training for GNNs

1️⃣ chemistry-inspired molecule fragmentation

분자를 화학적 지식에 따라 조각화하는 단계: BRICS(Breaking of Retrosynthetically Interesting Chemical Substructures) 화학적 도메인 지식 이용
dummy atom: 결합을 끊은 위치에 부착→ 두 조각이 서로 결합할 수 있는 위치 표시
motif vocabulary 개수 조절 위해 추가 분해 작업 진행 (개수 너무 많으면 출현 빈도 down) → 도메인 지식 활용

2️⃣ motif generation

molecule graph G = (V, E)
motif tree T(G) = (V, E, X )
where node set is V = {M1, ..., Mn}, edge set is E. X

→ motif tree가 atom을 잘 설명할 수 있도록 well labeling likehood를 최대화 시키는 motif 생성하기

generation order of motif tree 정하는 방법 → breadth-first , depth-first

fist order: first atom
BFS: 최대한 넓게 이동한 다음, 더 이상 갈 수 없을 때 아래로 이동
DFS: 최대한 깊이 내려간 뒤, 더이상 깊이 갈 곳이 없을 경우 옆으로 이동
topological prediction(childe node 생성 여부 예측)→ 더 이상 child node 없을 시에 backtrak

3️⃣ multi-level self-supervised pre-training for GNNs

motif layer, atom layer가 수평적으로 designed
random하게 node과 edge를 mask하여 pretrain (GNN 활용)
pretext task
- topology prediction: child node가 있는지, 어디로 이어질지 topology property prediction
- motif prediction: node의 속성 예측
Data
- 250k unlabeled molecules sampled from the ZINC15 database
- scaffold-split
Loss
- lambda: weight of loss

📚 실험 및 결과

1️⃣ downstream task (분자 속성 예측)

Data: 8 classification benchmark datasets contained in MoleculeNet

Downstream task
: 분자 특성 예측 (regression)
SOTA model과 비교 (Deep Graph Infomax , Attribute masking, GCC, Grover , GPT-GNN)

2) Baseline model (SOTA)

Deep Graph Infomax : maximizes the mutual information between the representations of the whole graphs and the representations of its sampled subgraphs.
Attribute masking: node/edge features and let GNNs predict these attributes.
GCC: the pretraining task as discriminating ego-networks sampled from a certain node ego-networks sampled from other nodes.
Grover predicts the contextual properties based on atom embeddings to encode contex-
tual information into node embeddings.
GPT-GNN: is a generative pretraining task which predicts masked edges and node
attributes.

3) Result

GNN 기반 모델의 성능이 대체적으로 높음
benchmark 모델들보다 최고 성능
estimation: AUC