Quick Overview
This analysis evaluates Graph Neural Network (GNN) models for predicting edge formation in Twitter networks focused on vaccine and COVID-19 discussions in Albanian.
Key Results
Top Performance Scores
| Dataset | Metric | Best Model | Score |
|---|---|---|---|
| graph_shqip | AUC | GCN + one-hot | 0.8858 |
| graph_shqip_nsl | AUC | GCN + one-hot | 0.8207 |
| graph_shqip_rt | AUC | GCN + one-hot | 0.8897 |
Content Features Performance
- LDA Topics: Competitive for accuracy (0.68-0.70) but underperforms on AUC
- Network Embeddings: Best accuracy on graph_shqip (0.7068) but lower AUC
- One-Hot Encoding: Consistently dominates across all metrics and datasets
Why One-Hot Wins
Structural Dominance
Users connect based on network proximity and structural patterns, not content similarity. This suggests that edge formation is driven by:
- Transitive closure (friends of friends)
- Community membership
- Network visibility and centrality
- Social proof from mutual connections
Content as Noise
Sophisticated content features may introduce noise due to:
- Dimensionality curse: High-dimensional vectors scatter the signal
- Temporal mismatch: Tweet content changes; network structure is stable
- Sparsity: Not all users have sufficient content for robust features
- Homogeneous topic space: All tweets are about vaccines/COVID-19
Dataset Context
- Language: Albanian only (fewer NLP tools available)
- Scale: Limited dataset size (amplifies structural signals)
- Topic: Highly specialized (vaccine/COVID-19 discussions)
- Search Terms: moderna, pfizer, astrazeneca, vaccine, coronavirus, covid-19, covid, vaksina
Encoder Architectures
- GCN (Graph Convolutional Networks): Best AUC performance, especially with one-hot encoding
- GAT (Graph Attention Networks): Good balance across metrics and features
- GraphTransformer: Better for accuracy, leverages content features effectively
- GraphSAGE: Moderate performance across all metrics
Recommendations
The Influencer Paradox
While content intuitively seems important, the analysis reveals a counterintuitive truth:
- Users don't follow influencers primarily because of their content
- They follow them because they're visible and central in the network
- Content drives engagement AFTER connection formation
- Influencer effect is a consequence of network position, not content quality
In specialized networks like vaccine discussions, structural homophily dominates semantic homophily. Users cluster by network proximity more than topic overlap.
Conclusion
This finding emphasizes the importance of network position and structural factors over content quality in determining link formation on social media.