📊 Executive Summary

GNN Edge Prediction: Content vs. Network Structure in Twitter

Quick Overview

This analysis evaluates Graph Neural Network (GNN) models for predicting edge formation in Twitter networks focused on vaccine and COVID-19 discussions in Albanian.

Primary Finding: Network structure predicts edge formation far better than content features. One-hot encoding achieves 0.82-0.89 AUC-ROC, outperforming sophisticated content-based features (0.54-0.80 AUC).

Key Results

Top Performance Scores

Dataset Metric Best Model Score
graph_shqip AUC GCN + one-hot 0.8858
graph_shqip_nsl AUC GCN + one-hot 0.8207
graph_shqip_rt AUC GCN + one-hot 0.8897

Content Features Performance

  • LDA Topics: Competitive for accuracy (0.68-0.70) but underperforms on AUC
  • Network Embeddings: Best accuracy on graph_shqip (0.7068) but lower AUC
  • One-Hot Encoding: Consistently dominates across all metrics and datasets

Why One-Hot Wins

Structural Dominance

Users connect based on network proximity and structural patterns, not content similarity. This suggests that edge formation is driven by:

  • Transitive closure (friends of friends)
  • Community membership
  • Network visibility and centrality
  • Social proof from mutual connections

Content as Noise

Sophisticated content features may introduce noise due to:

  • Dimensionality curse: High-dimensional vectors scatter the signal
  • Temporal mismatch: Tweet content changes; network structure is stable
  • Sparsity: Not all users have sufficient content for robust features
  • Homogeneous topic space: All tweets are about vaccines/COVID-19
Important: Content becomes important AFTER connections form, driving engagement and interaction frequency. It's not the primary mechanism for edge creation.

Dataset Context

Critical Limitations:
  • Language: Albanian only (fewer NLP tools available)
  • Scale: Limited dataset size (amplifies structural signals)
  • Topic: Highly specialized (vaccine/COVID-19 discussions)
  • Search Terms: moderna, pfizer, astrazeneca, vaccine, coronavirus, covid-19, covid, vaksina
Impact: Results may not generalize to broader, multi-topic networks or other languages.

Encoder Architectures

  • GCN (Graph Convolutional Networks): Best AUC performance, especially with one-hot encoding
  • GAT (Graph Attention Networks): Good balance across metrics and features
  • GraphTransformer: Better for accuracy, leverages content features effectively
  • GraphSAGE: Moderate performance across all metrics

Recommendations

For Edge Prediction: Use GCN with one-hot encoding. Simple, interpretable, and achieves 0.88+ AUC with minimal computational cost.
For Balanced Predictions: Use GraphTransformer with LDA features to achieve 0.68-0.70 accuracy with better true positive/negative balance.
For Content-Aware Systems: Use content features for post-connection tasks (ranking, filtering, recommendation) rather than edge prediction.
For Influencer Analysis: Network position (centrality, degree) is more important than content quality for predicting influence spread.

The Influencer Paradox

While content intuitively seems important, the analysis reveals a counterintuitive truth:

  • Users don't follow influencers primarily because of their content
  • They follow them because they're visible and central in the network
  • Content drives engagement AFTER connection formation
  • Influencer effect is a consequence of network position, not content quality

In specialized networks like vaccine discussions, structural homophily dominates semantic homophily. Users cluster by network proximity more than topic overlap.

Conclusion

Main Takeaway: For edge prediction in specialized, single-topic networks with semantic homogeneity, network structure is significantly more predictive than content. However, content plays a crucial role in explaining engagement and interaction quality on existing connections.

This finding emphasizes the importance of network position and structural factors over content quality in determining link formation on social media.