GNN Edge Prediction Analysis

Content vs. Network Structure in Social Media (Twitter)

Dataset Context

Critical Dataset Information:
  • Language: Albanian tweets only
  • Scale: Limited dataset size (not large-scale)
  • Topic: Highly specialized — vaccine and COVID-19 focused
  • Search Terms: moderna, pfizer, astrazeneca, vaccine, coronavirus, covid-19, covid, vaksina

Impact on Interpretation: This context significantly refines what we can conclude from the findings. The dominance of structural features is particularly pronounced in specialized, single-topic networks and may not generalize to broader social media analysis.

Executive Summary

This analysis examines the performance of different feature representations and Graph Neural Network architectures on edge prediction tasks across three Twitter network datasets: graph_shqip, graph_shqip_nsl, and graph_shqip_rt.

A key observation emerges from the experimental results: one-hot encoding of node identities consistently outperforms or equals sophisticated content-based features in predicting edge formation. This finding has significant implications for understanding network formation mechanisms in specialized, single-topic networks with semantic homogeneity.

Important Context: The dominance of structural features is particularly pronounced in this Albanian vaccine/COVID-19 discussion network due to topic homogeneity, limited dataset size, and language-specific NLP constraints. Results may not generalize to broader, multi-topic social media networks.

Key Findings

1. One-Hot Encoding Dominance

Finding: One-hot encoding (pure node identity) consistently achieves among the highest AUC-ROC scores across all three datasets and most encoder architectures.
Dataset AUC Metric One-Hot Performance Top Alternative
graph_shqip AUC-ROC 0.8858 (GCN) 0.8397 (ne + GraphTransformer)
graph_shqip_nsl AUC-ROC 0.8207 (GCN) 0.7601 (one-hot + GAT)
graph_shqip_rt AUC-ROC 0.8897 (GCN) 0.8279 (one-hot + GraphTransformer)

2. Content Features Show Mixed Results

Observation: Content-derived features like LDA topics and network embeddings show inconsistent performance:
  • LDA (Latent Dirichlet Allocation): Competitive for accuracy but underperforms one-hot on AUC
  • Network Embeddings (ne): Best accuracy on graph_shqip (0.7068) but lower AUC scores
  • User Features: Generally underperform across all metrics

3. Accuracy vs. AUC Divergence

There is a notable divergence between accuracy and AUC metrics:

  • AUC Excellence: One-hot encoding with GCN achieves 0.82-0.89 AUC-ROC
  • Accuracy Peak: GraphTransformer with topic-based features achieves highest accuracy (0.68-0.70)

This suggests that content features may help with balanced predictions but fail to generate high-confidence edge predictions.

Interpretation

Why Does One-Hot Encoding Outperform Content Features?

1. Structural Dominance Over Content

The superior performance of one-hot encoding indicates that network topology and structural patterns are more predictive of edge formation than tweet content. This challenges the assumption that content similarity drives social interactions.

Interpretation: Users appear to be forming connections based on network proximity and existing relationships, not on content overlap or topic similarity.

2. Homogeneous Topic Space (Context-Dependent)

Dataset-Specific Finding: All tweets revolve around vaccines/COVID-19, creating a highly homogeneous semantic space.

Implications:
  • Content features capture variations within a narrow topic, making semantic differentiation minimal
  • Structural patterns (mutual followers, community ties) become more predictive than content
  • Unlike general Twitter, there is limited content diversity to leverage for predictions
  • This finding may not generalize to broader, multi-topic networks

3. Language and Dataset Scale Effects

Important Limitation: The Albanian language and limited dataset size affect content feature quality:
  • NLP Model Quality: Smaller pretrained models and embeddings available for Albanian
  • LDA Robustness: Topic models may be less accurate with smaller corpus and limited vocabulary
  • Network Clustering: Limited dataset size means tighter clustering, amplifying structural patterns
  • Content Sparsity: Not all users have sufficient Albanian tweets for robust feature extraction

4. Content Features as Noise

Sophisticated content features (LDA, embeddings) may introduce noise that degrades model performance:

  • Dimensionality curse: High-dimensional content features may scatter the signal
  • Temporal mismatch: Tweet content is dynamic while network structure is more stable
  • Sparsity: Not all users have sufficient content for robust feature extraction
  • Redundancy: Content information may already be partially captured by network structure

5. Homophily vs. Content-Based Linking

The results suggest the network is primarily driven by structural homophily (tendency to connect with similar structural positions) rather than semantic homophily (tendency to connect based on similar interests/content).

Examples of mechanisms:
  • Following friends of friends (transitive closure)
  • Following users in the same community
  • Following users with similar follower counts
  • Mention-based interactions within echo chambers

6. The Role of "Influencers" and Content

Critical Insight: While content and influencers are important for information propagation, they appear to be secondary to network structure for predicting edge formation in this dataset.

This suggests that:
  • Users do not follow influencers primarily because of their content
  • Content becomes important after an edge is formed (for engagement)
  • Network position and visibility drive initial connections
  • The "influencer effect" may be a consequence of network centrality, not content quality

Encoder Architecture Analysis

Best Performing Architectures by Context

Metric Best Encoder Best Feature Score Dataset
AUC-ROC GCN one-hot 0.8897 graph_shqip_rt
Accuracy GraphTransformer lda_top1k 0.6889 graph_shqip_rt

Encoder Characteristics

  • GCN (Graph Convolutional Networks): Best for AUC with simple features, excels with one-hot encoding
  • GAT (Graph Attention Networks): Good balance across features, particularly with one-hot encoding
  • GraphTransformer: Better for accuracy, leverages complex content features effectively
  • GraphSAGE: Moderate performance across all metrics

Dataset-Specific Analysis

Revised Interpretation with Dataset Context

Limited Dataset Size Effects

The smaller dataset size has several impacts on the findings:

  • Tighter Network Clustering: Smaller networks naturally form more cohesive communities
  • Amplified Structural Signals: Structural patterns become more pronounced relative to content
  • Content Signal Dilution: Fewer examples of diverse content reduce the effectiveness of content-based features
  • Representation Quality: Embeddings and topic models are less robust with smaller training data

Specialized Topic Impact

The vaccine/COVID-19 focus creates a fundamentally different prediction problem:
  • Content Homogeneity: Users discuss the same vaccines, side effects, and policies — minimal semantic diversity
  • Temporal Events: Network formation spikes during vaccine rollouts, policy announcements, and outbreaks
  • Echo Chambers: Users naturally segregate into pro-vaccine and vaccine-skeptical communities
  • Reduced Content Variation: Unlike general Twitter, there's less content diversity to exploit for edge prediction

Albanian Language Considerations

Language-specific factors affect content feature quality:

  • Fewer Pretrained Models: Smaller ecosystem of Albanian NLP tools compared to English
  • LDA Topic Quality: Topic modeling with smaller corpora produces less stable and interpretable topics
  • Embedding Coverage: Word embeddings may have lower coverage for Albanian vocabulary
  • Morphological Complexity: Albanian inflections may increase sparsity in content features

Dataset-Specific Observations

graph_shqip (Full Dataset)

  • AUC Leader: one-hot + GCN (0.8858)
  • Accuracy Leader: ne + GraphTransformer (0.7068)
  • Characteristic: Largest dataset with most balanced performance
  • Insight: Shows that complex features can improve accuracy without improving AUC

graph_shqip_nsl (NSL-filtered)

  • AUC Leader: one-hot + GCN (0.8207)
  • Accuracy Leader: one-hot + GAT (0.6489)
  • Characteristic: Content features perform worse here, structural signals dominate
  • Insight: Filtering by network security labels amplifies structural importance

graph_shqip_rt (Retweet-based)

  • AUC Leader: one-hot + GCN (0.8897)
  • Accuracy Leader: lda_top1k + GraphTransformer (0.6889)
  • Characteristic: Retweet networks show highest overall AUC scores
  • Insight: Retweet behavior is more predictable from structure; content matters for selective retweets

Theoretical Implications

Generalizability Caveats

Important Limitation: These findings apply primarily to specialized, single-topic networks with:
  • Limited semantic diversity (vaccine/COVID-19 focus)
  • Smaller network size
  • Non-English language content
  • Strong event-driven temporal dynamics
Different results may emerge in: General Twitter networks, multi-topic discussions, larger-scale networks, or English-language content with higher-quality NLP tools.

Social Network Formation in Specialized Networks

For vaccine/COVID-19 discussion networks, formation follows these mechanisms:

  1. Event-Driven Clustering: Users connect during vaccine rollouts or policy announcements
  2. Temporal Proximity: Engagement at similar times creates network bonds
  3. Community Polarization: Users segregate into ideological clusters (pro/anti-vaccine)
  4. Network Visibility: Users discover others through mutual followers and trending topics

Role of Content (Context-Dependent)

In this specialized network, content appears to play a secondary role:

  • Within-Community Engagement: Content drives interaction frequency within established communities
  • Community Identity: Content signals alignment with pro or anti-vaccine positions
  • Not Primary Connector: Content does not predict edge formation directly in this homogeneous semantic space
  • Potentially More Important in Diverse Networks: Multi-topic networks may show greater content influence

Interpretation Discussion

Question: Influence of Content in Social Media Analysis

Considering that content is a reasonable influencer in social media, and we are analyzing tweets, how should we interpret the results where one-hot encoding outperforms or equals the content features?

Detailed Response & Analysis

1. Structural Dominance Over Content

The superior performance of one-hot encoding fundamentally suggests that network topology and node relationships are more predictive of edge formation than the actual tweet content. This indicates that in your network, who interacts with whom follows structural patterns rather than content similarity.

This is a significant finding because it challenges the intuitive assumption that people on Twitter follow others primarily because they like their content. Instead, the data reveals that structural factors dominate.

2. Content Features as Noise

Key Insight: Features like lda_full, lda_top1k, and ne (network embedding) derived from content might be introducing noise that hurts prediction, rather than helping. This suggests several possibilities:
  • Dimensionality Curse: Content feature vectors are high-dimensional and may scatter the signal across too many dimensions
  • Temporal Mismatch: Tweet content changes rapidly while network structure is relatively stable. A user's old tweets may not reflect current content preferences
  • Sparsity Issues: Not all users have sufficient content for robust feature extraction, creating unreliable representations
  • Redundancy: Content information may already be partially captured implicitly by network structure

3. Homophily: Structural vs. Semantic

The results suggest the network is primarily driven by structural homophily (tendency to connect with users in similar network positions) rather than semantic homophily (tendency to connect based on similar interests/content).

Real-world mechanisms at play:
  • Transitive Closure: Friends of friends becoming friends (triadic closure)
  • Community Effect: Users naturally join existing groups and communities
  • Preferential Attachment: Following popular/visible users regardless of content
  • Echo Chambers: Mention-based interactions within existing clusters
These mechanisms are fundamentally about network position and visibility, not content quality.

4. The Influencer Paradox

Critical Finding: While content and influencers are intuitively important for social media, the results reveal they are secondary to network structure for edge prediction.

This suggests:
  • Users do not follow influencers primarily because of their content — they follow them because they're visible and central in the network
  • Content becomes important AFTER edge formation — it drives engagement and interaction frequency, not connection creation
  • Network position drives initial connections — visibility through network centrality, recommendations, or mentions
  • The "influencer effect" may be a consequence of network centrality, not content quality — influencers are influential because they occupy central positions, not vice versa

5. Accuracy vs. AUC: A Tale of Two Metrics

Interestingly, content features perform relatively better on accuracy than AUC:

  • AUC Excellence (One-Hot): 0.82-0.89 with one-hot encoding
  • Accuracy Peak (Content): 0.68-0.70 with GraphTransformer + LDA features

This divergence indicates that content features help with balanced predictions (true positives and true negatives are both predicted well) but fail to generate high-confidence edge predictions. In other words, the model becomes more uncertain and hesitant with content features, resulting in more balanced predictions but lower discrimination between edges and non-edges.

6. Practical Implications for Your Study

Content is not irrelevant — it plays an important role, but primarily for:
  • Explaining why certain edges are engaged with more than others
  • Predicting interaction frequency and quality on existing edges
  • Characterizing communities and identifying topic-based clusters
  • Understanding how information flows through the network
Structure is the primary predictor of edge formation because:
  • Twitter's recommendation algorithm likely emphasizes network proximity and mutual connections
  • Discovery is driven by visibility (trending, mutual followers, mentions)
  • Network effects and social proof create self-reinforcing connection patterns
  • The cost of evaluating content for millions of potential connections is prohibitive

Conclusions & Recommendations

Main Conclusion

In this Twitter network analysis, network structure is significantly more predictive of edge formation than content features. One-hot encoding of node identities outperforms sophisticated content-based features in edge prediction tasks (AUC: 0.82-0.89 vs. 0.54-0.80), suggesting that users connect based on proximity and structural patterns rather than content similarity.

Practical Recommendations

For Edge Prediction Tasks: Use GCN with one-hot encoding. This achieves 0.88+ AUC and is simple, interpretable, and computationally efficient. Complex content features add noise rather than signal.
For Balanced Predictions: If accuracy is the primary metric, use GraphTransformer with LDA topic features. This achieves 0.68-0.70 accuracy and provides more balanced true positive/negative rates.
For Content-Aware Systems: Content features should be used for post-connection tasks (recommendation, ranking, filtering) rather than predicting connections. They help explain and optimize engagement on existing relationships.
For Influencer Analysis: Network position (degree, centrality) is more important than content quality for predicting influence spread. Influencer detection should prioritize structural metrics over content-based scoring.

Future Research Directions

  • Investigate temporal dynamics: Does content influence connection formation in newly formed networks?
  • Analyze subpopulations: Do content features perform better for niche communities?
  • Combine approaches: Ensemble methods using structure for confidence + content for explanation
  • Cross-platform analysis: Do these patterns hold on other social networks (Facebook, Instagram, TikTok)?