Dataset Context
- Language: Albanian tweets only
- Scale: Limited dataset size (not large-scale)
- Topic: Highly specialized — vaccine and COVID-19 focused
- Search Terms:
moderna,pfizer,astrazeneca,vaccine,coronavirus,covid-19,covid,vaksina
Impact on Interpretation: This context significantly refines what we can conclude from the findings. The dominance of structural features is particularly pronounced in specialized, single-topic networks and may not generalize to broader social media analysis.
Executive Summary
This analysis examines the performance of different feature representations and Graph Neural Network architectures on edge prediction tasks across three Twitter network datasets: graph_shqip, graph_shqip_nsl, and graph_shqip_rt.
A key observation emerges from the experimental results: one-hot encoding of node identities consistently outperforms or equals sophisticated content-based features in predicting edge formation. This finding has significant implications for understanding network formation mechanisms in specialized, single-topic networks with semantic homogeneity.
Key Findings
1. One-Hot Encoding Dominance
| Dataset | AUC Metric | One-Hot Performance | Top Alternative |
| graph_shqip | AUC-ROC | 0.8858 (GCN) | 0.8397 (ne + GraphTransformer) |
| graph_shqip_nsl | AUC-ROC | 0.8207 (GCN) | 0.7601 (one-hot + GAT) |
| graph_shqip_rt | AUC-ROC | 0.8897 (GCN) | 0.8279 (one-hot + GraphTransformer) |
2. Content Features Show Mixed Results
- LDA (Latent Dirichlet Allocation): Competitive for accuracy but underperforms one-hot on AUC
- Network Embeddings (ne): Best accuracy on graph_shqip (0.7068) but lower AUC scores
- User Features: Generally underperform across all metrics
3. Accuracy vs. AUC Divergence
There is a notable divergence between accuracy and AUC metrics:
- AUC Excellence: One-hot encoding with GCN achieves 0.82-0.89 AUC-ROC
- Accuracy Peak: GraphTransformer with topic-based features achieves highest accuracy (0.68-0.70)
This suggests that content features may help with balanced predictions but fail to generate high-confidence edge predictions.
Interpretation
Why Does One-Hot Encoding Outperform Content Features?
1. Structural Dominance Over Content
Interpretation: Users appear to be forming connections based on network proximity and existing relationships, not on content overlap or topic similarity.
2. Homogeneous Topic Space (Context-Dependent)
Implications:
- Content features capture variations within a narrow topic, making semantic differentiation minimal
- Structural patterns (mutual followers, community ties) become more predictive than content
- Unlike general Twitter, there is limited content diversity to leverage for predictions
- This finding may not generalize to broader, multi-topic networks
3. Language and Dataset Scale Effects
- NLP Model Quality: Smaller pretrained models and embeddings available for Albanian
- LDA Robustness: Topic models may be less accurate with smaller corpus and limited vocabulary
- Network Clustering: Limited dataset size means tighter clustering, amplifying structural patterns
- Content Sparsity: Not all users have sufficient Albanian tweets for robust feature extraction
4. Content Features as Noise
Sophisticated content features (LDA, embeddings) may introduce noise that degrades model performance:
- Dimensionality curse: High-dimensional content features may scatter the signal
- Temporal mismatch: Tweet content is dynamic while network structure is more stable
- Sparsity: Not all users have sufficient content for robust feature extraction
- Redundancy: Content information may already be partially captured by network structure
5. Homophily vs. Content-Based Linking
Examples of mechanisms:
- Following friends of friends (transitive closure)
- Following users in the same community
- Following users with similar follower counts
- Mention-based interactions within echo chambers
6. The Role of "Influencers" and Content
This suggests that:
- Users do not follow influencers primarily because of their content
- Content becomes important after an edge is formed (for engagement)
- Network position and visibility drive initial connections
- The "influencer effect" may be a consequence of network centrality, not content quality
Encoder Architecture Analysis
Best Performing Architectures by Context
| Metric | Best Encoder | Best Feature | Score | Dataset |
|---|---|---|---|---|
| AUC-ROC | GCN | one-hot | 0.8897 | graph_shqip_rt |
| Accuracy | GraphTransformer | lda_top1k | 0.6889 | graph_shqip_rt |
Encoder Characteristics
- GCN (Graph Convolutional Networks): Best for AUC with simple features, excels with one-hot encoding
- GAT (Graph Attention Networks): Good balance across features, particularly with one-hot encoding
- GraphTransformer: Better for accuracy, leverages complex content features effectively
- GraphSAGE: Moderate performance across all metrics
Dataset-Specific Analysis
Revised Interpretation with Dataset Context
Limited Dataset Size Effects
The smaller dataset size has several impacts on the findings:
- Tighter Network Clustering: Smaller networks naturally form more cohesive communities
- Amplified Structural Signals: Structural patterns become more pronounced relative to content
- Content Signal Dilution: Fewer examples of diverse content reduce the effectiveness of content-based features
- Representation Quality: Embeddings and topic models are less robust with smaller training data
Specialized Topic Impact
- Content Homogeneity: Users discuss the same vaccines, side effects, and policies — minimal semantic diversity
- Temporal Events: Network formation spikes during vaccine rollouts, policy announcements, and outbreaks
- Echo Chambers: Users naturally segregate into pro-vaccine and vaccine-skeptical communities
- Reduced Content Variation: Unlike general Twitter, there's less content diversity to exploit for edge prediction
Albanian Language Considerations
Language-specific factors affect content feature quality:
- Fewer Pretrained Models: Smaller ecosystem of Albanian NLP tools compared to English
- LDA Topic Quality: Topic modeling with smaller corpora produces less stable and interpretable topics
- Embedding Coverage: Word embeddings may have lower coverage for Albanian vocabulary
- Morphological Complexity: Albanian inflections may increase sparsity in content features
Dataset-Specific Observations
graph_shqip (Full Dataset)
- AUC Leader: one-hot + GCN (0.8858)
- Accuracy Leader: ne + GraphTransformer (0.7068)
- Characteristic: Largest dataset with most balanced performance
- Insight: Shows that complex features can improve accuracy without improving AUC
graph_shqip_nsl (NSL-filtered)
- AUC Leader: one-hot + GCN (0.8207)
- Accuracy Leader: one-hot + GAT (0.6489)
- Characteristic: Content features perform worse here, structural signals dominate
- Insight: Filtering by network security labels amplifies structural importance
graph_shqip_rt (Retweet-based)
- AUC Leader: one-hot + GCN (0.8897)
- Accuracy Leader: lda_top1k + GraphTransformer (0.6889)
- Characteristic: Retweet networks show highest overall AUC scores
- Insight: Retweet behavior is more predictable from structure; content matters for selective retweets
Theoretical Implications
Generalizability Caveats
- Limited semantic diversity (vaccine/COVID-19 focus)
- Smaller network size
- Non-English language content
- Strong event-driven temporal dynamics
Social Network Formation in Specialized Networks
For vaccine/COVID-19 discussion networks, formation follows these mechanisms:
- Event-Driven Clustering: Users connect during vaccine rollouts or policy announcements
- Temporal Proximity: Engagement at similar times creates network bonds
- Community Polarization: Users segregate into ideological clusters (pro/anti-vaccine)
- Network Visibility: Users discover others through mutual followers and trending topics
Role of Content (Context-Dependent)
In this specialized network, content appears to play a secondary role:
- Within-Community Engagement: Content drives interaction frequency within established communities
- Community Identity: Content signals alignment with pro or anti-vaccine positions
- Not Primary Connector: Content does not predict edge formation directly in this homogeneous semantic space
- Potentially More Important in Diverse Networks: Multi-topic networks may show greater content influence
Interpretation Discussion
Question: Influence of Content in Social Media Analysis
Considering that content is a reasonable influencer in social media, and we are analyzing tweets, how should we interpret the results where one-hot encoding outperforms or equals the content features?
Detailed Response & Analysis
1. Structural Dominance Over Content
The superior performance of one-hot encoding fundamentally suggests that network topology and node relationships are more predictive of edge formation than the actual tweet content. This indicates that in your network, who interacts with whom follows structural patterns rather than content similarity.
This is a significant finding because it challenges the intuitive assumption that people on Twitter follow others primarily because they like their content. Instead, the data reveals that structural factors dominate.
2. Content Features as Noise
lda_full, lda_top1k, and ne
(network embedding) derived from content might be introducing noise that hurts prediction,
rather than helping. This suggests several possibilities:
- Dimensionality Curse: Content feature vectors are high-dimensional and may scatter the signal across too many dimensions
- Temporal Mismatch: Tweet content changes rapidly while network structure is relatively stable. A user's old tweets may not reflect current content preferences
- Sparsity Issues: Not all users have sufficient content for robust feature extraction, creating unreliable representations
- Redundancy: Content information may already be partially captured implicitly by network structure
3. Homophily: Structural vs. Semantic
Real-world mechanisms at play:
- Transitive Closure: Friends of friends becoming friends (triadic closure)
- Community Effect: Users naturally join existing groups and communities
- Preferential Attachment: Following popular/visible users regardless of content
- Echo Chambers: Mention-based interactions within existing clusters
4. The Influencer Paradox
This suggests:
- Users do not follow influencers primarily because of their content — they follow them because they're visible and central in the network
- Content becomes important AFTER edge formation — it drives engagement and interaction frequency, not connection creation
- Network position drives initial connections — visibility through network centrality, recommendations, or mentions
- The "influencer effect" may be a consequence of network centrality, not content quality — influencers are influential because they occupy central positions, not vice versa
5. Accuracy vs. AUC: A Tale of Two Metrics
Interestingly, content features perform relatively better on accuracy than AUC:
- AUC Excellence (One-Hot): 0.82-0.89 with one-hot encoding
- Accuracy Peak (Content): 0.68-0.70 with GraphTransformer + LDA features
This divergence indicates that content features help with balanced predictions (true positives and true negatives are both predicted well) but fail to generate high-confidence edge predictions. In other words, the model becomes more uncertain and hesitant with content features, resulting in more balanced predictions but lower discrimination between edges and non-edges.
6. Practical Implications for Your Study
- Explaining why certain edges are engaged with more than others
- Predicting interaction frequency and quality on existing edges
- Characterizing communities and identifying topic-based clusters
- Understanding how information flows through the network
- Twitter's recommendation algorithm likely emphasizes network proximity and mutual connections
- Discovery is driven by visibility (trending, mutual followers, mentions)
- Network effects and social proof create self-reinforcing connection patterns
- The cost of evaluating content for millions of potential connections is prohibitive
Conclusions & Recommendations
Main Conclusion
In this Twitter network analysis, network structure is significantly more predictive of edge formation than content features. One-hot encoding of node identities outperforms sophisticated content-based features in edge prediction tasks (AUC: 0.82-0.89 vs. 0.54-0.80), suggesting that users connect based on proximity and structural patterns rather than content similarity.
Practical Recommendations
Future Research Directions
- Investigate temporal dynamics: Does content influence connection formation in newly formed networks?
- Analyze subpopulations: Do content features perform better for niche communities?
- Combine approaches: Ensemble methods using structure for confidence + content for explanation
- Cross-platform analysis: Do these patterns hold on other social networks (Facebook, Instagram, TikTok)?