ReachRichInsights › News Sentiment Analysis: Lexicon vs ML Trade-offs

News Sentiment Analysis: Lexicon vs ML Trade-offs

Quantifying financial news to "positive/negative/neutral" scores is the standard path for constructing event-driven factors. Two mainstream approaches — lexicon-based and ML/LLM — each with trade-offs. This post covers when to use which.

Lexicon-based

Build a financial-domain sentiment dictionary (e.g., 32 positive + 28 negative terms), score a news article by weighted word frequency:

sentiment = Σ(POS hit × w_pos) - Σ(NEG hit × w_neg)

Pros: - Transparent, auditable, zero latency, zero cost - Easy to customize per domain (A-shares have lots of local financial vocabulary: 庄股 / 高送转 / 借壳) - No training data needed — instant cold-start

Cons: - Misses contextual reversal ("not a loss" vs "loss is severe") - Misses metaphor, contrast, negation - Accuracy ceiling is limited

ML / LLM-based

Use a pretrained LLM (Chinese BERT, Llama, Qwen) for direct classification, or fine-tune a domain model.

Pros: - Significantly higher accuracy, especially on complex syntax - Recognizes context, negation, contrast - Can do finer tasks (entity extraction, relation classification)

Cons: - Compute cost high (LLM inference ~seconds, batch is expensive) - Poor explainability ("why is this negative?" — opaque) - Needs labeled data or commercial API

Selection in practice

Use case Recommendation
Real-time large-volume scoring (thousands per minute) Lexicon + tuned keyword library
Critical news deep parse (earnings, announcements) LLM secondary read
Research summary generation LLM (GPT/Claude/Qwen) AI summary
Explainable + auditable (advisor compliance) Lexicon-dominant + LLM sampling

Data layer prerequisites

Regardless of method, the prerequisite is a clean news stream + entity binding (which news is about which stock/concept). ReachRich handles this layer — raw text + sentiment scoring + AI summary; user picks which scoring to use.