Quantifying financial news to "positive/negative/neutral" scores is the standard path for constructing event-driven factors. Two mainstream approaches — lexicon-based and ML/LLM — each with trade-offs. This post covers when to use which.
Build a financial-domain sentiment dictionary (e.g., 32 positive + 28 negative terms), score a news article by weighted word frequency:
sentiment = Σ(POS hit × w_pos) - Σ(NEG hit × w_neg)
Pros: - Transparent, auditable, zero latency, zero cost - Easy to customize per domain (A-shares have lots of local financial vocabulary: 庄股 / 高送转 / 借壳) - No training data needed — instant cold-start
Cons: - Misses contextual reversal ("not a loss" vs "loss is severe") - Misses metaphor, contrast, negation - Accuracy ceiling is limited
Use a pretrained LLM (Chinese BERT, Llama, Qwen) for direct classification, or fine-tune a domain model.
Pros: - Significantly higher accuracy, especially on complex syntax - Recognizes context, negation, contrast - Can do finer tasks (entity extraction, relation classification)
Cons: - Compute cost high (LLM inference ~seconds, batch is expensive) - Poor explainability ("why is this negative?" — opaque) - Needs labeled data or commercial API
| Use case | Recommendation |
|---|---|
| Real-time large-volume scoring (thousands per minute) | Lexicon + tuned keyword library |
| Critical news deep parse (earnings, announcements) | LLM secondary read |
| Research summary generation | LLM (GPT/Claude/Qwen) AI summary |
| Explainable + auditable (advisor compliance) | Lexicon-dominant + LLM sampling |
Regardless of method, the prerequisite is a clean news stream + entity binding (which news is about which stock/concept). ReachRich handles this layer — raw text + sentiment scoring + AI summary; user picks which scoring to use.