FNSPID: A Comprehensive Financial News Dataset in Time Series

https://arxiv.org/pdf/2402.06698


0. Abstract

Background in financial market prediction

  • [Traditional] Reliance on quantitative factors
  • [Recent] + Sentiment data from financial news (feat. LLM)


Limitations of Existing Approaches

  • (1) Lack of large-scale datasets
  • (2) Insufficient alignment of numerical & sentiment data


[Proposed] FNSPID

  • tldr; Financial News & Stock Price Integration Dataset
  • Large-scale & time-aligned financial data
    • [TS] 29.7M stock price
    • [Text] 15.7M financial news
  • Coverage
    • 4,775 S&P 500 companies
    • Time span: 1999–2023
    • Sources: 4 stock market news websites


Key Properties

  • Larger scale & Higher diversity
  • Explicit inclusion of sentiment information


Findings

  • a) Dataset size & quality \(\rightarrow\) Improve prediction accuracy
  • b) Sentiment scores \(\rightarrow\) Uield modest gains for transformer-based models


1. Introduction

TS regression

  • Fundamental to financial forecasting
  • For both traditional finance & AI-based financial modeling


Classical financial models

  • (e.g., Fama–French Three-Factor Mode, Arbitrage Pricing Theory)
  • Rely on linear regression and historical data
  • Limitation: Struggle to anticipate market extremes & unprecedented events (e.g., financial crises)


ML/DL methods

  • Outperforming traditional models
  • Benefit from integrating stock prices with news sentiment


Modern portfolio theory

  • Emphasizes market correlations
  • Recent studies: Show strong links between sentiment data & stock trends


LLMs in finance

  • Multimodal approaches
    • Combine numerical & textual data \(\rightarrow\) improve accuracy
  • Significantly improve sentiment analysis
  • Limitation:
    • Information loss from sentiment-only representations
    • Lack of large, integrated datasets


Limitation of Existing financial news datasets

  • Lack sufficient scale
  • Lack aligned stock price data
  • Unstructured news formats hinder their use in seq2seq TS modeling


Proposal: Financial News and Stock Price Integration Dataset (FNSPID)

  • Uniquely aligns TS news & Stock prices

  • Contribution

    • (1) Integrating textual & numerical data

    • (2) Experiments using FNSPID
      • a) Larger datasets

      • b) High-quality sentiment information

        \(\rightarrow\) Enhances forecasting accuracy

    • (3) Facilitates research in …
      • Sentiment analysis
      • LLM fine-tuning …


2. Related Works

(1) Evolution of Financial Analysis Models

a) Classical factor models

  • (e.g., Fama–French Three-Factor Model, Arbitrage Pricing Theory)
  • Incorporate factors such as market risk, size, and value
  • Strength & Limitation
    • [Strength] Effective for long-term asset pricing analysis
    • [Limitation] Lack granularity for short-term forecasting (e.g., price peaks and troughs)


b) Statistical TS models

  • (e.g., ARIMA, GARCH)
  • Widely used for market trend & volatility analysis
  • [Limitation] Sensitive to investor subjectivity and behavioral bias


c) ML/DL in finance

  • Improves market entry and exit timing
  • Evolution:
    • Classical ML: Linear Regression, SVM
    • Advanced ML/DL: LSTM, RNN, Deep Q-learning
  • Reinforcement learning shows promise for trading strategies


d) Data-driven ML/DL advantages

  • Leverages heterogeneous data sources
    • Real-time news
    • Social media sentiment
    • Economic indicators
  • [Strength]
    • Capture non-linear patterns
    • Adapt to changing market conditions


e) Role of financial news and sentiment

  • Financial news: Strongly influences market movements
  • GARCH-based studies: Highlight impact during financial crises
  • Sentiment + ML integration:
    • Enhances prediction accuracy
    • Applied to volatility prediction and relational market analysis
  • Limitation of prior work:
    • Lack of real-time data
    • Limited use of detailed financial metrics
    • Neglect of individual investor behavior


f) Pre-trained language models and LLMs

  • (e.g., GPT-3.5, FinGPT)
  • Used for:
    • Generating financial news
    • Sentiment-based stock prediction
  • Outperform traditional sentiment analysis methods
  • Demonstrated strong sentiment–price correlation


g) LLMs for TS forecasting

  • Backbone LLMs show strong potential in TS prediction
  • Limitation: Scarcity of large, high-quality financial datasets
  • Indicates untapped capability of LLMs for financial market applications


(2) Existing Stock Dataset

figure2


a) Existing stock market datasets

  • DL-based sentiment analysis: Performance strongly depends on data volume and quality
  • Growing research focus on integrating news sentiment + stock prices


b) Early sentiment-focused datasets

  • Lutz dataset
    • Binary, sentence-level sentiment (positive / negative)
    • Text-only financial news
    • Limitation: No company-level financial data
  • Cortis dataset
    • Fine-grained sentiment for financial news and microblogs
    • Includes sentiment scores and lexical/semantic features
    • Limitations:
      • Very small scale (1,142 headlines)
      • Proprietary sentiment scoring method


c) Time-series–oriented datasets

  • Farimani dataset
    • Integrates latent economic concepts, news sentiment, and technical indicators
    • All data structured as time series
    • Limitation: Sentiment mainly tied to FX news, limited trading depth
  • SEntFiN 1.0 (Sinha et al.)
    • Entity-level sentiment annotations
    • Rich financial entity coverage
    • Limitations:
      • Missing timestamps
      • Headline-only text
      • Insufficient scale for robust training


d) Large-scale news datasets

  • Philippe dataset (Bloomberg & Reuters)
    • Large financial news time series
    • Limitation: No entity-level sentiment labels, raw news only
  • Finnhub dataset
    • Time-aligned stock prices and related news via API
    • Limitation:
      • Proprietary access
      • No built-in sentiment labels
      • Lacks systematic quality evaluation


e) Recent hybrid approaches

  • Combine numerical market data with social media text
  • Use deep reinforcement learning for stock prediction
  • Introduce dynamic datasets for model evaluation
  • Limitation: Often dataset-specific and not openly accessible


f) FNSPID dataset

  • Covers 1999–2023
  • Multilingual news (English & Russian)
  • Time-aligned financial news and stock prices
  • Designed for:
    • Sentiment analysis
    • Stock price prediction
  • Data quality:
    • Sourced exclusively from trusted financial platforms (e.g., NASDAQ)
    • Mitigates fake news concerns
    • Emphasizes dataset reliability and integrity


g) Open-access challenge

  • Many high-quality datasets and models (e.g., FinChat, BloombergGPT)
    • Restricted or proprietary
    • Limited availability for academic research
  • FNSPID objective
    • Provide a comprehensive, open, and accessible dataset
    • Enable broader and more inclusive research in financial modeling


3. Constructing FNSPID

Curated integration of numerical stock data + sentiment data


Construction divided into three tasks:

  • Task 1: “Full sentiment + numerical” dataset
  • Task 2: “Summarized” sentiment dataset
  • Task 3: “Quantified” sentiment dataset


a) Data sources

  • Numerical data
    • Stock prices collected via Yahoo Finance API
  • Sentiment / news data
    • Explored multiple financial news sources
      • e.g., Bloomberg, Reuters, Yahoo Finance, Forbes, CNBC
    • Due to usage restrictions, primary collection from NASDAQ

figure2


b) NASDAQ data collection pipeline

  • Two-stage process:
    • Step 1) Collect headlines and URLs for each stock using Selenium
    • Step 2) Extract full news content from URLs
  • Ensures structured textual data aligned with stock-level information


c) Data diversity and integrity

  • To reduce source bias and improve coverage:
    • Integrated previously processed datasets from:
      • Bloomberg
      • Reuters
      • Benzinga
      • Lenta
  • Combined with NASDAQ data to form [FNSPID Task 1]


d) Data ethics and compliance

  • Strict adherence to robots.txt and website usage policies
  • Only collected content that is:
    • Freely accessible
    • Not behind paywalls or subscriptions
  • Web scraping used only when no official API was available
  • Licenses of prior datasets verified before integration


(1) Data mining and processing

figure2

a) Data & Goal

  • Raw data: Numerical prices, URLs, news headlines, full news text
  • Goal:
    • (1) Reduce text length
    • (2) Preserve sentiment-relevant information


b) News summarization

  • Motivation:
    • Address token limits
    • Improve practicality of downstream sentiment analysis
  • Applied 4 rule-based summarization methods:
    • LexRank / Luhn / Latent Semantic Analysis (LSA) / TextRank
    • via the Sumy Python package

figure2


c) Stock-aware summarization

  • Introduced a weight model \(W_f\)
    • Emphasizes content relevant to the associated stock
  • Summary length fixed to 3 sentences
    • 1/8 of original article
    • Balances conciseness and specificity
  • Benefits:
    • Significant reduction in token usage
    • Improved ChatGPT prompt stability
  • Completion of [FNSPID Task 2]


d) Sentiment quantification

  • Limitation:
    • Early LLMs and TS-DL models: Struggle with raw language understanding
    • Large models (e.g., ChatGPT): Computationally expensive
  • Observation:
    • DL models can effectively utilize numerical sentiment signals


e) Sentiment labeling strategy

  • Constructed a labeled subset:
    • News from 50 major S&P 500 stocks
  • Used ChatGPT to generate sentiment labels:
    • Avoids costly manual annotation
    • Outperforms traditional sentiment algorithms
  • Input to ChatGPT
    • LSA-based summaries for concise yet informative prompts
  • Prompt configuration:
    • Up to 10 entries per prompt
    • Temperature = 0 for output stability


f) Sentiment scoring scheme

  • Adopted a 1–5 discrete scale:
    • 1: Negative
    • 2: Somewhat negative
    • 3: Neutral
    • 4: Somewhat positive
    • 5: Positive
  • Rationale:
    • More stable than alternatives (e.g., -1 to 1, 1 to 10)
    • Decimal scores lead to unstable representations
  • Sentiment distribution:
    • Approximately normal

figure2


g) Normalization and integration

  • Sentiment scores: Normalized consistently with other features

    \(\rightarrow\) Ensures sentiment contributes meaningfully (w/o dominating model training)


h) Handling missing sentiment data

  • Issue: Days without news articles
  • Solution:
    • Exponential decay toward neutral sentiment
  • Formula:
    • \[S(t) = 3 + (S(0) - 3)\cdot e^{-\lambda t}\]
  • Configuration:
    • Neutral target = 3
    • Decay rate \(\lambda = 0.03\)


i) Daily aggregation

  • Multiple news articles on the same day
    • Use average sentiment score
  • Motivation:
    • Balanced representation of daily market sentiment
    • Reduces bias from extreme individual news items


4. FNSPID Property

a) Dataset overview

  • Large-scale and diverse dataset:
    • 30+ GB of processed data
  • [TS] Numerical price data (Table 2)
  • [Text] Sentiment-related data (Figure 3)
    • URLs
    • News headlines
    • Full news text
    • Sentiment scores
    • Summaries from four methods


b) Data richness and diversity

  • Multi-modal structure:
    • Numerical prices
    • Textual news (+ quantified sentiment)


c) Computational effort

  • For dataset construction…
    • ~4 TB of computing resources
    • ~45 days of processing time


d) Labeled sentiment subset

  • Focused analysis on: Top 50 influential S&P 500 stocks (2024)
  • Result: 402,546 news articles with sentiment labels


4-1. Evaluation

Evaluation of FNSPID

  • Assesses linguistic, semantic, and temporal properties

figure2


Language distribution

  • Pproportions of English Russian content
  • Highlights the multilingual nature of FNSPID


News article segmentation

  • (1) Stock-symbol–referenced news
  • (2) Non–stock-symbol news


Temporal distribution

  • Date: 1999 to 2023
  • Identifies long-term trends and fluctuations in financial news coverage

figure2


Overall assessment

  • a) Large scale
  • b) Multilingual coverage
  • c) Long temporal span

  • Well-suited for:

    • Financial sentiment analysis
    • Time-series forecasting


5. Experiment

  • Effectiveness of FNSPID in stock price prediction
  • Focuses on how the quantity of news data influences model performance


5-1. Quantity Test

Goal

  • Analyze the impact of training data size on short-term stock price prediction
  • Use FNSPID Task 3 as the experimental benchmark


Input information

  • (1) Numerical features
    • Open price
    • Close price
    • Trading volume
  • (2) Sentiment features
    • Included or excluded depending on the experiment setting


Models

  • Traditional DL methods: RNN / LSTM / GRU / CNN
  • TS–oriented models: 4-layer Vanilla Transformer / 4-layer TimesNet


Experimental setup

  • Input window: 50 days
  • Prediction horizon: 3 days
  • Training epochs: 100
  • Training dataset sizes
    • 5 stocks (n = 11,277)
    • 25 stocks (n = 43,192)
    • 50 stocks (n = 127,937)
  • Evaluation
    • Conducted on 5 stocks
    • One outlier removed
    • Final results reported as averaged values


Test results

  • Evaluation metric: \(R^2\)

  • Experiment settings

    • A-Sen: Sentiment info (O)
    • A-Non: Sentiment info (X)


Key findings

  • Average \(R^2\) improvement of 6.29% when increasing training data from 5 to 25 stocks across all models


Observations

  • Transformer-based models > Recurrent models
  • Larger training datasets significantly enhance prediction accuracy
  • Small datasets limit performance in financial forecasting tasks


Conclusion

  • Robustness and applicability of FNSPID
  • Importance of data scale in financial TS prediction
  • Effectiveness of Transformer-based architectures for stock price forecasting


5-2. Quality Test

Goal

  • Evaluate the impact of sentiment quality on model training performance
  • Compare sentiment sources derived from FNSPID Task 3, FNSPID Task 2, and TextBlob


Experimental setup

  • Sentiment sources
    • FNSPID Task 2
      • ChatGPT-labeled sentiment
    • TextBlob sentiment
      • Based on rule-based scoring & lightweight NLP models
  • Benchmark
    • Sentiment information from FNSPID Task 3


Overall findings

  • Sentiment quality has a decisive impact on forecasting accuracy
  • Results in Table 3
    • Part A: FNSPID Task 2 sentiment improves prediction accuracy
    • Part B: TextBlob sentiment degrades model performance


Impact on model performance

  • Transformer-based comparison
    • FNSPID Task 3 sentiment
      • Accuracy improvement of +0.2% over non-sentiment input
    • TextBlob sentiment
      • Accuracy degradation of −1.16%


Sentiment effectiveness across models

  • Transformer: Consistently benefits from high-quality sentiment information
  • TimesNet: Occasionally exhibits positive gains
  • RNN, LSTM, GRU, CNN
    • Treat sentiment features as noise
    • Show no consistent performance improvement


Dataset scale effects

  • Small-scale training

    • LSTM outperforms Transformer when only 5 news items are available
  • Large-scale training

    • Transformer exhibits substantial accuracy gains as data volume increases
<br>

Discussion

  • Hyperparameter alignment

    • Uniform hyperparameter settings are used for fair comparison
    • This constraint may suppress the optimal performance of individual models
<br>
  • Sentiment representation

    • Sentiment scores mapped onto a 5-level scale
    • Paragraph-level sentiment compression may discard fine-grained information
    • Information loss contributes to limited sentiment effectiveness
<br>
  • Interpretation of limited gains

    • Financial news is known to influence stock prices

    • Marginal improvements arise due to

      • Already high baseline prediction accuracy
      • Delayed market reactions to news dissemination
<br>

Conclusion

  • Key takeaways from the Quality Test

    • Dataset quality and quantity jointly determine forecasting performance
    • High-quality sentiment information benefits Transformer-based models
    • Transformer-based architectures outperform traditional time-series models and recent alternatives such as TimesNet in stock price prediction

Categories: ,

Updated: