Move The Line - NFL Sentiment Analysis with NLTK

Technology Stack

NLP EngineNLTK

Data ProcessingPandas + NumPy

Data SourceReddit NFL Comments

LanguagePython 3

FormatJupyter Notebook

Architecture Note: NFL Reddit sentiment analysis pipeline using NLTK for text preprocessing, tokenization, stopword removal, and lemmatization to prepare data for sentiment classification models.

NLP Processing Pipeline

📥

Data Ingestion

CSV Loading

✂️

Tokenizer

NLTK word_tokenize

🧹

Cleaner

ASCII & Punctuation

🚫

Stopword Filter

English Corpus

🔤

Lemmatizer

WordNet Base Forms

📊

Output

Clean Text Corpus

Data Ingestion Pipeline

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart LR
    subgraph Source["📥 Data Source"]
        CSV["📄 NFL_reddit_data.csv
8,353 comments"]
    end
    subgraph Load["⚙️ Pandas Loading"]
        DF["📊 DataFrame Creation"]
        PR["📋 Parse Columns"]
    end
    subgraph Extract["🔧 Data Extraction"]
        PL["🏈 Player Names"]
        TX["💬 Comment Text"]
        MT["📋 Metadata
Score, Flair"]
    end
    subgraph Output["📤 Raw Data"]
        RAW["📊 Raw DataFrame"]
    end
    CSV --> DF
    DF --> PR
    PR --> PL
    PR --> TX
    PR --> MT
    PL --> RAW
    TX --> RAW
    MT --> RAW

Text Tokenization

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart LR
    subgraph Input["📥 Input"]
        RAW["💬 Raw Comment Text"]
    end
    subgraph NLTK["🔧 NLTK Processing"]
        TOK["✂️ word_tokenize()"]
        SEG["📝 Sentence Segments"]
    end
    subgraph ASCII["🔤 Non-ASCII Removal"]
        NRM["📐 unicodedata.normalize()
NFKD"]
        ASC["✅ ASCII Only"]
    end
    subgraph Output["📤 Output"]
        TKS["📋 Token List"]
    end
    RAW --> TOK
    TOK --> SEG
    SEG --> NRM
    NRM --> ASC
    ASC --> TKS

Text Normalization

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart TB
    subgraph Input["📥 Tokens"]
        TK["📋 Token List"]
    end
    subgraph Lower["🔤 Lowercase"]
        LC[".lower()"]
    end
    subgraph Punct["✂️ Punctuation"]
        RX["📐 Regex Filter"]
        NP["🚫 Remove Symbols"]
    end
    subgraph Stop["🚫 Stopwords"]
        SW["📚 NLTK Stopwords
English Corpus"]
        FLT["🔍 Filter Tokens"]
    end
    subgraph Output["📤 Clean Tokens"]
        CLN["✅ Normalized Words"]
    end
    TK --> LC
    LC --> RX
    RX --> NP
    NP --> SW
    SW --> FLT
    FLT --> CLN

Lemmatization Pipeline

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart LR
    subgraph Input["📥 Clean Tokens"]
        CT["📋 Normalized Words"]
    end
    subgraph Lemma["🔧 WordNet Lemmatizer"]
        WNL["📚 WordNetLemmatizer()"]
        POS["🏷️ POS='verb'"]
        LEM[".lemmatize()"]
    end
    subgraph Examples["📝 Transformations"]
        E1["running → run"]
        E2["celebrating → celebrate"]
        E3["played → play"]
    end
    subgraph Output["📤 Base Forms"]
        BF["✅ Lemmatized Tokens"]
    end
    CT --> WNL
    WNL --> POS
    POS --> LEM
    LEM --> E1
    LEM --> E2
    LEM --> E3
    E1 --> BF
    E2 --> BF
    E3 --> BF

Complete NLP Pipeline

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart TB
    subgraph Source["📥 Data Source"]
        CSV["📄 NFL Reddit CSV
8,353 Comments"]
    end
    subgraph Load["⚙️ Loading"]
        PD["🐼 Pandas DataFrame"]
    end
    subgraph Pipeline["🔧 normalize() Function"]
        T1["1️⃣ Tokenize"]
        T2["2️⃣ Remove Non-ASCII"]
        T3["3️⃣ Lowercase"]
        T4["4️⃣ Remove Punctuation"]
        T5["5️⃣ Remove Stopwords"]
        T6["6️⃣ Lemmatize"]
    end
    subgraph Apply["📊 DataFrame Apply"]
        AP[".astype(str).apply(normalize)"]
        CL["clean_text column"]
    end
    subgraph Ready["📤 ML Ready"]
        FV["🎯 Feature Vectors"]
        SA["📊 Sentiment Analysis"]
        TM["📝 Topic Modeling"]
    end
    CSV --> PD
    PD --> T1
    T1 --> T2
    T2 --> T3
    T3 --> T4
    T4 --> T5
    T5 --> T6
    T6 --> AP
    AP --> CL
    CL --> FV
    CL --> SA
    CL --> TM