MOVE THE LINE

2 Squirrels AI

Technology Stack

NLP EngineNLTK
Data ProcessingPandas + NumPy
Data SourceReddit NFL Comments
LanguagePython 3
FormatJupyter Notebook

Architecture Note: NFL Reddit sentiment analysis pipeline using NLTK for text preprocessing, tokenization, stopword removal, and lemmatization to prepare data for sentiment classification models.

NLP Processing Pipeline

๐Ÿ“ฅ
Data Ingestion
CSV Loading
โœ‚๏ธ
Tokenizer
NLTK word_tokenize
๐Ÿงน
Cleaner
ASCII & Punctuation
๐Ÿšซ
Stopword Filter
English Corpus
๐Ÿ”ค
Lemmatizer
WordNet Base Forms
๐Ÿ“Š
Output
Clean Text Corpus

Data Ingestion Pipeline

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart LR
    subgraph Source["๐Ÿ“ฅ Data Source"]
        CSV["๐Ÿ“„ NFL_reddit_data.csv
8,353 comments"] end subgraph Load["โš™๏ธ Pandas Loading"] DF["๐Ÿ“Š DataFrame Creation"] PR["๐Ÿ“‹ Parse Columns"] end subgraph Extract["๐Ÿ”ง Data Extraction"] PL["๐Ÿˆ Player Names"] TX["๐Ÿ’ฌ Comment Text"] MT["๐Ÿ“‹ Metadata
Score, Flair"] end subgraph Output["๐Ÿ“ค Raw Data"] RAW["๐Ÿ“Š Raw DataFrame"] end CSV --> DF DF --> PR PR --> PL PR --> TX PR --> MT PL --> RAW TX --> RAW MT --> RAW

Text Tokenization

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart LR
    subgraph Input["๐Ÿ“ฅ Input"]
        RAW["๐Ÿ’ฌ Raw Comment Text"]
    end
    subgraph NLTK["๐Ÿ”ง NLTK Processing"]
        TOK["โœ‚๏ธ word_tokenize()"]
        SEG["๐Ÿ“ Sentence Segments"]
    end
    subgraph ASCII["๐Ÿ”ค Non-ASCII Removal"]
        NRM["๐Ÿ“ unicodedata.normalize()
NFKD"] ASC["โœ… ASCII Only"] end subgraph Output["๐Ÿ“ค Output"] TKS["๐Ÿ“‹ Token List"] end RAW --> TOK TOK --> SEG SEG --> NRM NRM --> ASC ASC --> TKS

Text Normalization

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart TB
    subgraph Input["๐Ÿ“ฅ Tokens"]
        TK["๐Ÿ“‹ Token List"]
    end
    subgraph Lower["๐Ÿ”ค Lowercase"]
        LC[".lower()"]
    end
    subgraph Punct["โœ‚๏ธ Punctuation"]
        RX["๐Ÿ“ Regex Filter"]
        NP["๐Ÿšซ Remove Symbols"]
    end
    subgraph Stop["๐Ÿšซ Stopwords"]
        SW["๐Ÿ“š NLTK Stopwords
English Corpus"] FLT["๐Ÿ” Filter Tokens"] end subgraph Output["๐Ÿ“ค Clean Tokens"] CLN["โœ… Normalized Words"] end TK --> LC LC --> RX RX --> NP NP --> SW SW --> FLT FLT --> CLN

Lemmatization Pipeline

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart LR
    subgraph Input["๐Ÿ“ฅ Clean Tokens"]
        CT["๐Ÿ“‹ Normalized Words"]
    end
    subgraph Lemma["๐Ÿ”ง WordNet Lemmatizer"]
        WNL["๐Ÿ“š WordNetLemmatizer()"]
        POS["๐Ÿท๏ธ POS='verb'"]
        LEM[".lemmatize()"]
    end
    subgraph Examples["๐Ÿ“ Transformations"]
        E1["running โ†’ run"]
        E2["celebrating โ†’ celebrate"]
        E3["played โ†’ play"]
    end
    subgraph Output["๐Ÿ“ค Base Forms"]
        BF["โœ… Lemmatized Tokens"]
    end
    CT --> WNL
    WNL --> POS
    POS --> LEM
    LEM --> E1
    LEM --> E2
    LEM --> E3
    E1 --> BF
    E2 --> BF
    E3 --> BF
                

Complete NLP Pipeline

%%{init: {'theme': 'dark', 'themeVariables': { 'primaryColor': '#C17852', 'primaryTextColor': '#F0F6FC', 'primaryBorderColor': '#4A5E32', 'lineColor': '#E6C98F', 'secondaryColor': '#161B22', 'tertiaryColor': '#0D1117', 'background': '#0D1117', 'mainBkg': '#161B22', 'nodeBorder': '#4A5E32', 'clusterBkg': '#161B22', 'clusterBorder': '#4A5E32', 'titleColor': '#E6C98F', 'edgeLabelBackground': '#161B22'}}}%%
flowchart TB
    subgraph Source["๐Ÿ“ฅ Data Source"]
        CSV["๐Ÿ“„ NFL Reddit CSV
8,353 Comments"] end subgraph Load["โš™๏ธ Loading"] PD["๐Ÿผ Pandas DataFrame"] end subgraph Pipeline["๐Ÿ”ง normalize() Function"] T1["1๏ธโƒฃ Tokenize"] T2["2๏ธโƒฃ Remove Non-ASCII"] T3["3๏ธโƒฃ Lowercase"] T4["4๏ธโƒฃ Remove Punctuation"] T5["5๏ธโƒฃ Remove Stopwords"] T6["6๏ธโƒฃ Lemmatize"] end subgraph Apply["๐Ÿ“Š DataFrame Apply"] AP[".astype(str).apply(normalize)"] CL["clean_text column"] end subgraph Ready["๐Ÿ“ค ML Ready"] FV["๐ŸŽฏ Feature Vectors"] SA["๐Ÿ“Š Sentiment Analysis"] TM["๐Ÿ“ Topic Modeling"] end CSV --> PD PD --> T1 T1 --> T2 T2 --> T3 T3 --> T4 T4 --> T5 T5 --> T6 T6 --> AP AP --> CL CL --> FV CL --> SA CL --> TM