NLTK (Natural Language Toolkit) vs LLM

MD
R
Markdown

NLTK (Natural Language Toolkit) for Python. It's a leading platform for building Python programs to work with human language data, offering easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries.

Scenarios

Customer feedback analysis: Sentiment analysis (LLM: YES) News article processing: Named Entity Recognition (LLM: YES) Grammar checking tools: Part-of-speech tagging (LLM: YES) Search engine optimization: Tokenization for keyword extraction (LLM: NO) Travel apps: Basic language translation (LLM: YES) Text summarization: Extractive method (LLM: NO) Spam email detection: Simple rule-based filtering (LLM: NO) Chatbot development: Pattern matching responses (LLM: NO) Academic text analysis: Frequency distribution of words (LLM: NO) Social media trend analysis: Hashtag and mention extraction (LLM: NO)

Example: Text summarization: Extractive method

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

def summarize(text, num_sentences=3):
    sentences = sent_tokenize(text)
    words = word_tokenize(text.lower())
    word_freq = {}
    
    for word in words:
        if word not in stopwords.words('english'):
            if word not in word_freq:
                word_freq[word] = 1
            else:
                word_freq[word] += 1
    
    sentence_scores = {}
    for sentence in sentences:
        for word in word_tokenize(sentence.lower()):
            if word in word_freq:
                if sentence not in sentence_scores:
                    sentence_scores[sentence] = word_freq[word]
                else:
                    sentence_scores[sentence] += word_freq[word]
    
    summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:num_sentences]
    summary = ' '.join(summary_sentences)
    
    return summary

# Example usage
text = "NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project."

print(summarize(text))

Created on 10/4/2024