The Academy has not banned AI from the Oscars. It has defined what it means to be the author of a film.

Learn how to build an AI authorship detection system that can distinguish between human-authored and AI-generated content in scripts, using natural language processing techniques.

Introduction

The Oscars have officially clarified that AI cannot be credited as an author or performer in films, emphasizing that acting roles must be performed by humans and screenplays must be human-authored. This rule change reflects growing concerns about AI's role in creative industries. In this tutorial, we'll explore how to build a system that can detect human-authored content in scripts using natural language processing techniques. This is particularly relevant for understanding how to verify human authorship in creative works.

Prerequisites

Python 3.7 or higher
Basic understanding of NLP concepts
Knowledge of machine learning concepts
Installed libraries: nltk, scikit-learn, pandas, numpy

Step-by-step instructions

Step 1: Setting up the Environment

Install Required Libraries

We need several Python libraries to analyze text patterns and detect human authorship. The NLTK library provides natural language processing tools, while scikit-learn offers machine learning capabilities for classification.

pip install nltk scikit-learn pandas numpy

Download NLTK Data

Before we can perform text analysis, we need to download essential NLTK datasets including tokenizers, stop words, and part-of-speech taggers.

import nltk
nltk.download('punkt')
 nltk.download('stopwords')
 nltk.download('averaged_perceptron_tagger')
 nltk.download('vader_lexicon')

Step 2: Creating the Authorship Detection Framework

Define Text Preprocessing Functions

Before analyzing text, we need to clean and prepare it for feature extraction. This includes removing punctuation, converting to lowercase, and tokenizing.

import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sent_tokenize import sent_tokenize

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [token for token in tokens if token not in stop_words]
    return tokens

Extract Linguistic Features

To distinguish between human and AI-generated text, we'll extract several linguistic features that are characteristic of human writing. These include sentence complexity, word diversity, and lexical richness.

def extract_features(text):
    tokens = preprocess_text(text)
    sentences = sent_tokenize(text)
    
    # Calculate average sentence length
    avg_sentence_length = len(tokens) / len(sentences) if sentences else 0
    
    # Calculate lexical diversity (unique words / total words)
    lexical_diversity = len(set(tokens)) / len(tokens) if tokens else 0
    
    # Calculate average word length
    avg_word_length = sum(len(token) for token in tokens) / len(tokens) if tokens else 0
    
    # Count specific parts of speech
    pos_tags = nltk.pos_tag(tokens)
    
    # Count nouns, verbs, adjectives, adverbs
    noun_count = len([tag for tag in pos_tags if tag[1].startswith('NN')])
    verb_count = len([tag for tag in pos_tags if tag[1].startswith('VB')])
    adj_count = len([tag for tag in pos_tags if tag[1].startswith('JJ')])
    adv_count = len([tag for tag in pos_tags if tag[1].startswith('RB')])
    
    return {
        'avg_sentence_length': avg_sentence_length,
        'lexical_diversity': lexical_diversity,
        'avg_word_length': avg_word_length,
        'noun_count': noun_count,
        'verb_count': verb_count,
        'adj_count': adj_count,
        'adv_count': adv_count
    }

Step 3: Building the Classification Model

Create Training Data

To train our model, we need a dataset of both human-authored and AI-generated texts. For demonstration purposes, we'll create synthetic examples, but in practice, you would gather real data.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Create sample dataset
sample_data = [
    {'text': 'The movie was absolutely fantastic and the acting was superb.', 'author': 'human'},
    {'text': 'This film demonstrates exceptional storytelling and remarkable cinematography.', 'author': 'human'},
    {'text': 'The plot was well developed and the characters were compelling.', 'author': 'human'},
    {'text': 'The film features advanced visual effects and innovative narrative techniques.', 'author': 'ai'},
    {'text': 'This cinematic masterpiece showcases cutting-edge production methods.', 'author': 'ai'},
    {'text': 'The storyline was engaging and the performances were outstanding.', 'author': 'human'},
    {'text': 'The movie incorporates state-of-the-art special effects and modern filmmaking.', 'author': 'ai'},
    {'text': 'The director demonstrated exceptional creative vision in this work.', 'author': 'human'}
]

# Convert to DataFrame
df = pd.DataFrame(sample_data)

# Extract features for all texts
features_list = []
for text in df['text']:
    features = extract_features(text)
    features_list.append(features)

# Create feature DataFrame
features_df = pd.DataFrame(features_list)
features_df['author'] = df['author']

Train the Model

Now we'll train a machine learning model to classify texts as human or AI-authored based on the extracted features.

# Prepare training data
X = features_df.drop('author', axis=1)
Y = features_df['author']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')

Step 4: Implementing the Detection System

Create a Prediction Function

With our trained model, we can now create a function that analyzes new script text and predicts whether it's likely human-authored.

def predict_authorship(text):
    # Extract features from the input text
    features = extract_features(text)
    
    # Convert to DataFrame for prediction
    features_df = pd.DataFrame([features])
    
    # Make prediction
    prediction = model.predict(features_df)[0]
    probability = model.predict_proba(features_df)[0]
    
    # Return results
    return {
        'prediction': prediction,
        'confidence': max(probability),
        'features': features
    }

Test the System

Let's test our system with some sample texts to see how it performs.

# Test with human-authored text
human_text = "The actor delivered a powerful performance that captured the essence of the character."
result = predict_authorship(human_text)
print(f'Text: {human_text}')
print(f'Prediction: {result["prediction"]} (Confidence: {result["confidence"]:.2f})')

# Test with AI-generated text
ai_text = "The cinematic narrative employs advanced algorithms to enhance viewer engagement."
result = predict_authorship(ai_text)
print(f'Text: {ai_text}')
print(f'Prediction: {result["prediction"]} (Confidence: {result["confidence"]:.2f})')

Step 5: Improving Accuracy

Feature Engineering

To improve accuracy, we can add more sophisticated features like sentiment analysis, readability scores, and specific linguistic patterns.

from nltk.sentiment import SentimentIntensityAnalyzer

# Initialize sentiment analyzer
sia = SentimentIntensityAnalyzer()

def enhanced_extract_features(text):
    # Get base features
    base_features = extract_features(text)
    
    # Add sentiment scores
    sentiment = sia.polarity_scores(text)
    base_features['compound_sentiment'] = sentiment['compound']
    base_features['positive_sentiment'] = sentiment['pos']
    base_features['negative_sentiment'] = sentiment['neg']
    base_features['neutral_sentiment'] = sentiment['neu']
    
    # Add readability features
    words = len(word_tokenize(text))
    sentences = len(sent_tokenize(text))
    base_features['readability'] = words / sentences if sentences else 0
    
    return base_features

Model Optimization

Experiment with different algorithms and hyperparameters to optimize performance. Try different classifiers like SVM, Gradient Boosting, or even neural networks for better accuracy.

Summary

This tutorial demonstrated how to build a system that can detect human-authored content in scripts using natural language processing techniques. We created a framework that preprocesses text, extracts linguistic features, and trains a machine learning model to classify content as human or AI-authored. The system analyzes various text characteristics including sentence complexity, lexical diversity, and sentiment patterns. While this is a simplified demonstration, it provides a foundation for more sophisticated authorship detection systems that could be used to verify compliance with Oscars' new rules. The key takeaway is that human creativity exhibits specific linguistic patterns that can be quantified and analyzed using NLP techniques.