Introduction
The Oscars have officially clarified that AI cannot be credited as an author or performer in films, emphasizing that acting roles must be performed by humans and screenplays must be human-authored. This rule change reflects growing concerns about AI's role in creative industries. In this tutorial, we'll explore how to build a system that can detect human-authored content in scripts using natural language processing techniques. This is particularly relevant for understanding how to verify human authorship in creative works.
Prerequisites
- Python 3.7 or higher
- Basic understanding of NLP concepts
- Knowledge of machine learning concepts
- Installed libraries: nltk, scikit-learn, pandas, numpy
Step-by-step instructions
Step 1: Setting up the Environment
Install Required Libraries
We need several Python libraries to analyze text patterns and detect human authorship. The NLTK library provides natural language processing tools, while scikit-learn offers machine learning capabilities for classification.
pip install nltk scikit-learn pandas numpy
Download NLTK Data
Before we can perform text analysis, we need to download essential NLTK datasets including tokenizers, stop words, and part-of-speech taggers.
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
Step 2: Creating the Authorship Detection Framework
Define Text Preprocessing Functions
Before analyzing text, we need to clean and prepare it for feature extraction. This includes removing punctuation, converting to lowercase, and tokenizing.
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sent_tokenize import sent_tokenize
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove special characters and digits
text = re.sub(r'[^a-zA-Z\s]', '', text)
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords
tokens = [token for token in tokens if token not in stop_words]
return tokens
Extract Linguistic Features
To distinguish between human and AI-generated text, we'll extract several linguistic features that are characteristic of human writing. These include sentence complexity, word diversity, and lexical richness.
def extract_features(text):
tokens = preprocess_text(text)
sentences = sent_tokenize(text)
# Calculate average sentence length
avg_sentence_length = len(tokens) / len(sentences) if sentences else 0
# Calculate lexical diversity (unique words / total words)
lexical_diversity = len(set(tokens)) / len(tokens) if tokens else 0
# Calculate average word length
avg_word_length = sum(len(token) for token in tokens) / len(tokens) if tokens else 0
# Count specific parts of speech
pos_tags = nltk.pos_tag(tokens)
# Count nouns, verbs, adjectives, adverbs
noun_count = len([tag for tag in pos_tags if tag[1].startswith('NN')])
verb_count = len([tag for tag in pos_tags if tag[1].startswith('VB')])
adj_count = len([tag for tag in pos_tags if tag[1].startswith('JJ')])
adv_count = len([tag for tag in pos_tags if tag[1].startswith('RB')])
return {
'avg_sentence_length': avg_sentence_length,
'lexical_diversity': lexical_diversity,
'avg_word_length': avg_word_length,
'noun_count': noun_count,
'verb_count': verb_count,
'adj_count': adj_count,
'adv_count': adv_count
}
Step 3: Building the Classification Model
Create Training Data
To train our model, we need a dataset of both human-authored and AI-generated texts. For demonstration purposes, we'll create synthetic examples, but in practice, you would gather real data.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Create sample dataset
sample_data = [
{'text': 'The movie was absolutely fantastic and the acting was superb.', 'author': 'human'},
{'text': 'This film demonstrates exceptional storytelling and remarkable cinematography.', 'author': 'human'},
{'text': 'The plot was well developed and the characters were compelling.', 'author': 'human'},
{'text': 'The film features advanced visual effects and innovative narrative techniques.', 'author': 'ai'},
{'text': 'This cinematic masterpiece showcases cutting-edge production methods.', 'author': 'ai'},
{'text': 'The storyline was engaging and the performances were outstanding.', 'author': 'human'},
{'text': 'The movie incorporates state-of-the-art special effects and modern filmmaking.', 'author': 'ai'},
{'text': 'The director demonstrated exceptional creative vision in this work.', 'author': 'human'}
]
# Convert to DataFrame
df = pd.DataFrame(sample_data)
# Extract features for all texts
features_list = []
for text in df['text']:
features = extract_features(text)
features_list.append(features)
# Create feature DataFrame
features_df = pd.DataFrame(features_list)
features_df['author'] = df['author']
Train the Model
Now we'll train a machine learning model to classify texts as human or AI-authored based on the extracted features.
# Prepare training data
X = features_df.drop('author', axis=1)
Y = features_df['author']
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy:.2f}')
Step 4: Implementing the Detection System
Create a Prediction Function
With our trained model, we can now create a function that analyzes new script text and predicts whether it's likely human-authored.
def predict_authorship(text):
# Extract features from the input text
features = extract_features(text)
# Convert to DataFrame for prediction
features_df = pd.DataFrame([features])
# Make prediction
prediction = model.predict(features_df)[0]
probability = model.predict_proba(features_df)[0]
# Return results
return {
'prediction': prediction,
'confidence': max(probability),
'features': features
}
Test the System
Let's test our system with some sample texts to see how it performs.
# Test with human-authored text
human_text = "The actor delivered a powerful performance that captured the essence of the character."
result = predict_authorship(human_text)
print(f'Text: {human_text}')
print(f'Prediction: {result["prediction"]} (Confidence: {result["confidence"]:.2f})')
# Test with AI-generated text
ai_text = "The cinematic narrative employs advanced algorithms to enhance viewer engagement."
result = predict_authorship(ai_text)
print(f'Text: {ai_text}')
print(f'Prediction: {result["prediction"]} (Confidence: {result["confidence"]:.2f})')
Step 5: Improving Accuracy
Feature Engineering
To improve accuracy, we can add more sophisticated features like sentiment analysis, readability scores, and specific linguistic patterns.
from nltk.sentiment import SentimentIntensityAnalyzer
# Initialize sentiment analyzer
sia = SentimentIntensityAnalyzer()
def enhanced_extract_features(text):
# Get base features
base_features = extract_features(text)
# Add sentiment scores
sentiment = sia.polarity_scores(text)
base_features['compound_sentiment'] = sentiment['compound']
base_features['positive_sentiment'] = sentiment['pos']
base_features['negative_sentiment'] = sentiment['neg']
base_features['neutral_sentiment'] = sentiment['neu']
# Add readability features
words = len(word_tokenize(text))
sentences = len(sent_tokenize(text))
base_features['readability'] = words / sentences if sentences else 0
return base_features
Model Optimization
Experiment with different algorithms and hyperparameters to optimize performance. Try different classifiers like SVM, Gradient Boosting, or even neural networks for better accuracy.
Summary
This tutorial demonstrated how to build a system that can detect human-authored content in scripts using natural language processing techniques. We created a framework that preprocesses text, extracts linguistic features, and trains a machine learning model to classify content as human or AI-authored. The system analyzes various text characteristics including sentence complexity, lexical diversity, and sentiment patterns. While this is a simplified demonstration, it provides a foundation for more sophisticated authorship detection systems that could be used to verify compliance with Oscars' new rules. The key takeaway is that human creativity exhibits specific linguistic patterns that can be quantified and analyzed using NLP techniques.



