So you’re involved in NLP models, and you may find data preprocessing to be a rather mundane phase, am I right? Well, I share the same sentiment. It’s important to note that each NLP model comes with its own specific requirements for data preparation prior to training. To address this, I have included a code snippet below that offers a comprehensive solution for text preprocessing. Rest assured, this code is highly versatile and compatible with the majority of text-based NLP models.
The code snippet harnesses the power of Regex and NLTK to facilitate text preprocessing. It employs a well-structured three-stage process, which I will elaborate on further in this article. Before delving into the details, it’s crucial to understand the prerequisites for installation and import. Let’s begin by exploring the necessary requirements.
!pip install nltk
After successfully installing NLTK (Natural Language Toolkit), you can proceed by importing all the essential components needed to kickstart your code.
import re, string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('omw-1.4')
The code provided will bring in Regex and NLTK libraries, along with the download of several dictionaries that will prove useful at a later point in the code.
Subsequently, we proceed with the execution of the Pre-processing phase, which unfolds in four distinct stages. In these stages, each parameter can be modified or omitted according to your specific needs.
Step 1: Text preprocessing — This will convert all characters to lowercase, strip away all white spaces and tabs, remove html tags, replace pronunciation with white spaces and any numbers.
# Step 1: Text preprocessing
def preprocess(text):
text = text.lower() # Lowercase Text
text=text.strip() # Get Rid of Leading / Trailing Whitespace
text=re.compile('<.*?>').sub('', text) #Remove HTML Tags / Markups
text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text) # Replace punctuation with space. Careful since punctuation can sometime be useful.
text = re.sub('\s+', ' ', text) # Remove extra space and tabs
text = re.sub(r'\[[0-9]*\]',' ',text) # [0-9] matches any digit (0 to Infinity)
text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
text = re.sub(r'\d',' ',text) # Matches any digit from 0 to 100000..., \D matches non-digits
text = re.sub(r'\s+',' ',text) # \s Matches any whitespace, \s+ Matches multiple whitespace, \S Matches non-whitespace
return text
Step 2: Stopword Removal — In NLP, “stopwords” refer to commonly used words that are often considered insignificant in the analysis of text data. These words, such as “the,” “is,” and “and,” don’t carry much meaning and can be found frequently in any text.
# Step 2: Stopword Removal
def stopword(string):
a= [i for i in string.split() if i not in stopwords.words('english')]
return ' '.join(a)
Step 3: Initializing Stemmer & Lemmatizer — Stemming and Lemmatizing are techniques used to reduce words to their root form, making it easier to analyze and understand text.
Stemming involves removing prefixes or suffixes from words to obtain their base or root form. For example, if we have the words “running,” “runs,” and “ran,” stemming would convert them all to the common root “run.” This helps to group similar words together and reduce the total number of unique words in a text.
After initializing the stemming process within the text, we proceed to employ the Lemmatizer, elevating the level of Pre-processing even further.
Lemmatizing goes a step further by considering the context and part of speech of a word before reducing it to its base form, called the lemma. For example, the word “better” could be lemmatized to “good” because they have the same meaning in the context of comparison. Lemmatization ensures that words are reduced to meaningful and valid lemmas.
# Step 3: Initializing Stemmer & Lemmatizer
# Initialize the Stemmer
snow = SnowballStemmer('english')
def stemming(string):
a=[snow.stem(i) for i in word_tokenize(string) ]
return " ".join(a)
# Initialize the Lemmatizer
wl = WordNetLemmatizer()
# This is a helper function to map NTLK position tags
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN
# Tokenize the sentence
def lemmatizer(string):
word_pos_tags = nltk.pos_tag(word_tokenize(string)) # Get position tags
a=[wl.lemmatize(tag[0], get_wordnet_pos(tag[1])) for idx, tag in enumerate(word_pos_tags)] # Map the position tag and lemmatize the word/token
return " ".join(a)
We consolidate the entire process into a unified function, allowing us to conveniently invoke it without the need to execute our code separately for each of the three stages.
# FINAL PREPROCESSING
def finalpreprocess(string):
return lemmatizer(stopword(preprocess(string)))
And there you have it, your very own pre-processor. To apply this to a single string, simply utilize the code provided below.
# Preprocess a String
text = input("Enter the text that needs to be Pre-Processed: ")
print(text)
text = finalpreprocess(text)
print("Pre-Processed Text: ",preprocess(text))
In order to Pre-process a complete pandas dataframe you can use the code below.
# Preprocess a Pandas Dataframe
import pandas as pd
# Load the CSV file
df = pd.read_csv('data.csv')
# Apply preprocessing to the 'text' column
df['text'] = df['text'].apply(finalpreprocess)
# Save the preprocessed DataFrame to a new CSV file
df.to_csv('preprocessed_data.csv', index=False)
In conclusion, data preprocessing is a crucial step in working with NLP models. Although it may seem tedious, it plays a vital role in preparing the data for effective training.
By following these steps, you can ensure that your text is properly processed and ready for further analysis and modeling. Remember to adjust the parameters according to your specific requirements. With these techniques at your disposal, you can surely train NLP models and unlock new insights from your data.