Spam Detection in messages using Mindsdb and HuggingFace
part of hashnodeXmindsdb hackathon
Spam detection is a technique to identify and filter out unwanted messages like advertisements, phishing scams, malware etc.
Spam detection involves using algorithms and techniques to analyze the content of incoming messages to determine whether the message is a scam or not.
In this tutorial, we will be using Mindsdb as a database and the model to predict the spam message from HuggingFace.
So let us first go through Mindsdb and HuggingFace:
According to the official documentation from mindsdb:
Data that lives in your database is a valuable asset. MindsDB enables you to use your data and make forecasts. It speeds up the ML development process by bringing machine learning into the database.
With MindsDB, you can build, train, optimize, and deploy your ML models without the need for other platforms. And to get the forecasts, simply query your data and ML models.
And you can do all this by just writing SQL queries in the Mindsdb cloud editor. Sounds exciting!
Hugging face: It provides pre-trained models that can be fine-tuned for various NLP tasks, such as text classification, question answering, and language generation.
The model used in this tutorial is from Huggingface:
model_name = 'mrm8488/bert-tiny-finetuned-sms-spam-detection'
The dataset which we will be using is from Kaggle and contains spam and nonspam messages. The model is already trained, so we will use this dataset to predict the result of the model and its accuracy. The dataset will be used as input to the pre-trained model.
Before following the tutorial make sure to signup on to the Mindsdb Cloud .
After that open the mindsdb cloud editor, where you can write SQL queries for your project.
The dataset can be found Here
Now let's go through the procedure to use the model and predict spam messages:
- Step-1: Upload the dataset file to mindsdb by clicking on the top left corner and give it a proper name, I have named it 'message_dataset'
After uploading you can watch the contents of the file using
SELECT * FROM files.message_dataset ;
- Step-2: importing the pre-trained model from hugging face and using it to create a model in mindsdb using:
You can provide any labels, I have used ham for not spam & spam for spam.
'spam_model' is the name of our model. You can use any other name of your choice.
CREATE MODEL mindsdb.spam_model
PREDICT Category
USING
engine = 'huggingface',
task = 'text-classification',
model_name = 'mrm8488/bert-tiny-finetuned-sms-spam-detection',
input_column = 'Message',
labels = ['ham', 'spam'];
- Step-3: After creating the model you can watch the status of your model by
SELECT *
FROM mindsdb.models
WHERE name = 'spam_model';
- Step-4: Use the dataset to provide input to our model and get a prediction:
SELECT r.Category,r.Category_explain,t.Message AS input_text
FROM files.data_text AS t
JOIN mindsdb.spam_model AS r;
- The output for the above query is, For each row there is a Category that shows whether the message is spam or ham (not spam).
Also, the category_explain column gives the probability which is confirmed by the result.
OR
- Alternatively, you can use a custom message as input instead of the entire dataset file which will give the result of the message
SELECT r.Category,r.Category_explain
FROM mindsdb.spam_model AS r
WHERE Message='Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&Cs apply 08452810075over18s'
- The output of the above query is, which predicts that it is spam!.
- For more information about creating your model and using it to predict you can visit the official sites:
PS- This project is a part of a hackathon conducted by Hashnode X Mindsdb
Cover image by -Vectorjuice on Freepik !