Spam Detection in messages using Mindsdb and HuggingFace

Spam Detection in messages using Mindsdb and HuggingFace

part of hashnodeXmindsdb hackathon

Spam detection is a technique to identify and filter out unwanted messages like advertisements, phishing scams, malware etc.

Spam detection involves using algorithms and techniques to analyze the content of incoming messages to determine whether the message is a scam or not.

In this tutorial, we will be using Mindsdb as a database and the model to predict the spam message from HuggingFace.

So let us first go through Mindsdb and HuggingFace:

According to the official documentation from mindsdb:

Data that lives in your database is a valuable asset. MindsDB enables you to use your data and make forecasts. It speeds up the ML development process by bringing machine learning into the database.

With MindsDB, you can build, train, optimize, and deploy your ML models without the need for other platforms. And to get the forecasts, simply query your data and ML models.

And you can do all this by just writing SQL queries in the Mindsdb cloud editor. Sounds exciting!

Hugging face: It provides pre-trained models that can be fine-tuned for various NLP tasks, such as text classification, question answering, and language generation.

The model used in this tutorial is from Huggingface:

model_name = 'mrm8488/bert-tiny-finetuned-sms-spam-detection'
  • The dataset which we will be using is from Kaggle and contains spam and nonspam messages. The model is already trained, so we will use this dataset to predict the result of the model and its accuracy. The dataset will be used as input to the pre-trained model.

  • Before following the tutorial make sure to signup on to the Mindsdb Cloud .

  • After that open the mindsdb cloud editor, where you can write SQL queries for your project.

The dataset can be found Here

Now let's go through the procedure to use the model and predict spam messages:

  • Step-1: Upload the dataset file to mindsdb by clicking on the top left corner and give it a proper name, I have named it 'message_dataset'

After uploading you can watch the contents of the file using

SELECT * FROM files.message_dataset ;
  • Step-2: importing the pre-trained model from hugging face and using it to create a model in mindsdb using:

You can provide any labels, I have used ham for not spam & spam for spam.

'spam_model' is the name of our model. You can use any other name of your choice.

CREATE MODEL mindsdb.spam_model                          
PREDICT Category                          
USING
  engine = 'huggingface',              
  task = 'text-classification',        
  model_name = 'mrm8488/bert-tiny-finetuned-sms-spam-detection', 
  input_column = 'Message',        
  labels = ['ham', 'spam'];
  • Step-3: After creating the model you can watch the status of your model by
 SELECT *
FROM mindsdb.models
WHERE name = 'spam_model';
  • Step-4: Use the dataset to provide input to our model and get a prediction:

SELECT r.Category,r.Category_explain,t.Message AS input_text
FROM files.data_text AS t
JOIN mindsdb.spam_model AS r;
  • The output for the above query is, For each row there is a Category that shows whether the message is spam or ham (not spam).

Also, the category_explain column gives the probability which is confirmed by the result.

OR

  • Alternatively, you can use a custom message as input instead of the entire dataset file which will give the result of the message
SELECT r.Category,r.Category_explain
FROM mindsdb.spam_model AS r
WHERE Message='Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&Cs apply 08452810075over18s'
  • The output of the above query is, which predicts that it is spam!.

  • For more information about creating your model and using it to predict you can visit the official sites:

Mindsdb

HuggingFace

PS- This project is a part of a hackathon conducted by Hashnode X Mindsdb

#MindsDB #MindsDBHackathon

Cover image by -Vectorjuice on Freepik !

Did you find this article valuable?

Support The Dev guy by becoming a sponsor. Any amount is appreciated!