Detecting fraudulent job postings using NLP and Machine Learning
app.demo.mp4
The Fake Internship Detector is an NLP-based machine learning project designed to identify fraudulent job and internship postings. With the rise of online job scams targeting students, this project aims to make the hiring space safer by analyzing the language and writing patterns of job descriptions.
The model learns to differentiate between genuine and fake postings using both textual and behavioral signals extracted from real job data.
- Detect fraudulent job/internship postings automatically
- Use NLP to analyze textual patterns in job descriptions
- Combine linguistic and behavioral features for improved accuracy
- Build a robust model capable of identifying scams with high reliability
- Source: Open-source job posting dataset (such as Kaggle – Real or Fake Job Posting Prediction)
- Total records: ~18,000
- Target variable:
fraudulent→ 0 = Genuine, 1 = Fake - Notable challenge: Class imbalance (~95% genuine, 5% fake)
- Combined multiple text columns (title, description, requirements, etc.) into one field
- Removed URLs, punctuation, numbers, and special characters using regex
- Converted all text to lowercase
- Applied stopword removal and lemmatization (using NLTK)
- Handled missing values and irrelevant columns
- Visualized class imbalance (95% genuine vs. 5% fake)
- Generated missing value heatmaps, word clouds, and text length distributions
- Observed that fake posts were shorter, repetitive, and used flashy terms like “money”, “urgent”, “work from home”
-
Added numerical indicators to strengthen the model:
num_wordsnum_unique_wordsnum_charsavg_word_lennum_exclamations,num_question_marks,num_uppercase-
🧠 These features complement NLP features by capturing stylistic and behavioral patterns.
- Used TF-IDF Vectorizer to transform text into numerical features
- Parameters:
max_features = 5000ngram_range = (1, 2)→ captures both single words and bigramsstop_words = 'english'
- Handled severe class imbalance using SMOTE (Synthetic Minority Oversampling Technique)
Trained and compared multiple models:
- Logistic Regression
- Random Forest
- SVM
Evaluation Metrics:
- Accuracy
- Precision
- Recall
- F1-Score (primary metric)
🎯 Final Model Performance:
F1-Score ≈ 80% (balanced performance between precision & recall)
- Used Matplotlib, Seaborn, and Plotly for interactive and comparative insights
- Plotted feature distributions, confusion matrices, and ROC curves
-
| Category | Tools & Libraries |
|---|---|
| Language | Python |
| Data Handling | Pandas, NumPy |
| Visualization | Matplotlib, Seaborn, Plotly |
| NLP | NLTK, Scikit-learn (TF-IDF) |
| Modeling | Logistic Regression, Random Forest, SVM |
| Balancing | SMOTE |
| Environment | Jupyter Notebook / Google Colab |
- Achieved ~80% F1-Score on test data
- Significantly improved recall (caught more fake postings)
- Successfully identified linguistic and behavioral patterns unique to fraudulent ads
app/ dashboard.py model/ random_forest_model.joblib tfidf_vectorizer.joblib notebooks/ data_preprocessing_eda_baseline.ipynb src/ pycache/ preprocessing.cpython-312.pyc model.py preprocessing.py
- Integrate a REST API for job posting platforms
- Experiment with transformer-based NLP models (e.g., BERT, RoBERTa)
- Expand dataset for multilingual job postings
- Combined NLP + Feature Engineering for hybrid modeling
- Learned to handle imbalanced datasets effectively
- Improved skills in text preprocessing, model tuning, and data visualization
- Created a system that can genuinely help users avoid online fraud
💼 LinkedIn
💻 GitHub
📧 [email protected]
⭐ If you liked this project, consider giving it a star on GitHub!