Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
134 commits
Select commit Hold shift + click to select a range
2f6582f
Group norms + Repo readme file update!
MahdiaAhmadi Jun 1, 2025
33657b5
Merge branch 'MIT-Emerging-Talent:main' into main
MUSABKAYMAK Jun 3, 2025
e08b60b
Merge branch 'MIT-Emerging-Talent:main' into main
MUSABKAYMAK Jun 8, 2025
78965f3
Update .markdownlint.yml
MahdiaAhmadi Jun 8, 2025
57326ea
Update .markdownlint.yml
MahdiaAhmadi Jun 10, 2025
f1bb4da
trial
MUSABKAYMAK Jun 10, 2025
3eb4f8e
trial
MUSABKAYMAK Jun 10, 2025
a1ad6a9
change 1 to constraints
ggmeklit Jun 15, 2025
36fb2d3
commit #1 - changes to constraint doc
ggmeklit Jun 15, 2025
31cbf42
fixing linting
ggmeklit Jun 15, 2025
f5cf5a9
fixed header on retrospective file
ggmeklit Jun 15, 2025
850852f
Added meeting notes
ggmeklit Jun 15, 2025
b819f5a
Merge pull request #11 from MIT-Emerging-Talent/Cross_Cultural_Collab…
MahdiaAhmadi Jun 15, 2025
e0f810c
Merge pull request #10 from MIT-Emerging-Talent/1.Problem_identification
MahdiaAhmadi Jun 15, 2025
dd73647
Add milestone 1 deliverables
MahdiaAhmadi Jun 15, 2025
36709ac
Add milestone 1 deliverables
MahdiaAhmadi Jun 15, 2025
f8dafce
Merge branch 'main' into Problem_identification
MahdiaAhmadi Jun 15, 2025
af37fa2
Update retrospective.md
MahdiaAhmadi Jun 15, 2025
a627b38
Merge pull request #13 from MIT-Emerging-Talent/Problem_identification
MUSABKAYMAK Jun 15, 2025
24afc32
update to communication file
ggmeklit Jun 16, 2025
2c205be
Learning goals for the group
SEMIRATESFAI Jun 16, 2025
04aa186
fixed multiple items on learning goals
SEMIRATESFAI Jun 16, 2025
e5f1b94
Submitting Contributing plan for effective virtual collaboration
SEMIRATESFAI Jun 16, 2025
8c51031
fixing error on communication file
ggmeklit Jun 17, 2025
79c7da6
Merge branch 'main' into Cross_Cultural_Collaboration
ggmeklit Jun 17, 2025
98c71e2
update to constraints file
ggmeklit Jun 17, 2025
adac151
Merge branch 'Cross_Cultural_Collaboration' of https://github.com/MIT…
ggmeklit Jun 17, 2025
8aed6d2
final edit on communication
ggmeklit Jun 17, 2025
403e5b4
Merge pull request #16 from MIT-Emerging-Talent/Cross_Cultural_Collab…
SEMIRATESFAI Jun 17, 2025
a585d87
Merge pull request #17 from MIT-Emerging-Talent/learning-goals
ggmeklit Jun 17, 2025
e7e0286
removed older team availablity
ggmeklit Jun 17, 2025
831d196
Merge pull request #18 from MIT-Emerging-Talent/Cross_Cultural_Collab…
SEMIRATESFAI Jun 17, 2025
383cc21
edit and added resource link
SEMIRATESFAI Jun 17, 2025
81082fb
edited and added link
SEMIRATESFAI Jun 17, 2025
6ee2d66
Merge pull request #19 from MIT-Emerging-Talent/learning-goals
ggmeklit Jun 17, 2025
76d5277
refact: Update README and collaboration documents for clarity and str…
MahdiaAhmadi Jun 29, 2025
91fb434
Add data cleaning script and cleaned dataset
AhmadHamedDehzad Jun 30, 2025
a91838a
Update clean_data.py with header comments; add refreshed cleaned_phis…
AhmadHamedDehzad Jun 30, 2025
77059cd
Add data cleaning README with objectives and step-by-step plan
AhmadHamedDehzad Jun 30, 2025
a820f72
read_me updated
ggmeklit Jul 1, 2025
f244e8f
updated_readme
ggmeklit Jul 1, 2025
4763381
Few changes
ggmeklit Jul 1, 2025
4fac1ad
reorganized
ggmeklit Jul 1, 2025
8608972
reorganized
ggmeklit Jul 1, 2025
70a8e55
Merge pull request #26 from MIT-Emerging-Talent/data-cleaning
MUSABKAYMAK Jul 1, 2025
7835a53
created retrospective file
SEMIRATESFAI Jul 1, 2025
caeebe4
Merge pull request #27 from MIT-Emerging-Talent/feature/Retrospective
ggmeklit Jul 1, 2025
88dcc5b
updated research question
ggmeklit Jul 1, 2025
dc064a9
Merge pull request #28 from MIT-Emerging-Talent/data-cleaning
ggmeklit Jul 1, 2025
2f68c2c
Add Milestone 2 retrospective
SEMIRATESFAI Jul 1, 2025
6ed8d0d
Merge pull request #30 from MIT-Emerging-Talent/milestone-2-retrospec…
ggmeklit Jul 1, 2025
11a2cfa
Update README.md
ggmeklit Jul 1, 2025
ca9cd51
retrospective file placed correctly
ggmeklit Jul 1, 2025
6e069ce
deleted extra file
ggmeklit Jul 1, 2025
4532863
Merge pull request #31 from MIT-Emerging-Talent/milestone-2-retrospec…
SEMIRATESFAI Jul 1, 2025
f0cf5b5
Clarify dataset origins in Milestone 2 retrospective (Nazario and Naz…
SEMIRATESFAI Jul 1, 2025
21d337c
Update VSCode settings and README during Milestone 2 edits
SEMIRATESFAI Jul 1, 2025
3786af6
Merge pull request #32 from MIT-Emerging-Talent/milestone-2-retrospec…
ggmeklit Jul 4, 2025
a93d816
raw dataset added
MahdiaAhmadi Jul 21, 2025
16b4be8
doc: readme file for dataset folder update
MahdiaAhmadi Jul 21, 2025
97034fe
feat: data clearning script added
MahdiaAhmadi Jul 21, 2025
afda459
feat: cleaned data added
MahdiaAhmadi Jul 21, 2025
64af0cd
doc: non technical report added
MahdiaAhmadi Jul 21, 2025
83d3b6b
feat: data anaylsis file added
MahdiaAhmadi Jul 21, 2025
6ea0a8b
feat: plot files added
MahdiaAhmadi Jul 21, 2025
664757d
doc: requiremnets file added
MahdiaAhmadi Jul 21, 2025
91748b3
doc: technical report of dataset added
MahdiaAhmadi Jul 21, 2025
5202ebe
feat: filtered data added
MahdiaAhmadi Jul 21, 2025
783a89f
feat: safe email data extracted from rawdata added
MahdiaAhmadi Jul 21, 2025
25f3df8
doc: readme file for data preparation updata
MahdiaAhmadi Jul 21, 2025
6d86e69
feat: clean file rm
MahdiaAhmadi Jul 21, 2025
65a0945
rm: unneccary file removed
MahdiaAhmadi Jul 21, 2025
66c090f
doc: readme file updated
MahdiaAhmadi Jul 21, 2025
9e2bb6e
remove
MahdiaAhmadi Jul 21, 2025
5107772
remove
MahdiaAhmadi Jul 21, 2025
847b7a6
remove
MahdiaAhmadi Jul 21, 2025
7987689
doc: readme file update
MahdiaAhmadi Jul 21, 2025
0d58506
doc: doc file update
MahdiaAhmadi Jul 21, 2025
688e297
Fix linting and formatting errors - work in progress
MahdiaAhmadi Jul 21, 2025
b39f186
Fix dataset path in analysis script
MahdiaAhmadi Jul 21, 2025
2054812
Fix markdown formatting errors and NLTK download issue
MahdiaAhmadi Jul 21, 2025
5b3a3c1
Ensure notebook formatting compatibility
MahdiaAhmadi Jul 21, 2025
f20be28
Format notebook code to fix CI/CD formatting check
MahdiaAhmadi Jul 21, 2025
27186f4
Merge pull request #33 from MIT-Emerging-Talent/data_analysis
MahdiaAhmadi Jul 21, 2025
d3e61b6
Add Enron_cleaned.csv to branch
ggmeklit Jul 27, 2025
0e20ff9
fixed CI checks on technical report
ggmeklit Jul 27, 2025
99b03d6
CI_checks on md file
ggmeklit Jul 27, 2025
bec5867
technical report md file
ggmeklit Jul 27, 2025
2d3cd31
ruff_format
ggmeklit Jul 27, 2025
e2d2f75
Remove old notebook version of data cleaning
ggmeklit Jul 27, 2025
035289d
Merge pull request #34 from MIT-Emerging-Talent/Changes-and-CI-checks
MahdiaAhmadi Jul 27, 2025
a733d56
doc: Revise README.md to reflect project focus on phishing email ling…
MahdiaAhmadi Jul 27, 2025
473addc
doc:Add CONTRIBUTING.md with guidelines for project contributions
MahdiaAhmadi Jul 27, 2025
eebeee8
doc: Add retrospective document outlining lessons learned and strate…
MahdiaAhmadi Jul 27, 2025
879d634
doc: Add data collection retrospective outlining strategies, lessons …
MahdiaAhmadi Jul 27, 2025
64e9c0c
doc: Add comprehensive data preparation retrospective outlining strat…
MahdiaAhmadi Jul 27, 2025
fd6a2fc
doc: Add data exploration retrospective outlining strategies, lessons…
MahdiaAhmadi Jul 27, 2025
61f9bb1
doc: Add comprehensive data analysis retrospective outlining strategi…
MahdiaAhmadi Jul 27, 2025
790974b
doc: Update README.md to enhance project description and citation format
MahdiaAhmadi Jul 27, 2025
0622ea9
doc: Update CONTRIBUTING.md to add a dedicated thank you section for …
MahdiaAhmadi Jul 27, 2025
79dc14a
Merge pull request #35 from MIT-Emerging-Talent/Repo_arrangement
MahdiaAhmadi Jul 27, 2025
74ef5a0
feat: cleaning and analysis scripts seprated!
MahdiaAhmadi Jul 30, 2025
b98c319
Merge branch 'main' into new-feature-branch
MahdiaAhmadi Jul 30, 2025
3bec083
Merge pull request #37 from MIT-Emerging-Talent/new-feature-branch
MahdiaAhmadi Jul 30, 2025
e244e39
Update README.md
MahdiaAhmadi Jul 30, 2025
9a59f73
Restored previous version of technical and non technical analysis
ggmeklit Aug 6, 2025
93cc100
Restored previous version of technical and non technical analysis
ggmeklit Aug 6, 2025
56c9b89
Updated_readme_under_datasets
ggmeklit Aug 6, 2025
61c3b9f
Read_me_under_dataprep_update
ggmeklit Aug 6, 2025
0f4733b
rewrote research question in question format
ggmeklit Aug 6, 2025
8111219
technical report updated with Ahmed's version - with minor edit on th…
ggmeklit Aug 8, 2025
2ff58dd
Updated technical_report.md formatting
ggmeklit Aug 8, 2025
843095b
md formatting for technical report
ggmeklit Aug 8, 2025
086440c
Updated technical_report.md formatting
ggmeklit Aug 8, 2025
383e172
added a finding summary
ggmeklit Aug 9, 2025
d4f964d
findings_summary_file_added
ggmeklit Aug 10, 2025
e48b135
formatting
ggmeklit Aug 10, 2025
97fe921
md_formatting
ggmeklit Aug 10, 2025
6502427
minor_spacing
ggmeklit Aug 10, 2025
76495db
Merge pull request #39 from MIT-Emerging-Talent/technical_and_non_tec…
MahdiaAhmadi Aug 10, 2025
f690092
confusion matrix and robustness added
MUSABKAYMAK Aug 11, 2025
54885db
confusion matrix and robustness added
MUSABKAYMAK Aug 11, 2025
e6b010e
doc: communication strategy
MahdiaAhmadi Aug 11, 2025
de73021
doc: communication strategy markdown error fix
MahdiaAhmadi Aug 11, 2025
3e16839
feat: communication strategy
MahdiaAhmadi Aug 11, 2025
4344a73
Add files via upload
MahdiaAhmadi Aug 11, 2025
9503a39
Delete 5_communication_strategy/CDSP Pandas_Pact_Communication_Guidel…
MahdiaAhmadi Aug 11, 2025
5cfb0f8
communication strategy materials uploaded
MahdiaAhmadi Aug 11, 2025
7dbf58a
updated_communication_presentation
ggmeklit Aug 12, 2025
57eebd5
Merge branch 'confusion-matrix-and-robustness' of https://github.com/…
ggmeklit Aug 12, 2025
eef5f7c
updated_findings_summary
ggmeklit Aug 12, 2025
59d9cc5
adding_file_mapping
ggmeklit Aug 12, 2025
76bcf4e
added_mapping
ggmeklit Aug 12, 2025
1176780
md linting fixed on mapping file
ggmeklit Aug 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .markdownlint.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
ignore:
- venv
- .github

MD013:
line_length: 350

8 changes: 7 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -122,5 +122,11 @@
"source.fixAll.ruff": "explicit",
"source.organizeImports.ruff": "explicit"
}
}
},
"cSpell.words": [
"Kerth",
"Nazario",
"NLTK",
"stopwords"
]
}
91 changes: 90 additions & 1 deletion 0_domain_study/README.md
Original file line number Diff line number Diff line change
@@ -1 +1,90 @@
# Domain Research
# 🛡️ Domain Study: Phishing and Linguistic Influence on User Behavior

Welcome to the `0_domain_study` folder! This section summarizes our team's research into phishing — specifically the linguistic features that affect user click-through behavior. Below you'll find a structured overview of our research domain, background, and actionable insights.

---

## 📌 Problem Statement (Based on Team's Personal Experiences)

**Research Question:**
_What type of linguistic features in phishing emails influence user click-through behavior?_

Phishing is a growing concern globally. Based on personal experiences:

- **Meklit** (Canada) frequently encounters smishing and phishing at work. One incident involved a fake OneDrive link flagged by IT — highlighting how rushed environments reduce our vigilance.
- **Mahdia** (Portugal) emphasized how scammers use sophisticated linguistic techniques to appear trustworthy and manipulate users.
- **Ahmad** often receives fake IT department emails at work and personal prize scams. He noticed a clear difference in language tone: urgent and professional at work vs emotional in personal life.
- **Semira** initially was intrigued about fake malware by watching anti-virus software finds. She bolstered her cybersecurity knowledge through research and workshop to spot psychological manipulation such as fake tax threats or UPS warnings. She now verifies and reports these attempts.
- **Musab** (USA) stressed the emotional toll of constant phishing attempts and how phishing poses both legal and financial risks.

Together, we observed that phishing strategies are becoming more **emotionally manipulative**, **context-aware**, and **linguistically advanced**, requiring in-depth study of their language patterns.

---

## 🧠 Our Understanding of the Problem Domain (Using Systems Thinking)

Phishing is a **socio-technical** problem involving three interconnected components:

1. **Phishers (Attackers):**
Skilled in social engineering. Use language to create urgency, trust, fear, or curiosity.

2. **Communication Channels:**
Mainly email, but also SMS (smishing) and voice (vishing). All channels aim to prompt the user into clicking or responding.

3. **Recipients (Targets):**
Everyday users or professionals. Often fall victim due to low awareness, poor digital hygiene, or stress.

We particularly focus on the **linguistic layer**—how language is engineered to bypass cognitive defenses and influence behavior.

---

## ❓ Research Question

> **"What type of linguistic features in phishing emails influence user click-through behavior?"**

We updated the research question from the above to

>**How do phishing emails differ from legitmate emails interms of common linguistic patterns and language tactics?**

Given email’s dominant role in phishing, and the centrality of language in deceiving recipients, this research question aims to uncover patterns in wording, tone, and psychological triggers. The revised question aims to address for the lack of data in user click through behavior while still uncovering lingustic patterns
in phishing and legitimate emails.

## 📚 Background Review of the Domain

### 1. **Human Psychology and Language Triggers**

- **Emotions** like fear, urgency, or reward are widely used in phishing (Jakobsson & Myers, 2006).
- **Users acting under pressure** are less likely to evaluate messages critically (Vishwanath et al., 2011).

### 2. **Phishing Detection Tools**

- Tools like email filters, browser warnings, and ML-based classifiers can detect known phishing messages (Bergholz et al., 2010).
- However, attackers adapt quickly with new linguistic patterns to bypass these systems.

### 3. **User Education**

- Training and awareness programs are effective but vary in success.
- **Interactive and ongoing training** is more impactful than one-off sessions (Jansson & Von Solms, 2013).

### 4. **Evolving Threat Landscape**

- **Spear phishing** and **smishing** are on the rise (Hong, 2012).
- Smartphones and social platforms open new vectors.
- Despite evolution, **email remains the most common attack method** (CISA, 2023).

Click here: [Full Background Review](https://docs.google.com/document/d/1at2nE_Ladr2_HlNFqoaHtACwAhOVvcFE6qYVRcrerbg/edit?tab=t.0)

### Conclusion

Phishing success stems largely from **manipulating language to trigger impulsive reactions**. Understanding this manipulation can help in detection and prevention.

---

## 📂 Resources & References

- **Bergholz et al. (2010)** – Email filtering via ML
- **Hong (2012)** – Evolution of phishing
- **Jakobsson & Myers (2006)** – Psychological manipulation in phishing
- **Jansson & Von Solms (2013)** – Phishing education effectiveness
- **Vishwanath et al. (2011)** – User susceptibility factors
- **CISA (2023)** – Counter-Phishing Recommendations for Federal Agencies
93 changes: 93 additions & 0 deletions 0_domain_study/retrospective.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Domain Study Retrospective

## Stop Doing

- Relying solely on personal experiences without broader research validation
- Working in isolation when researching complex technical concepts
- Postponing documentation until research is "complete"

## Continue Doing

- Building from team members' personal experiences with phishing
- Using systems thinking to understand the multi-faceted nature of phishing
- Collaborative research approach with diverse cultural perspectives
- Regular refinement of research questions based on data availability

## Start Doing

- Earlier validation of research feasibility with available datasets
- More structured literature review process
- Creating shared knowledge base for domain concepts
- Setting clearer milestones for domain research completion

## Lessons Learned

1. **Personal experiences are valuable starting points** - Our team's diverse encounters with phishing across different countries provided rich initial insights
2. **Research questions evolve** - We learned to adapt our focus from "user click-through behavior" to "linguistic patterns" based on data availability
3. **Domain complexity requires structured approach** - Phishing operates as a socio-technical system requiring interdisciplinary understanding
4. **Cultural diversity enhances understanding** - Different team members' geographic experiences revealed various phishing tactics and contexts

---

## Strategy vs. Board

### What parts of your plan went as expected?

- Successfully collected diverse personal experiences from team members across different countries
- Developed a comprehensive understanding of phishing as a socio-technical problem
- Created a solid foundation for understanding linguistic manipulation tactics
- Established clear problem boundaries and scope

### What parts of your plan did not work out?

- Initial research question was too ambitious given available data constraints
- Underestimated the time needed for thorough domain research
- Limited access to current phishing campaign data for contemporary analysis

### Did you need to add things that weren't in your strategy?

- Literature review of existing phishing detection research
- Technical feasibility assessment for linguistic analysis approaches
- Data availability research to inform research question refinement
- Systems thinking framework to understand phishing ecosystem

### Or remove extra steps?

- Removed user behavior survey component due to resource constraints
- Simplified focus from multi-channel phishing to email-specific analysis
- Reduced scope from real-time detection to pattern identification

---

## Individual Retrospectives

### Meklit

Contributed workplace phishing experiences from Canada, particularly around smishing and sophisticated fake OneDrive attacks. Learned about the importance of context in phishing detection and how work environments affect user vigilance.

### Mahdia

Provided insights from Portugal phishing landscape and emphasized sophisticated linguistic manipulation techniques. Developed understanding of trust-building language patterns used by scammers.

### Ahmad

Shared experiences with workplace IT phishing and personal prize scams, highlighting the difference in linguistic tones across contexts. Contributed to understanding of professional vs. personal phishing approaches.

### Semira

Brought cybersecurity workshop knowledge and experience with fake tax/delivery scams. Developed expertise in psychological manipulation tactics and verification processes.

### Musab

Emphasized the emotional and legal aspects of phishing from US perspective. Contributed to understanding the broader impact beyond just technical detection.

---

## Impact on Next Milestones

This domain study established a solid foundation for:

- Data collection strategy focusing on email-based phishing
- Feature engineering approach for linguistic analysis
- Understanding of psychological manipulation tactics to detect
- Framework for interpreting results in broader phishing context
Loading
Loading